1
0
Commit Graph

209 Commits

Author SHA1 Message Date
ForwardXu
282aa68552 [HUDI-3099] Purge drop partition for spark sql (#4436) 2021-12-28 09:38:26 +08:00
xuzifu666
032b883bd1 [HUDI-3014] Add table option to set utc timezone (#4306) 2021-12-23 16:27:45 +08:00
ForwardXu
5d93edc539 [HUDI-3060] drop table for spark sql (#4364) 2021-12-22 19:17:43 +08:00
harshal patil
7d046f914a [HUDI-3008] Fixing HoodieFileIndex partition column parsing for nested fields 2021-12-21 11:54:52 +05:30
xuzifu666
3ca92108b2 remove unused import (#4349) 2021-12-20 16:32:41 +08:00
Sivabalan Narayanan
03f71ef1a2 [HUDI-2970] Adding tests for archival of replace commit actions (#4268) 2021-12-18 23:59:39 -08:00
xiarixiaoyao
9246b16492 [HUDI-2958] Automatically set spark.sql.parquet.writelegacyformat, when using bulkinsert to insert data which contains decimalType (#4253) 2021-12-17 08:58:02 -05:00
xiarixiaoyao
294d712948 [HUDI-3001] Clean up the marker directory when finish bootstrap operation. (#4298) 2021-12-16 12:36:01 -08:00
ForwardXu
dd96129191 [HUDI-2990] Sync to HMS when deleting partitions (#4291) 2021-12-13 20:40:06 +08:00
Alexey Kudinkin
2d864f7524 [HUDI-2814] Make Z-index more generic Column-Stats Index (#4106) 2021-12-10 14:56:09 -08:00
xiarixiaoyao
68f8597b12 [HUDI-2966] Add TaskCompletionListener for HoodieMergeOnReadRDD to close logScaner when the query finished. (#4265)
* [HUDI-2966] Add TaskCompletionListener for HoodieMergeOnReadRDD to close logScaner when the query finished.
2021-12-09 19:51:49 +08:00
Yann Byron
2f96f4300b Revert "[HUDI-2495] Resolve inconsistent key generation for timestamp types by GenericRecord and Row (#3944)" (#4201) 2021-12-03 11:13:38 -05:00
Alexey Kudinkin
bed7f9897a [HUDI-2911] Removing default value for PARTITIONPATH_FIELD_NAME resulting in incorrect KeyGenerator configuration (#4195) 2021-12-03 07:33:38 -05:00
Yann Byron
ca427240c0 [MINOR] use catalog schema if can not find table schema (#4182) 2021-12-03 00:37:13 -08:00
zzzhy
61a03bc072 [MINOR] Fix the wrong usage of timestamp length variable bug (#4179)
Signed-off-by: zzzhy <candle_1667@163.com>
2021-12-02 22:47:31 +08:00
董可伦
a398aad1fc [HUDI-2642] Add support ignoring case in update sql operation (#3882) 2021-11-29 22:36:36 -08:00
董可伦
3433f00cb5 [MINOR] Fix typo,rename 'getUrlEncodePartitoning' to 'getUrlEncodePartitioning' (#4130) 2021-11-29 18:31:22 -08:00
Sivabalan Narayanan
38e75ea806 Removing rfc from release package and fixing release validation script (#4147) 2021-11-29 13:18:35 +08:00
yuzhao.cyz
a1d0ff4209 Moving to 0.11.0-SNAPSHOT on master branch. 2021-11-27 17:22:10 +08:00
xiarixiaoyao
780a2ac5b2 [HUDI-2102] Support hilbert curve for hudi (#3952)
Co-authored-by: Y Ethan Guo <ethan.guoyihua@gmail.com>
2021-11-26 23:20:19 -08:00
Raymond Xu
3a8d64e584 [HUDI-2868] Fix skipped HoodieSparkSqlWriterSuite (#4125)
- Co-authored-by: Yann Byron <biyan900116@gmail.com>
2021-11-26 22:59:20 -05:00
Y Ethan Guo
d1e83e4ba0 [HUDI-2767] Enabling timeline-server-based marker as default (#4112)
- Changes the default config of marker type (HoodieWriteConfig.MARKERS_TYPE or hoodie.write.markers.type) from DIRECT to TIMELINE_SERVER_BASED for Spark Engine.
- Adds engine-specific marker type configs: Spark -> TIMELINE_SERVER_BASED, Flink -> DIRECT, Java -> DIRECT.
- Uses DIRECT markers as well for Spark structured streaming due to timeline server only available for the first mini-batch.
- Fixes the marker creation method for non-partitioned table in TimelineServerBasedWriteMarkers.
- Adds the fallback to direct markers even when TIMELINE_SERVER_BASED is configured, in WriteMarkersFactory: when HDFS is used, or embedded timeline server is disabled, the fallback to direct markers happens.
- Fixes the closing of timeline service.
- Fixes tests that depend on markers, mainly by starting the timeline service for each test.
2021-11-26 16:41:05 -05:00
Alexey Kudinkin
5755ff25a4 [HUDI-2814] Addressing issues w/ Z-order Layout Optimization (#4060)
* `ZCurveOptimizeHelper` > `ZOrderingIndexHelper`;
Moved Z-index helper under `hudi.index.zorder` package

* Tidying up `ZOrderingIndexHelper`

* Fixing compilation

* Fixed index new/original table merging sequence to always prefer values from new index;
Cleaned up `HoodieSparkUtils`

* Added test for `mergeIndexSql`

* Abstracted Z-index name composition w/in `ZOrderingIndexHelper`;

* Fixed `DataSkippingUtils` to interrupt prunning in case data filter contains non-indexed column reference

* Properly handle exceptions origination during pruning in `HoodieFileIndex`

* Make sure no errors are logged upon encountering `AnalysisException`

* Cleaned up Z-index updating sequence;
Tidying up comments, java-docs;

* Fixed Z-index to properly handle changes of the list of clustered columns

* Tidying up

* `lint`

* Suppressing `JavaDocStyle` first sentence check

* Fixed compilation

* Fixing incorrect `DecimalType` conversion

* Refactored test `TestTableLayoutOptimization`
  - Added Z-index table composition test (against fixtures)
  - Separated out GC test;
Tidying up

* Fixed tests re-shuffling column order for Z-Index table `DataFrame` to align w/ the one by one loaded from JSON

* Scaffolded `DataTypeUtils` to do basic checks of Spark types;
Added proper compatibility checking b/w old/new index-tables

* Added test for Z-index tables merging

* Fixed import being shaded by creating internal `hudi.util` package

* Fixed packaging for `TestOptimizeTable`

* Revised `updateMetadataIndex` seq to provide Z-index updating process w/ source table schema

* Make sure existing Z-index table schema is sync'd to source table's one

* Fixed shaded refs

* Fixed tests

* Fixed type conversion of Parquet provided metadata values into Spark expected schemas

* Fixed `composeIndexSchema` utility to propose proper schema

* Added more tests for Z-index:
  - Checking that Z-index table is built correctly
  - Checking that Z-index tables are merged correctly (during update)

* Fixing source table

* Fixing tests to read from Parquet w/ proper schema

* Refactored `ParquetUtils` utility reading stats from Parquet footers

* Fixed incorrect handling of Decimals extracted from Parquet footers

* Worked around issues in javac failign to compile stream's collection

* Fixed handling of `Date` type

* Fixed handling of `DateType` to be parsed as `LocalDate`

* Updated fixture;
Make sure test loads Z-index fixture using proper schema

* Removed superfluous scheme adjusting when reading from Parquet, since Spark is actually able to perfectly restore schema (given Parquet was previously written by Spark as well)

* Fixing race-condition in Parquet's `DateStringifier` trying to share `SimpleDataFormat` object which is inherently not thread-safe

* Tidying up

* Make sure schema is used upon reading to validate input files are in the appropriate format;
Tidying up;

* Worked around javac (1.8) inability to infer expression type properly

* Updated fixtures;
Tidying up

* Fixing compilation after rebase

* Assert clustering have in Z-order layout optimization testing

* Tidying up exception messages

* XXX

* Added test validating Z-index lookup filter correctness

* Added more test-cases;
Tidying up

* Added tests for string expressions

* Fixed incorrect Z-index filter lookup translations

* Added more test-cases

* Added proper handling on complex negations of AND/OR expressions by pushing NOT operator down into inner expressions for appropriate handling

* Added `-target:jvm-1.8` for `hudi-spark` module

* Adding more tests

* Added tests for non-indexed columns

* Properly handle non-indexed columns by falling back to a re-write of containing expression as  `TrueLiteral` instead

* Fixed tests

* Removing the parquet test files and disabling corresponding tests

Co-authored-by: Vinoth Chandar <vinoth@apache.org>
2021-11-26 10:02:15 -08:00
Danny Chan
a2eb2b0b0a [HUDI-2480] FileSlice after pending compaction-requested instant-time… (#3703)
* [HUDI-2480] FileSlice after pending compaction-requested instant-time is ignored by MOR snapshot reader

* include file slice after a pending compaction for spark reader

Co-authored-by: garyli1019 <yanjia.gary.li@gmail.com>
2021-11-25 22:30:09 +08:00
Alexey Kudinkin
60b23b9797 [HUDI-2788] Fixing issues w/ Z-order Layout Optimization (#4026)
* Simplyfying, tidying up

* Fixed packaging for `TestOptimizeTable`

* Cleaned up `HoodiFileIndex` file filtering seq;
Removed optimization manually reading Parquet table circumventing Spark

* Refactored `DataSkippingUtils`:
  - Fixed checks to validate all statistics cols are present
  - Fixed some predicates being constructed incorrectly
  - Rewrote comments for easier comprehension, added more notes
  - Tidying up

* Tidying up tests

* `lint`

* Fixing compilation

* `TestOptimizeTable` > `TestTableLayoutOptimization`;
Added assertions to test data skipping paths

* Fixed tests to properly hit data-skipping path

* Fixed pruned files candidates lookup seq to conservatively included all non-indexed files

* Added java-doc

* Fixed compilation
2021-11-24 10:10:28 -08:00
Yann Byron
a234833f0a [HUDI-2759] extract HoodieCatalogTable to coordinate spark catalog table and hoodie table (#3998) 2021-11-24 02:12:38 -08:00
Yann Byron
0cf2f103e0 [HUDI-2838] refresh table after drop partition (#4084) 2021-11-23 19:46:48 -08:00
Alexey Kudinkin
18cf59507f [HUDI-2831] Securing usages of SimpleDateFormat to be thread-safe (#4073) 2021-11-23 20:25:11 -05:00
董可伦
969a5bf11e [MINOR] Fix typo,rename 'HooodieAvroDeserializer' to 'HoodieAvroDeserializer' (#4064) 2021-11-23 19:10:57 +08:00
xiarixiaoyao
9de9951348 [HUDI-2778] Optimize statistics collection related codes and add some docs for z-order add fix some bugs (#4013)
* [HUDI-2778] Optimize statistics collection related codes and add more docs for z-order.

* add test code for multi-thread parquet footer read
2021-11-22 21:46:02 -08:00
Sivabalan Narayanan
fc9ca6a07a [HUDI-2559] Converting commit timestamp format to millisecs (#4024)
- Adds support for generating commit timestamps with millisecs granularity. 
- Older commit timestamps (in secs granularity) will be suffixed with 999 and parsed with millisecs format.
2021-11-22 11:44:38 -05:00
Raymond Xu
02f7ca2b05 [HUDI-1870] Add more Spark CI build tasks (#4022)
* [HUDI-1870] Add more Spark CI build tasks

- build for spark3.0.x
- build for spark-shade-unbundle-avro
- fix build failures
  - delete unnecessary assertion for spark 3.0.x
  - use AvroConversionUtils#convertAvroSchemaToStructType instead of calling SchemaConverters#toSqlType directly to solve the compilation failures with spark-shade-unbundle-avro (#5)

Co-authored-by: Yann <biyan900116@gmail.com>
2021-11-22 02:16:45 -08:00
wenningd
3dc6262437 [HUDI-2242] Add configuration inference logic for few options (#3359)
Co-authored-by: Wenning Ding <wenningd@amazon.com>
2021-11-19 19:38:38 -08:00
wenningd
24def0b30d [HUDI-2362] Add external config file support (#3416)
Co-authored-by: Wenning Ding <wenningd@amazon.com>
2021-11-18 01:59:26 -08:00
董可伦
4d884bdaa9 [MINOR] Fix typo,'Hooide' corrected to 'Hoodie' (#4007) 2021-11-17 16:50:04 +08:00
Yann Byron
0bb6d8ff80 [HUDI-2706] refactor spark-sql to make consistent with DataFrame api (#3936) 2021-11-14 15:44:39 -08:00
Sivabalan Narayanan
93fd3517e3 [HUDI-2741] Fixing instantiating metadata table config in HoodieFileIndex (#3974) 2021-11-12 17:28:25 -05:00
Yann Byron
90529aa552 [HUDI-2495] Resolve inconsistent key generation for timestamp types by GenericRecord and Row (#3944) 2021-11-11 11:54:34 +08:00
Sagar Sumit
187bedf795 [HUDI-2442] Change default values for certin clustering configs (#3875) 2021-11-10 14:23:24 +05:30
Shawy Geng
cf2ecd77ba [HUDI-2679] Fix the TestMergeIntoLogOnlyTable typo. (#3918) 2021-11-07 10:19:17 -08:00
manasaks
e0285800fb HUDI-1827 : Add ORC support in Bootstrap Op (#3457)
Co-authored-by: Sivabalan Narayanan <n.siva.b@gmail.com>
2021-11-06 12:23:20 -04:00
董可伦
844346c3ab [HUDI-2471] Add support ignoring case in merge into (#3700) 2021-11-05 07:50:16 -07:00
Prashant Wason
b7ee341e14 [HUDI-1794] Moved static COMMIT_FORMATTER to thread local variable as SimpleDateFormat is not thread safe. (#2819) 2021-11-05 09:31:42 -04:00
Yann Byron
6351e5f4d0 [HUDI-2538] persist some configs to hoodie.properties when the first write (#3823) 2021-11-03 10:04:23 +08:00
Yann Byron
1f17467f73 [HUDI-1869] Upgrading Spark3 To 3.1 (#3844)
Co-authored-by: pengzhiwei <pengzhiwei2015@icloud.com>
2021-11-02 18:25:12 -07:00
Carl-Zhou-CN
dee3a14aae [HUDI-2582] Support concurrent key gen for different tables with row writer path (#3817)
Co-authored-by: yao.zhou <yao.zhou@linkflowtech.com>
2021-11-02 18:05:09 -04:00
xiarixiaoyao
d194643b49 [HUDI-2101][RFC-28] support z-order for hudi (#3330)
* [HUDI-2101]support z-order for hudi

* Renaming some configs for consistency/simplicity.

* Minor code cleanups

Co-authored-by: Vinoth Chandar <vinoth@apache.org>
2021-11-02 09:31:57 -07:00
Y Ethan Guo
0223c442ec [HUDI-2502] Refactor index in hudi-client module (#3778)
- Refactor Index to reduce Line of Code and re-use across engines.
2021-10-28 04:16:00 -04:00
vinoyang
b1c4acf0ae [HUDI-2614] Remove duplicated hadoop-hdfs with tests classifier exists in bundles (#3864) 2021-10-26 22:36:10 +08:00
Yann Byron
1e2be85a0f [HUDI-2482] support 'drop partition' sql (#3754) 2021-10-19 22:09:53 +08:00