lanyuanxiaoyao/hudi - hudi - Gitea: Git with a cup of tea

Author	SHA1	Message	Date
Raymond Xu	2467c137e4	[HUDI-3100] Add config for hive conditional sync (#4440 )	2022-01-06 23:26:35 -08:00
Vinish Reddy	eee715b3ff	[HUDI-3168] Fixing null schema with empty commit in incremental relation (#4513 )	2022-01-05 11:43:10 -05:00
Sivabalan Narayanan	7329d229d5	Adding tests to validate different key generators (#4473 )	2022-01-04 10:48:04 +05:30
leesf	29ab6fb9ad	[HUDI-3140] Fix bulk_insert failure on Spark 3.2.0 (#4498 )	2022-01-04 09:59:59 +08:00
harshal	2b2ae34cb9	[HUDI-2558] Fixing Clustering w/ sort columns with null values fails (#4404 )	2022-01-03 12:19:43 +05:30
Yann Byron	fe9406dd33	[HUDI-3131] fix ctas error in spark3.1.1 (#4476 )	2022-01-02 03:06:55 -08:00
Yann Byron	1622b52c9c	[HUDI-3136] Fix merge/insert/show partitions error on Spark3.2 (#4490 )	2022-01-02 02:42:10 -08:00
Shawy Geng	a4e622ac61	[HUDI-1951] Add bucket hash index, compatible with the hive bucket (#3173 ) * [HUDI-2154] Add index key field to HoodieKey * [HUDI-2157] Add the bucket index and its read/write implemention of Spark engine. * revert HUDI-2154 add index key field to HoodieKey * fix all comments and introduce a new tricky way to get index key at runtime support double insert for bucket index * revert spark read optimizer based on bucket index * add the storage layout * index tag, hash function and add ut * fix ut * address partial comments * Code review feedback * add layout config and docs * fix ut * rename hoodie.layout and rebase master Co-authored-by: Vinoth Chandar <vinoth@apache.org>	2021-12-30 12:38:26 -08:00
ForwardXu	504747ecf4	[HUDI-3108] Fix Purge Drop MOR Table Cause error (#4455 )	2021-12-29 20:23:23 +08:00
Yann Byron	05942e018c	[HUDI-2811] Support Spark 3.2 (#4270 )	2021-12-28 00:12:44 -08:00
Yann Byron	1f7afba5e4	[HUDI-3093] fix spark-sql query table that write with TimestampBasedKeyGenerator (#4416 )	2021-12-27 21:39:52 -08:00
ForwardXu	282aa68552	[HUDI-3099] Purge drop partition for spark sql (#4436 )	2021-12-28 09:38:26 +08:00
xuzifu666	032b883bd1	[HUDI-3014] Add table option to set utc timezone (#4306 )	2021-12-23 16:27:45 +08:00
ForwardXu	5d93edc539	[HUDI-3060] drop table for spark sql (#4364 )	2021-12-22 19:17:43 +08:00
harshal patil	7d046f914a	[HUDI-3008] Fixing HoodieFileIndex partition column parsing for nested fields	2021-12-21 11:54:52 +05:30
xuzifu666	3ca92108b2	remove unused import (#4349 )	2021-12-20 16:32:41 +08:00
Sivabalan Narayanan	03f71ef1a2	[HUDI-2970] Adding tests for archival of replace commit actions (#4268 )	2021-12-18 23:59:39 -08:00
xiarixiaoyao	9246b16492	[HUDI-2958] Automatically set spark.sql.parquet.writelegacyformat, when using bulkinsert to insert data which contains decimalType (#4253 )	2021-12-17 08:58:02 -05:00
xiarixiaoyao	294d712948	[HUDI-3001] Clean up the marker directory when finish bootstrap operation. (#4298 )	2021-12-16 12:36:01 -08:00
ForwardXu	dd96129191	[HUDI-2990] Sync to HMS when deleting partitions (#4291 )	2021-12-13 20:40:06 +08:00
Alexey Kudinkin	2d864f7524	[HUDI-2814] Make Z-index more generic Column-Stats Index (#4106 )	2021-12-10 14:56:09 -08:00
xiarixiaoyao	68f8597b12	[HUDI-2966] Add TaskCompletionListener for HoodieMergeOnReadRDD to close logScaner when the query finished. (#4265 ) * [HUDI-2966] Add TaskCompletionListener for HoodieMergeOnReadRDD to close logScaner when the query finished.	2021-12-09 19:51:49 +08:00
Yann Byron	2f96f4300b	Revert "[HUDI-2495] Resolve inconsistent key generation for timestamp types by GenericRecord and Row (#3944 )" (#4201 )	2021-12-03 11:13:38 -05:00
Alexey Kudinkin	bed7f9897a	[HUDI-2911] Removing default value for `PARTITIONPATH_FIELD_NAME` resulting in incorrect `KeyGenerator` configuration (#4195 )	2021-12-03 07:33:38 -05:00
Yann Byron	ca427240c0	[MINOR] use catalog schema if can not find table schema (#4182 )	2021-12-03 00:37:13 -08:00
zzzhy	61a03bc072	[MINOR] Fix the wrong usage of timestamp length variable bug (#4179 ) Signed-off-by: zzzhy <candle_1667@163.com>	2021-12-02 22:47:31 +08:00
董可伦	a398aad1fc	[HUDI-2642] Add support ignoring case in update sql operation (#3882 )	2021-11-29 22:36:36 -08:00
董可伦	3433f00cb5	[MINOR] Fix typo,rename 'getUrlEncodePartitoning' to 'getUrlEncodePartitioning' (#4130 )	2021-11-29 18:31:22 -08:00
Sivabalan Narayanan	38e75ea806	Removing rfc from release package and fixing release validation script (#4147 )	2021-11-29 13:18:35 +08:00
yuzhao.cyz	a1d0ff4209	Moving to 0.11.0-SNAPSHOT on master branch.	2021-11-27 17:22:10 +08:00
xiarixiaoyao	780a2ac5b2	[HUDI-2102] Support hilbert curve for hudi (#3952 ) Co-authored-by: Y Ethan Guo <ethan.guoyihua@gmail.com>	2021-11-26 23:20:19 -08:00
Raymond Xu	3a8d64e584	[HUDI-2868] Fix skipped HoodieSparkSqlWriterSuite (#4125 ) - Co-authored-by: Yann Byron <biyan900116@gmail.com>	2021-11-26 22:59:20 -05:00
Y Ethan Guo	d1e83e4ba0	[HUDI-2767] Enabling timeline-server-based marker as default (#4112 ) - Changes the default config of marker type (HoodieWriteConfig.MARKERS_TYPE or hoodie.write.markers.type) from DIRECT to TIMELINE_SERVER_BASED for Spark Engine. - Adds engine-specific marker type configs: Spark -> TIMELINE_SERVER_BASED, Flink -> DIRECT, Java -> DIRECT. - Uses DIRECT markers as well for Spark structured streaming due to timeline server only available for the first mini-batch. - Fixes the marker creation method for non-partitioned table in TimelineServerBasedWriteMarkers. - Adds the fallback to direct markers even when TIMELINE_SERVER_BASED is configured, in WriteMarkersFactory: when HDFS is used, or embedded timeline server is disabled, the fallback to direct markers happens. - Fixes the closing of timeline service. - Fixes tests that depend on markers, mainly by starting the timeline service for each test.	2021-11-26 16:41:05 -05:00
Alexey Kudinkin	5755ff25a4	[HUDI-2814] Addressing issues w/ Z-order Layout Optimization (#4060 ) * `ZCurveOptimizeHelper` > `ZOrderingIndexHelper`; Moved Z-index helper under `hudi.index.zorder` package * Tidying up `ZOrderingIndexHelper` * Fixing compilation * Fixed index new/original table merging sequence to always prefer values from new index; Cleaned up `HoodieSparkUtils` * Added test for `mergeIndexSql` * Abstracted Z-index name composition w/in `ZOrderingIndexHelper`; * Fixed `DataSkippingUtils` to interrupt prunning in case data filter contains non-indexed column reference * Properly handle exceptions origination during pruning in `HoodieFileIndex` * Make sure no errors are logged upon encountering `AnalysisException` * Cleaned up Z-index updating sequence; Tidying up comments, java-docs; * Fixed Z-index to properly handle changes of the list of clustered columns * Tidying up * `lint` * Suppressing `JavaDocStyle` first sentence check * Fixed compilation * Fixing incorrect `DecimalType` conversion * Refactored test `TestTableLayoutOptimization` - Added Z-index table composition test (against fixtures) - Separated out GC test; Tidying up * Fixed tests re-shuffling column order for Z-Index table `DataFrame` to align w/ the one by one loaded from JSON * Scaffolded `DataTypeUtils` to do basic checks of Spark types; Added proper compatibility checking b/w old/new index-tables * Added test for Z-index tables merging * Fixed import being shaded by creating internal `hudi.util` package * Fixed packaging for `TestOptimizeTable` * Revised `updateMetadataIndex` seq to provide Z-index updating process w/ source table schema * Make sure existing Z-index table schema is sync'd to source table's one * Fixed shaded refs * Fixed tests * Fixed type conversion of Parquet provided metadata values into Spark expected schemas * Fixed `composeIndexSchema` utility to propose proper schema * Added more tests for Z-index: - Checking that Z-index table is built correctly - Checking that Z-index tables are merged correctly (during update) * Fixing source table * Fixing tests to read from Parquet w/ proper schema * Refactored `ParquetUtils` utility reading stats from Parquet footers * Fixed incorrect handling of Decimals extracted from Parquet footers * Worked around issues in javac failign to compile stream's collection * Fixed handling of `Date` type * Fixed handling of `DateType` to be parsed as `LocalDate` * Updated fixture; Make sure test loads Z-index fixture using proper schema * Removed superfluous scheme adjusting when reading from Parquet, since Spark is actually able to perfectly restore schema (given Parquet was previously written by Spark as well) * Fixing race-condition in Parquet's `DateStringifier` trying to share `SimpleDataFormat` object which is inherently not thread-safe * Tidying up * Make sure schema is used upon reading to validate input files are in the appropriate format; Tidying up; * Worked around javac (1.8) inability to infer expression type properly * Updated fixtures; Tidying up * Fixing compilation after rebase * Assert clustering have in Z-order layout optimization testing * Tidying up exception messages * XXX * Added test validating Z-index lookup filter correctness * Added more test-cases; Tidying up * Added tests for string expressions * Fixed incorrect Z-index filter lookup translations * Added more test-cases * Added proper handling on complex negations of AND/OR expressions by pushing NOT operator down into inner expressions for appropriate handling * Added `-target:jvm-1.8` for `hudi-spark` module * Adding more tests * Added tests for non-indexed columns * Properly handle non-indexed columns by falling back to a re-write of containing expression as `TrueLiteral` instead * Fixed tests * Removing the parquet test files and disabling corresponding tests Co-authored-by: Vinoth Chandar <vinoth@apache.org>	2021-11-26 10:02:15 -08:00
Danny Chan	a2eb2b0b0a	[HUDI-2480] FileSlice after pending compaction-requested instant-time… (#3703 ) * [HUDI-2480] FileSlice after pending compaction-requested instant-time is ignored by MOR snapshot reader * include file slice after a pending compaction for spark reader Co-authored-by: garyli1019 <yanjia.gary.li@gmail.com>	2021-11-25 22:30:09 +08:00
Alexey Kudinkin	60b23b9797	[HUDI-2788] Fixing issues w/ Z-order Layout Optimization (#4026 ) * Simplyfying, tidying up * Fixed packaging for `TestOptimizeTable` * Cleaned up `HoodiFileIndex` file filtering seq; Removed optimization manually reading Parquet table circumventing Spark * Refactored `DataSkippingUtils`: - Fixed checks to validate all statistics cols are present - Fixed some predicates being constructed incorrectly - Rewrote comments for easier comprehension, added more notes - Tidying up * Tidying up tests * `lint` * Fixing compilation * `TestOptimizeTable` > `TestTableLayoutOptimization`; Added assertions to test data skipping paths * Fixed tests to properly hit data-skipping path * Fixed pruned files candidates lookup seq to conservatively included all non-indexed files * Added java-doc * Fixed compilation	2021-11-24 10:10:28 -08:00
Yann Byron	a234833f0a	[HUDI-2759] extract HoodieCatalogTable to coordinate spark catalog table and hoodie table (#3998 )	2021-11-24 02:12:38 -08:00
Yann Byron	0cf2f103e0	[HUDI-2838] refresh table after drop partition (#4084 )	2021-11-23 19:46:48 -08:00
Alexey Kudinkin	18cf59507f	[HUDI-2831] Securing usages of `SimpleDateFormat` to be thread-safe (#4073 )	2021-11-23 20:25:11 -05:00
董可伦	969a5bf11e	[MINOR] Fix typo,rename 'HooodieAvroDeserializer' to 'HoodieAvroDeserializer' (#4064 )	2021-11-23 19:10:57 +08:00
xiarixiaoyao	9de9951348	[HUDI-2778] Optimize statistics collection related codes and add some docs for z-order add fix some bugs (#4013 ) * [HUDI-2778] Optimize statistics collection related codes and add more docs for z-order. * add test code for multi-thread parquet footer read	2021-11-22 21:46:02 -08:00
Sivabalan Narayanan	fc9ca6a07a	[HUDI-2559] Converting commit timestamp format to millisecs (#4024 ) - Adds support for generating commit timestamps with millisecs granularity. - Older commit timestamps (in secs granularity) will be suffixed with 999 and parsed with millisecs format.	2021-11-22 11:44:38 -05:00
Raymond Xu	02f7ca2b05	[HUDI-1870] Add more Spark CI build tasks (#4022 ) * [HUDI-1870] Add more Spark CI build tasks - build for spark3.0.x - build for spark-shade-unbundle-avro - fix build failures - delete unnecessary assertion for spark 3.0.x - use AvroConversionUtils#convertAvroSchemaToStructType instead of calling SchemaConverters#toSqlType directly to solve the compilation failures with spark-shade-unbundle-avro (#5) Co-authored-by: Yann <biyan900116@gmail.com>	2021-11-22 02:16:45 -08:00
wenningd	3dc6262437	[HUDI-2242] Add configuration inference logic for few options (#3359 ) Co-authored-by: Wenning Ding <wenningd@amazon.com>	2021-11-19 19:38:38 -08:00
wenningd	24def0b30d	[HUDI-2362] Add external config file support (#3416 ) Co-authored-by: Wenning Ding <wenningd@amazon.com>	2021-11-18 01:59:26 -08:00
董可伦	4d884bdaa9	[MINOR] Fix typo,'Hooide' corrected to 'Hoodie' (#4007 )	2021-11-17 16:50:04 +08:00
Yann Byron	0bb6d8ff80	[HUDI-2706] refactor spark-sql to make consistent with DataFrame api (#3936 )	2021-11-14 15:44:39 -08:00
Sivabalan Narayanan	93fd3517e3	[HUDI-2741] Fixing instantiating metadata table config in HoodieFileIndex (#3974 )	2021-11-12 17:28:25 -05:00
Yann Byron	90529aa552	[HUDI-2495] Resolve inconsistent key generation for timestamp types by GenericRecord and Row (#3944 )	2021-11-11 11:54:34 +08:00
Sagar Sumit	187bedf795	[HUDI-2442] Change default values for certin clustering configs (#3875 )	2021-11-10 14:23:24 +05:30

1 2 3 4 5

220 Commits