lanyuanxiaoyao/hudi - hudi - Gitea: Git with a cup of tea

Author	SHA1	Message	Date
xiarixiaoyao	780a2ac5b2	[HUDI-2102] Support hilbert curve for hudi (#3952 ) Co-authored-by: Y Ethan Guo <ethan.guoyihua@gmail.com>	2021-11-26 23:20:19 -08:00
Raymond Xu	3a8d64e584	[HUDI-2868] Fix skipped HoodieSparkSqlWriterSuite (#4125 ) - Co-authored-by: Yann Byron <biyan900116@gmail.com>	2021-11-26 22:59:20 -05:00
Y Ethan Guo	d1e83e4ba0	[HUDI-2767] Enabling timeline-server-based marker as default (#4112 ) - Changes the default config of marker type (HoodieWriteConfig.MARKERS_TYPE or hoodie.write.markers.type) from DIRECT to TIMELINE_SERVER_BASED for Spark Engine. - Adds engine-specific marker type configs: Spark -> TIMELINE_SERVER_BASED, Flink -> DIRECT, Java -> DIRECT. - Uses DIRECT markers as well for Spark structured streaming due to timeline server only available for the first mini-batch. - Fixes the marker creation method for non-partitioned table in TimelineServerBasedWriteMarkers. - Adds the fallback to direct markers even when TIMELINE_SERVER_BASED is configured, in WriteMarkersFactory: when HDFS is used, or embedded timeline server is disabled, the fallback to direct markers happens. - Fixes the closing of timeline service. - Fixes tests that depend on markers, mainly by starting the timeline service for each test.	2021-11-26 16:41:05 -05:00
Alexey Kudinkin	5755ff25a4	[HUDI-2814] Addressing issues w/ Z-order Layout Optimization (#4060 ) * `ZCurveOptimizeHelper` > `ZOrderingIndexHelper`; Moved Z-index helper under `hudi.index.zorder` package * Tidying up `ZOrderingIndexHelper` * Fixing compilation * Fixed index new/original table merging sequence to always prefer values from new index; Cleaned up `HoodieSparkUtils` * Added test for `mergeIndexSql` * Abstracted Z-index name composition w/in `ZOrderingIndexHelper`; * Fixed `DataSkippingUtils` to interrupt prunning in case data filter contains non-indexed column reference * Properly handle exceptions origination during pruning in `HoodieFileIndex` * Make sure no errors are logged upon encountering `AnalysisException` * Cleaned up Z-index updating sequence; Tidying up comments, java-docs; * Fixed Z-index to properly handle changes of the list of clustered columns * Tidying up * `lint` * Suppressing `JavaDocStyle` first sentence check * Fixed compilation * Fixing incorrect `DecimalType` conversion * Refactored test `TestTableLayoutOptimization` - Added Z-index table composition test (against fixtures) - Separated out GC test; Tidying up * Fixed tests re-shuffling column order for Z-Index table `DataFrame` to align w/ the one by one loaded from JSON * Scaffolded `DataTypeUtils` to do basic checks of Spark types; Added proper compatibility checking b/w old/new index-tables * Added test for Z-index tables merging * Fixed import being shaded by creating internal `hudi.util` package * Fixed packaging for `TestOptimizeTable` * Revised `updateMetadataIndex` seq to provide Z-index updating process w/ source table schema * Make sure existing Z-index table schema is sync'd to source table's one * Fixed shaded refs * Fixed tests * Fixed type conversion of Parquet provided metadata values into Spark expected schemas * Fixed `composeIndexSchema` utility to propose proper schema * Added more tests for Z-index: - Checking that Z-index table is built correctly - Checking that Z-index tables are merged correctly (during update) * Fixing source table * Fixing tests to read from Parquet w/ proper schema * Refactored `ParquetUtils` utility reading stats from Parquet footers * Fixed incorrect handling of Decimals extracted from Parquet footers * Worked around issues in javac failign to compile stream's collection * Fixed handling of `Date` type * Fixed handling of `DateType` to be parsed as `LocalDate` * Updated fixture; Make sure test loads Z-index fixture using proper schema * Removed superfluous scheme adjusting when reading from Parquet, since Spark is actually able to perfectly restore schema (given Parquet was previously written by Spark as well) * Fixing race-condition in Parquet's `DateStringifier` trying to share `SimpleDataFormat` object which is inherently not thread-safe * Tidying up * Make sure schema is used upon reading to validate input files are in the appropriate format; Tidying up; * Worked around javac (1.8) inability to infer expression type properly * Updated fixtures; Tidying up * Fixing compilation after rebase * Assert clustering have in Z-order layout optimization testing * Tidying up exception messages * XXX * Added test validating Z-index lookup filter correctness * Added more test-cases; Tidying up * Added tests for string expressions * Fixed incorrect Z-index filter lookup translations * Added more test-cases * Added proper handling on complex negations of AND/OR expressions by pushing NOT operator down into inner expressions for appropriate handling * Added `-target:jvm-1.8` for `hudi-spark` module * Adding more tests * Added tests for non-indexed columns * Properly handle non-indexed columns by falling back to a re-write of containing expression as `TrueLiteral` instead * Fixed tests * Removing the parquet test files and disabling corresponding tests Co-authored-by: Vinoth Chandar <vinoth@apache.org>	2021-11-26 10:02:15 -08:00
Danny Chan	a2eb2b0b0a	[HUDI-2480] FileSlice after pending compaction-requested instant-time… (#3703 ) * [HUDI-2480] FileSlice after pending compaction-requested instant-time is ignored by MOR snapshot reader * include file slice after a pending compaction for spark reader Co-authored-by: garyli1019 <yanjia.gary.li@gmail.com>	2021-11-25 22:30:09 +08:00
Alexey Kudinkin	60b23b9797	[HUDI-2788] Fixing issues w/ Z-order Layout Optimization (#4026 ) * Simplyfying, tidying up * Fixed packaging for `TestOptimizeTable` * Cleaned up `HoodiFileIndex` file filtering seq; Removed optimization manually reading Parquet table circumventing Spark * Refactored `DataSkippingUtils`: - Fixed checks to validate all statistics cols are present - Fixed some predicates being constructed incorrectly - Rewrote comments for easier comprehension, added more notes - Tidying up * Tidying up tests * `lint` * Fixing compilation * `TestOptimizeTable` > `TestTableLayoutOptimization`; Added assertions to test data skipping paths * Fixed tests to properly hit data-skipping path * Fixed pruned files candidates lookup seq to conservatively included all non-indexed files * Added java-doc * Fixed compilation	2021-11-24 10:10:28 -08:00
Yann Byron	a234833f0a	[HUDI-2759] extract HoodieCatalogTable to coordinate spark catalog table and hoodie table (#3998 )	2021-11-24 02:12:38 -08:00
Yann Byron	0cf2f103e0	[HUDI-2838] refresh table after drop partition (#4084 )	2021-11-23 19:46:48 -08:00
Alexey Kudinkin	18cf59507f	[HUDI-2831] Securing usages of `SimpleDateFormat` to be thread-safe (#4073 )	2021-11-23 20:25:11 -05:00
董可伦	969a5bf11e	[MINOR] Fix typo,rename 'HooodieAvroDeserializer' to 'HoodieAvroDeserializer' (#4064 )	2021-11-23 19:10:57 +08:00
xiarixiaoyao	9de9951348	[HUDI-2778] Optimize statistics collection related codes and add some docs for z-order add fix some bugs (#4013 ) * [HUDI-2778] Optimize statistics collection related codes and add more docs for z-order. * add test code for multi-thread parquet footer read	2021-11-22 21:46:02 -08:00
Sivabalan Narayanan	fc9ca6a07a	[HUDI-2559] Converting commit timestamp format to millisecs (#4024 ) - Adds support for generating commit timestamps with millisecs granularity. - Older commit timestamps (in secs granularity) will be suffixed with 999 and parsed with millisecs format.	2021-11-22 11:44:38 -05:00
Raymond Xu	02f7ca2b05	[HUDI-1870] Add more Spark CI build tasks (#4022 ) * [HUDI-1870] Add more Spark CI build tasks - build for spark3.0.x - build for spark-shade-unbundle-avro - fix build failures - delete unnecessary assertion for spark 3.0.x - use AvroConversionUtils#convertAvroSchemaToStructType instead of calling SchemaConverters#toSqlType directly to solve the compilation failures with spark-shade-unbundle-avro (#5) Co-authored-by: Yann <biyan900116@gmail.com>	2021-11-22 02:16:45 -08:00
wenningd	3dc6262437	[HUDI-2242] Add configuration inference logic for few options (#3359 ) Co-authored-by: Wenning Ding <wenningd@amazon.com>	2021-11-19 19:38:38 -08:00
wenningd	24def0b30d	[HUDI-2362] Add external config file support (#3416 ) Co-authored-by: Wenning Ding <wenningd@amazon.com>	2021-11-18 01:59:26 -08:00
董可伦	4d884bdaa9	[MINOR] Fix typo,'Hooide' corrected to 'Hoodie' (#4007 )	2021-11-17 16:50:04 +08:00
Yann Byron	0bb6d8ff80	[HUDI-2706] refactor spark-sql to make consistent with DataFrame api (#3936 )	2021-11-14 15:44:39 -08:00
Sivabalan Narayanan	93fd3517e3	[HUDI-2741] Fixing instantiating metadata table config in HoodieFileIndex (#3974 )	2021-11-12 17:28:25 -05:00
Yann Byron	90529aa552	[HUDI-2495] Resolve inconsistent key generation for timestamp types by GenericRecord and Row (#3944 )	2021-11-11 11:54:34 +08:00
Sagar Sumit	187bedf795	[HUDI-2442] Change default values for certin clustering configs (#3875 )	2021-11-10 14:23:24 +05:30
Shawy Geng	cf2ecd77ba	[HUDI-2679] Fix the TestMergeIntoLogOnlyTable typo. (#3918 )	2021-11-07 10:19:17 -08:00
manasaks	e0285800fb	HUDI-1827 : Add ORC support in Bootstrap Op (#3457 ) Co-authored-by: Sivabalan Narayanan <n.siva.b@gmail.com>	2021-11-06 12:23:20 -04:00
董可伦	844346c3ab	[HUDI-2471] Add support ignoring case in merge into (#3700 )	2021-11-05 07:50:16 -07:00
Prashant Wason	b7ee341e14	[HUDI-1794] Moved static COMMIT_FORMATTER to thread local variable as SimpleDateFormat is not thread safe. (#2819 )	2021-11-05 09:31:42 -04:00
Yann Byron	6351e5f4d0	[HUDI-2538] persist some configs to hoodie.properties when the first write (#3823 )	2021-11-03 10:04:23 +08:00
Yann Byron	1f17467f73	[HUDI-1869] Upgrading Spark3 To 3.1 (#3844 ) Co-authored-by: pengzhiwei <pengzhiwei2015@icloud.com>	2021-11-02 18:25:12 -07:00
Carl-Zhou-CN	dee3a14aae	[HUDI-2582] Support concurrent key gen for different tables with row writer path (#3817 ) Co-authored-by: yao.zhou <yao.zhou@linkflowtech.com>	2021-11-02 18:05:09 -04:00
xiarixiaoyao	d194643b49	[HUDI-2101][RFC-28] support z-order for hudi (#3330 ) * [HUDI-2101]support z-order for hudi * Renaming some configs for consistency/simplicity. * Minor code cleanups Co-authored-by: Vinoth Chandar <vinoth@apache.org>	2021-11-02 09:31:57 -07:00
Y Ethan Guo	0223c442ec	[HUDI-2502] Refactor index in hudi-client module (#3778 ) - Refactor Index to reduce Line of Code and re-use across engines.	2021-10-28 04:16:00 -04:00
vinoyang	b1c4acf0ae	[HUDI-2614] Remove duplicated hadoop-hdfs with tests classifier exists in bundles (#3864 )	2021-10-26 22:36:10 +08:00
Yann Byron	1e2be85a0f	[HUDI-2482] support 'drop partition' sql (#3754 )	2021-10-19 22:09:53 +08:00
Danny Chan	abf3e3fe71	[HUDI-2548] Flink streaming reader misses the rolling over file handles (#3787 )	2021-10-14 10:36:18 +08:00
Sivabalan Narayanan	8a487eafa7	[HUDI-2494] Fixing glob pattern to skip all hoodie meta paths (#3768 )	2021-10-12 14:06:40 -04:00
董可伦	10e3a9a3fb	[MINOR] Fix typo,'properites' corrected to 'properties' (#3738 )	2021-10-06 20:37:01 -04:00
Yann Byron	e91e611afb	[HUDI-2456] support 'show partitions' sql (#3693 )	2021-10-06 15:46:49 +08:00
Sivabalan Narayanan	5f32162a2f	[HUDI-2285][HUDI-2476] Metadata table synchronous design. Rebased and Squashed from pull/3426 (#3590 ) * [HUDI-2285] Adding Synchronous updates to metadata before completion of commits in data timelime. - This patch adds synchronous updates to metadata table. In other words, every write is first committed to metadata table followed by data table. While reading metadata table, we ignore any delta commits that are present only in metadata table and not in data table timeline. - Compaction of metadata table is fenced by the condition that we trigger compaction only when there are no inflight requests in datatable. This ensures that all base files in metadata table is always in sync with data table(w/o any holes) and only there could be some extra invalid commits among delta log files in metadata table. - Due to this, archival of data table also fences itself up until compacted instant in metadata table. All writes to metadata table happens within the datatable lock. So, metadata table works in one writer mode only. This might be tough to loosen since all writers write to same FILES partition and so, will result in a conflict anyways. - As part of this, have added acquiring locks in data table for those operations which were not before while committing (rollback, clean, compaction, cluster). To note, we were not doing any conflict resolution. All we are doing here is to commit by taking a lock. So that all writes to metadata table is always a single writer. - Also added building block to add buckets for partitions, which will be leveraged by other indexes like record level index, etc. For now, FILES partition has only one bucket. In general, any number of buckets per partition is allowed and each partition has a fixed fileId prefix with incremental suffix for each bucket within each partition. Have fixed [HUDI-2476]. This fix is about retrying a failed compaction if it succeeded in metadata for first time, but failed w/ data table. - Enabling metadata table by default. - Adding more tests for metadata table Co-authored-by: Prashant Wason <pwason@uber.com>	2021-10-06 00:17:52 -04:00
董可伦	2f07e1267f	[MINOR] Fix typo Hooodie corrected to Hoodie & reuqired corrected to required (#3730 )	2021-09-30 09:55:32 +08:00
Sagar Sumit	bc4966ea73	[HUDI-2484] Fix hive sync mode setting in Deltastreamer (#3712 )	2021-09-24 13:05:42 -04:00
Danny Chan	5515a0d319	[HUDI-2479] HoodieFileIndex throws NPE for FileSlice with pure log files (#3702 )	2021-09-23 15:14:30 +08:00
董可伦	5a94043f38	[HUDI-2343]Fix the exception for mergeInto when the primaryKey and preCombineField of source table and target table differ in case only (#3517 )	2021-09-21 22:11:52 +08:00
liujinhui	76554aa31a	[MINOR] Add document for DataSourceReadOptions (#3653 )	2021-09-15 14:33:43 +08:00
liujinhui	9f3c4a2a7f	[HUDI-2410] Fix getDefaultBootstrapIndexClass logical error (#3633 )	2021-09-13 16:10:17 +08:00
vinoth chandar	ea59a7ff5f	[HUDI-2080] Move to ubuntu-18.04 for Azure CI (#3409 ) Update Azure CI ubuntu from 16.04 to 18.04 due to 16.04 will be removed soon Fixed some consistently failed tests * fix TestCOWDataSourceStorage TestMORDataSourceStorage * reset mocks Also update readme badge Co-authored-by: Raymond Xu <2701446+xushiyan@users.noreply.github.com>	2021-09-07 09:44:30 -07:00
wenningd	69cbcc9516	Merge pull request #3541 from rahil-c/rahil-c/HUDI-2359 [HUDI-2359] Add basic "hoodie_is_deleted" unit tests to TestDataSource classes	2021-08-27 16:28:51 -07:00
Satish M	55a80a817d	[HUDI-2264] Refactor HoodieSparkSqlWriterSuite to add setup and teardown (#3544 )	2021-08-26 10:01:48 -04:00
pengzhiwei	cc5256a7d8	[HUDI-2357] MERGE INTO doesn't work for tables created using CTAS (#3534 )	2021-08-26 16:54:41 +08:00
Rahil Chertara	694300477f	[HUDI-2359] Add basic "hoodie_is_deleted" unit tests to TestDataSource classes	2021-08-25 16:35:35 -07:00
zhangyue19921010	de94787a85	[HUDI-2345] Hoodie columns sort partitioner for bulk insert (#3523 ) Co-authored-by: yuezhang <yuezhang@freewheel.tv>	2021-08-24 21:45:17 +08:00
董可伦	be8c1e499f	Support referencing subquery with column aliases by table alias in merge into (#3380 )	2021-08-21 21:53:16 +08:00
Udit Mehrotra	e39d0a2f28	Keep non-conflicting names for common configs between DataSourceOptions and HoodieWriteConfig (#3511 )	2021-08-20 02:42:59 -07:00

1 2 3 4

190 Commits