lanyuanxiaoyao/hudi - hudi - Gitea: Git with a cup of tea

Author	SHA1	Message	Date
Y Ethan Guo	b8601a9f58	[HUDI-2656] Generalize HoodieIndex for flexible record data type (#3893 ) Co-authored-by: Raymond Xu <2701446+xushiyan@users.noreply.github.com>	2022-02-03 20:24:04 -08:00
Alexey Kudinkin	819e8018ff	[HUDI-3322][HUDI-3343] Fixing Metadata Table Records Duplication Issues (#4716 ) This change is addressing issues in regards to Metadata Table observing ingesting duplicated records leading to it persisting incorrect file-sizes for the files referred to in those records. There are multiple issues that were leading to that: - [HUDI-3322] Incorrect Rollback Plan generation: Rollback Plan generated for MOR tables was overly expansively listing all log-files with the latest base-instant as the ones that have been affected by the rollback, leading to invalid MT records being ingested referring to those. - [HUDI-3343] Metadata Table including Uncommitted Log Files during Bootstrap: Since MT is bootstrapped at the end of the commit operation execution (after FS activity, but before committing to the timeline), it was actually incorrectly ingesting some files that were part of the intermediate state of the operation being committed. This change will unblock Stack of PRs based off #4556	2022-02-02 16:10:51 -05:00
Alexey Kudinkin	a68e1dc2db	[HUDI-431] Adding support for Parquet in MOR `LogBlock`s (#4333 ) - Adding support for Parquet in MOR tables Log blocks Co-authored-by: Sivabalan Narayanan <n.siva.b@gmail.com>	2022-02-02 14:35:05 -05:00
Raymond Xu	0bd38f26ca	[HUDI-2596] Make class names consistent in hudi-client (#4680 )	2022-01-27 17:05:08 -08:00
Manoj Govindassamy	f87c47352a	[HUDI-2763] Metadata table records - support for key deduplication based on hardcoded key field (#4449 ) * [HUDI-2763] Metadata table records - support for key deduplication and virtual keys - The backing log format for the metadata table is HFile, a KeyValue type. Since the key field in the metadata record payload is a duplicate of the Key in the Cell, the redundant key field in the record can be emptied to save on the cost. - HoodieHFileWriter and HoodieHFileDataBlock will now serialize records with the key field emptied by default. HFile writer tries to find if the record has metadata payload schema field 'key' and if so it does the key trimming from the record payload. - HoodieHFileReader when reading the serialized records back from disk, it materializes the missing keyFields if any. HFile reader tries to find if the record has metadata payload schema fiels 'key' and if so it does the key materialization in the record payload. - Tests have been added to verify the default virtual keys and key deduplication support for the metadata table records. Co-authored-by: Vinoth Chandar <vinoth@apache.org>	2022-01-26 13:34:04 -05:00
Y Ethan Guo	397795c7d0	[HUDI-3007] Fix issues in HoodieRepairTool (#4564 )	2022-01-12 09:03:27 -08:00
YueZhang	9fe28e56b4	[HUDI-3045] New clustering regex match config to choose partitions when building clustering plan (#4346 ) Co-authored-by: yuezhang <yuezhang@freewheel.tv>	2022-01-11 23:23:55 -08:00
Manoj Govindassamy	251d4eb3b6	[HUDI-3030] InProcessLockPovider as default when any async servcies enabled with no lock provider override (#4406 ) * [HUDI-3030] InProcessLockPovider as default when any async servcies enabled with no lock provider override - Making InProcessLockProvider as the default lock provider when any async services are enabled and when no lock provider is explicitly set. - This is the workaround for metadata table updates racing with async table serice operations * [HUDI-3030] InProcessLockPovider as default when any async servcies enabled with no lock provider override - Renaming isAnyTableServicesInline/Async() to areAnyTableServicesInline/Async() * [HUDI-3030] InProcessLockPovider as default when any async servcies enabled with no lock provider override - Additionally checking for write config properties when verifying the lock provider override. Updated the unit test for this case.	2022-01-10 08:40:24 +05:30
Yuwei XIAO	2444f40a4b	[HUDI-3095] abstract partition filter logic to enable code reuse (#4454 ) * [HUDI-3095] abstract partition filter logic to enable code reuse * [HUDI-3095] address reviews	2021-12-31 11:07:52 +05:30
Manoj Govindassamy	4a48f99a59	[HUDI-3064][HUDI-3054] FileSystemBasedLockProviderTestClass tryLock fix and TestHoodieClientMultiWriter test fixes (#4384 ) - Made FileSystemBasedLockProviderTestClass thread safe and fixed the tryLock retry logic. - Made TestHoodieClientMultiWriter. testHoodieClientBasicMultiWriter deterministic in verifying the HoodieWriteConflictException.	2021-12-19 13:31:02 -05:00
Sivabalan Narayanan	dc40397fa9	[HUDI-3064] Fixing a bug in TransactionManager and FileSystemTestLock (#4372 )	2021-12-18 11:52:11 -05:00
Manoj Govindassamy	733732be7b	[HUDI-3029] Transaction manager: avoid deadlock when doing begin and end transactions (#4373 )	2021-12-18 11:43:10 -05:00
Manoj Govindassamy	d1d48ed494	[HUDI-3029] Transaction manager: avoid deadlock when doing begin and end transactions (#4363 ) * [HUDI-3029] Transaction manager: avoid deadlock when doing begin and end transactions - Transaction manager has begin and end transactions as synchronized methods. Based on the lock provider implementaion, this can lead to deadlock situation when the underlying lock() calls are blocking or with a long timeout. - Fixing transaction manager begin and end transactions to not get to deadlock and to not assume anything on the lock provider implementation.	2021-12-18 09:43:17 -05:00
Manoj Govindassamy	7784249e55	[HUDI-2962] InProcess lock provider to guard single writer process with async table operations (#4259 ) - Adding Local JVM process based lock provider implementation - This local lock provider can be used by a single writer process with async table operations to guard the metadata tabl against concurrent updates.	2021-12-17 20:18:46 -05:00
Alexey Kudinkin	2d864f7524	[HUDI-2814] Make Z-index more generic Column-Stats Index (#4106 )	2021-12-10 14:56:09 -08:00
Sivabalan Narayanan	e483f7c776	[HUDI-2902] Fixing populate meta fields with Hfile writers and Disabling virtual keys by default for metadata table (#4194 )	2021-12-03 07:20:21 -05:00
xiarixiaoyao	780a2ac5b2	[HUDI-2102] Support hilbert curve for hudi (#3952 ) Co-authored-by: Y Ethan Guo <ethan.guoyihua@gmail.com>	2021-11-26 23:20:19 -08:00
Y Ethan Guo	d1e83e4ba0	[HUDI-2767] Enabling timeline-server-based marker as default (#4112 ) - Changes the default config of marker type (HoodieWriteConfig.MARKERS_TYPE or hoodie.write.markers.type) from DIRECT to TIMELINE_SERVER_BASED for Spark Engine. - Adds engine-specific marker type configs: Spark -> TIMELINE_SERVER_BASED, Flink -> DIRECT, Java -> DIRECT. - Uses DIRECT markers as well for Spark structured streaming due to timeline server only available for the first mini-batch. - Fixes the marker creation method for non-partitioned table in TimelineServerBasedWriteMarkers. - Adds the fallback to direct markers even when TIMELINE_SERVER_BASED is configured, in WriteMarkersFactory: when HDFS is used, or embedded timeline server is disabled, the fallback to direct markers happens. - Fixes the closing of timeline service. - Fixes tests that depend on markers, mainly by starting the timeline service for each test.	2021-11-26 16:41:05 -05:00
Alexey Kudinkin	6f5d8d04cd	[HUDI-2840] Fixed DeltaStreaemer to properly respect configuration passed t/h properties file (#4090 ) * Rebased `DFSPropertiesConfiguration` to access Hadoop config in liue of FS to avoid confusion * Fixed `readConfig` to take Hadoop's `Configuration` instead of FS; Fixing usages * Added test for local FS access * Rebase to use `FSUtils.getFs` * Combine properties provided as a file along w/ overrides provided from the CLI * Added helper utilities to `HoodieClusteringConfig`; Make sure corresponding config methods fallback to defaults; * Fixed DeltaStreamer usage to respect properly combined configuration; Abstracted `HoodieClusteringConfig.from` convenience utility to init Clustering config from `Properties` * Tidying up * `lint` * Reverting changes to `HoodieWriteConfig` * Tdiying up * Fixed incorrect merge of the props * Converted `HoodieConfig` to wrap around `Properties` into `TypedProperties` * Fixed compilation * Fixed compilation	2021-11-25 14:48:22 -08:00
Udit Mehrotra	e0125a7911	[HUDI-2801] Add Amazon CloudWatch metrics reporter (#4081 )	2021-11-25 13:33:16 -08:00
Sivabalan Narayanan	435ea1543c	[HUDI-2793] Fixing deltastreamer checkpoint fetch/copy over (#4034 ) - Removed the copy over logic in transaction utils. Deltastreamer will go back to previous commits and get the checkpoint value.	2021-11-24 18:26:40 -05:00
Manoj Govindassamy	973f78f5ca	[HUDI-2443] Hudi KVComparator for all HFile writer usages (#3889 ) * [HUDI-2443] Hudi KVComparator for all HFile writer usages - Hudi relies on custom class shading for Hbase's KeyValue.KVComparator to avoid versioning and class loading issues. There are few places which are still using the Hbase's comparator class directly and version upgrades would make them obsolete. Refactoring the HoodieKVComparator and making all HFile writer creation using the same shaded class. * [HUDI-2443] Hudi KVComparator for all HFile writer usages - Moving HoodieKVComparator from common.bootstrap.index to common.util * [HUDI-2443] Hudi KVComparator for all HFile writer usages - Retaining the old HoodieKVComparatorV2 for boostrap case. Adding the new comparator as HoodieKVComparatorV2 to differentiate from the old one. * [HUDI-2443] Hudi KVComparator for all HFile writer usages - Renamed HoodieKVComparatorV2 to HoodieMetadataKVComparator and moved it under the package org.apache.hudi.metadata. * Make comparator classname configurable * Revert new config and address other review comments Co-authored-by: Sagar Sumit <sagarsumit09@gmail.com>	2021-11-24 10:05:36 -08:00
Raymond Xu	5078d29eb4	[HUDI-2818] Fix 2to3 upgrade when set `hoodie.table.keygenerator.class` (#4077 )	2021-11-23 19:43:34 -08:00
Y Ethan Guo	ca9bfa2a40	[HUDI-2332] Add clustering and compaction in Kafka Connect Sink (#3857 ) * [HUDI-2332] Add clustering and compaction in Kafka Connect Sink * Disable validation check on instant time for compaction and adjust configs * Add javadocs * Add clustering and compaction config * Fix transaction causing missing records in the target table * Add debugging logs * Fix kafka offset sync in participant * Adjust how clustering and compaction are configured in kafka-connect * Fix clustering strategy * Remove irrelevant changes from other published PRs * Update clustering logic and others * Update README * Fix test failures * Fix indentation * Fix clustering config * Add JavaCustomColumnsSortPartitioner and make async compaction enabled by default * Add test for JavaCustomColumnsSortPartitioner * Add more changes after IDE sync * Update README with clarification * Fix clustering logic after rebasing * Remove unrelated changes	2021-11-23 14:23:28 +05:30
Manoj Govindassamy	7f3b89fad7	[HUDI-2472] Enabling metadata table for TestHoodieIndex test case (#4045 ) - Enablng the metadata table for testSimpleGlobalIndexTagLocationWhenShouldUpdatePartitionPath. This is more of a test issue.	2021-11-22 07:21:24 -05:00
Manoj Govindassamy	c8617d9390	[HUDI-2472] Enabling metadata table for TestHoodieMergeOnReadTable and TestHoodieCompactor (#4023 )	2021-11-19 20:02:21 -05:00
davehagman	dfe3b84715	[HUDI-2579] Make deltastreamer checkpoint state merging more explicit (#3820 ) Co-authored-by: Sivabalan Narayanan <n.siva.b@gmail.com>	2021-11-09 17:37:59 -05:00
Sivabalan Narayanan	6d109c6de5	[HUDI-2595] Fixing metadata table updates such that only regular writes from data table can trigger table services in metadata table (#3900 )	2021-11-08 22:12:32 -05:00
xiarixiaoyao	d194643b49	[HUDI-2101][RFC-28] support z-order for hudi (#3330 ) * [HUDI-2101]support z-order for hudi * Renaming some configs for consistency/simplicity. * Minor code cleanups Co-authored-by: Vinoth Chandar <vinoth@apache.org>	2021-11-02 09:31:57 -07:00
Sivabalan Narayanan	69ee790a47	[HUDI-1294] Adding inline read and seek based read(batch get) for hfile log blocks in metadata table (#3762 )	2021-10-29 12:12:44 -04:00
Manoj Govindassamy	b68c5a68f9	[HUDI-2472] Fix few Cleaner tests with metadata table enabled (#3825 )	2021-10-20 18:57:00 -04:00
Sivabalan Narayanan	5f32162a2f	[HUDI-2285][HUDI-2476] Metadata table synchronous design. Rebased and Squashed from pull/3426 (#3590 ) * [HUDI-2285] Adding Synchronous updates to metadata before completion of commits in data timelime. - This patch adds synchronous updates to metadata table. In other words, every write is first committed to metadata table followed by data table. While reading metadata table, we ignore any delta commits that are present only in metadata table and not in data table timeline. - Compaction of metadata table is fenced by the condition that we trigger compaction only when there are no inflight requests in datatable. This ensures that all base files in metadata table is always in sync with data table(w/o any holes) and only there could be some extra invalid commits among delta log files in metadata table. - Due to this, archival of data table also fences itself up until compacted instant in metadata table. All writes to metadata table happens within the datatable lock. So, metadata table works in one writer mode only. This might be tough to loosen since all writers write to same FILES partition and so, will result in a conflict anyways. - As part of this, have added acquiring locks in data table for those operations which were not before while committing (rollback, clean, compaction, cluster). To note, we were not doing any conflict resolution. All we are doing here is to commit by taking a lock. So that all writes to metadata table is always a single writer. - Also added building block to add buckets for partitions, which will be leveraged by other indexes like record level index, etc. For now, FILES partition has only one bucket. In general, any number of buckets per partition is allowed and each partition has a fixed fileId prefix with incremental suffix for each bucket within each partition. Have fixed [HUDI-2476]. This fix is about retrying a failed compaction if it succeeded in metadata for first time, but failed w/ data table. - Enabling metadata table by default. - Adding more tests for metadata table Co-authored-by: Prashant Wason <pwason@uber.com>	2021-10-06 00:17:52 -04:00
liujinhui	61d0096088	[HUDI-2434] Make periodSeconds of GraphiteReporter configurable (#3667 )	2021-09-17 19:39:55 +08:00
liujinhui	2791fb9a96	[HUDI-2423] Separate some config logic from HoodieMetricsConfig into HoodieMetricsGraphiteConfig HoodieMetricsJmxConfig (#3652 )	2021-09-16 15:08:10 +08:00
vinoth chandar	ea59a7ff5f	[HUDI-2080] Move to ubuntu-18.04 for Azure CI (#3409 ) Update Azure CI ubuntu from 16.04 to 18.04 due to 16.04 will be removed soon Fixed some consistently failed tests * fix TestCOWDataSourceStorage TestMORDataSourceStorage * reset mocks Also update readme badge Co-authored-by: Raymond Xu <2701446+xushiyan@users.noreply.github.com>	2021-09-07 09:44:30 -07:00
Raymond Xu	6bd3ca98d6	[HUDI-1989] Fix flakiness in TestHoodieMergeOnReadTable (#3574 ) * [HUDI-1989] Refactor clustering tests for MoR table * refactor assertion helper * add CheckedFunction * SparkClientFunctionalTestHarness.java * put back original test case * move testcases out from TestHoodieMergeOnReadTable.java * add TestHoodieSparkMergeOnReadTableRollback.java * use SparkClientFunctionalTestHarness * add tag	2021-09-03 13:17:17 -07:00
Udit Mehrotra	e39d0a2f28	Keep non-conflicting names for common configs between DataSourceOptions and HoodieWriteConfig (#3511 )	2021-08-20 02:42:59 -07:00
Udit Mehrotra	c350d05dd3	Restore 0.8.0 config keys with deprecated annotation (#3506 ) Co-authored-by: Sagar Sumit <sagarsumit09@gmail.com> Co-authored-by: Vinoth Chandar <vinoth@apache.org>	2021-08-19 13:36:40 -07:00
ayachi_nene	99663d370b	[HUDI-2301] fix FileSliceMetrics utils bug (#3487 )	2021-08-17 11:09:53 -07:00
Sagar Sumit	0544d70d8f	[MINOR] Deprecate older configs (#3464 ) Rename and deprecate props in HoodieWriteConfig Rename and deprecate older props	2021-08-12 20:31:04 -07:00
wenningd	91bb0d1318	[HUDI-2255] Refactor Datasource options (#3373 ) Co-authored-by: Wenning Ding <wenningd@amazon.com>	2021-08-03 17:50:30 -07:00
Sivabalan Narayanan	61148c1c43	[HUDI-2176, 2178, 2179] Adding virtual key support to COW table (#3306 )	2021-07-26 17:21:04 -04:00
Xuedong Luan	71e14cf866	[HUDI-2213] Remove unnecessary parameter for HoodieMetrics constructor and fix NPE in UT (#3333 )	2021-07-23 19:57:35 +08:00
Xuedong Luan	6d592c5896	[HUDI-2211] Fix NullPointerException in TestHoodieConsoleMetrics (#3331 )	2021-07-23 11:22:54 +08:00
Jintao Guan	2debb9b3ed	[HUDI-1828] Update unit tests to support ORC as the base file format (#3237 )	2021-07-15 00:05:42 +08:00
Sagar Sumit	b0089b894a	[MINOR] Fix EXTERNAL_RECORD_AND_SCHEMA_TRANSFORMATION config (#3250 )	2021-07-13 00:24:40 -04:00
wenningd	d412fb2fe6	[HUDI-89] Add configOption & refactor all configs based on that (#2833 ) Co-authored-by: Wenning Ding <wenningd@amazon.com>	2021-06-30 14:26:30 -07:00
Jintao Guan	b8fe5b91d5	[HUDI-764] [HUDI-765] ORC reader writer Implementation (#2999 ) Co-authored-by: Qingyun (Teresa) Kang <kteresa@uber.com>	2021-06-15 15:21:43 -07:00
wangxianghu	7261f08507	[HUDI-1929] Support configure KeyGenerator by type (#2993 )	2021-06-08 09:26:10 -04:00
Vinay Patil	cf90f17732	[HUDI-1281] Add deltacommit to ActionType (#3018 ) Co-authored-by: veenaypatil <vinay18.patil@gmail.com>	2021-06-04 22:30:48 -07:00

1 2

66 Commits