lanyuanxiaoyao/hudi - hudi - Gitea: Git with a cup of tea

Author	SHA1	Message	Date
Alexey Kudinkin	9e87d164b3	[HUDI-3760] Adding capability to fetch Metadata Records by prefix (#5208 ) - Adding capability to fetch Metadata Records by key prefix so that Data Skipping could fetch only Column Stats - Index records pertaining to the columns being queried by, instead of reading out whole Index. - Fixed usages of HFileScanner in HFileReader. few code paths uses cached scanner if available. Other code paths uses its own HFileScanner w/ positional read. Brief change log - Rebasing ColumnStatsIndexSupport to rely on HoodieBackedTableMetadata in lieu of reading t/h Spark DS - Adding methods enabling key-prefix lookups to HoodiFileReader, HoodieHFileReader - Wiring key-prefix lookup t/h LogRecordScanner impls - Cleaning up HoodieHFileReader impl Co-authored-by: sivabalan <n.siva.b@gmail.com> Co-authored-by: Sagar Sumit <sagarsumit09@gmail.com>	2022-04-06 12:11:08 -04:00
Sivabalan Narayanan	84064a9b08	[HUDI-3772] Fixing auto adjustment of lock configs for deltastreamer (#5207 )	2022-04-02 23:44:10 -07:00
ForwardXu	80011df995	[HUDI-3135] Make delete partitions lazy to be executed by the cleaner (#4489 ) As of now, delete partitions will ensure all file groups are deleted, but the partition as such is not deleted. So, get all partitions might be returning the deleted partitions as well. but no data will be served since all file groups are deleted. With this patch, we are fixing it. We are letting cleaner take care of deleting the partitions when all file groups pertaining to a partitions are deleted. - Fixed the CleanPlanActionExecutor to return meta info about list of partitions to be deleted. If there are no valid file groups for a partition, clean planner will include the partition to be deleted. - Fixed HoodieCleanPlan avro schema to include the list of partitions to be deleted - CleanActionExecutor is fixed to delete partitions if any (as per clean plan) - Same info is added to HoodieCleanMetadata - Metadata table when applying clean metadata, will check for partitions to be deleted and will update the "all_partitions" record for the deleted partitions. Co-authored-by: sivabalan <n.siva.b@gmail.com>	2022-03-31 03:35:39 -04:00
xiarixiaoyao	2d73c8ae86	[HUDI-3355] Issue with out of order commits in the timeline when ingestion writers using SparkAllowUpdateStrategy (#4962 )	2022-03-30 15:54:25 -07:00
Y Ethan Guo	eaa4c4f2e2	[HUDI-1180] Upgrade HBase to 2.4.9 (#5004 ) Co-authored-by: Sagar Sumit <sagarsumit09@gmail.com>	2022-03-24 19:04:53 -07:00
Y Ethan Guo	9b6e138af2	[HUDI-3640] Set SimpleKeyGenerator as default in 2to3 table upgrade for Spark engine (#5075 )	2022-03-21 20:35:06 -04:00
Y Ethan Guo	95e6e53810	[HUDI-3404] Automatically adjust write configs based on metadata table and write concurrency mode (#4975 )	2022-03-17 01:25:04 -04:00
Sagar Sumit	575bc63468	[HUDI-3356][HUDI-3203] HoodieData for metadata index records; BloomFilter construction from index based on the type param (#4848 ) Rework of #4761 This diff introduces following changes: - Write stats are converted to metadata index records during the commit. Making them use the HoodieData type so that the record generation scales up with needs. - Metadata index init support for bloom filter and column stats partitions. - When building the BloomFilter from the index records, using the type param stored in the payload instead of hardcoded type. - Delta writes can change column ranges and the column stats index need to be properly updated with new ranges to be consistent with the table dataset. This fix add column stats index update support for the delta writes. Co-authored-by: Manoj Govindassamy <manoj.govindassamy@gmail.com>	2022-03-08 10:39:04 -05:00
shibei	62f534d002	[HUDI-3445] Support Clustering Command Based on Call Procedure Command for Spark SQL (#4901 ) * [HUDI-3445] Clustering Command Based on Call Procedure Command for Spark SQL * [HUDI-3445] Clustering Command Based on Call Procedure Command for Spark SQL * [HUDI-3445] Clustering Command Based on Call Procedure Command for Spark SQL Co-authored-by: shibei <huberylee.li@alibaba-inc.com>	2022-03-04 09:33:16 +08:00
YueZhang	3363c66468	[HUDI-3394] Check isWriteLockedByCurrentThread before unlock for InProcessLockProvider (#4819 ) Co-authored-by: yuezhang <yuezhang@freewheel.tv> Co-authored-by: Y Ethan Guo <ethan.guoyihua@gmail.com>	2022-02-15 22:41:25 -08:00
Raymond Xu	538ec44fa8	[HUDI-2931] Add config to disable table services (#4777 )	2022-02-15 09:49:53 -05:00
Raymond Xu	27bd7b538e	[HUDI-1576] Make archiving an async service (#4795 )	2022-02-14 21:15:06 -05:00
Sivabalan Narayanan	0ababcfaa7	[HUDI-1847] Adding inline scheduling support for spark datasource path for compaction and clustering (#4420 ) - This adds support in spark-datasource to just schedule table services inline so that users can leverage async execution w/o the need for lock service providers.	2022-02-10 08:04:55 -05:00
Sivabalan Narayanan	6a32cfe020	[HUDI-3091] Making SIMPLE index as the default index type (#4659 ) * [HUDI-3091] Making SIMPLE index as the default index type * Fixing tests * Traiging timeouts * disable SIMPLE index for bootstrap tests * removing test run start and end log statements * Fixing simple index parallellism for some tests * Disabling failing test for now * reverting previous disable * Reverting all changes * fixing azure pipeline script	2022-02-08 15:02:18 +05:30
Y Ethan Guo	b8601a9f58	[HUDI-2656] Generalize HoodieIndex for flexible record data type (#3893 ) Co-authored-by: Raymond Xu <2701446+xushiyan@users.noreply.github.com>	2022-02-03 20:24:04 -08:00
Alexey Kudinkin	819e8018ff	[HUDI-3322][HUDI-3343] Fixing Metadata Table Records Duplication Issues (#4716 ) This change is addressing issues in regards to Metadata Table observing ingesting duplicated records leading to it persisting incorrect file-sizes for the files referred to in those records. There are multiple issues that were leading to that: - [HUDI-3322] Incorrect Rollback Plan generation: Rollback Plan generated for MOR tables was overly expansively listing all log-files with the latest base-instant as the ones that have been affected by the rollback, leading to invalid MT records being ingested referring to those. - [HUDI-3343] Metadata Table including Uncommitted Log Files during Bootstrap: Since MT is bootstrapped at the end of the commit operation execution (after FS activity, but before committing to the timeline), it was actually incorrectly ingesting some files that were part of the intermediate state of the operation being committed. This change will unblock Stack of PRs based off #4556	2022-02-02 16:10:51 -05:00
Alexey Kudinkin	a68e1dc2db	[HUDI-431] Adding support for Parquet in MOR `LogBlock`s (#4333 ) - Adding support for Parquet in MOR tables Log blocks Co-authored-by: Sivabalan Narayanan <n.siva.b@gmail.com>	2022-02-02 14:35:05 -05:00
Raymond Xu	0bd38f26ca	[HUDI-2596] Make class names consistent in hudi-client (#4680 )	2022-01-27 17:05:08 -08:00
Manoj Govindassamy	f87c47352a	[HUDI-2763] Metadata table records - support for key deduplication based on hardcoded key field (#4449 ) * [HUDI-2763] Metadata table records - support for key deduplication and virtual keys - The backing log format for the metadata table is HFile, a KeyValue type. Since the key field in the metadata record payload is a duplicate of the Key in the Cell, the redundant key field in the record can be emptied to save on the cost. - HoodieHFileWriter and HoodieHFileDataBlock will now serialize records with the key field emptied by default. HFile writer tries to find if the record has metadata payload schema field 'key' and if so it does the key trimming from the record payload. - HoodieHFileReader when reading the serialized records back from disk, it materializes the missing keyFields if any. HFile reader tries to find if the record has metadata payload schema fiels 'key' and if so it does the key materialization in the record payload. - Tests have been added to verify the default virtual keys and key deduplication support for the metadata table records. Co-authored-by: Vinoth Chandar <vinoth@apache.org>	2022-01-26 13:34:04 -05:00
Y Ethan Guo	397795c7d0	[HUDI-3007] Fix issues in HoodieRepairTool (#4564 )	2022-01-12 09:03:27 -08:00
YueZhang	9fe28e56b4	[HUDI-3045] New clustering regex match config to choose partitions when building clustering plan (#4346 ) Co-authored-by: yuezhang <yuezhang@freewheel.tv>	2022-01-11 23:23:55 -08:00
Manoj Govindassamy	251d4eb3b6	[HUDI-3030] InProcessLockPovider as default when any async servcies enabled with no lock provider override (#4406 ) * [HUDI-3030] InProcessLockPovider as default when any async servcies enabled with no lock provider override - Making InProcessLockProvider as the default lock provider when any async services are enabled and when no lock provider is explicitly set. - This is the workaround for metadata table updates racing with async table serice operations * [HUDI-3030] InProcessLockPovider as default when any async servcies enabled with no lock provider override - Renaming isAnyTableServicesInline/Async() to areAnyTableServicesInline/Async() * [HUDI-3030] InProcessLockPovider as default when any async servcies enabled with no lock provider override - Additionally checking for write config properties when verifying the lock provider override. Updated the unit test for this case.	2022-01-10 08:40:24 +05:30
Yuwei XIAO	2444f40a4b	[HUDI-3095] abstract partition filter logic to enable code reuse (#4454 ) * [HUDI-3095] abstract partition filter logic to enable code reuse * [HUDI-3095] address reviews	2021-12-31 11:07:52 +05:30
Manoj Govindassamy	4a48f99a59	[HUDI-3064][HUDI-3054] FileSystemBasedLockProviderTestClass tryLock fix and TestHoodieClientMultiWriter test fixes (#4384 ) - Made FileSystemBasedLockProviderTestClass thread safe and fixed the tryLock retry logic. - Made TestHoodieClientMultiWriter. testHoodieClientBasicMultiWriter deterministic in verifying the HoodieWriteConflictException.	2021-12-19 13:31:02 -05:00
Sivabalan Narayanan	dc40397fa9	[HUDI-3064] Fixing a bug in TransactionManager and FileSystemTestLock (#4372 )	2021-12-18 11:52:11 -05:00
Manoj Govindassamy	733732be7b	[HUDI-3029] Transaction manager: avoid deadlock when doing begin and end transactions (#4373 )	2021-12-18 11:43:10 -05:00
Manoj Govindassamy	d1d48ed494	[HUDI-3029] Transaction manager: avoid deadlock when doing begin and end transactions (#4363 ) * [HUDI-3029] Transaction manager: avoid deadlock when doing begin and end transactions - Transaction manager has begin and end transactions as synchronized methods. Based on the lock provider implementaion, this can lead to deadlock situation when the underlying lock() calls are blocking or with a long timeout. - Fixing transaction manager begin and end transactions to not get to deadlock and to not assume anything on the lock provider implementation.	2021-12-18 09:43:17 -05:00
Manoj Govindassamy	7784249e55	[HUDI-2962] InProcess lock provider to guard single writer process with async table operations (#4259 ) - Adding Local JVM process based lock provider implementation - This local lock provider can be used by a single writer process with async table operations to guard the metadata tabl against concurrent updates.	2021-12-17 20:18:46 -05:00
Alexey Kudinkin	2d864f7524	[HUDI-2814] Make Z-index more generic Column-Stats Index (#4106 )	2021-12-10 14:56:09 -08:00
Sivabalan Narayanan	e483f7c776	[HUDI-2902] Fixing populate meta fields with Hfile writers and Disabling virtual keys by default for metadata table (#4194 )	2021-12-03 07:20:21 -05:00
xiarixiaoyao	780a2ac5b2	[HUDI-2102] Support hilbert curve for hudi (#3952 ) Co-authored-by: Y Ethan Guo <ethan.guoyihua@gmail.com>	2021-11-26 23:20:19 -08:00
Y Ethan Guo	d1e83e4ba0	[HUDI-2767] Enabling timeline-server-based marker as default (#4112 ) - Changes the default config of marker type (HoodieWriteConfig.MARKERS_TYPE or hoodie.write.markers.type) from DIRECT to TIMELINE_SERVER_BASED for Spark Engine. - Adds engine-specific marker type configs: Spark -> TIMELINE_SERVER_BASED, Flink -> DIRECT, Java -> DIRECT. - Uses DIRECT markers as well for Spark structured streaming due to timeline server only available for the first mini-batch. - Fixes the marker creation method for non-partitioned table in TimelineServerBasedWriteMarkers. - Adds the fallback to direct markers even when TIMELINE_SERVER_BASED is configured, in WriteMarkersFactory: when HDFS is used, or embedded timeline server is disabled, the fallback to direct markers happens. - Fixes the closing of timeline service. - Fixes tests that depend on markers, mainly by starting the timeline service for each test.	2021-11-26 16:41:05 -05:00
Alexey Kudinkin	6f5d8d04cd	[HUDI-2840] Fixed DeltaStreaemer to properly respect configuration passed t/h properties file (#4090 ) * Rebased `DFSPropertiesConfiguration` to access Hadoop config in liue of FS to avoid confusion * Fixed `readConfig` to take Hadoop's `Configuration` instead of FS; Fixing usages * Added test for local FS access * Rebase to use `FSUtils.getFs` * Combine properties provided as a file along w/ overrides provided from the CLI * Added helper utilities to `HoodieClusteringConfig`; Make sure corresponding config methods fallback to defaults; * Fixed DeltaStreamer usage to respect properly combined configuration; Abstracted `HoodieClusteringConfig.from` convenience utility to init Clustering config from `Properties` * Tidying up * `lint` * Reverting changes to `HoodieWriteConfig` * Tdiying up * Fixed incorrect merge of the props * Converted `HoodieConfig` to wrap around `Properties` into `TypedProperties` * Fixed compilation * Fixed compilation	2021-11-25 14:48:22 -08:00
Udit Mehrotra	e0125a7911	[HUDI-2801] Add Amazon CloudWatch metrics reporter (#4081 )	2021-11-25 13:33:16 -08:00
Sivabalan Narayanan	435ea1543c	[HUDI-2793] Fixing deltastreamer checkpoint fetch/copy over (#4034 ) - Removed the copy over logic in transaction utils. Deltastreamer will go back to previous commits and get the checkpoint value.	2021-11-24 18:26:40 -05:00
Manoj Govindassamy	973f78f5ca	[HUDI-2443] Hudi KVComparator for all HFile writer usages (#3889 ) * [HUDI-2443] Hudi KVComparator for all HFile writer usages - Hudi relies on custom class shading for Hbase's KeyValue.KVComparator to avoid versioning and class loading issues. There are few places which are still using the Hbase's comparator class directly and version upgrades would make them obsolete. Refactoring the HoodieKVComparator and making all HFile writer creation using the same shaded class. * [HUDI-2443] Hudi KVComparator for all HFile writer usages - Moving HoodieKVComparator from common.bootstrap.index to common.util * [HUDI-2443] Hudi KVComparator for all HFile writer usages - Retaining the old HoodieKVComparatorV2 for boostrap case. Adding the new comparator as HoodieKVComparatorV2 to differentiate from the old one. * [HUDI-2443] Hudi KVComparator for all HFile writer usages - Renamed HoodieKVComparatorV2 to HoodieMetadataKVComparator and moved it under the package org.apache.hudi.metadata. * Make comparator classname configurable * Revert new config and address other review comments Co-authored-by: Sagar Sumit <sagarsumit09@gmail.com>	2021-11-24 10:05:36 -08:00
Raymond Xu	5078d29eb4	[HUDI-2818] Fix 2to3 upgrade when set `hoodie.table.keygenerator.class` (#4077 )	2021-11-23 19:43:34 -08:00
Y Ethan Guo	ca9bfa2a40	[HUDI-2332] Add clustering and compaction in Kafka Connect Sink (#3857 ) * [HUDI-2332] Add clustering and compaction in Kafka Connect Sink * Disable validation check on instant time for compaction and adjust configs * Add javadocs * Add clustering and compaction config * Fix transaction causing missing records in the target table * Add debugging logs * Fix kafka offset sync in participant * Adjust how clustering and compaction are configured in kafka-connect * Fix clustering strategy * Remove irrelevant changes from other published PRs * Update clustering logic and others * Update README * Fix test failures * Fix indentation * Fix clustering config * Add JavaCustomColumnsSortPartitioner and make async compaction enabled by default * Add test for JavaCustomColumnsSortPartitioner * Add more changes after IDE sync * Update README with clarification * Fix clustering logic after rebasing * Remove unrelated changes	2021-11-23 14:23:28 +05:30
Manoj Govindassamy	7f3b89fad7	[HUDI-2472] Enabling metadata table for TestHoodieIndex test case (#4045 ) - Enablng the metadata table for testSimpleGlobalIndexTagLocationWhenShouldUpdatePartitionPath. This is more of a test issue.	2021-11-22 07:21:24 -05:00
Manoj Govindassamy	c8617d9390	[HUDI-2472] Enabling metadata table for TestHoodieMergeOnReadTable and TestHoodieCompactor (#4023 )	2021-11-19 20:02:21 -05:00
davehagman	dfe3b84715	[HUDI-2579] Make deltastreamer checkpoint state merging more explicit (#3820 ) Co-authored-by: Sivabalan Narayanan <n.siva.b@gmail.com>	2021-11-09 17:37:59 -05:00
Sivabalan Narayanan	6d109c6de5	[HUDI-2595] Fixing metadata table updates such that only regular writes from data table can trigger table services in metadata table (#3900 )	2021-11-08 22:12:32 -05:00
xiarixiaoyao	d194643b49	[HUDI-2101][RFC-28] support z-order for hudi (#3330 ) * [HUDI-2101]support z-order for hudi * Renaming some configs for consistency/simplicity. * Minor code cleanups Co-authored-by: Vinoth Chandar <vinoth@apache.org>	2021-11-02 09:31:57 -07:00
Sivabalan Narayanan	69ee790a47	[HUDI-1294] Adding inline read and seek based read(batch get) for hfile log blocks in metadata table (#3762 )	2021-10-29 12:12:44 -04:00
Manoj Govindassamy	b68c5a68f9	[HUDI-2472] Fix few Cleaner tests with metadata table enabled (#3825 )	2021-10-20 18:57:00 -04:00
Sivabalan Narayanan	5f32162a2f	[HUDI-2285][HUDI-2476] Metadata table synchronous design. Rebased and Squashed from pull/3426 (#3590 ) * [HUDI-2285] Adding Synchronous updates to metadata before completion of commits in data timelime. - This patch adds synchronous updates to metadata table. In other words, every write is first committed to metadata table followed by data table. While reading metadata table, we ignore any delta commits that are present only in metadata table and not in data table timeline. - Compaction of metadata table is fenced by the condition that we trigger compaction only when there are no inflight requests in datatable. This ensures that all base files in metadata table is always in sync with data table(w/o any holes) and only there could be some extra invalid commits among delta log files in metadata table. - Due to this, archival of data table also fences itself up until compacted instant in metadata table. All writes to metadata table happens within the datatable lock. So, metadata table works in one writer mode only. This might be tough to loosen since all writers write to same FILES partition and so, will result in a conflict anyways. - As part of this, have added acquiring locks in data table for those operations which were not before while committing (rollback, clean, compaction, cluster). To note, we were not doing any conflict resolution. All we are doing here is to commit by taking a lock. So that all writes to metadata table is always a single writer. - Also added building block to add buckets for partitions, which will be leveraged by other indexes like record level index, etc. For now, FILES partition has only one bucket. In general, any number of buckets per partition is allowed and each partition has a fixed fileId prefix with incremental suffix for each bucket within each partition. Have fixed [HUDI-2476]. This fix is about retrying a failed compaction if it succeeded in metadata for first time, but failed w/ data table. - Enabling metadata table by default. - Adding more tests for metadata table Co-authored-by: Prashant Wason <pwason@uber.com>	2021-10-06 00:17:52 -04:00
liujinhui	61d0096088	[HUDI-2434] Make periodSeconds of GraphiteReporter configurable (#3667 )	2021-09-17 19:39:55 +08:00
liujinhui	2791fb9a96	[HUDI-2423] Separate some config logic from HoodieMetricsConfig into HoodieMetricsGraphiteConfig HoodieMetricsJmxConfig (#3652 )	2021-09-16 15:08:10 +08:00
vinoth chandar	ea59a7ff5f	[HUDI-2080] Move to ubuntu-18.04 for Azure CI (#3409 ) Update Azure CI ubuntu from 16.04 to 18.04 due to 16.04 will be removed soon Fixed some consistently failed tests * fix TestCOWDataSourceStorage TestMORDataSourceStorage * reset mocks Also update readme badge Co-authored-by: Raymond Xu <2701446+xushiyan@users.noreply.github.com>	2021-09-07 09:44:30 -07:00
Raymond Xu	6bd3ca98d6	[HUDI-1989] Fix flakiness in TestHoodieMergeOnReadTable (#3574 ) * [HUDI-1989] Refactor clustering tests for MoR table * refactor assertion helper * add CheckedFunction * SparkClientFunctionalTestHarness.java * put back original test case * move testcases out from TestHoodieMergeOnReadTable.java * add TestHoodieSparkMergeOnReadTableRollback.java * use SparkClientFunctionalTestHarness * add tag	2021-09-03 13:17:17 -07:00

1 2

80 Commits