lanyuanxiaoyao/hudi - hudi - Gitea: Git with a cup of tea

Author	SHA1	Message	Date
Y Ethan Guo	b1e7e1f14e	[HUDI-3708] Fix failure with HoodieMetadataRecord due to schema compatibility check (#5204 )	2022-04-01 20:17:02 -07:00
Sagar Sumit	28dafa774e	[HUDI-2488][HUDI-3175] Implement async metadata indexing (#4693 ) - Add a new action called INDEX, whose state transition is described in the RFC. - Changes in timeline to support the new action. - Add an index planner in ScheduleIndexActionExecutor. - Add index plan executor in RunIndexActionExecutor. - Add 3 APIs in HoodieTableMetadataWriter; a) scheduleIndex: will generate an index plan based on latest completed instant, initialize file groups and add a requested INDEX instant, b) index: executes the index plan and also takes care of writes that happened after indexing was requested, c) dropIndex: will drop index by removing the given metadata partition. - Add 2 new table configs to serve as the source of truth for inflight and completed indexes. - Support upgrade/downgrade taking care of the newly added configs. - Add tool to trigger indexing in HoodieIndexer. - Handle corner cases related to partial failures. - Abort gracefully after deleting partition and instant. - Handle other actions in timeline to consider before catching up	2022-04-01 01:33:12 +05:30
ForwardXu	80011df995	[HUDI-3135] Make delete partitions lazy to be executed by the cleaner (#4489 ) As of now, delete partitions will ensure all file groups are deleted, but the partition as such is not deleted. So, get all partitions might be returning the deleted partitions as well. but no data will be served since all file groups are deleted. With this patch, we are fixing it. We are letting cleaner take care of deleting the partitions when all file groups pertaining to a partitions are deleted. - Fixed the CleanPlanActionExecutor to return meta info about list of partitions to be deleted. If there are no valid file groups for a partition, clean planner will include the partition to be deleted. - Fixed HoodieCleanPlan avro schema to include the list of partitions to be deleted - CleanActionExecutor is fixed to delete partitions if any (as per clean plan) - Same info is added to HoodieCleanMetadata - Metadata table when applying clean metadata, will check for partitions to be deleted and will update the "all_partitions" record for the deleted partitions. Co-authored-by: sivabalan <n.siva.b@gmail.com>	2022-03-31 03:35:39 -04:00
Alexey Kudinkin	5e8ff8d793	[HUDI-3514] Rebase Data Skipping flow to rely on MT Column Stats index (#4948 )	2022-03-15 10:38:36 -07:00
yuzhaojing	dabae80423	[HUDI-3420] Remove duplicates type in HoodieClusteringGroup.avsc (#4808 ) Co-authored-by: yuzhaojing <yuzhaojing@bytedance.com>	2022-02-23 10:49:47 +08:00
Sivabalan Narayanan	17cb5cb433	[HUDI-3432] Fixing restore with metadata enabled (#4849 ) * Fixing restore with metadata enabled * Fixing test failures	2022-02-21 18:25:30 +05:30
Sivabalan Narayanan	e7ec3a82dc	[HUDI-2432] Adding restore.requested instant and restore plan for restore action (#4605 ) - This adds a restore plan and serializes it to restore.requested meta file in timeline. This also means that we are introducing schedule and execution phases for restore which was not present before.	2022-02-10 08:06:23 -05:00
Manoj Govindassamy	5927bdd1c0	[HUDI-1295] Metadata Index - Bloom filter and Column stats index to speed up index lookups (#4352 ) * [HUDI-1295] Metadata Index - Bloom filter and Column stats index to speed up index lookups - Today, base files have bloom filter at their footers and index lookups have to load the base file to perform any bloom lookups. Though we have interval tree based file purging, we still end up in significant amount of base file read for the bloom filter for the end index lookups for the keys. This index lookup operation can be made more performant by having all the bloom filters in a new metadata partition and doing pointed lookups based on keys. * [HUDI-1295] Metadata Index - Bloom filter and Column stats index to speed up index lookups - Adding indexing support for clean, restore and rollback operations. Each of these operations will now be converted to index records for bloom filter and column stats additionally. * [HUDI-1295] Metadata Index - Bloom filter and Column stats index to speed up index lookups - Making hoodie key consistent for both column stats and bloom index by including fileId instead of fileName, in both read and write paths. - Performance optimization for looking up records in the metadata table. - Avoiding multi column sorting needed for HoodieBloomMetaIndexBatchCheckFunction * [HUDI-1295] Metadata Index - Bloom filter and Column stats index to speed up index lookups - HoodieBloomMetaIndexBatchCheckFunction cleanup to remove unused classes - Base file checking before reading the file footer for bloom or column stats * [HUDI-1295] Metadata Index - Bloom filter and Column stats index to speed up index lookups - Updating the bloom index and column stats index to have full file name included in the key instead of just file id. - Minor test fixes. * [HUDI-1295] Metadata Index - Bloom filter and Column stats index to speed up index lookups - Fixed flink commit method to handle metadata table all partition update records - TestBloomIndex fixes * [HUDI-1295] Metadata Index - Bloom filter and Column stats index to speed up index lookups - SparkHoodieBloomIndexHelper code simplification for various config modes - Signature change for getBloomFilters() and getColumnStats(). Callers can just pass in interested partition and file names, the index key is then constructed internally based on the passed in parameters. - KeyLookupHandle and KeyLookupResults code refactoring - Metadata schema changes - removed the reserved field * [HUDI-1295] Metadata Index - Bloom filter and Column stats index to speed up index lookups - Removing HoodieColumnStatsMetadata and using HoodieColumnRangeMetadata instead. Fixed the users of the the removed class. * [HUDI-1295] Metadata Index - Bloom filter and Column stats index to speed up index lookups - Extending meta index test to cover deletes, compactions, clean and restore table operations. Also, fixed the getBloomFilters() and getColumnStats() to account for deleted entries. * [HUDI-1295] Metadata Index - Bloom filter and Column stats index to speed up index lookups - Addressing review comments - java doc for new classes, keys sorting for lookup, index methods renaming. * [HUDI-1295] Metadata Index - Bloom filter and Column stats index to speed up index lookups - Consolidated the bloom filter checking for keys in to one HoodieMetadataBloomIndexCheckFunction instead of a spearate batch and lazy mode. Removed all the configs around it. - Made the metadata table partition file group count configurable. - Fixed the HoodieKeyLookupHandle to have auto closable file reader when checking bloom filter and range keys. - Config property renames. Test fixes. * [HUDI-1295] Metadata Index - Bloom filter and Column stats index to speed up index lookups - Enabling column stats indexing for all columns by default - Handling column stat generation errors and test update * [HUDI-1295] Metadata Index - Bloom filter and Column stats index to speed up index lookups - Metadata table partition file group count taken from the slices when the table is bootstrapped. - Prep records for the commit refactored to the base class - HoodieFileReader interface changes for filtering keys - Multi column and data types support for colums stats index * [HUDI-1295] Metadata Index - Bloom filter and Column stats index to speed up index lookups - rebase to latest master and merge fixes for the build and test failures * [HUDI-1295] Metadata Index - Bloom filter and Column stats index to speed up index lookups - Extending the metadata column stats type payload schema to include more statistics about the column ranges to help query integration. * [HUDI-1295] Metadata Index - Bloom filter and Column stats index to speed up index lookups - Addressing review comments	2022-02-03 18:12:48 +05:30
Alexey Kudinkin	819e8018ff	[HUDI-3322][HUDI-3343] Fixing Metadata Table Records Duplication Issues (#4716 ) This change is addressing issues in regards to Metadata Table observing ingesting duplicated records leading to it persisting incorrect file-sizes for the files referred to in those records. There are multiple issues that were leading to that: - [HUDI-3322] Incorrect Rollback Plan generation: Rollback Plan generated for MOR tables was overly expansively listing all log-files with the latest base-instant as the ones that have been affected by the rollback, leading to invalid MT records being ingested referring to those. - [HUDI-3343] Metadata Table including Uncommitted Log Files during Bootstrap: Since MT is bootstrapped at the end of the commit operation execution (after FS activity, but before committing to the timeline), it was actually incorrectly ingesting some files that were part of the intermediate state of the operation being committed. This change will unblock Stack of PRs based off #4556	2022-02-02 16:10:51 -05:00
YueZhang	7647562dad	[HUDI-2833][Design] Merge small archive files instead of expanding indefinitely. (#4078 ) Co-authored-by: yuezhang <yuezhang@freewheel.tv>	2022-01-18 22:42:35 -08:00
Sivabalan Narayanan	b8dad628e5	[HUDI-2422] Adding rollback plan and rollback requested instant (#3651 ) - This patch introduces rollback plan and rollback.requested instant. Rollback will be done in two phases, namely rollback plan and rollback action. In planning, we prepare the rollback plan and serialize it to rollback.requested. In the rollback action phase, we fetch details from the plan and just delete the files as per the plan. This will ensure final rollback commit metadata will contain all files that got rolled back even if rollback failed midway and retried again.	2021-09-16 11:16:06 -04:00
Sagar Sumit	70b6bd485f	[HUDI-1468] Support custom clustering strategies and preserve commit metadata as part of clustering (#3419 ) Co-authored-by: Satish Kotha <satishkotha@uber.com>	2021-08-06 22:53:08 -04:00
Susu Dong	685f77b5dd	[HUDI-1740] Fix insert-overwrite API archival (#2784 ) - fix problem of archiving replace commits - Fix problem when getting empty replacecommit.requested - Improved the logic of handling empty and non-empty requested/inflight commit files. Added unit tests to cover both empty and non-empty inflight files cases and cleaned up some unused test util methods Co-authored-by: yorkzero831 <yorkzero8312@gmail.com> Co-authored-by: zheren.yu <zheren.yu@paypay-corp.co.jp>	2021-05-21 13:52:13 -07:00
Sebastian Bernauer	aa0da72c59	Preparation for Avro update (#2650 )	2021-03-30 21:50:17 -07:00
Sivabalan Narayanan	b038623ed3	[HUDI 1615] Fixing null schema in bulk_insert row writer path (#2653 ) * [HUDI-1615] Avoid passing in null schema from row writing/deltastreamer * Fixing null schema in bulk insert row writer path * Fixing tests Co-authored-by: vc <vinoth@apache.org>	2021-03-16 09:44:11 -07:00
satishkotha	c4a66324cd	[HUDI-1651] Fix archival of requested replacecommit (#2622 )	2021-03-09 15:56:44 -08:00
n3nash	b2c47a24be	[HUDI-1589] Fix Rollback Metadata AVRO backwards incompatiblity (#2543 )	2021-02-05 16:03:34 -08:00
Danny Chan	bc0325f6ea	[HUDI-1522] Add a new pipeline for Flink writer (#2430 ) * [HUDI-1522] Add a new pipeline for Flink writer	2021-01-28 08:53:13 +08:00
Sivabalan Narayanan	e3d3677b7e	[HUDI-1502] MOR rollback and restore support for metadata sync (#2421 ) - Adds field to RollbackMetadata that capture the logs written for rollback blocks - Adds field to RollbackMetadata that capture new logs files written by unsynced deltacommits Co-authored-by: Vinoth Chandar <vinoth@apache.org>	2021-01-11 13:23:13 -08:00
Prashant Wason	298808baaf	[HUDI-842] Implementation of HUDI RFC-15. - Introduced an internal metadata table, that stores file listings. - metadata table is kept upto date with - Fixed handling of CleanerPlan. - [HUDI-842] Reduce parallelism to speed up the test. - [HUDI-842] Implementation of CLI commands for metadata operations and lookups. - [HUDI-842] Extend rollback metadata to include the files which have been appended to. - [HUDI-842] Support for rollbacks in MOR Table. - MarkerBasedRollbackStrategy needs to correctly provide the list of files for which rollback blocks were appended. - [HUDI-842] Added unit test for rollback of partial commits (inflight but not completed yet). - [HUDI-842] Handled the error case where metadata update succeeds but dataset commit fails. - [HUDI-842] Schema evolution strategy for Metadata Table. Each type of metadata saved (FilesystemMetadata, ColumnIndexMetadata, etc.) will be a separate field with default null. The type of the record will identify the valid field. This way, we can grow the schema when new type of information is saved within in which still keeping it backward compatible. - [HUDI-842] Fix non-partitioned case and speedup initial creation of metadata table.Choose only 1 partition for jsc as the number of records is low (hundreds to thousands). There is more overhead of creating large number of partitions for JavaRDD and it slows down operations like WorkloadProfile. For the non-partitioned case, use "." as the name of the partition to prevent empty keys in HFile. - [HUDI-842] Reworked metrics pusblishing. - Code has been split into reader and writer side. HoodieMetadata code to be accessed by using HoodieTable.metadata() to get instance of metdata for the table. Code is serializable to allow executors to use the functionality. - [RFC-15] Add metrics to track the time for each file system call. - [RFC-15] Added a distributed metrics registry for spark which can be used to collect metrics from executors. This helps create a stats dashboard which shows the metadata table improvements in real-time for production tables. - [HUDI-1321] Created HoodieMetadataConfig to specify configuration for the metadata table. This is safer than full-fledged properties for the metadata table (like HoodieWriteConfig) as it makes burdensome to tune the metadata. With limited configuration, we can control the performance of the metadata table closely. [HUDI-1319][RFC-15] Adding interfaces for HoodieMetadata, HoodieMetadataWriter (apache#2266) - moved MetadataReader to HoodieBackedTableMetadata, under the HoodieTableMetadata interface - moved MetadataWriter to HoodieBackedTableMetadataWriter, under the HoodieTableMetadataWriter - Pulled all the metrics into HoodieMetadataMetrics - Writer now wraps the metadata, instead of extending it - New enum for MetadataPartitionType - Streamlined code flow inside HoodieBackedTableMetadataWriter w.r.t initializing metadata state - [HUDI-1319] Make async operations work with metadata table (apache#2332) - Changes the syncing model to only move over completed instants on data timeline - Syncing happens postCommit and on writeClient initialization - Latest delta commit on the metadata table is sufficient as the watermark for data timeline archival - Cleaning/Compaction use a suffix to the last instant written to metadata table, such that we keep the 1-1 - .. mapping between data and metadata timelines. - Got rid of a lot of the complexity around checking for valid commits during open of base/log files - Tests now use local FS, to simulate more failure scenarios - Some failure scenarios exposed HUDI-1434, which is needed for MOR to work correctly co-authored by: Vinoth Chandar <vinoth@apache.org>	2021-01-04 07:59:47 -08:00
Satish Kotha	6dc03b65bf	[HUDI-1075] Implement simple clustering strategies to create ClusteringPlan and to run the plan	2020-12-21 17:34:15 -08:00
satishkotha	33ec88fc38	[HUDI-1352] Add FileSystemView APIs to query pending clustering operations (#2202 )	2020-11-05 08:49:58 -08:00
satishkotha	a99e93bed5	[HUDI-1072] Introduce REPLACE top level action. Implement insert_overwrite operation on top of replace action (#2048 )	2020-09-29 17:04:25 -07:00
wenningd	8b928e9bca	[HUDI-808] Support cleaning bootstrap source data (#1870 ) Co-authored-by: Wenning Ding <wenningd@amazon.com> Co-authored-by: Balaji Varadarajan <vbalaji@apache.org>	2020-08-11 01:43:46 -07:00
vinoth chandar	539621bd33	[HUDI-242] Support for RFC-12/Bootstrapping of external datasets to hudi (#1876 ) - [HUDI-418] Bootstrap Index Implementation using HFile with unit-test - [HUDI-421] FileSystem View Changes to support Bootstrap with unit-tests - [HUDI-424] Implement Query Side Integration for querying tables containing bootstrap file slices - [HUDI-423] Implement upsert functionality for handling updates to these bootstrap file slices - [HUDI-421] Bootstrap Write Client with tests - [HUDI-425] Added HoodieDeltaStreamer support - [HUDI-899] Add a knob to change partition-path style while performing metadata bootstrap - [HUDI-900] Metadata Bootstrap Key Generator needs to handle complex keys correctly - [HUDI-424] Simplify Record reader implementation - [HUDI-423] Implement upsert functionality for handling updates to these bootstrap file slices - [HUDI-420] Hoodie Demo working with hive and sparkSQL. Also, Hoodie CLI working with bootstrap tables Co-authored-by: Mehrotra <uditme@amazon.com> Co-authored-by: Vinoth Chandar <vinoth@apache.org> Co-authored-by: Balaji Varadarajan <varadarb@uber.com>	2020-08-03 20:19:21 -07:00
hongdd	8306205d7a	[HUDI-332]Add operation type (insert/upsert/bulkinsert/delete) to HoodieCommitMetadata (#1157 ) [HUDI-332]Add operation type (insert/upsert/bulkinsert/delete) to HoodieCommitMetadata (#1157)	2020-03-03 10:10:29 -08:00
Balaji Varadarajan	9a1f698eef	[HUDI-308] Avoid Renames for tracking state transitions of all actions on dataset	2019-12-15 21:26:30 -08:00
Balaji Varadarajan	1032fc3e54	[HUDI-137] Hudi cleaning state changes should be consistent with compaction actions Before this change, Cleaner performs cleaning of old file versions and then stores the deleted files in .clean files. With this setup, we will not be able to track file deletions if a cleaner fails after deleting files but before writing .clean metadata. This is fine for regular file-system view generation but Incremental timeline syncing relies on clean/commit/compaction metadata to keep a consistent file-system view. Cleaner state transitions is now similar to that of compaction. 1. Requested : HoodieWriteClient.scheduleClean() selects the list of files that needs to be deleted and stores them in metadata 2. Inflight : HoodieWriteClient marks the state to be inflight before it starts deleting 3. Completed : HoodieWriteClient marks the state after completing the deletion according to the cleaner plan	2019-11-11 10:40:16 -08:00
Guru107	eda472adb0	[MINOR] Fix avro schema warnings in build	2019-10-31 21:49:38 -07:00
Balaji Varadarajan	d8be818ac9	[HUDI-130] Paths written in compaction plan needs to be relative to base-path	2019-10-23 02:52:24 -07:00
Balaji Varadarajan	58623631d4	[HUDI-249] Update Release-notes. Add sign-artifacts to POM and release related scripts. Add missing license headers	2019-09-13 08:41:29 -07:00
Balaji Varadarajan	a4f9d7575f	HUDI-123 Rename code packages/constants to org.apache.hudi (#830 ) - Rename com.uber.hoodie to org.apache.hudi - Flag to pass com.uber.hoodie Input formats for hoodie-sync - Works with HUDI demo. - Also tested for backwards compatibility with datasets built by com.uber.hoodie packages - Migration guide : https://cwiki.apache.org/confluence/display/HUDI/Migration+Guide+From+com.uber.hoodie+to+org.apache.hudi	2019-08-11 17:48:17 -07:00

32 Commits