lanyuanxiaoyao/hudi - hudi - Gitea: Git with a cup of tea

Author	SHA1	Message	Date
Raymond Xu	0938f55a2b	[HUDI-3458] Fix BulkInsertPartitioner generic type (#4854 )	2022-02-20 13:51:58 -05:00
RexAn	5009138d04	[HUDI-3438] Avoid getSmallFiles if hoodie.parquet.small.file.limit is 0 (#4823 ) Co-authored-by: Hui An <hui.an@shopee.com>	2022-02-18 08:57:04 -05:00
YueZhang	76e2faa28d	[HUDI-3370] The files recorded in the commit may not match the actual ones for MOR Compaction (#4753 ) * use HoodieCommitMetadata to replace writeStatuses computation Co-authored-by: yuezhang <yuezhang@freewheel.tv>	2022-02-14 11:12:52 +08:00
Sivabalan Narayanan	e7ec3a82dc	[HUDI-2432] Adding restore.requested instant and restore plan for restore action (#4605 ) - This adds a restore plan and serializes it to restore.requested meta file in timeline. This also means that we are introducing schedule and execution phases for restore which was not present before.	2022-02-10 08:06:23 -05:00
Y Ethan Guo	b8601a9f58	[HUDI-2656] Generalize HoodieIndex for flexible record data type (#3893 ) Co-authored-by: Raymond Xu <2701446+xushiyan@users.noreply.github.com>	2022-02-03 20:24:04 -08:00
Raymond Xu	0bd38f26ca	[HUDI-2596] Make class names consistent in hudi-client (#4680 )	2022-01-27 17:05:08 -08:00
董可伦	1f7b6b2154	[HUDI-2417] Add support allowDuplicateInserts in HoodieJavaClient (#3644 )	2022-01-24 14:26:27 -05:00
Sivabalan Narayanan	e00a9042e9	[HUDI-3072] Fixing conflict resolution in transaction management code path for auto commit code path (#4588 ) * Fixing conflict resolution in transaction management code path for auto commit code path * Addressing comments * Fixing test failures	2022-01-24 16:13:28 +05:30
Sagar Sumit	827549949c	[HUDI-2909] Handle logical type in TimestampBasedKeyGenerator (#4203 ) * [HUDI-2909] Handle logical type in TimestampBasedKeyGenerator Timestampbased key generator was returning diff values for row writer and non row writer path. this patch fixes it and is guarded by a config flag (`hoodie.datasource.write.keygenerator.consistent.logical.timestamp.enabled`)	2022-01-08 10:22:44 -05:00
Yuwei XIAO	2444f40a4b	[HUDI-3095] abstract partition filter logic to enable code reuse (#4454 ) * [HUDI-3095] abstract partition filter logic to enable code reuse * [HUDI-3095] address reviews	2021-12-31 11:07:52 +05:30
Shawy Geng	a4e622ac61	[HUDI-1951] Add bucket hash index, compatible with the hive bucket (#3173 ) * [HUDI-2154] Add index key field to HoodieKey * [HUDI-2157] Add the bucket index and its read/write implemention of Spark engine. * revert HUDI-2154 add index key field to HoodieKey * fix all comments and introduce a new tricky way to get index key at runtime support double insert for bucket index * revert spark read optimizer based on bucket index * add the storage layout * index tag, hash function and add ut * fix ut * address partial comments * Code review feedback * add layout config and docs * fix ut * rename hoodie.layout and rebase master Co-authored-by: Vinoth Chandar <vinoth@apache.org>	2021-12-30 12:38:26 -08:00
Alexey Kudinkin	5755ff25a4	[HUDI-2814] Addressing issues w/ Z-order Layout Optimization (#4060 ) * `ZCurveOptimizeHelper` > `ZOrderingIndexHelper`; Moved Z-index helper under `hudi.index.zorder` package * Tidying up `ZOrderingIndexHelper` * Fixing compilation * Fixed index new/original table merging sequence to always prefer values from new index; Cleaned up `HoodieSparkUtils` * Added test for `mergeIndexSql` * Abstracted Z-index name composition w/in `ZOrderingIndexHelper`; * Fixed `DataSkippingUtils` to interrupt prunning in case data filter contains non-indexed column reference * Properly handle exceptions origination during pruning in `HoodieFileIndex` * Make sure no errors are logged upon encountering `AnalysisException` * Cleaned up Z-index updating sequence; Tidying up comments, java-docs; * Fixed Z-index to properly handle changes of the list of clustered columns * Tidying up * `lint` * Suppressing `JavaDocStyle` first sentence check * Fixed compilation * Fixing incorrect `DecimalType` conversion * Refactored test `TestTableLayoutOptimization` - Added Z-index table composition test (against fixtures) - Separated out GC test; Tidying up * Fixed tests re-shuffling column order for Z-Index table `DataFrame` to align w/ the one by one loaded from JSON * Scaffolded `DataTypeUtils` to do basic checks of Spark types; Added proper compatibility checking b/w old/new index-tables * Added test for Z-index tables merging * Fixed import being shaded by creating internal `hudi.util` package * Fixed packaging for `TestOptimizeTable` * Revised `updateMetadataIndex` seq to provide Z-index updating process w/ source table schema * Make sure existing Z-index table schema is sync'd to source table's one * Fixed shaded refs * Fixed tests * Fixed type conversion of Parquet provided metadata values into Spark expected schemas * Fixed `composeIndexSchema` utility to propose proper schema * Added more tests for Z-index: - Checking that Z-index table is built correctly - Checking that Z-index tables are merged correctly (during update) * Fixing source table * Fixing tests to read from Parquet w/ proper schema * Refactored `ParquetUtils` utility reading stats from Parquet footers * Fixed incorrect handling of Decimals extracted from Parquet footers * Worked around issues in javac failign to compile stream's collection * Fixed handling of `Date` type * Fixed handling of `DateType` to be parsed as `LocalDate` * Updated fixture; Make sure test loads Z-index fixture using proper schema * Removed superfluous scheme adjusting when reading from Parquet, since Spark is actually able to perfectly restore schema (given Parquet was previously written by Spark as well) * Fixing race-condition in Parquet's `DateStringifier` trying to share `SimpleDataFormat` object which is inherently not thread-safe * Tidying up * Make sure schema is used upon reading to validate input files are in the appropriate format; Tidying up; * Worked around javac (1.8) inability to infer expression type properly * Updated fixtures; Tidying up * Fixing compilation after rebase * Assert clustering have in Z-order layout optimization testing * Tidying up exception messages * XXX * Added test validating Z-index lookup filter correctness * Added more test-cases; Tidying up * Added tests for string expressions * Fixed incorrect Z-index filter lookup translations * Added more test-cases * Added proper handling on complex negations of AND/OR expressions by pushing NOT operator down into inner expressions for appropriate handling * Added `-target:jvm-1.8` for `hudi-spark` module * Adding more tests * Added tests for non-indexed columns * Properly handle non-indexed columns by falling back to a re-write of containing expression as `TrueLiteral` instead * Fixed tests * Removing the parquet test files and disabling corresponding tests Co-authored-by: Vinoth Chandar <vinoth@apache.org>	2021-11-26 10:02:15 -08:00
Y Ethan Guo	ca9bfa2a40	[HUDI-2332] Add clustering and compaction in Kafka Connect Sink (#3857 ) * [HUDI-2332] Add clustering and compaction in Kafka Connect Sink * Disable validation check on instant time for compaction and adjust configs * Add javadocs * Add clustering and compaction config * Fix transaction causing missing records in the target table * Add debugging logs * Fix kafka offset sync in participant * Adjust how clustering and compaction are configured in kafka-connect * Fix clustering strategy * Remove irrelevant changes from other published PRs * Update clustering logic and others * Update README * Fix test failures * Fix indentation * Fix clustering config * Add JavaCustomColumnsSortPartitioner and make async compaction enabled by default * Add test for JavaCustomColumnsSortPartitioner * Add more changes after IDE sync * Update README with clarification * Fix clustering logic after rebasing * Remove unrelated changes	2021-11-23 14:23:28 +05:30
Sivabalan Narayanan	2d3f2a3275	[HUDI-2734] Setting default metadata enable as false for Java (#4003 )	2021-11-17 14:43:00 -05:00
Sivabalan Narayanan	04eb5fdc65	[HUDI-2753] Ensure list based rollback strategy is used for restore (#3983 )	2021-11-17 10:06:55 +05:30
Sagar Sumit	bb6a19e7d7	[HUDI-1877] Support records staying in same fileId after clustering (#3833 ) * [HUDI-1877] Support records staying in same fileId after clustering Add plan strategy * Ensure same filegroup id and refactor based on comments	2021-11-09 20:17:50 -08:00
Sivabalan Narayanan	6d109c6de5	[HUDI-2595] Fixing metadata table updates such that only regular writes from data table can trigger table services in metadata table (#3900 )	2021-11-08 22:12:32 -05:00
xiarixiaoyao	d194643b49	[HUDI-2101][RFC-28] support z-order for hudi (#3330 ) * [HUDI-2101]support z-order for hudi * Renaming some configs for consistency/simplicity. * Minor code cleanups Co-authored-by: Vinoth Chandar <vinoth@apache.org>	2021-11-02 09:31:57 -07:00
Sivabalan Narayanan	29574af239	[HUDI-2573] Fixing double locking with multi-writers (#3827 ) - There are two code paths, where we are taking double locking. this was added as part of adding data table locks to update metadata table. Fixing those flows to avoid taking locks if a parent transaction already acquired a lock.	2021-10-29 12:14:39 -04:00
Y Ethan Guo	0223c442ec	[HUDI-2502] Refactor index in hudi-client module (#3778 ) - Refactor Index to reduce Line of Code and re-use across engines.	2021-10-28 04:16:00 -04:00
Y Ethan Guo	5ed35bff83	[HUDI-2501] Add HoodieData abstraction and refactor compaction actions in hudi-client module (#3741 )	2021-10-22 15:58:51 -04:00
Sivabalan Narayanan	5f32162a2f	[HUDI-2285][HUDI-2476] Metadata table synchronous design. Rebased and Squashed from pull/3426 (#3590 ) * [HUDI-2285] Adding Synchronous updates to metadata before completion of commits in data timelime. - This patch adds synchronous updates to metadata table. In other words, every write is first committed to metadata table followed by data table. While reading metadata table, we ignore any delta commits that are present only in metadata table and not in data table timeline. - Compaction of metadata table is fenced by the condition that we trigger compaction only when there are no inflight requests in datatable. This ensures that all base files in metadata table is always in sync with data table(w/o any holes) and only there could be some extra invalid commits among delta log files in metadata table. - Due to this, archival of data table also fences itself up until compacted instant in metadata table. All writes to metadata table happens within the datatable lock. So, metadata table works in one writer mode only. This might be tough to loosen since all writers write to same FILES partition and so, will result in a conflict anyways. - As part of this, have added acquiring locks in data table for those operations which were not before while committing (rollback, clean, compaction, cluster). To note, we were not doing any conflict resolution. All we are doing here is to commit by taking a lock. So that all writes to metadata table is always a single writer. - Also added building block to add buckets for partitions, which will be leveraged by other indexes like record level index, etc. For now, FILES partition has only one bucket. In general, any number of buckets per partition is allowed and each partition has a fixed fileId prefix with incremental suffix for each bucket within each partition. Have fixed [HUDI-2476]. This fix is about retrying a failed compaction if it succeeded in metadata for first time, but failed w/ data table. - Enabling metadata table by default. - Adding more tests for metadata table Co-authored-by: Prashant Wason <pwason@uber.com>	2021-10-06 00:17:52 -04:00
Y Ethan Guo	46808dcb1f	[HUDI-2497] Refactor clean and restore actions in hudi-client module (#3734 )	2021-09-30 18:20:25 -04:00
Sivabalan Narayanan	f0585facd6	[HUDI-2474] Refreshing timeline for every operation in Hudi when metadata is enabled (#3698 )	2021-09-28 05:16:52 -04:00
vinoth chandar	57d5da68aa	[HUDI-2330][HUDI-2335] Adding support for merge-on-read tables (#3679 ) - Inserts go into logs, hashed by Kafka and Hudi partitions - Fixed issues with the setupKafka script - Bumped up the default commit interval to 300 seconds - Minor renaming	2021-09-16 15:24:34 -07:00
Sivabalan Narayanan	b8dad628e5	[HUDI-2422] Adding rollback plan and rollback requested instant (#3651 ) - This patch introduces rollback plan and rollback.requested instant. Rollback will be done in two phases, namely rollback plan and rollback action. In planning, we prepare the rollback plan and serialize it to rollback.requested. In the rollback action phase, we fetch details from the plan and just delete the files as per the plan. This will ensure final rollback commit metadata will contain all files that got rolled back even if rollback failed midway and retried again.	2021-09-16 11:16:06 -04:00
Y Ethan Guo	916f12b7dd	[HUDI-2433] Refactor rollback actions in hudi-client module (#3664 )	2021-09-15 18:52:43 -04:00
董可伦	dbcf60f370	[MINOR] fix typo (#3640 )	2021-09-11 15:45:49 +08:00
rmahindra123	e528dd798a	[HUDI-2394] Implement Kafka Sink Protocol for Hudi for Ingesting Immutable Data (#3592 ) - Fixing packaging, naming of classes - Use of log4j over slf4j for uniformity - More follow-on fixes - Added a version to control/coordinator events. - Eliminated the config added to write config - Fixed fetching of checkpoints based on table type - Clean up of naming, code placement Co-authored-by: Rajesh Mahindra <rmahindra@Rajeshs-MacBook-Pro.local> Co-authored-by: Vinoth Chandar <vinoth@apache.org>	2021-09-10 18:20:26 -07:00
Y Ethan Guo	9056c68744	[HUDI-2305] Add MARKERS.type and fix marker-based rollback (#3472 ) - Rollback infers the directory structure and does rollback based on the strategy used while markers were written. "write markers type" in write config is used to determine marker strategy only for new writes.	2021-08-14 08:18:49 -04:00
Prashant Wason	76bc686a77	[HUDI-1292] Created a config to enable/disable syncing of metadata table. (#3427 ) * [HUDI-1292] Created a config to enable/disable syncing of metadata table. - Metadata Table should only be synced from a single pipeline to prevent conflicts. - Skip syncing metadata table for clustering and compaction - Renamed useFileListingMetadata Co-authored-by: Vinoth Chandar <vinoth@apache.org>	2021-08-12 15:45:57 -07:00
Y Ethan Guo	4783176554	[HUDI-1138] Add timeline-server-based marker file strategy for improving marker-related latency (#3233 ) - Can be enabled for cloud stores like S3. Not supported for hdfs yet, due to partial write failures.	2021-08-11 11:48:13 -04:00
Sagar Sumit	70b6bd485f	[HUDI-1468] Support custom clustering strategies and preserve commit metadata as part of clustering (#3419 ) Co-authored-by: Satish Kotha <satishkotha@uber.com>	2021-08-06 22:53:08 -04:00
Sivabalan Narayanan	61148c1c43	[HUDI-2176, 2178, 2179] Adding virtual key support to COW table (#3306 )	2021-07-26 17:21:04 -04:00
pengzhiwei	f760ec543e	[HUDI-1659] Basic Implement Of Spark Sql Support For Hoodie (#2645 ) Main functions: Support create table for hoodie. Support CTAS. Support Insert for hoodie. Including dynamic partition and static partition insert. Support MergeInto for hoodie. Support DELETE Support UPDATE Both support spark2 & spark3 based on DataSourceV1. Main changes: Add sql parser for spark2. Add HoodieAnalysis for sql resolve and logical plan rewrite. Add commands implementation for CREATE TABLE、INSERT、MERGE INTO & CTAS. In order to push down the update&insert logical to the HoodieRecordPayload for MergeInto, I make same change to the HoodieWriteHandler and other related classes. 1、Add the inputSchema for parser the incoming record. This is because the inputSchema for MergeInto is different from writeSchema as there are some transforms in the update& insert expression. 2、Add WRITE_SCHEMA to HoodieWriteConfig to pass the write schema for merge into. 3、Pass properties to HoodieRecordPayload#getInsertValue to pass the insert expression and table schema. Verify this pull request Add TestCreateTable for test create hoodie tables and CTAS. Add TestInsertTable for test insert hoodie tables. Add TestMergeIntoTable for test merge hoodie tables. Add TestUpdateTable for test update hoodie tables. Add TestDeleteTable for test delete hoodie tables. Add TestSqlStatement for test supported ddl/dml currently.	2021-06-07 23:24:32 -07:00
Roc Marshal	fcedbfcb58	[MINOR][hudi-client] Code-cleanup,remove redundant variable declarations (#2956 )	2021-05-17 13:34:42 +08:00
TeRS-K	be9db2c4f5	[HUDI-1055] Remove hardcoded parquet in tests (#2740 ) * Remove hardcoded parquet in tests * Use DataFileUtils.getInstance * Renaming DataFileUtils to BaseFileUtils Co-authored-by: Vinoth Chandar <vinoth@apache.org>	2021-05-11 10:01:45 -07:00
leo-Iamok	8bc65b9318	[HUDI-1731] Rename UpsertPartitioner in hudi-java-client (#2734 ) Co-authored-by: lei.zhu <lei.zhu@envisioncn.com>	2021-03-31 11:06:04 +08:00
Shen Hong	ecbd389a3f	[HUDI-1478] Introduce HoodieBloomIndex to hudi-java-client (#2608 )	2021-03-28 20:28:40 +08:00
n3nash	74241947c1	[HUDI-845] Added locking capability to allow multiple writers (#2374 ) * [HUDI-845] Added locking capability to allow multiple writers 1. Added LockProvider API for pluggable lock methodologies 2. Added Resolution Strategy API to allow for pluggable conflict resolution 3. Added TableService client API to schedule table services 4. Added Transaction Manager for wrapping actions within transactions	2021-03-16 16:43:53 -07:00
Shen Hong	2efd0760ac	[HUDI-1477] Support copyOnWriteTable in java client (#2382 )	2021-02-23 20:50:55 +08:00
n3nash	ffcfb58bac	[HUDI-1486] Remove inline inflight rollback in hoodie writer (#2359 ) 1. Refactor rollback and move cleaning failed commits logic into cleaner 2. Introduce hoodie heartbeat to ascertain failed commits 3. Fix test cases	2021-02-19 20:12:22 -08:00
Sivabalan Narayanan	c9fcf964b2	[HUDI-1315] Adding builder for HoodieTableMetaClient initialization (#2534 )	2021-02-20 09:54:26 +08:00
Danny Chan	4c5b6923cc	[HUDI-1557] Make Flink write pipeline write task scalable (#2506 ) This is the #step 2 of RFC-24: https://cwiki.apache.org/confluence/display/HUDI/RFC+-+24%3A+Hoodie+Flink+Writer+Proposal This PR introduce a BucketAssigner that assigns bucket ID (partition path & fileID) for each stream record. There is no need to look up index and partition the records anymore in the following pipeline for these records, we actually decide the write target location before the write and each record computes its location when the BucketAssigner receives it, thus, the indexing is with streaming style. Computing locations for a batch of records all at a time is resource consuming so a pressure to the engine, we should avoid that in streaming system.	2021-02-06 22:03:52 +08:00
Shen Hong	c4afd179c1	[HUDI-1476] Introduce unit test infra for java client (#2478 )	2021-01-24 11:17:19 -08:00
teeyog	c931dc5406	[MINOR] Remove redundant judgments (#2466 )	2021-01-20 20:41:09 +08:00
Udit Mehrotra	17df517b81	[HUDI-1510] Move HoodieEngineContext and its dependencies to hudi-common (#2410 )	2021-01-07 11:34:06 -08:00
vinoth chandar	5ff8e88d58	[HUDI-1513] Introduce WriteClient#preWrite() and relocate metadata table syncing (#2413 ) - Syncing to metadata table, setting operation type, starting async cleaner done in preWrite() - Fixes an issues where delete() was not starting async cleaner correctly - Fixed tests and enabled metadata table for TestAsyncCompaction	2021-01-06 23:08:30 -08:00
Prashant Wason	298808baaf	[HUDI-842] Implementation of HUDI RFC-15. - Introduced an internal metadata table, that stores file listings. - metadata table is kept upto date with - Fixed handling of CleanerPlan. - [HUDI-842] Reduce parallelism to speed up the test. - [HUDI-842] Implementation of CLI commands for metadata operations and lookups. - [HUDI-842] Extend rollback metadata to include the files which have been appended to. - [HUDI-842] Support for rollbacks in MOR Table. - MarkerBasedRollbackStrategy needs to correctly provide the list of files for which rollback blocks were appended. - [HUDI-842] Added unit test for rollback of partial commits (inflight but not completed yet). - [HUDI-842] Handled the error case where metadata update succeeds but dataset commit fails. - [HUDI-842] Schema evolution strategy for Metadata Table. Each type of metadata saved (FilesystemMetadata, ColumnIndexMetadata, etc.) will be a separate field with default null. The type of the record will identify the valid field. This way, we can grow the schema when new type of information is saved within in which still keeping it backward compatible. - [HUDI-842] Fix non-partitioned case and speedup initial creation of metadata table.Choose only 1 partition for jsc as the number of records is low (hundreds to thousands). There is more overhead of creating large number of partitions for JavaRDD and it slows down operations like WorkloadProfile. For the non-partitioned case, use "." as the name of the partition to prevent empty keys in HFile. - [HUDI-842] Reworked metrics pusblishing. - Code has been split into reader and writer side. HoodieMetadata code to be accessed by using HoodieTable.metadata() to get instance of metdata for the table. Code is serializable to allow executors to use the functionality. - [RFC-15] Add metrics to track the time for each file system call. - [RFC-15] Added a distributed metrics registry for spark which can be used to collect metrics from executors. This helps create a stats dashboard which shows the metadata table improvements in real-time for production tables. - [HUDI-1321] Created HoodieMetadataConfig to specify configuration for the metadata table. This is safer than full-fledged properties for the metadata table (like HoodieWriteConfig) as it makes burdensome to tune the metadata. With limited configuration, we can control the performance of the metadata table closely. [HUDI-1319][RFC-15] Adding interfaces for HoodieMetadata, HoodieMetadataWriter (apache#2266) - moved MetadataReader to HoodieBackedTableMetadata, under the HoodieTableMetadata interface - moved MetadataWriter to HoodieBackedTableMetadataWriter, under the HoodieTableMetadataWriter - Pulled all the metrics into HoodieMetadataMetrics - Writer now wraps the metadata, instead of extending it - New enum for MetadataPartitionType - Streamlined code flow inside HoodieBackedTableMetadataWriter w.r.t initializing metadata state - [HUDI-1319] Make async operations work with metadata table (apache#2332) - Changes the syncing model to only move over completed instants on data timeline - Syncing happens postCommit and on writeClient initialization - Latest delta commit on the metadata table is sufficient as the watermark for data timeline archival - Cleaning/Compaction use a suffix to the last instant written to metadata table, such that we keep the 1-1 - .. mapping between data and metadata timelines. - Got rid of a lot of the complexity around checking for valid commits during open of base/log files - Tests now use local FS, to simulate more failure scenarios - Some failure scenarios exposed HUDI-1434, which is needed for MOR to work correctly co-authored by: Vinoth Chandar <vinoth@apache.org>	2021-01-04 07:59:47 -08:00
Shen Hong	ff8313caf1	[HUDI-1423] Support delete in hudi-java-client (#2353 )	2021-01-03 20:38:45 +08:00

1 2

57 Commits