lanyuanxiaoyao/hudi - hudi - Gitea: Git with a cup of tea

Author	SHA1	Message	Date
Sivabalan Narayanan	7bb90e8caf	[HUDI-2794] Guarding table service commits within a single lock to commit to both data table and metadata table (#4037 ) * Fixing a single lock to commit table services across metadata table and data table * Addressing comments * rebasing with master	2021-11-25 11:19:30 -08:00
Sagar Sumit	b972aa5bf2	[HUDI-2800] Remove rdd.isEmpty() validation to prevent CreateHandle being called twice (#4121 )	2021-11-25 23:46:36 +05:30
Sivabalan Narayanan	a9bd20804b	[HUDI-2792] Configure metadata payload consistency check (#4035 ) - Relax metadata payload consistency check to consider spark task failures with spurious deletes	2021-11-24 21:56:31 -05:00
Sagar Sumit	ff94d92980	[HUDI-2766] Cluster update strategy should not be fenced by write config (#4093 ) Fix pending clustering rollback test	2021-11-24 10:15:40 -08:00
Alexey Kudinkin	60b23b9797	[HUDI-2788] Fixing issues w/ Z-order Layout Optimization (#4026 ) * Simplyfying, tidying up * Fixed packaging for `TestOptimizeTable` * Cleaned up `HoodiFileIndex` file filtering seq; Removed optimization manually reading Parquet table circumventing Spark * Refactored `DataSkippingUtils`: - Fixed checks to validate all statistics cols are present - Fixed some predicates being constructed incorrectly - Rewrote comments for easier comprehension, added more notes - Tidying up * Tidying up tests * `lint` * Fixing compilation * `TestOptimizeTable` > `TestTableLayoutOptimization`; Added assertions to test data skipping paths * Fixed tests to properly hit data-skipping path * Fixed pruned files candidates lookup seq to conservatively included all non-indexed files * Added java-doc * Fixed compilation	2021-11-24 10:10:28 -08:00
Y Ethan Guo	ca9bfa2a40	[HUDI-2332] Add clustering and compaction in Kafka Connect Sink (#3857 ) * [HUDI-2332] Add clustering and compaction in Kafka Connect Sink * Disable validation check on instant time for compaction and adjust configs * Add javadocs * Add clustering and compaction config * Fix transaction causing missing records in the target table * Add debugging logs * Fix kafka offset sync in participant * Adjust how clustering and compaction are configured in kafka-connect * Fix clustering strategy * Remove irrelevant changes from other published PRs * Update clustering logic and others * Update README * Fix test failures * Fix indentation * Fix clustering config * Add JavaCustomColumnsSortPartitioner and make async compaction enabled by default * Add test for JavaCustomColumnsSortPartitioner * Add more changes after IDE sync * Update README with clarification * Fix clustering logic after rebasing * Remove unrelated changes	2021-11-23 14:23:28 +05:30
xiarixiaoyao	9de9951348	[HUDI-2778] Optimize statistics collection related codes and add some docs for z-order add fix some bugs (#4013 ) * [HUDI-2778] Optimize statistics collection related codes and add more docs for z-order. * add test code for multi-thread parquet footer read	2021-11-22 21:46:02 -08:00
Sagar Sumit	e22150fe15	[HUDI-1937] Rollback unfinished replace commit to allow updates (#3869 ) * [HUDI-1937] Rollback unfinished replace commit to allow updates while clustering * Revert and delete requested replacecommit too * Rollback pending clustering instants transactionally * No double locking and add a config to enable rollback * Update config to be clear about rollback only on conflict	2021-11-23 07:29:03 +05:30
Alexey Kudinkin	3bdab01a49	[HUDI-2550] Expand File-Group candidates list for appending for MOR tables (#3986 )	2021-11-22 19:19:59 -05:00
Sivabalan Narayanan	fc9ca6a07a	[HUDI-2559] Converting commit timestamp format to millisecs (#4024 ) - Adds support for generating commit timestamps with millisecs granularity. - Older commit timestamps (in secs granularity) will be suffixed with 999 and parsed with millisecs format.	2021-11-22 11:44:38 -05:00
Raymond Xu	02f7ca2b05	[HUDI-1870] Add more Spark CI build tasks (#4022 ) * [HUDI-1870] Add more Spark CI build tasks - build for spark3.0.x - build for spark-shade-unbundle-avro - fix build failures - delete unnecessary assertion for spark 3.0.x - use AvroConversionUtils#convertAvroSchemaToStructType instead of calling SchemaConverters#toSqlType directly to solve the compilation failures with spark-shade-unbundle-avro (#5) Co-authored-by: Yann <biyan900116@gmail.com>	2021-11-22 02:16:45 -08:00
Manoj Govindassamy	459b34240b	[HUDI-2593] Virtual keys support for metadata table (#3968 ) - Metadata table today has virtual keys disabled, thereby populating the metafields for each record written out and increasing the overall storage space used. Hereby adding virtual keys support for metadata table so that metafields are disabled for metadata table records. - Adding a custom KeyGenerator for Metadata table so as to not rely on the default Base/SimpleKeyGenerators which currently look for record key and partition field set in the table config. - AbstractHoodieLogRecordReader's version of processing next data block and createHoodieRecord() will be a generic version and making the derived class HoodieMetadataMergedLogRecordReader take care of the special creation of records from explictly passed in partition names.	2021-11-19 18:11:29 -05:00
Sagar Sumit	eba354e922	[HUDI-2731] Make clustering work regardless of whether there are base… (#3970 )	2021-11-19 11:09:08 -05:00
Sivabalan Narayanan	04eb5fdc65	[HUDI-2753] Ensure list based rollback strategy is used for restore (#3983 )	2021-11-17 10:06:55 +05:30
Prashant Wason	77b0440eb4	[HUDI-2634] Improved the metadata table bootstrap for very large tables. (#3873 ) * [HUDI-2634] Improved the metadata table bootstrap for very large tables. Following improvements are implemented: 1. Memory overhead reduction: - Existing code caches FileStatus for each file in memory. - Created a new class DirectoryInfo which is used to cache a director's file list with parts of the FileStatus (only filename and file len). This reduces the memory requirements. 2. Improved parallelism: - Existing code collects all the listing to the Driver and then creates HoodieRecord on the Driver. - This takes a long time for large tables (11million HoodieRecords to be created) - Created a new function in SparkRDDWriteClient specifically for bootstrap commit. In it, the HoodieRecord creation is parallelized across executors so it completes fast. 3. Fixed setting to limit the number of parallel listings: - Existing code had a bug wherein 1500 executors were hardcoded to perform listing. This leads to exception due to limit in the spark's result memory. - Corrected the use of the config. Result: Dataset has 1299 partitions and 12Million files. file listing time=1.5mins HoodieRecord creation time=13seconds deltacommit duration=2.6mins Co-authored-by: Sivabalan Narayanan <n.siva.b@gmail.com>	2021-11-10 22:37:48 -05:00
Shawy Geng	fd0f5df26d	[HUDI-2297] Estimate available memory size for spillable map accurately. (#3455 )	2021-11-10 14:05:12 +08:00
Sagar Sumit	bb6a19e7d7	[HUDI-1877] Support records staying in same fileId after clustering (#3833 ) * [HUDI-1877] Support records staying in same fileId after clustering Add plan strategy * Ensure same filegroup id and refactor based on comments	2021-11-09 20:17:50 -08:00
Prashant Wason	2f95967dfe	[HUDI-2591] Bootstrap metadata table only if upgrade / downgrade is not required. (#3836 )	2021-11-09 10:26:20 -05:00
Sivabalan Narayanan	6d109c6de5	[HUDI-2595] Fixing metadata table updates such that only regular writes from data table can trigger table services in metadata table (#3900 )	2021-11-08 22:12:32 -05:00
manasaks	e0285800fb	HUDI-1827 : Add ORC support in Bootstrap Op (#3457 ) Co-authored-by: Sivabalan Narayanan <n.siva.b@gmail.com>	2021-11-06 12:23:20 -04:00
Sagar Sumit	08c35a55b3	[HUDI-2526] Make spark.sql.parquet.writeLegacyFormat configurable (#3917 )	2021-11-05 13:03:41 -04:00
Prashant Wason	b7ee341e14	[HUDI-1794] Moved static COMMIT_FORMATTER to thread local variable as SimpleDateFormat is not thread safe. (#2819 )	2021-11-05 09:31:42 -04:00
Yann Byron	6351e5f4d0	[HUDI-2538] persist some configs to hoodie.properties when the first write (#3823 )	2021-11-03 10:04:23 +08:00
Yann Byron	1f17467f73	[HUDI-1869] Upgrading Spark3 To 3.1 (#3844 ) Co-authored-by: pengzhiwei <pengzhiwei2015@icloud.com>	2021-11-02 18:25:12 -07:00
xiarixiaoyao	d194643b49	[HUDI-2101][RFC-28] support z-order for hudi (#3330 ) * [HUDI-2101]support z-order for hudi * Renaming some configs for consistency/simplicity. * Minor code cleanups Co-authored-by: Vinoth Chandar <vinoth@apache.org>	2021-11-02 09:31:57 -07:00
Sivabalan Narayanan	29574af239	[HUDI-2573] Fixing double locking with multi-writers (#3827 ) - There are two code paths, where we are taking double locking. this was added as part of adding data table locks to update metadata table. Fixing those flows to avoid taking locks if a parent transaction already acquired a lock.	2021-10-29 12:14:39 -04:00
Y Ethan Guo	0223c442ec	[HUDI-2502] Refactor index in hudi-client module (#3778 ) - Refactor Index to reduce Line of Code and re-use across engines.	2021-10-28 04:16:00 -04:00
Manoj Govindassamy	c9d641cc30	[HUDI-2468] Metadata table support for rolling back the first commit (#3843 ) - Fix is to make Metadata table writer creation aware of the currently inflight action so that it can make some informed decision about whether bootstrapping is needed for the table and whether any pending action on the data timeline can be ignored.	2021-10-23 10:07:09 -04:00
Y Ethan Guo	5ed35bff83	[HUDI-2501] Add HoodieData abstraction and refactor compaction actions in hudi-client module (#3741 )	2021-10-22 15:58:51 -04:00
zhangyue19921010	e6711b171a	[HUDI-2435][BUG]Fix clustering handle errors (#3666 ) * done * remove unused imports * code reviewed * code reviewed Co-authored-by: yuezhang <yuezhang@freewheel.tv>	2021-10-12 15:24:48 -07:00
Sivabalan Narayanan	8a487eafa7	[HUDI-2494] Fixing glob pattern to skip all hoodie meta paths (#3768 )	2021-10-12 14:06:40 -04:00
Ilias Antoniou	ceace1c653	[HUDI-2496] Insert duplicate records when precombined is deactivated for "insert" operation (#3740 )	2021-10-10 21:33:16 -04:00
Danny Chan	ad63938890	[HUDI-2537] Fix metadata table for flink (#3774 )	2021-10-10 09:30:39 +08:00
Y Ethan Guo	2e152177fb	[HUDI-2513] Refactor table upgrade and downgrade actions in hudi-client module (#3743 )	2021-10-06 20:20:41 -04:00
Yann Byron	e91e611afb	[HUDI-2456] support 'show partitions' sql (#3693 )	2021-10-06 15:46:49 +08:00
Sivabalan Narayanan	5f32162a2f	[HUDI-2285][HUDI-2476] Metadata table synchronous design. Rebased and Squashed from pull/3426 (#3590 ) * [HUDI-2285] Adding Synchronous updates to metadata before completion of commits in data timelime. - This patch adds synchronous updates to metadata table. In other words, every write is first committed to metadata table followed by data table. While reading metadata table, we ignore any delta commits that are present only in metadata table and not in data table timeline. - Compaction of metadata table is fenced by the condition that we trigger compaction only when there are no inflight requests in datatable. This ensures that all base files in metadata table is always in sync with data table(w/o any holes) and only there could be some extra invalid commits among delta log files in metadata table. - Due to this, archival of data table also fences itself up until compacted instant in metadata table. All writes to metadata table happens within the datatable lock. So, metadata table works in one writer mode only. This might be tough to loosen since all writers write to same FILES partition and so, will result in a conflict anyways. - As part of this, have added acquiring locks in data table for those operations which were not before while committing (rollback, clean, compaction, cluster). To note, we were not doing any conflict resolution. All we are doing here is to commit by taking a lock. So that all writes to metadata table is always a single writer. - Also added building block to add buckets for partitions, which will be leveraged by other indexes like record level index, etc. For now, FILES partition has only one bucket. In general, any number of buckets per partition is allowed and each partition has a fixed fileId prefix with incremental suffix for each bucket within each partition. Have fixed [HUDI-2476]. This fix is about retrying a failed compaction if it succeeded in metadata for first time, but failed w/ data table. - Enabling metadata table by default. - Adding more tests for metadata table Co-authored-by: Prashant Wason <pwason@uber.com>	2021-10-06 00:17:52 -04:00
Y Ethan Guo	46808dcb1f	[HUDI-2497] Refactor clean and restore actions in hudi-client module (#3734 )	2021-09-30 18:20:25 -04:00
Sivabalan Narayanan	f0585facd6	[HUDI-2474] Refreshing timeline for every operation in Hudi when metadata is enabled (#3698 )	2021-09-28 05:16:52 -04:00
Shawy Geng	b4b23e401b	[HUDI-2383] Clean the marker files after compaction (#3576 )	2021-09-23 15:40:58 -04:00
Sivabalan Narayanan	b8dad628e5	[HUDI-2422] Adding rollback plan and rollback requested instant (#3651 ) - This patch introduces rollback plan and rollback.requested instant. Rollback will be done in two phases, namely rollback plan and rollback action. In planning, we prepare the rollback plan and serialize it to rollback.requested. In the rollback action phase, we fetch details from the plan and just delete the files as per the plan. This will ensure final rollback commit metadata will contain all files that got rolled back even if rollback failed midway and retried again.	2021-09-16 11:16:06 -04:00
Y Ethan Guo	916f12b7dd	[HUDI-2433] Refactor rollback actions in hudi-client module (#3664 )	2021-09-15 18:52:43 -04:00
董可伦	dbcf60f370	[MINOR] fix typo (#3640 )	2021-09-11 15:45:49 +08:00
zhangyue19921010	de94787a85	[HUDI-2345] Hoodie columns sort partitioner for bulk insert (#3523 ) Co-authored-by: yuezhang <yuezhang@freewheel.tv>	2021-08-24 21:45:17 +08:00
Udit Mehrotra	e39d0a2f28	Keep non-conflicting names for common configs between DataSourceOptions and HoodieWriteConfig (#3511 )	2021-08-20 02:42:59 -07:00
Udit Mehrotra	c350d05dd3	Restore 0.8.0 config keys with deprecated annotation (#3506 ) Co-authored-by: Sagar Sumit <sagarsumit09@gmail.com> Co-authored-by: Vinoth Chandar <vinoth@apache.org>	2021-08-19 13:36:40 -07:00
Y Ethan Guo	23dca6c237	[HUDI-2268] Add upgrade and downgrade to and from 0.9.0 (#3470 ) - Added upgrade and downgrade step to and from 0.9.0. Upgrade adds few table properties. Downgrade recreates timeline server based marker files if any.	2021-08-14 20:20:23 -04:00
Y Ethan Guo	9056c68744	[HUDI-2305] Add MARKERS.type and fix marker-based rollback (#3472 ) - Rollback infers the directory structure and does rollback based on the strategy used while markers were written. "write markers type" in write config is used to determine marker strategy only for new writes.	2021-08-14 08:18:49 -04:00
Prashant Wason	8eed440694	[HUDI-2119] Ensure the rolled-back instance was previously synced to the Metadata Table when syncing a Rollback Instant. (#3210 ) * [HUDI-2119] Ensure the rolled-back instance was previously synced to the Metadata Table when syncing a Rollback Instant. If the rolled-back instant was synced to the Metadata Table, a corresponding deltacommit with the same timestamp should have been created on the Metadata Table timeline. To ensure we can always perfomr this check, the Metadata Table instants should not be archived until their corresponding instants are present in the dataset timeline. But ensuring this requires a large number of instants to be kept on the metadata table. In this change, the metadata table will keep atleast the number of instants that the main dataset is keeping. If the instant being rolled back was before the metadata table timeline, the code will throw an exception and the metadata table will have to be re-bootstrapped. This should be a very rare occurance and should occur only when the dataset is being repaired by rolling back multiple commits or restoring to an much older time. * Fixed checkstyle * Improvements from review comments. Fixed checkstyle Replaced explicit null check with Option.ofNullable Removed redundant function getSynedInstantTime * Renamed getSyncedInstantTime and getSyncedInstantTimeForReader. Sync is confusing so renamed to getUpdateTime() and getReaderTime(). * Removed getReaderTime which is only for testing as the same method can be accessed during testing differently without making it part of the public interface. * Fix compilation error * Reverting changes to HoodieMetadataFileSystemView Co-authored-by: Vinoth Chandar <vinoth@apache.org>	2021-08-13 21:23:34 -07:00
Sagar Sumit	0544d70d8f	[MINOR] Deprecate older configs (#3464 ) Rename and deprecate props in HoodieWriteConfig Rename and deprecate older props	2021-08-12 20:31:04 -07:00
Prashant Wason	76bc686a77	[HUDI-1292] Created a config to enable/disable syncing of metadata table. (#3427 ) * [HUDI-1292] Created a config to enable/disable syncing of metadata table. - Metadata Table should only be synced from a single pipeline to prevent conflicts. - Skip syncing metadata table for clustering and compaction - Renamed useFileListingMetadata Co-authored-by: Vinoth Chandar <vinoth@apache.org>	2021-08-12 15:45:57 -07:00

1 2 3 4

185 Commits