lanyuanxiaoyao/hudi - hudi - Gitea: Git with a cup of tea

Author	SHA1	Message	Date
liujinhui	126b88b48d	[HUDI-2150] Rename/Restructure configs for better modularity (#6061 ) - Move clean related configuration to HoodieCleanConfig - Move Archival related configuration to HoodieArchivalConfig - hoodie.compaction.payload.class move this to HoodiePayloadConfig	2022-07-09 20:00:48 +05:30
miomiocat	397fd30142	[HUDI-3984] Remove mandatory check of partiton path for cli command (#5458 )	2022-06-30 10:00:13 -07:00
YueZhang	e71f04768e	[MINOR] Make CLI 'commit rollback' using rollbackUsingMarkers false as default (#5174 ) Co-authored-by: yuezhang <yuezhang@freewheel.tv>	2022-06-29 10:12:46 -07:00
Alexey Kudinkin	c86edfc28e	[HUDI-4319] Fixed Parquet's `PLAIN_DICTIONARY` encoding not being applied when bulk-inserting (#5966 ) * Fixed Dictionary encoding config not being properly propagated to Parquet writer (making it unable to apply it, substantially bloating the storage footprint)	2022-06-24 23:52:28 -04:00
Shawn Chang	5c204f1416	[HUDI-4177] Fix hudi-cli rollback with rollbackUsingMarkers method call (#5734 ) * Fix hudi-cli rollback with rollbackUsingMarkers method call * Add test for hudi-cli rollbackUsingMarkers Co-authored-by: Shawn Chang <yxchang@amazon.com>	2022-06-21 10:54:12 +08:00
Forus	ba4d5bd847	[HUDI-4251] Fix the problem that the command 'commits sync' description does not match. (#5881 )	2022-06-20 16:03:58 -07:00
ForwardXu	c5c4cfec91	[HUDI-3507] Support export command based on Call Produce Command (#5901 )	2022-06-19 18:48:22 +08:00
Shiyan Xu	5aaac21d1d	[HUDI-4224] Fix CI issues (#5842 ) - Upgrade junit to 5.7.2 - Downgrade surefire and failsafe to 2.22.2 - Fix test failures that were previously not reported - Improve azure pipeline configs Co-authored-by: liujinhui1994 <965147871@qq.com> Co-authored-by: Y Ethan Guo <ethan.guoyihua@gmail.com>	2022-06-12 11:44:18 -07:00
Danny Chan	7f8630cc57	[HUDI-4167] Remove the timeline refresh with initializing hoodie table (#5716 ) The timeline refresh on table initialization invokes the fs view #sync, which has two actions now: 1. reload the timeline of the fs view, so that the next fs view request is based on this timeline metadata 2. if this is a local fs view, clear all the local states; if this is a remote fs view, send request to sync the remote fs view But, let's see the construction, the meta client is instantiated freshly so the timeline is already the latest, the table is also constructed freshly, so the fs view has no local states, that means, the #sync is unnecessary totally. In this patch, the metadata lifecycle and data set fs view are kept in sync, when the fs view is refreshed, the underneath metadata is also refreshed synchronouly. The freshness of the metadata follows the same rules as data fs view: 1. if the fs view is local, the visibility is based on the client table metadata client's latest commit 2. if the fs view is remote, the timeline server would #sync the fs view and metadata together based on the lagging server local timeline From the perspective of client, no need to care about the refresh action anymore no matter whether the metadata table is enabled or not. That make the client logic more clear and less error-prone. Removes the timeline refresh has another benefit: if avoids unncecessary #refresh of the remote fs view, if all the clients send request to #sync the remote fs view, the server would encounter conflicts and the client encounters a response error.	2022-06-02 09:48:48 +08:00
luoyajun	f30b3aef3e	[MINOR] Fix a potential NPE and some finer points of hudi cli (#5656 )	2022-05-24 11:13:18 -07:00
Heap	47b764ec33	[HUDI-4134] Fix Method naming consistency issues in FSUtils (#5655 )	2022-05-23 15:28:48 -07:00
Alexey Kudinkin	4a8589f222	[HUDI-4038] Avoid calling `getDataSize` after every record written (#5497 ) - getDataSize has non-trivial overhead in the current ParquetWriter impl, requiring traversal of already composed Column Groups in memory. Instead we can sample these calls to getDataSize to amortize its cost. Co-authored-by: sivabalan <n.siva.b@gmail.com>	2022-05-11 08:08:31 -04:00
Danny Chan	0281725c6b	[MINOR] Inline the partition path logic into the builder (#5310 )	2022-04-13 16:54:39 +05:30
Alexey Kudinkin	7a9d48d126	[HUDI-3834] Fixing performance hits in reading Column Stats Index (#5266 ) Fixing performance hits in reading Column Stats Index: [HUDI-3834] There's substantial performance degradation in Avro 1.10 default generated Builder classes: they by default rely on SpecificData.getForSchema that load corresponding model's class using reflection, which takes a hit when executed on the hot-path (this was bringing overall runtime to read full Column Stats Index of 800k records to 60s, whereas now it's taking mere 3s) Addressing memory churn by over-used Hadoop's Path creation: Path ctor is not a lightweight sequence and produces quite a bit of memory churn adding pressure on GC. Cleaning such avoidable allocations up to make sure there's no unnecessarily added pressure on GC.	2022-04-10 13:42:06 -04:00
Sagar Sumit	df87095ef0	[HUDI-3454] Fix partition name in all code paths for LogRecordScanner (#5252 ) * Depend on FSUtils#getRelativePartitionPath(basePath, logFilePath.getParent) to get the partition. * If the list of log file paths in the split is empty, then fallback to usual behaviour.	2022-04-08 09:59:36 +05:30
Alexey Kudinkin	9e87d164b3	[HUDI-3760] Adding capability to fetch Metadata Records by prefix (#5208 ) - Adding capability to fetch Metadata Records by key prefix so that Data Skipping could fetch only Column Stats - Index records pertaining to the columns being queried by, instead of reading out whole Index. - Fixed usages of HFileScanner in HFileReader. few code paths uses cached scanner if available. Other code paths uses its own HFileScanner w/ positional read. Brief change log - Rebasing ColumnStatsIndexSupport to rely on HoodieBackedTableMetadata in lieu of reading t/h Spark DS - Adding methods enabling key-prefix lookups to HoodiFileReader, HoodieHFileReader - Wiring key-prefix lookup t/h LogRecordScanner impls - Cleaning up HoodieHFileReader impl Co-authored-by: sivabalan <n.siva.b@gmail.com> Co-authored-by: Sagar Sumit <sagarsumit09@gmail.com>	2022-04-06 12:11:08 -04:00
Yann Byron	3195f51562	[HUDI-3748] write and select hudi table when enable hoodie.datasource.write.drop.partition.columns (#5201 )	2022-04-05 01:31:41 -07:00
Prashant Wason	b28f0d6ceb	[HUDI-3290] Different file formats for the partition metadata file. (#5179 ) * [HUDI-3290] Different file formats for the partition metadata file. Partition metadata files are stored in each partition to help identify the base path of a table. These files are saved in the properties file format. Some query engines do not work when non Parquet/ORC files are found in a partition. Added a new table config 'hoodie.partition.metafile.use.data.format' which when enabled (default false for backward compatibility) ensures that partition metafiles will be saved in the same format as the base files of a dataset. For new datasets, the config can be set via hudi-cli. Deltastreamer has a new parameter --partition-metafile-use-data-format which will create a table with this setting. * Code review comments - Adding a new command to migrate from text to base file formats for meta file. - Reimplementing readFromFS() to first read the text format, then base format - Avoid extra exists() checks in readFromFS() - Added unit tests, enabled parquet format across hoodie-hadoop-mr - Code cleanup, restructuring, naming consistency. * Wiring in all the other Spark code paths to respect this config - Turned on parquet meta format for COW data source tests - Removed the deltastreamer command line to keep it shorter * populate HoodiePartitionMetadata#format after readFromFS() Co-authored-by: Vinoth Chandar <vinoth@apache.org> Co-authored-by: Raymond Xu <2701446+xushiyan@users.noreply.github.com>	2022-04-04 08:08:20 -07:00
Alexey Kudinkin	cc3737be50	[HUDI-3664] Fixing Column Stats Index composition (#5181 ) Co-authored-by: Sagar Sumit <sagarsumit09@gmail.com>	2022-04-02 17:15:52 -07:00
Sagar Sumit	28dafa774e	[HUDI-2488][HUDI-3175] Implement async metadata indexing (#4693 ) - Add a new action called INDEX, whose state transition is described in the RFC. - Changes in timeline to support the new action. - Add an index planner in ScheduleIndexActionExecutor. - Add index plan executor in RunIndexActionExecutor. - Add 3 APIs in HoodieTableMetadataWriter; a) scheduleIndex: will generate an index plan based on latest completed instant, initialize file groups and add a requested INDEX instant, b) index: executes the index plan and also takes care of writes that happened after indexing was requested, c) dropIndex: will drop index by removing the given metadata partition. - Add 2 new table configs to serve as the source of truth for inflight and completed indexes. - Support upgrade/downgrade taking care of the newly added configs. - Add tool to trigger indexing in HoodieIndexer. - Handle corner cases related to partial failures. - Abort gracefully after deleting partition and instant. - Handle other actions in timeline to consider before catching up	2022-04-01 01:33:12 +05:30
Sivabalan Narayanan	3cdb590e15	[HUDI-3733] Adding HoodieFailedWritesCleaningPolicy for restore with hudi-cli (#5158 ) * Adding HoodieFailedWritesCleaningPolicy for restore with hudi-cli Co-authored-by: Sagar Sumit <sagarsumit09@gmail.com>	2022-03-31 13:00:49 +05:30
YueZhang	2dbb273d26	[HUDI-3721] Delete MDT if necessary when trigger rollback to savepoint (#5173 ) Co-authored-by: yuezhang <yuezhang@freewheel.tv>	2022-03-30 20:26:37 -07:00
Y Ethan Guo	484b3407e0	[HUDI-3604] Adjust the order of timeline changes in rollbacks (#5114 )	2022-03-26 22:37:44 -07:00
ForwardXu	6c8224cae6	[HUDI-3501] Support savepoints command based on Call Produce Command (#5025 )	2022-03-13 16:58:21 +04:00
YueZhang	53826d69e4	[HUDI-2747] support set --sparkMaster for MDT cli (#4964 ) Co-authored-by: yuezhang <yuezhang@freewheel.tv>	2022-03-07 16:57:03 -05:00
Alexey Kudinkin	85f47b53df	[HUDI-3469] Refactor `HoodieTestDataGenerator` to provide for reproducible Builds (#4866 )	2022-03-01 22:15:26 -08:00
Bo Cui	3fdc9332e5	[HUDI-3516] Implement record iterator for HoodieDataBlock (#4909 ) * Use iterator to void eager materialization to be memory friendly	2022-03-02 10:19:36 +08:00
ForwardXu	a81a6326d5	[HUDI-3441] Add support for "marker delete" in hudi-cli (#4922 )	2022-03-01 16:03:53 +08:00
wenningd	18dc89cf79	[HUDI-3450] Avoid passing empty string spark master to hudi cli (#4844 ) Co-authored-by: Wenning Ding <wenningd@amazon.com>	2022-02-28 11:37:24 -05:00
YueZhang	3694485609	[HUDI-3429] Support clustering scheduleAndExecute for hudi-cli and add clustering-cli Tests (#4817 ) Co-authored-by: yuezhang <yuezhang@freewheel.tv>	2022-02-24 23:28:38 -05:00
Pratyaksh Sharma	bf16bc122a	[HUDI-349]: Added new cleaning policy based on number of hours (#3646 )	2022-02-21 09:04:42 -05:00
YueZhang	359fbfde79	[HUDI-2648] Retry FileSystem action instead of failed directly. (#3887 ) Co-authored-by: yuezhang <yuezhang@freewheel.tv>	2022-02-20 15:31:31 -05:00
Sagar Sumit	ed106f671e	[HUDI-2809] Introduce a checksum mechanism for validating hoodie.properties (#4712 ) Fix dependency conflict Fix repairs command Implement putIfAbsent for DDB lock provider Add upgrade step and validate while fetching configs Validate checksum for latest table version only while fetching config Move generateChecksum to BinaryUtil Rebase and resolve conflict Fix table version check	2022-02-18 10:17:06 +05:30
Y Ethan Guo	9a05940a74	[HUDI-3366] Remove hardcoded logic of disabling metadata table in tests (#4792 )	2022-02-15 16:41:47 -05:00
Raymond Xu	27bd7b538e	[HUDI-1576] Make archiving an async service (#4795 )	2022-02-14 21:15:06 -05:00
Y Ethan Guo	6aba00e84f	[MINOR] Fix typos in Spark client related classes (#4781 )	2022-02-13 06:41:58 -08:00
satishkotha	89ed6f062e	[HUDI-3362] Fix restore to rollback pending clustering operations followed by other rolling back other commits (#4772 )	2022-02-11 14:12:45 -05:00
wenningd	1c778590d1	[HUDI-3395] Allow pass rollbackUsingMarkers to Hudi CLI rollback command (#4557 ) Co-authored-by: Wenning Ding <wenningd@amazon.com>	2022-02-10 09:41:22 -05:00
YueZhang	de206acbae	[HUDI-3369] New ScheduleAndExecute mode for HoodieCompactor and hudi-cli (#4750 ) Schedule and execute compaction plan in one single mode.	2022-02-07 15:01:34 +05:30
Alexey Kudinkin	a68e1dc2db	[HUDI-431] Adding support for Parquet in MOR `LogBlock`s (#4333 ) - Adding support for Parquet in MOR tables Log blocks Co-authored-by: Sivabalan Narayanan <n.siva.b@gmail.com>	2022-02-02 14:35:05 -05:00
peanut-chenzhong	c0e8b03d93	[HUDI-1977] Fix Hudi CLI tempview query issue (#4626 )	2022-01-29 10:39:08 +08:00
Raymond Xu	0bd38f26ca	[HUDI-2596] Make class names consistent in hudi-client (#4680 )	2022-01-27 17:05:08 -08:00
YueZhang	b2b23f5d3a	[HUDI-3183] Wrong result of HoodieArchivedTimeline loadInstants with TimeRangeFilter (#4521 ) Co-authored-by: yuezhang <yuezhang@freewheel.tv>	2022-01-06 21:16:29 -05:00
Sivabalan Narayanan	2954027b92	[HUDI-52] Enabling savepoint and restore for MOR table (#4507 ) * Enabling restore for MOR table * Fixing savepoint for compaction commits in MOR	2022-01-06 21:26:08 +05:30
Aimiyoo	57f43de1ea	[MINOR] Fix DedupeSparkJob typo (#4418 )	2021-12-22 11:51:26 -08:00
Sivabalan Narayanan	3ce0526924	Adding verbose output for metadata validate files command (#4166 )	2021-12-10 09:38:38 -08:00
Manoj Govindassamy	445208a0d2	[HUDI-2845] Metadata CLI - files/partition file listing fix and new validate option (#4092 ) - Co-authored-by: Sivabalan Narayanan <n.siva.b@gmail.com>	2021-11-26 16:44:16 -05:00
Y Ethan Guo	d1e83e4ba0	[HUDI-2767] Enabling timeline-server-based marker as default (#4112 ) - Changes the default config of marker type (HoodieWriteConfig.MARKERS_TYPE or hoodie.write.markers.type) from DIRECT to TIMELINE_SERVER_BASED for Spark Engine. - Adds engine-specific marker type configs: Spark -> TIMELINE_SERVER_BASED, Flink -> DIRECT, Java -> DIRECT. - Uses DIRECT markers as well for Spark structured streaming due to timeline server only available for the first mini-batch. - Fixes the marker creation method for non-partitioned table in TimelineServerBasedWriteMarkers. - Adds the fallback to direct markers even when TIMELINE_SERVER_BASED is configured, in WriteMarkersFactory: when HDFS is used, or embedded timeline server is disabled, the fallback to direct markers happens. - Fixes the closing of timeline service. - Fixes tests that depend on markers, mainly by starting the timeline service for each test.	2021-11-26 16:41:05 -05:00
Manoj Govindassamy	3d75aca40d	[HUDI-2850] Fixing Clustering CLI - schedule and run command fixes to avoid NumberFormatException (#4101 )	2021-11-26 07:17:23 -05:00
Alexey Kudinkin	6f5d8d04cd	[HUDI-2840] Fixed DeltaStreaemer to properly respect configuration passed t/h properties file (#4090 ) * Rebased `DFSPropertiesConfiguration` to access Hadoop config in liue of FS to avoid confusion * Fixed `readConfig` to take Hadoop's `Configuration` instead of FS; Fixing usages * Added test for local FS access * Rebase to use `FSUtils.getFs` * Combine properties provided as a file along w/ overrides provided from the CLI * Added helper utilities to `HoodieClusteringConfig`; Make sure corresponding config methods fallback to defaults; * Fixed DeltaStreamer usage to respect properly combined configuration; Abstracted `HoodieClusteringConfig.from` convenience utility to init Clustering config from `Properties` * Tidying up * `lint` * Reverting changes to `HoodieWriteConfig` * Tdiying up * Fixed incorrect merge of the props * Converted `HoodieConfig` to wrap around `Properties` into `TypedProperties` * Fixed compilation * Fixed compilation	2021-11-25 14:48:22 -08:00

1 2 3 4 5

210 Commits