lanyuanxiaoyao/hudi - hudi - Gitea: Git with a cup of tea

Author	SHA1	Message	Date
Alexey Kudinkin	9e87d164b3	[HUDI-3760] Adding capability to fetch Metadata Records by prefix (#5208 ) - Adding capability to fetch Metadata Records by key prefix so that Data Skipping could fetch only Column Stats - Index records pertaining to the columns being queried by, instead of reading out whole Index. - Fixed usages of HFileScanner in HFileReader. few code paths uses cached scanner if available. Other code paths uses its own HFileScanner w/ positional read. Brief change log - Rebasing ColumnStatsIndexSupport to rely on HoodieBackedTableMetadata in lieu of reading t/h Spark DS - Adding methods enabling key-prefix lookups to HoodiFileReader, HoodieHFileReader - Wiring key-prefix lookup t/h LogRecordScanner impls - Cleaning up HoodieHFileReader impl Co-authored-by: sivabalan <n.siva.b@gmail.com> Co-authored-by: Sagar Sumit <sagarsumit09@gmail.com>	2022-04-06 12:11:08 -04:00
Raymond Xu	e96f08f355	Moving to 0.12.0-SNAPSHOT on master branch.	2022-04-06 15:24:10 +08:00
Yann Byron	3195f51562	[HUDI-3748] write and select hudi table when enable hoodie.datasource.write.drop.partition.columns (#5201 )	2022-04-05 01:31:41 -07:00
Prashant Wason	b28f0d6ceb	[HUDI-3290] Different file formats for the partition metadata file. (#5179 ) * [HUDI-3290] Different file formats for the partition metadata file. Partition metadata files are stored in each partition to help identify the base path of a table. These files are saved in the properties file format. Some query engines do not work when non Parquet/ORC files are found in a partition. Added a new table config 'hoodie.partition.metafile.use.data.format' which when enabled (default false for backward compatibility) ensures that partition metafiles will be saved in the same format as the base files of a dataset. For new datasets, the config can be set via hudi-cli. Deltastreamer has a new parameter --partition-metafile-use-data-format which will create a table with this setting. * Code review comments - Adding a new command to migrate from text to base file formats for meta file. - Reimplementing readFromFS() to first read the text format, then base format - Avoid extra exists() checks in readFromFS() - Added unit tests, enabled parquet format across hoodie-hadoop-mr - Code cleanup, restructuring, naming consistency. * Wiring in all the other Spark code paths to respect this config - Turned on parquet meta format for COW data source tests - Removed the deltastreamer command line to keep it shorter * populate HoodiePartitionMetadata#format after readFromFS() Co-authored-by: Vinoth Chandar <vinoth@apache.org> Co-authored-by: Raymond Xu <2701446+xushiyan@users.noreply.github.com>	2022-04-04 08:08:20 -07:00
Alexey Kudinkin	cc3737be50	[HUDI-3664] Fixing Column Stats Index composition (#5181 ) Co-authored-by: Sagar Sumit <sagarsumit09@gmail.com>	2022-04-02 17:15:52 -07:00
Sagar Sumit	28dafa774e	[HUDI-2488][HUDI-3175] Implement async metadata indexing (#4693 ) - Add a new action called INDEX, whose state transition is described in the RFC. - Changes in timeline to support the new action. - Add an index planner in ScheduleIndexActionExecutor. - Add index plan executor in RunIndexActionExecutor. - Add 3 APIs in HoodieTableMetadataWriter; a) scheduleIndex: will generate an index plan based on latest completed instant, initialize file groups and add a requested INDEX instant, b) index: executes the index plan and also takes care of writes that happened after indexing was requested, c) dropIndex: will drop index by removing the given metadata partition. - Add 2 new table configs to serve as the source of truth for inflight and completed indexes. - Support upgrade/downgrade taking care of the newly added configs. - Add tool to trigger indexing in HoodieIndexer. - Handle corner cases related to partial failures. - Abort gracefully after deleting partition and instant. - Handle other actions in timeline to consider before catching up	2022-04-01 01:33:12 +05:30
Sivabalan Narayanan	3cdb590e15	[HUDI-3733] Adding HoodieFailedWritesCleaningPolicy for restore with hudi-cli (#5158 ) * Adding HoodieFailedWritesCleaningPolicy for restore with hudi-cli Co-authored-by: Sagar Sumit <sagarsumit09@gmail.com>	2022-03-31 13:00:49 +05:30
YueZhang	2dbb273d26	[HUDI-3721] Delete MDT if necessary when trigger rollback to savepoint (#5173 ) Co-authored-by: yuezhang <yuezhang@freewheel.tv>	2022-03-30 20:26:37 -07:00
Alexey Kudinkin	e5a2baeed0	[HUDI-3549] Removing dependency on "spark-avro" (#4955 ) Hudi will be taking on promise for it bundles to stay compatible with Spark minor versions (for ex 2.4, 3.1, 3.2): meaning that single build of Hudi (for ex "hudi-spark3.2-bundle") will be compatible with ALL patch versions in that minor branch (in that case 3.2.1, 3.2.0, etc) To achieve that we'll have to remove (and ban) "spark-avro" as a dependency, which on a few occasions was the root-cause of incompatibility b/w consecutive Spark patch versions (most recently 3.2.1 and 3.2.0, due to this PR). Instead of bundling "spark-avro" as dependency, we will be copying over some of the classes Hudi depends on and maintain them along the Hudi code-base to make sure we're able to provide for the aforementioned guarantee. To workaround arising compatibility issues we will be applying local patches to guarantee compatibility of Hudi bundles w/in the Spark minor version branches. Following Hudi modules to Spark minor branches is currently maintained: "hudi-spark3" -> 3.2.x "hudi-spark3.1.x" -> 3.1.x "hudi-spark2" -> 2.4.x Following classes hierarchies (borrowed from "spark-avro") are maintained w/in these Spark-specific modules to guarantee compatibility with respective minor version branches: AvroSerializer AvroDeserializer AvroUtils Each of these classes has been correspondingly copied from Spark 3.2.1 (for 3.2.x branch), 3.1.2 (for 3.1.x branch), 2.4.4 (for 2.4.x branch) into their respective modules. SchemaConverters class in turn is shared across all those modules given its relative stability (there're only cosmetical changes from 2.4.4 to 3.2.1). All of the aforementioned classes have their corresponding scope of visibility limited to corresponding packages (org.apache.spark.sql.avro, org.apache.spark.sql) to make sure broader code-base does not become dependent on them and instead relies on facades abstracting them. Additionally, given that Hudi plans on supporting all the patch versions of Spark w/in aforementioned minor versions branches of Spark, additional build steps were added to validate that Hudi could be properly compiled against those versions. Testing, however, is performed against the most recent patch versions of Spark with the help of Azure CI. Brief change log: - Removing spark-avro bundling from Hudi by default - Scaffolded Spark 3.2.x hierarchy - Bootstrapped Spark 3.1.x Avro serializer/deserializer hierarchy - Bootstrapped Spark 2.4.x Avro serializer/deserializer hierarchy - Moved ExpressionCodeGen,ExpressionPayload into hudi-spark module - Fixed AvroDeserializer to stay compatible w/ both Spark 3.2.1 and 3.2.0 - Modified bot.yml to build full matrix of support Spark versions - Removed "spark-avro" dependency from all modules - Fixed relocation of spark-avro classes in bundles to assist in running integ-tests.	2022-03-29 14:44:47 -04:00
Y Ethan Guo	484b3407e0	[HUDI-3604] Adjust the order of timeline changes in rollbacks (#5114 )	2022-03-26 22:37:44 -07:00
ForwardXu	6c8224cae6	[HUDI-3501] Support savepoints command based on Call Produce Command (#5025 )	2022-03-13 16:58:21 +04:00
YueZhang	53826d69e4	[HUDI-2747] support set --sparkMaster for MDT cli (#4964 ) Co-authored-by: yuezhang <yuezhang@freewheel.tv>	2022-03-07 16:57:03 -05:00
Alexey Kudinkin	85f47b53df	[HUDI-3469] Refactor `HoodieTestDataGenerator` to provide for reproducible Builds (#4866 )	2022-03-01 22:15:26 -08:00
Bo Cui	3fdc9332e5	[HUDI-3516] Implement record iterator for HoodieDataBlock (#4909 ) * Use iterator to void eager materialization to be memory friendly	2022-03-02 10:19:36 +08:00
ForwardXu	a81a6326d5	[HUDI-3441] Add support for "marker delete" in hudi-cli (#4922 )	2022-03-01 16:03:53 +08:00
wenningd	18dc89cf79	[HUDI-3450] Avoid passing empty string spark master to hudi cli (#4844 ) Co-authored-by: Wenning Ding <wenningd@amazon.com>	2022-02-28 11:37:24 -05:00
YueZhang	3694485609	[HUDI-3429] Support clustering scheduleAndExecute for hudi-cli and add clustering-cli Tests (#4817 ) Co-authored-by: yuezhang <yuezhang@freewheel.tv>	2022-02-24 23:28:38 -05:00
Pratyaksh Sharma	bf16bc122a	[HUDI-349]: Added new cleaning policy based on number of hours (#3646 )	2022-02-21 09:04:42 -05:00
YueZhang	359fbfde79	[HUDI-2648] Retry FileSystem action instead of failed directly. (#3887 ) Co-authored-by: yuezhang <yuezhang@freewheel.tv>	2022-02-20 15:31:31 -05:00
Sagar Sumit	ed106f671e	[HUDI-2809] Introduce a checksum mechanism for validating hoodie.properties (#4712 ) Fix dependency conflict Fix repairs command Implement putIfAbsent for DDB lock provider Add upgrade step and validate while fetching configs Validate checksum for latest table version only while fetching config Move generateChecksum to BinaryUtil Rebase and resolve conflict Fix table version check	2022-02-18 10:17:06 +05:30
Y Ethan Guo	9a05940a74	[HUDI-3366] Remove hardcoded logic of disabling metadata table in tests (#4792 )	2022-02-15 16:41:47 -05:00
Raymond Xu	27bd7b538e	[HUDI-1576] Make archiving an async service (#4795 )	2022-02-14 21:15:06 -05:00
Y Ethan Guo	6aba00e84f	[MINOR] Fix typos in Spark client related classes (#4781 )	2022-02-13 06:41:58 -08:00
satishkotha	89ed6f062e	[HUDI-3362] Fix restore to rollback pending clustering operations followed by other rolling back other commits (#4772 )	2022-02-11 14:12:45 -05:00
wenningd	1c778590d1	[HUDI-3395] Allow pass rollbackUsingMarkers to Hudi CLI rollback command (#4557 ) Co-authored-by: Wenning Ding <wenningd@amazon.com>	2022-02-10 09:41:22 -05:00
YueZhang	de206acbae	[HUDI-3369] New ScheduleAndExecute mode for HoodieCompactor and hudi-cli (#4750 ) Schedule and execute compaction plan in one single mode.	2022-02-07 15:01:34 +05:30
Alexey Kudinkin	a68e1dc2db	[HUDI-431] Adding support for Parquet in MOR `LogBlock`s (#4333 ) - Adding support for Parquet in MOR tables Log blocks Co-authored-by: Sivabalan Narayanan <n.siva.b@gmail.com>	2022-02-02 14:35:05 -05:00
peanut-chenzhong	c0e8b03d93	[HUDI-1977] Fix Hudi CLI tempview query issue (#4626 )	2022-01-29 10:39:08 +08:00
Raymond Xu	0bd38f26ca	[HUDI-2596] Make class names consistent in hudi-client (#4680 )	2022-01-27 17:05:08 -08:00
YueZhang	b2b23f5d3a	[HUDI-3183] Wrong result of HoodieArchivedTimeline loadInstants with TimeRangeFilter (#4521 ) Co-authored-by: yuezhang <yuezhang@freewheel.tv>	2022-01-06 21:16:29 -05:00
Sivabalan Narayanan	2954027b92	[HUDI-52] Enabling savepoint and restore for MOR table (#4507 ) * Enabling restore for MOR table * Fixing savepoint for compaction commits in MOR	2022-01-06 21:26:08 +05:30
Aimiyoo	57f43de1ea	[MINOR] Fix DedupeSparkJob typo (#4418 )	2021-12-22 11:51:26 -08:00
Sivabalan Narayanan	3ce0526924	Adding verbose output for metadata validate files command (#4166 )	2021-12-10 09:38:38 -08:00
yuzhao.cyz	a1d0ff4209	Moving to 0.11.0-SNAPSHOT on master branch.	2021-11-27 17:22:10 +08:00
huleilei	8402cac407	[HUDI-2848] Excluse guava from hudi-cli pom (#4100 )	2021-11-26 16:56:03 -05:00
Manoj Govindassamy	445208a0d2	[HUDI-2845] Metadata CLI - files/partition file listing fix and new validate option (#4092 ) - Co-authored-by: Sivabalan Narayanan <n.siva.b@gmail.com>	2021-11-26 16:44:16 -05:00
Y Ethan Guo	d1e83e4ba0	[HUDI-2767] Enabling timeline-server-based marker as default (#4112 ) - Changes the default config of marker type (HoodieWriteConfig.MARKERS_TYPE or hoodie.write.markers.type) from DIRECT to TIMELINE_SERVER_BASED for Spark Engine. - Adds engine-specific marker type configs: Spark -> TIMELINE_SERVER_BASED, Flink -> DIRECT, Java -> DIRECT. - Uses DIRECT markers as well for Spark structured streaming due to timeline server only available for the first mini-batch. - Fixes the marker creation method for non-partitioned table in TimelineServerBasedWriteMarkers. - Adds the fallback to direct markers even when TIMELINE_SERVER_BASED is configured, in WriteMarkersFactory: when HDFS is used, or embedded timeline server is disabled, the fallback to direct markers happens. - Fixes the closing of timeline service. - Fixes tests that depend on markers, mainly by starting the timeline service for each test.	2021-11-26 16:41:05 -05:00
Manoj Govindassamy	3d75aca40d	[HUDI-2850] Fixing Clustering CLI - schedule and run command fixes to avoid NumberFormatException (#4101 )	2021-11-26 07:17:23 -05:00
Alexey Kudinkin	6f5d8d04cd	[HUDI-2840] Fixed DeltaStreaemer to properly respect configuration passed t/h properties file (#4090 ) * Rebased `DFSPropertiesConfiguration` to access Hadoop config in liue of FS to avoid confusion * Fixed `readConfig` to take Hadoop's `Configuration` instead of FS; Fixing usages * Added test for local FS access * Rebase to use `FSUtils.getFs` * Combine properties provided as a file along w/ overrides provided from the CLI * Added helper utilities to `HoodieClusteringConfig`; Make sure corresponding config methods fallback to defaults; * Fixed DeltaStreamer usage to respect properly combined configuration; Abstracted `HoodieClusteringConfig.from` convenience utility to init Clustering config from `Properties` * Tidying up * `lint` * Reverting changes to `HoodieWriteConfig` * Tdiying up * Fixed incorrect merge of the props * Converted `HoodieConfig` to wrap around `Properties` into `TypedProperties` * Fixed compilation * Fixed compilation	2021-11-25 14:48:22 -08:00
Sivabalan Narayanan	fc9ca6a07a	[HUDI-2559] Converting commit timestamp format to millisecs (#4024 ) - Adds support for generating commit timestamps with millisecs granularity. - Older commit timestamps (in secs granularity) will be suffixed with 999 and parsed with millisecs format.	2021-11-22 11:44:38 -05:00
vinoth chandar	ae0c67d9fc	[HUDI-2795] Add mechanism to safely update,delete and recover table properties (#4038 ) * [HUDI-2795] Add mechanism to safely update,delete and recover table properties - Fail safe mechanism, that lets queries succeed off a backup file - Readers who are not upgraded to this version of code will just fail until recovery is done. - Added unit tests that exercises all these scenarios. - Adding CLI for recovery, updation to table command. - [Pending] Add some hash based verfication to ensure any rare partial writes for HDFS * Fixing upgrade/downgrade infrastructure to use new updation method	2021-11-20 08:07:40 -08:00
wenningd	24def0b30d	[HUDI-2362] Add external config file support (#3416 ) Co-authored-by: Wenning Ding <wenningd@amazon.com>	2021-11-18 01:59:26 -08:00
Sivabalan Narayanan	ce7d233307	[HUDI-2151] Part3 Enabling marker based rollback as default rollback strategy (#3950 ) * Enabling timeline server based markers * Enabling timeline server based markers and marker based rollback * Removing constraint that timeline server can be enabled only for hdfs * Fixing tests	2021-11-17 11:51:28 +05:30
Prashant Wason	b7ee341e14	[HUDI-1794] Moved static COMMIT_FORMATTER to thread local variable as SimpleDateFormat is not thread safe. (#2819 )	2021-11-05 09:31:42 -04:00
董可伦	48a3906ccc	[MINOR] Fix typo,'paritition' corrected to 'partition' (#3764 )	2021-10-11 14:07:34 -04:00
Y Ethan Guo	2e152177fb	[HUDI-2513] Refactor table upgrade and downgrade actions in hudi-client module (#3743 )	2021-10-06 20:20:41 -04:00
Sivabalan Narayanan	5f32162a2f	[HUDI-2285][HUDI-2476] Metadata table synchronous design. Rebased and Squashed from pull/3426 (#3590 ) * [HUDI-2285] Adding Synchronous updates to metadata before completion of commits in data timelime. - This patch adds synchronous updates to metadata table. In other words, every write is first committed to metadata table followed by data table. While reading metadata table, we ignore any delta commits that are present only in metadata table and not in data table timeline. - Compaction of metadata table is fenced by the condition that we trigger compaction only when there are no inflight requests in datatable. This ensures that all base files in metadata table is always in sync with data table(w/o any holes) and only there could be some extra invalid commits among delta log files in metadata table. - Due to this, archival of data table also fences itself up until compacted instant in metadata table. All writes to metadata table happens within the datatable lock. So, metadata table works in one writer mode only. This might be tough to loosen since all writers write to same FILES partition and so, will result in a conflict anyways. - As part of this, have added acquiring locks in data table for those operations which were not before while committing (rollback, clean, compaction, cluster). To note, we were not doing any conflict resolution. All we are doing here is to commit by taking a lock. So that all writes to metadata table is always a single writer. - Also added building block to add buckets for partitions, which will be leveraged by other indexes like record level index, etc. For now, FILES partition has only one bucket. In general, any number of buckets per partition is allowed and each partition has a fixed fileId prefix with incremental suffix for each bucket within each partition. Have fixed [HUDI-2476]. This fix is about retrying a failed compaction if it succeeded in metadata for first time, but failed w/ data table. - Enabling metadata table by default. - Adding more tests for metadata table Co-authored-by: Prashant Wason <pwason@uber.com>	2021-10-06 00:17:52 -04:00
Carl-Zhou-CN	aa546554ff	[HUDI-2451] On windows client with hdfs server for wrong file separator (#3687 ) Co-authored-by: yao.zhou <yao.zhou@linkflowtech.com>	2021-09-26 21:51:27 +08:00
liujinhui	eb5e7eec0a	MINOR_CHECKSTYLE (#3616 ) Fix checkstyle	2021-09-07 18:19:39 +08:00
Raymond Xu	cf002b6918	[HUDI-2079] Make CLI command tests functional (#3601 ) Make all tests in org.apache.hudi.cli.commands extend org.apache.hudi.cli.functional.CLIFunctionalTestHarness and tag as "functional". This also resolves a blocker where DFS init consistently failed when moving to ubuntu 18.04	2021-09-06 15:53:53 -07:00

1 2 3 4 5

224 Commits