1
0
Commit Graph

275 Commits

Author SHA1 Message Date
xiarixiaoyao
444ff496a4 [RFC-33] [HUDI-2429][Stacked on HUDI-2560] Support full Schema evolution for Spark (#4910)
* [HUDI-2560] introduce id_based schema to support full schema evolution.

* add test for FileBasedInternalSchemaStorageManger and rebase code

* add support for change column type and fix some test case

* fix some bugs encountered in the production env and delete useless code

* fix test error

* rebase code

* fixed some nested schema change bugs

* [HUDI-2429][Stacked On HUDI-2560]Support full schema evolution for spark

* [use dummyInternalSchema instead of null]

* add support for spark3.1.x

* remove support for spark3.1.x , sicne some compile fail

* support spark3.1.x

* rebase and prepare solve all comments

* address all comments

* rebase code

* fixed the count(*) bug

* try to get internalSchema by parser commit file/history file directly, not use metaclient which is time cost
address some comments

* fixed all comments

* fix new comments

* rebase code,fix UT failed

* fixed mistake

* rebase code ,fixed new comments

* rebase code , and prepare for address new comments

* address commits

* address new comments

* fix new issues

* control fallback original write logical
2022-04-01 13:20:24 -07:00
Danny Chan
6df14f15a3 [HUDI-2752] The MOR DELETE block breaks the event time sequence of CDC (#4880) 2022-04-01 20:46:51 +08:00
Sagar Sumit
28dafa774e [HUDI-2488][HUDI-3175] Implement async metadata indexing (#4693)
- Add a new action called INDEX, whose state transition is described in the RFC.
- Changes in timeline to support the new action.
- Add an index planner in ScheduleIndexActionExecutor.
- Add index plan executor in RunIndexActionExecutor.
- Add 3 APIs in HoodieTableMetadataWriter; a) scheduleIndex: will generate an index plan based on latest completed instant, initialize file groups and add a requested INDEX instant, b) index: executes the index plan and also takes care of writes that happened after indexing was requested, c) dropIndex: will drop index by removing the given metadata partition.
- Add 2 new table configs to serve as the source of truth for inflight and completed indexes.
- Support upgrade/downgrade taking care of the newly added configs.
- Add tool to trigger indexing in HoodieIndexer.
- Handle corner cases related to partial failures.
- Abort gracefully after deleting partition and instant.
- Handle other actions in timeline to consider before catching up
2022-04-01 01:33:12 +05:30
ForwardXu
80011df995 [HUDI-3135] Make delete partitions lazy to be executed by the cleaner (#4489)
As of now, delete partitions will ensure all file groups are deleted, but the partition as such is not deleted. So, get all partitions might be returning the deleted partitions as well. but no data will be served since all file groups are deleted. With this patch, we are fixing it. We are letting cleaner take care of deleting the partitions when all file groups pertaining to a partitions are deleted.

- Fixed the CleanPlanActionExecutor to return meta info about list of partitions to be deleted. If there are no valid file groups for a partition, clean planner will include the partition to be deleted.
- Fixed HoodieCleanPlan avro schema to include the list of partitions to be deleted
- CleanActionExecutor is fixed to delete partitions if any (as per clean plan)
- Same info is added to HoodieCleanMetadata
- Metadata table when applying clean metadata, will check for partitions to be deleted and will update the "all_partitions" record for the deleted partitions.

Co-authored-by: sivabalan <n.siva.b@gmail.com>
2022-03-31 03:35:39 -04:00
xiarixiaoyao
2d73c8ae86 [HUDI-3355] Issue with out of order commits in the timeline when ingestion writers using SparkAllowUpdateStrategy (#4962) 2022-03-30 15:54:25 -07:00
Raymond Xu
6ccbae4d2a [HUDI-2757] Implement Hudi AWS Glue sync (#5076) 2022-03-28 14:54:59 -04:00
Y Ethan Guo
eaa4c4f2e2 [HUDI-1180] Upgrade HBase to 2.4.9 (#5004)
Co-authored-by: Sagar Sumit <sagarsumit09@gmail.com>
2022-03-24 19:04:53 -07:00
Pratyaksh Sharma
ca0931d332 [HUDI-1436]: Provide an option to trigger clean every nth commit (#4385)
- Provided option to trigger clean every nth commit with default number of commits as 1 so that existing users are not affected.
Co-authored-by: sivabalan <n.siva.b@gmail.com>
2022-03-21 20:06:30 -04:00
Y Ethan Guo
15d1c18625 [MINOR] Remove flaky assert in TestInLineFileSystem (#5069) 2022-03-20 18:58:30 -04:00
Raymond Xu
7446ff95a7 [HUDI-2439] Replace RDD with HoodieData in HoodieSparkTable and commit executors (#4856)
- Adopt HoodieData in Spark action commit executors
- Make Spark independent DeleteHelper, WriteHelper, MergeHelper in hudi-client-common
- Make HoodieTable in WriteClient APIs have raw type to decouple with Client's generic types
2022-03-17 04:17:56 -07:00
Y Ethan Guo
5ba2d9ab2f [HUDI-3494] Consider triggering condition of MOR compaction during archival (#4974) 2022-03-17 01:28:11 -04:00
Sagar Sumit
d514570e90 [HUDI-3633] Allow non-string values to be set in TypedProperties (#5045)
* [HUDI-3633] Allow non-string values to be set in TypedProperties

* Override getProperty to ignore instanceof string check
2022-03-15 22:33:22 +04:00
Alexey Kudinkin
5e8ff8d793 [HUDI-3514] Rebase Data Skipping flow to rely on MT Column Stats index (#4948) 2022-03-15 10:38:36 -07:00
Sagar Sumit
eee96e9af3 [HUDI-3593] Restore TypedProperties and flush checksum in table config (#5013)
Create new TypedProperties while performing clustering

Add OrderedProperties and minor refactoring

Add javadoc and remove getters from OrderedProperties
2022-03-13 07:58:55 +05:30
RexAn
93277b2bcd [HUDI-3592] Fix NPE of DefaultHoodieRecordPayload if Property is empty (#4999)
Co-authored-by: Rex An <bonean131@gmail.com>
2022-03-11 17:45:40 -05:00
苏承祥
faed6996ee [HUDI-3566] Add thread factory in BoundedInMemoryExecutor (#4926)
Co-authored-by: 苏承祥 <sucx@tuya.com>
2022-03-11 18:58:49 +08:00
MrSleeping123
8859b48b2a [HUDI-3383] Sync column comments while syncing a hive table (#4960)
Desc: Add a hive sync config(hoodie.datasource.hive_sync.sync_comment). This config defaults to false.
While syncing data source to hudi, add column comments to source avro schema, and the sync_comment is true, syncing column comments to the hive table.
2022-03-10 09:44:39 +08:00
Alexey Kudinkin
a66fd40692 [HUDI-3365] Make sure Metadata Table records are updated appropriately on HDFS (#4739)
- This change makes sure MT records are updated appropriately on HDFS: previously after Log File append operations MT records were updated w/ just the size of the deltas being appended to the original files, which have been found to be the cause of issues in case of Rollbacks that were instead updating MT with records bearing the full file-size.

- To make sure that we hedge against similar issues going f/w, this PR alleviates this discrepancy and streamlines the flow of MT table always ingesting records bearing full file-sizes.
2022-03-07 15:38:27 -05:00
Aditya Tiwari
051ad0b033 [HUDI-3130] Fixing Hive getSchema for RT tables addressing different partitions having different schemas (#4468)
* Fixing Hive getSchema for RT tables

* Addressing feedback

* temp diff

* fixing tests after spark datasource read support for metadata table is merged to master

* Adding multi-partition schema evolution tests to HoodieRealTimeRecordReader

Co-authored-by: Aditya Tiwari <aditya.tiwari@flipkart.com>
Co-authored-by: sivabalan <n.siva.b@gmail.com>
2022-03-06 07:51:35 +05:30
Alexey Kudinkin
85f47b53df [HUDI-3469] Refactor HoodieTestDataGenerator to provide for reproducible Builds (#4866) 2022-03-01 22:15:26 -08:00
Bo Cui
3fdc9332e5 [HUDI-3516] Implement record iterator for HoodieDataBlock (#4909)
*  Use iterator to void eager materialization to be memory friendly
2022-03-02 10:19:36 +08:00
stayrascal
8f1e4f5b3e [HUDI-3528] Fix String convert issue and overwrite putAll method in TypedProperties.java (#4920) 2022-02-28 10:45:47 -05:00
YueZhang
742810070b [HUDI-3421]Pending clustering may break AbstractTableFileSystemView#getxxBaseFile() (#4810) 2022-02-25 16:46:27 +05:30
Alexey Kudinkin
85e8a5c4de [HUDI-1296] Support Metadata Table in Spark Datasource (#4789)
* Bootstrapping initial support for Metadata Table in Spark Datasource

- Consolidated Avro/Row conversion utilities to center around Spark's AvroDeserializer ; removed duplication
- Bootstrapped HoodieBaseRelation
- Updated HoodieMergeOnReadRDD to be able to handle Metadata Table
- Modified MOR relations to be able to read different Base File formats (Parquet, HFile)
2022-02-24 16:23:13 -05:00
YueZhang
359fbfde79 [HUDI-2648] Retry FileSystem action instead of failed directly. (#3887)
Co-authored-by: yuezhang <yuezhang@freewheel.tv>
2022-02-20 15:31:31 -05:00
Bo Cui
83279971a1 [HUDI-3446] Supports batch reader in BootstrapOperator#loadRecords (#4837)
* [HUDI-3446] Supports batch Reader in BootstrapOperator#loadRecords
2022-02-19 21:21:48 +08:00
Sagar Sumit
ed106f671e [HUDI-2809] Introduce a checksum mechanism for validating hoodie.properties (#4712)
Fix dependency conflict

Fix repairs command

Implement putIfAbsent for DDB lock provider

Add upgrade step and validate while fetching configs

Validate checksum for latest table version only while fetching config

Move generateChecksum to BinaryUtil

Rebase and resolve conflict

Fix table version check
2022-02-18 10:17:06 +05:30
Y Ethan Guo
9a05940a74 [HUDI-3366] Remove hardcoded logic of disabling metadata table in tests (#4792) 2022-02-15 16:41:47 -05:00
Y Ethan Guo
b8601a9f58 [HUDI-2656] Generalize HoodieIndex for flexible record data type (#3893)
Co-authored-by: Raymond Xu <2701446+xushiyan@users.noreply.github.com>
2022-02-03 20:24:04 -08:00
Alexey Kudinkin
819e8018ff [HUDI-3322][HUDI-3343] Fixing Metadata Table Records Duplication Issues (#4716)
This change is addressing issues in regards to Metadata Table observing ingesting duplicated records leading to it persisting incorrect file-sizes for the files referred to in those records.

There are multiple issues that were leading to that:

- [HUDI-3322] Incorrect Rollback Plan generation: Rollback Plan generated for MOR tables was overly expansively listing all log-files with the latest base-instant as the ones that have been affected by the rollback, leading to invalid MT records being ingested referring to those.
- [HUDI-3343] Metadata Table including Uncommitted Log Files during Bootstrap: Since MT is bootstrapped at the end of the commit operation execution (after FS activity, but before committing to the timeline), it was actually incorrectly ingesting some files that were part of the intermediate state of the operation being committed.

This change will unblock Stack of PRs based off #4556
2022-02-02 16:10:51 -05:00
Alexey Kudinkin
a68e1dc2db [HUDI-431] Adding support for Parquet in MOR LogBlocks (#4333)
- Adding support for Parquet in MOR tables Log blocks

Co-authored-by: Sivabalan Narayanan <n.siva.b@gmail.com>
2022-02-02 14:35:05 -05:00
Raymond Xu
caef3d5c58 [HUDI-3330] Remove fixture test tables for multi writer tests (#4704) 2022-02-02 17:50:10 +05:30
董可伦
56cd8ffae0 [HUDI-2837] Add support for using database name in incremental query (#4083) 2022-01-22 22:11:27 -08:00
Alexey Kudinkin
4bea758738 [HUDI-3191] Rebasing Hive's FileInputFormat onto AbstractHoodieTableFileIndex (#4531) 2022-01-18 14:54:51 -08:00
0x574C
28b3b6ad8f [MINOR] Remove org.apache.directory.api.util.Strings import (#4601) 2022-01-16 16:58:18 +08:00
Y Ethan Guo
7d163ee3de [MINOR] Fix local flaky test in TestFSUtils (#4596)
Co-authored-by: Raymond Xu <2701446+xushiyan@users.noreply.github.com>
2022-01-13 22:48:57 -08:00
Sagar Sumit
195dac90fa [MINOR] Disable flaky tests to unlock CI (#4592) 2022-01-13 19:43:27 -08:00
Y Ethan Guo
397795c7d0 [HUDI-3007] Fix issues in HoodieRepairTool (#4564) 2022-01-12 09:03:27 -08:00
Sivabalan Narayanan
7a8b94c82d [HUDI-3180] Include files from completed commits while bootstrapping metadata table (#4519) 2022-01-10 15:33:15 -05:00
YueZhang
cf362fb2d5 [MINOR] Fix some code style issues based on check-style plugin (#4532)
Co-authored-by: yuezhang <yuezhang@freewheel.tv>
2022-01-09 01:14:56 -08:00
Sagar Sumit
827549949c [HUDI-2909] Handle logical type in TimestampBasedKeyGenerator (#4203)
* [HUDI-2909] Handle logical type in TimestampBasedKeyGenerator

Timestampbased key generator was returning diff values for row writer and non row writer path. this patch fixes it and is guarded by a config flag (`hoodie.datasource.write.keygenerator.consistent.logical.timestamp.enabled`)
2022-01-08 10:22:44 -05:00
Sagar Sumit
518488c633 [HUDI-3185] HoodieConfig#getBoolean should return false when default not set (#4536)
Remove unnecessary config
2022-01-07 16:20:11 -05:00
董可伦
436becf3ea [HUDI-2675] Fix the exception 'Not an Avro data file' when archive and clean (#4016) 2021-12-29 22:53:17 -05:00
Danny Chan
7b07aac286 [HUDI-3101] Excluding compaction instants from pending rollback info (#4443) 2021-12-25 14:10:45 +08:00
Raymond Xu
32a44bbe06 [HUDI-2970] Add test for archiving replace commit (#4345) 2021-12-21 00:01:59 -05:00
Danny Chan
ea2eba1a55 [HUDI-3015] Implement #reset and #sync for metadata filesystem view (#4307) 2021-12-16 15:26:16 +08:00
Alexey Kudinkin
2d864f7524 [HUDI-2814] Make Z-index more generic Column-Stats Index (#4106) 2021-12-10 14:56:09 -08:00
zhangyue19921010
3ba2909690 [HUDI-2892][BUG] Pending Clustering may stain the ActiveTimeLine and lead to incomplete query results (#4172)
Co-authored-by: yuezhang <yuezhang@freewheel.tv>
2021-12-10 09:57:01 -08:00
Sivabalan Narayanan
be368264f4 [HUDI-2952] Fixing metadata table for non-partitioned dataset (#4243) 2021-12-10 11:11:42 -05:00
leesf
5ac9ce7289 [MINOR] Fix Compile broken (#4263) 2021-12-09 13:12:18 +08:00