1
0
Commit Graph

423 Commits

Author SHA1 Message Date
Raymond Xu
e96f08f355 Moving to 0.12.0-SNAPSHOT on master branch. 2022-04-06 15:24:10 +08:00
Sagar Sumit
898be6174a [HUDI-3782] Fixing table config when any of the index is disabled (#5222) 2022-04-05 23:06:52 -04:00
Prashant Wason
b28f0d6ceb [HUDI-3290] Different file formats for the partition metadata file. (#5179)
* [HUDI-3290] Different file formats for the partition metadata file.

Partition metadata files are stored in each partition to help identify the base path of a table. These files are saved in the properties file format. Some query engines do not work when non Parquet/ORC files are found in a partition.

Added a new table config 'hoodie.partition.metafile.use.data.format' which when enabled (default false for backward compatibility) ensures that partition metafiles will be saved in the same format as the base files of a dataset.

For new datasets, the config can be set via hudi-cli. Deltastreamer has a new parameter --partition-metafile-use-data-format which will create a table with this setting.

* Code review comments

- Adding a new command to migrate from text to base file formats for meta file.
- Reimplementing readFromFS() to first read the text format, then base format
- Avoid extra exists() checks in readFromFS()
- Added unit tests, enabled parquet format across hoodie-hadoop-mr
- Code cleanup, restructuring, naming consistency.

* Wiring in all the other Spark code paths to respect this config

 - Turned on parquet meta format for COW data source tests
 - Removed the deltastreamer command line to keep it shorter

* populate HoodiePartitionMetadata#format after readFromFS()

Co-authored-by: Vinoth Chandar <vinoth@apache.org>
Co-authored-by: Raymond Xu <2701446+xushiyan@users.noreply.github.com>
2022-04-04 08:08:20 -07:00
Y Ethan Guo
c34eb07598 [MINOR] Reuse deleteMetadataTable for disabling metadata table (#5217) 2022-04-03 16:12:14 +05:30
Sivabalan Narayanan
84064a9b08 [HUDI-3772] Fixing auto adjustment of lock configs for deltastreamer (#5207) 2022-04-02 23:44:10 -07:00
Alexey Kudinkin
cc3737be50 [HUDI-3664] Fixing Column Stats Index composition (#5181)
Co-authored-by: Sagar Sumit <sagarsumit09@gmail.com>
2022-04-02 17:15:52 -07:00
Sagar Sumit
74eb09be9b [HUDI-3776] Fix BloomIndex incorrectly using ColStats to lookup records locations (#5213) 2022-04-02 18:22:57 -04:00
YueZhang
020786a5f9 [HUDI-3451] Delete metadata table when the write client disables MDT (#5186)
* Add checks for metadata table init to avoid possible out-of-sync

* Revise the logic to reuse existing table config

* Revise docs and naming

Co-authored-by: yuezhang <yuezhang@freewheel.tv>
Co-authored-by: Y Ethan Guo <ethan.guoyihua@gmail.com>
2022-04-02 16:31:06 +05:30
Y Ethan Guo
fb45fc9cb9 [HUDI-3773] Fix parallelism used for metadata table bloom filter index (#5209) 2022-04-01 20:14:07 -07:00
xiarixiaoyao
444ff496a4 [RFC-33] [HUDI-2429][Stacked on HUDI-2560] Support full Schema evolution for Spark (#4910)
* [HUDI-2560] introduce id_based schema to support full schema evolution.

* add test for FileBasedInternalSchemaStorageManger and rebase code

* add support for change column type and fix some test case

* fix some bugs encountered in the production env and delete useless code

* fix test error

* rebase code

* fixed some nested schema change bugs

* [HUDI-2429][Stacked On HUDI-2560]Support full schema evolution for spark

* [use dummyInternalSchema instead of null]

* add support for spark3.1.x

* remove support for spark3.1.x , sicne some compile fail

* support spark3.1.x

* rebase and prepare solve all comments

* address all comments

* rebase code

* fixed the count(*) bug

* try to get internalSchema by parser commit file/history file directly, not use metaclient which is time cost
address some comments

* fixed all comments

* fix new comments

* rebase code,fix UT failed

* fixed mistake

* rebase code ,fixed new comments

* rebase code , and prepare for address new comments

* address commits

* address new comments

* fix new issues

* control fallback original write logical
2022-04-01 13:20:24 -07:00
董可伦
23b31225df [HUDI-3769] Optimize the logs of HoodieMergeHandle and BufferedConnectWriter (#5200) 2022-04-01 21:17:49 +08:00
Danny Chan
6df14f15a3 [HUDI-2752] The MOR DELETE block breaks the event time sequence of CDC (#4880) 2022-04-01 20:46:51 +08:00
ForwardXu
98b4e9796e [HUDI-3406] Rollback incorrectly relying on FS listing instead of Com… (#4957)
* [HUDI-3406] Rollback incorrectly relying on FS listing instead of Commit Metadata

* [HUDI-3406] Rollback incorrectly relying on FS listing instead of Commit Metadata

* [HUDI-3406] Rollback incorrectly relying on FS listing instead of Commit Metadata

* fix comments

* fix comments

* fix comments
2022-04-01 10:01:41 +08:00
Sagar Sumit
a048e940fd [HUDI-3743] Support DELETE_PARTITION for metadata table (#5169)
In order to drop any metadata partition (index), we can reuse the DELETE_PARTITION operation in metadata table. Subsequent to this, we can support drop index (with table config update) for async metadata indexer.

- Add a new API in HoodieTableMetadataWriter
- Current only supported for Spark metadata writer
2022-03-31 21:29:17 -04:00
Sagar Sumit
28dafa774e [HUDI-2488][HUDI-3175] Implement async metadata indexing (#4693)
- Add a new action called INDEX, whose state transition is described in the RFC.
- Changes in timeline to support the new action.
- Add an index planner in ScheduleIndexActionExecutor.
- Add index plan executor in RunIndexActionExecutor.
- Add 3 APIs in HoodieTableMetadataWriter; a) scheduleIndex: will generate an index plan based on latest completed instant, initialize file groups and add a requested INDEX instant, b) index: executes the index plan and also takes care of writes that happened after indexing was requested, c) dropIndex: will drop index by removing the given metadata partition.
- Add 2 new table configs to serve as the source of truth for inflight and completed indexes.
- Support upgrade/downgrade taking care of the newly added configs.
- Add tool to trigger indexing in HoodieIndexer.
- Handle corner cases related to partial failures.
- Abort gracefully after deleting partition and instant.
- Handle other actions in timeline to consider before catching up
2022-04-01 01:33:12 +05:30
Sivabalan Narayanan
73a21092f8 [HUDI-3732] Fixing rollback validation (#5157)
* Fixing rollback validation

* Adding tests
2022-03-31 17:25:24 +05:30
ForwardXu
80011df995 [HUDI-3135] Make delete partitions lazy to be executed by the cleaner (#4489)
As of now, delete partitions will ensure all file groups are deleted, but the partition as such is not deleted. So, get all partitions might be returning the deleted partitions as well. but no data will be served since all file groups are deleted. With this patch, we are fixing it. We are letting cleaner take care of deleting the partitions when all file groups pertaining to a partitions are deleted.

- Fixed the CleanPlanActionExecutor to return meta info about list of partitions to be deleted. If there are no valid file groups for a partition, clean planner will include the partition to be deleted.
- Fixed HoodieCleanPlan avro schema to include the list of partitions to be deleted
- CleanActionExecutor is fixed to delete partitions if any (as per clean plan)
- Same info is added to HoodieCleanMetadata
- Metadata table when applying clean metadata, will check for partitions to be deleted and will update the "all_partitions" record for the deleted partitions.

Co-authored-by: sivabalan <n.siva.b@gmail.com>
2022-03-31 03:35:39 -04:00
Sivabalan Narayanan
4569734d60 [HUDI-3713] Guarding archival for multi-writer (#5138) 2022-03-31 01:44:31 -04:00
YueZhang
2dbb273d26 [HUDI-3721] Delete MDT if necessary when trigger rollback to savepoint (#5173)
Co-authored-by: yuezhang <yuezhang@freewheel.tv>
2022-03-30 20:26:37 -07:00
xiarixiaoyao
2d73c8ae86 [HUDI-3355] Issue with out of order commits in the timeline when ingestion writers using SparkAllowUpdateStrategy (#4962) 2022-03-30 15:54:25 -07:00
Sivabalan Narayanan
4fed8dd319 [HUDI-3485] Adding scheduler pool configs for async clustering (#5043) 2022-03-29 21:27:45 -04:00
Alexey Kudinkin
e5a2baeed0 [HUDI-3549] Removing dependency on "spark-avro" (#4955)
Hudi will be taking on promise for it bundles to stay compatible with Spark minor versions (for ex 2.4, 3.1, 3.2): meaning that single build of Hudi (for ex "hudi-spark3.2-bundle") will be compatible with ALL patch versions in that minor branch (in that case 3.2.1, 3.2.0, etc)

To achieve that we'll have to remove (and ban) "spark-avro" as a dependency, which on a few occasions was the root-cause of incompatibility b/w consecutive Spark patch versions (most recently 3.2.1 and 3.2.0, due to this PR).

Instead of bundling "spark-avro" as dependency, we will be copying over some of the classes Hudi depends on and maintain them along the Hudi code-base to make sure we're able to provide for the aforementioned guarantee. To workaround arising compatibility issues we will be applying local patches to guarantee compatibility of Hudi bundles w/in the Spark minor version branches.

Following Hudi modules to Spark minor branches is currently maintained:

"hudi-spark3" -> 3.2.x
"hudi-spark3.1.x" -> 3.1.x
"hudi-spark2" -> 2.4.x
Following classes hierarchies (borrowed from "spark-avro") are maintained w/in these Spark-specific modules to guarantee compatibility with respective minor version branches:

AvroSerializer
AvroDeserializer
AvroUtils
Each of these classes has been correspondingly copied from Spark 3.2.1 (for 3.2.x branch), 3.1.2 (for 3.1.x branch), 2.4.4 (for 2.4.x branch) into their respective modules.

SchemaConverters class in turn is shared across all those modules given its relative stability (there're only cosmetical changes from 2.4.4 to 3.2.1).
All of the aforementioned classes have their corresponding scope of visibility limited to corresponding packages (org.apache.spark.sql.avro, org.apache.spark.sql) to make sure broader code-base does not become dependent on them and instead relies on facades abstracting them.

Additionally, given that Hudi plans on supporting all the patch versions of Spark w/in aforementioned minor versions branches of Spark, additional build steps were added to validate that Hudi could be properly compiled against those versions. Testing, however, is performed against the most recent patch versions of Spark with the help of Azure CI.

Brief change log:
- Removing spark-avro bundling from Hudi by default
- Scaffolded Spark 3.2.x hierarchy
- Bootstrapped Spark 3.1.x Avro serializer/deserializer hierarchy
- Bootstrapped Spark 2.4.x Avro serializer/deserializer hierarchy
- Moved ExpressionCodeGen,ExpressionPayload into hudi-spark module
- Fixed AvroDeserializer to stay compatible w/ both Spark 3.2.1 and 3.2.0
- Modified bot.yml to build full matrix of support Spark versions
- Removed "spark-avro" dependency from all modules
- Fixed relocation of spark-avro classes in bundles to assist in running integ-tests.
2022-03-29 14:44:47 -04:00
Sivabalan Narayanan
d074089c62 [HUDI-2566] Adding multi-writer test support to integ test (#5065) 2022-03-28 17:05:00 -04:00
Y Ethan Guo
4ed84b216d [HUDI-3720] Fix the logic of reattempting pending rollback (#5148) 2022-03-28 14:54:31 -04:00
Y Ethan Guo
484b3407e0 [HUDI-3604] Adjust the order of timeline changes in rollbacks (#5114) 2022-03-26 22:37:44 -07:00
Alexey Kudinkin
189d5297b8 [HUDI-3709] Fixing ParquetWriter impls not respecting Parquet Max File Size limit (#5129) 2022-03-26 17:51:36 -04:00
RexAn
57b4f39c31 [HUDI-3612] Clustering strategy should create new TypedProperties when modifying it (#5027) 2022-03-26 16:16:03 +05:30
Danny Chan
0c09a973fb [HUDI-3435] Do not throw exception when instant to rollback does not exist in metadata table active timeline (#4821) 2022-03-26 11:42:54 +08:00
Danny Chan
8896864d7b [HUDI-3678] Fix record rewrite of create handle when 'preserveMetadata' is true (#5088) 2022-03-25 11:48:50 +08:00
Y Ethan Guo
eaa4c4f2e2 [HUDI-1180] Upgrade HBase to 2.4.9 (#5004)
Co-authored-by: Sagar Sumit <sagarsumit09@gmail.com>
2022-03-24 19:04:53 -07:00
Danny Chan
5e86cdd1e9 [HUDI-3701] Flink bulk_insert support bucket hash index (#5118) 2022-03-25 09:01:42 +08:00
Y Ethan Guo
608d4bf32d [HUDI-3638] Make ZookeeperBasedLockProvider serializable (#5112) 2022-03-24 17:59:47 -07:00
Sagar Sumit
f96ba7abf0 [HUDI-3642] Handle NPE due to empty requested replacecommit metadata (#5090) 2022-03-23 12:13:02 -07:00
Rajesh Mahindra
5f570ea151 [HUDI-2883] Refactor hive sync tool / config to use reflection and standardize configs (#4175)
- Refactor hive sync tool / config to use reflection and standardize configs

Co-authored-by: sivabalan <n.siva.b@gmail.com>
Co-authored-by: Rajesh Mahindra <rmahindra@Rajeshs-MacBook-Pro.local>
Co-authored-by: Raymond Xu <2701446+xushiyan@users.noreply.github.com>
2022-03-21 22:56:31 -04:00
Y Ethan Guo
9b6e138af2 [HUDI-3640] Set SimpleKeyGenerator as default in 2to3 table upgrade for Spark engine (#5075) 2022-03-21 20:35:06 -04:00
Pratyaksh Sharma
ca0931d332 [HUDI-1436]: Provide an option to trigger clean every nth commit (#4385)
- Provided option to trigger clean every nth commit with default number of commits as 1 so that existing users are not affected.
Co-authored-by: sivabalan <n.siva.b@gmail.com>
2022-03-21 20:06:30 -04:00
Raymond Xu
7446ff95a7 [HUDI-2439] Replace RDD with HoodieData in HoodieSparkTable and commit executors (#4856)
- Adopt HoodieData in Spark action commit executors
- Make Spark independent DeleteHelper, WriteHelper, MergeHelper in hudi-client-common
- Make HoodieTable in WriteClient APIs have raw type to decouple with Client's generic types
2022-03-17 04:17:56 -07:00
Y Ethan Guo
5ba2d9ab2f [HUDI-3494] Consider triggering condition of MOR compaction during archival (#4974) 2022-03-17 01:28:11 -04:00
Y Ethan Guo
95e6e53810 [HUDI-3404] Automatically adjust write configs based on metadata table and write concurrency mode (#4975) 2022-03-17 01:25:04 -04:00
liujinhui
e60acc1258 [HUDI-3583] Fix MarkerBasedRollbackStrategy NoSuchElementException (#4984)
Co-authored-by: Y Ethan Guo <ethan.guoyihua@gmail.com>
2022-03-12 23:00:50 -08:00
Sivabalan Narayanan
e7bb0413af [HUDI-3556] Re-use rollback instant for rolling back of clustering and compaction if rollback failed mid-way (#4971) 2022-03-11 18:40:13 -05:00
Danny Chan
ec24407191 [HUDI-3581] Reorganize some clazz for hudi flink (#4983) 2022-03-10 15:55:15 +08:00
Alexey Kudinkin
034addaef5 [HUDI-3396] Make sure BaseFileOnlyViewRelation only reads projected columns (#4818)
NOTE: This change is first part of the series to clean up Hudi's Spark DataSource related implementations, making sure there's minimal code duplication among them, implementations are consistent and performant

This PR is making sure that BaseFileOnlyViewRelation only reads projected columns as well as avoiding unnecessary serde from Row to InternalRow

Brief change log
- Introduced HoodieBaseRDD as a base for all custom RDD impls
- Extracted common fields/methods to HoodieBaseRelation
- Cleaned up and streamlined HoodieBaseFileViewOnlyRelation
- Fixed all of the Relations to avoid superfluous Row <> InternalRow conversions
2022-03-09 21:45:25 -05:00
Sivabalan Narayanan
4324e874ae [HUDI-3587] Making SupportsUpgradeDowngrade serializable (#4991) 2022-03-09 00:04:42 -05:00
Sagar Sumit
575bc63468 [HUDI-3356][HUDI-3203] HoodieData for metadata index records; BloomFilter construction from index based on the type param (#4848)
Rework of #4761 
This diff introduces following changes:

- Write stats are converted to metadata index records during the commit. Making them use the HoodieData type so that the record generation scales up with needs. 
- Metadata index init support for bloom filter and column stats partitions.
- When building the BloomFilter from the index records, using the type param stored in the payload instead of hardcoded type.
- Delta writes can change column ranges and the column stats index need to be properly updated with new ranges to be consistent with the table dataset. This fix add column stats index update support for the delta writes.

Co-authored-by: Manoj Govindassamy <manoj.govindassamy@gmail.com>
2022-03-08 10:39:04 -05:00
Sivabalan Narayanan
29040762fa [HUDI-3576] Configuring timeline refreshes based on latest commit (#4973) 2022-03-07 17:01:49 -05:00
Alexey Kudinkin
a66fd40692 [HUDI-3365] Make sure Metadata Table records are updated appropriately on HDFS (#4739)
- This change makes sure MT records are updated appropriately on HDFS: previously after Log File append operations MT records were updated w/ just the size of the deltas being appended to the original files, which have been found to be the cause of issues in case of Rollbacks that were instead updating MT with records bearing the full file-size.

- To make sure that we hedge against similar issues going f/w, this PR alleviates this discrepancy and streamlines the flow of MT table always ingesting records bearing full file-sizes.
2022-03-07 15:38:27 -05:00
Alexey Kudinkin
f0bcee3c01 [HUDI-3561] Avoid including whole MultipleSparkJobExecutionStrategy object into the closure for Spark to serialize (#4954)
- Avoid including whole MultipleSparkJobExecutionStrategy object into the closure for Spark to serialize
2022-03-07 13:42:03 -05:00
Sivabalan Narayanan
3539578ccb [HUDI-3213] Making commit preserve metadata to true for compaction (#4811)
* Making commit preserve metadata to true

* Fixing integ tests

* Fixing preserve commit metadata for metadata table

* fixed bootstrap tests

* temp diff

* Fixing merge handle

* renaming fallback record

* fixing build issue

* Fixing test failures
2022-03-07 18:02:05 +05:30
shibei
62f534d002 [HUDI-3445] Support Clustering Command Based on Call Procedure Command for Spark SQL (#4901)
* [HUDI-3445] Clustering Command Based on Call Procedure Command for Spark SQL

* [HUDI-3445] Clustering Command Based on Call Procedure Command for Spark SQL

* [HUDI-3445] Clustering Command Based on Call Procedure Command for Spark SQL

Co-authored-by: shibei <huberylee.li@alibaba-inc.com>
2022-03-04 09:33:16 +08:00