1
0
Commit Graph

322 Commits

Author SHA1 Message Date
Raymond Xu
e96f08f355 Moving to 0.12.0-SNAPSHOT on master branch. 2022-04-06 15:24:10 +08:00
Alexey Kudinkin
8baeb816d5 [HUDI-3723] Fixed stack overflows in Record Iterators (#5235) 2022-04-05 23:12:13 -04:00
Yann Byron
3195f51562 [HUDI-3748] write and select hudi table when enable hoodie.datasource.write.drop.partition.columns (#5201) 2022-04-05 01:31:41 -07:00
ForwardXu
3449e86989 [HUDI-3780] improve drop partitions (#5178) 2022-04-05 11:52:33 +08:00
Prashant Wason
b28f0d6ceb [HUDI-3290] Different file formats for the partition metadata file. (#5179)
* [HUDI-3290] Different file formats for the partition metadata file.

Partition metadata files are stored in each partition to help identify the base path of a table. These files are saved in the properties file format. Some query engines do not work when non Parquet/ORC files are found in a partition.

Added a new table config 'hoodie.partition.metafile.use.data.format' which when enabled (default false for backward compatibility) ensures that partition metafiles will be saved in the same format as the base files of a dataset.

For new datasets, the config can be set via hudi-cli. Deltastreamer has a new parameter --partition-metafile-use-data-format which will create a table with this setting.

* Code review comments

- Adding a new command to migrate from text to base file formats for meta file.
- Reimplementing readFromFS() to first read the text format, then base format
- Avoid extra exists() checks in readFromFS()
- Added unit tests, enabled parquet format across hoodie-hadoop-mr
- Code cleanup, restructuring, naming consistency.

* Wiring in all the other Spark code paths to respect this config

 - Turned on parquet meta format for COW data source tests
 - Removed the deltastreamer command line to keep it shorter

* populate HoodiePartitionMetadata#format after readFromFS()

Co-authored-by: Vinoth Chandar <vinoth@apache.org>
Co-authored-by: Raymond Xu <2701446+xushiyan@users.noreply.github.com>
2022-04-04 08:08:20 -07:00
Sivabalan Narayanan
84064a9b08 [HUDI-3772] Fixing auto adjustment of lock configs for deltastreamer (#5207) 2022-04-02 23:44:10 -07:00
Alexey Kudinkin
cc3737be50 [HUDI-3664] Fixing Column Stats Index composition (#5181)
Co-authored-by: Sagar Sumit <sagarsumit09@gmail.com>
2022-04-02 17:15:52 -07:00
xiarixiaoyao
444ff496a4 [RFC-33] [HUDI-2429][Stacked on HUDI-2560] Support full Schema evolution for Spark (#4910)
* [HUDI-2560] introduce id_based schema to support full schema evolution.

* add test for FileBasedInternalSchemaStorageManger and rebase code

* add support for change column type and fix some test case

* fix some bugs encountered in the production env and delete useless code

* fix test error

* rebase code

* fixed some nested schema change bugs

* [HUDI-2429][Stacked On HUDI-2560]Support full schema evolution for spark

* [use dummyInternalSchema instead of null]

* add support for spark3.1.x

* remove support for spark3.1.x , sicne some compile fail

* support spark3.1.x

* rebase and prepare solve all comments

* address all comments

* rebase code

* fixed the count(*) bug

* try to get internalSchema by parser commit file/history file directly, not use metaclient which is time cost
address some comments

* fixed all comments

* fix new comments

* rebase code,fix UT failed

* fixed mistake

* rebase code ,fixed new comments

* rebase code , and prepare for address new comments

* address commits

* address new comments

* fix new issues

* control fallback original write logical
2022-04-01 13:20:24 -07:00
Danny Chan
6df14f15a3 [HUDI-2752] The MOR DELETE block breaks the event time sequence of CDC (#4880) 2022-04-01 20:46:51 +08:00
Sagar Sumit
a048e940fd [HUDI-3743] Support DELETE_PARTITION for metadata table (#5169)
In order to drop any metadata partition (index), we can reuse the DELETE_PARTITION operation in metadata table. Subsequent to this, we can support drop index (with table config update) for async metadata indexer.

- Add a new API in HoodieTableMetadataWriter
- Current only supported for Spark metadata writer
2022-03-31 21:29:17 -04:00
liuhe0702
1da196c1e8 [HUDI-2777] Improve HoodieSparkSqlWriter write performance (#5187) 2022-03-31 12:48:47 -07:00
xiarixiaoyao
7889c7852f [HUDI-3729][SPARK] fixed the per regression by enable vectorizeReader for parquet file (#5168)
* [MINOR][SPARK] fixed the per regression by enable vectorizeReader for parquet file

* address comments

* add perf result
2022-03-31 20:09:26 +08:00
ForwardXu
80011df995 [HUDI-3135] Make delete partitions lazy to be executed by the cleaner (#4489)
As of now, delete partitions will ensure all file groups are deleted, but the partition as such is not deleted. So, get all partitions might be returning the deleted partitions as well. but no data will be served since all file groups are deleted. With this patch, we are fixing it. We are letting cleaner take care of deleting the partitions when all file groups pertaining to a partitions are deleted.

- Fixed the CleanPlanActionExecutor to return meta info about list of partitions to be deleted. If there are no valid file groups for a partition, clean planner will include the partition to be deleted.
- Fixed HoodieCleanPlan avro schema to include the list of partitions to be deleted
- CleanActionExecutor is fixed to delete partitions if any (as per clean plan)
- Same info is added to HoodieCleanMetadata
- Metadata table when applying clean metadata, will check for partitions to be deleted and will update the "all_partitions" record for the deleted partitions.

Co-authored-by: sivabalan <n.siva.b@gmail.com>
2022-03-31 03:35:39 -04:00
KnightChess
2c4554fada [HUDI-3750] Fix NPE when build HoodieFileIndex (#5134)
Co-authored-by: wulingqi <wulingqi@baijiahulian.com>
2022-03-30 19:19:05 -07:00
Alexey Kudinkin
8b796e9686 [HUDI-3653] Cleaning up bespoke Column Stats Index implementation (#5062) 2022-03-30 10:01:43 -07:00
Sagar Sumit
04478a45d9 [MINOR] Fix dates as per UTC in TestDataSkippingUtils (#5166)
* Fix timezone in test
2022-03-30 20:03:14 +05:30
Sivabalan Narayanan
4fed8dd319 [HUDI-3485] Adding scheduler pool configs for async clustering (#5043) 2022-03-29 21:27:45 -04:00
ForwardXu
941c254c33 [HUDI-2520] Fix CTAS statment issue when sync to hive (#5145) 2022-03-29 12:25:31 -07:00
Alexey Kudinkin
e5a2baeed0 [HUDI-3549] Removing dependency on "spark-avro" (#4955)
Hudi will be taking on promise for it bundles to stay compatible with Spark minor versions (for ex 2.4, 3.1, 3.2): meaning that single build of Hudi (for ex "hudi-spark3.2-bundle") will be compatible with ALL patch versions in that minor branch (in that case 3.2.1, 3.2.0, etc)

To achieve that we'll have to remove (and ban) "spark-avro" as a dependency, which on a few occasions was the root-cause of incompatibility b/w consecutive Spark patch versions (most recently 3.2.1 and 3.2.0, due to this PR).

Instead of bundling "spark-avro" as dependency, we will be copying over some of the classes Hudi depends on and maintain them along the Hudi code-base to make sure we're able to provide for the aforementioned guarantee. To workaround arising compatibility issues we will be applying local patches to guarantee compatibility of Hudi bundles w/in the Spark minor version branches.

Following Hudi modules to Spark minor branches is currently maintained:

"hudi-spark3" -> 3.2.x
"hudi-spark3.1.x" -> 3.1.x
"hudi-spark2" -> 2.4.x
Following classes hierarchies (borrowed from "spark-avro") are maintained w/in these Spark-specific modules to guarantee compatibility with respective minor version branches:

AvroSerializer
AvroDeserializer
AvroUtils
Each of these classes has been correspondingly copied from Spark 3.2.1 (for 3.2.x branch), 3.1.2 (for 3.1.x branch), 2.4.4 (for 2.4.x branch) into their respective modules.

SchemaConverters class in turn is shared across all those modules given its relative stability (there're only cosmetical changes from 2.4.4 to 3.2.1).
All of the aforementioned classes have their corresponding scope of visibility limited to corresponding packages (org.apache.spark.sql.avro, org.apache.spark.sql) to make sure broader code-base does not become dependent on them and instead relies on facades abstracting them.

Additionally, given that Hudi plans on supporting all the patch versions of Spark w/in aforementioned minor versions branches of Spark, additional build steps were added to validate that Hudi could be properly compiled against those versions. Testing, however, is performed against the most recent patch versions of Spark with the help of Azure CI.

Brief change log:
- Removing spark-avro bundling from Hudi by default
- Scaffolded Spark 3.2.x hierarchy
- Bootstrapped Spark 3.1.x Avro serializer/deserializer hierarchy
- Bootstrapped Spark 2.4.x Avro serializer/deserializer hierarchy
- Moved ExpressionCodeGen,ExpressionPayload into hudi-spark module
- Fixed AvroDeserializer to stay compatible w/ both Spark 3.2.1 and 3.2.0
- Modified bot.yml to build full matrix of support Spark versions
- Removed "spark-avro" dependency from all modules
- Fixed relocation of spark-avro classes in bundles to assist in running integ-tests.
2022-03-29 14:44:47 -04:00
Alexey Kudinkin
fcb003ec76 [HUDI-3731] Fixing Column Stats Index record Merging sequence missing columnName (#5159)
* Added `DataSkippingFailureMode` to control how DS handles failures in the flow (either "strict", when exception would be thrown, or "fallback" when it will just fallback to the full-scan)

* Make sure tests execute in `DataSkippingFailureMode.Strict`

* Fixed Column Stats Index record merging sequence missing `columnName`
2022-03-29 21:09:56 +05:30
leesf
8f8a8158e2 [HUDI-2520] Fix drop table issue when sync to Hive (#5143) 2022-03-28 19:34:12 -07:00
ForwardXu
72e0b52b18 [HUDI-3722] Fix truncate hudi table's error (#5140) 2022-03-29 09:44:18 +08:00
Raymond Xu
6ccbae4d2a [HUDI-2757] Implement Hudi AWS Glue sync (#5076) 2022-03-28 14:54:59 -04:00
huberylee
1d0f4ccfe0 [HUDI-3538] Support Compaction Command Based on Call Procedure Command for Spark SQL (#4945)
* Support Compaction Command Based on Call Procedure Command for Spark SQL

* Addressed review comments
2022-03-28 14:11:35 +08:00
ForwardXu
d31cde284c [MINOR] Fix call command parser use spark3.2 (#5144) 2022-03-28 11:13:44 +08:00
Sivabalan Narayanan
f2a93ead3b [HUDI-3724] Fixing closure of ParquetReader (#5141) 2022-03-28 09:36:15 +08:00
xiarixiaoyao
9da2dd416e [HUDI-3719] High performance costs of AvroSerizlizer in DataSource wr… (#5137)
* [HUDI-3719] High performance costs of AvroSerizlizer in DataSource writing

* add benchmark framework which modify from spark
add avroSerDerBenchmark
2022-03-27 11:01:43 -07:00
Alexey Kudinkin
51034fecf1 [HUDI-3396] Refactoring MergeOnReadRDD to avoid duplication, fetch only projected columns (#4888) 2022-03-25 09:32:03 -07:00
ForwardXu
12cc8e715b [MINOR] fix QuickstartUtils move (#5133) 2022-03-25 07:34:35 -07:00
ForwardXu
e5c3f9089b [HUDI-3563] Make quickstart examples covered by CI tests (#5082) 2022-03-25 01:37:17 -07:00
Alexey Kudinkin
8b38ddedc2 [HUDI-3594] Supporting Composite Expressions over Data Table Columns in Data Skipping flow (#4996) 2022-03-24 22:27:15 -07:00
Rajesh Mahindra
5f570ea151 [HUDI-2883] Refactor hive sync tool / config to use reflection and standardize configs (#4175)
- Refactor hive sync tool / config to use reflection and standardize configs

Co-authored-by: sivabalan <n.siva.b@gmail.com>
Co-authored-by: Rajesh Mahindra <rmahindra@Rajeshs-MacBook-Pro.local>
Co-authored-by: Raymond Xu <2701446+xushiyan@users.noreply.github.com>
2022-03-21 22:56:31 -04:00
Alexey Kudinkin
099c2c099a [HUDI-3457] Refactored Spark DataSource Relations to avoid code duplication (#4877)
Refactoring Spark DataSource Relations to avoid code duplication. 

Following Relations were in scope:

- BaseFileOnlyViewRelation
- MergeOnReadSnapshotRelaation
- MergeOnReadIncrementalRelation
2022-03-18 22:32:16 -07:00
RexAn
9ece77561a [MINOR] HoodieFileScanRDD could print null path (#5056)
Co-authored-by: Rex An <bonean131@gmail.com>
2022-03-17 12:53:45 -07:00
Alexey Kudinkin
5e8ff8d793 [HUDI-3514] Rebase Data Skipping flow to rely on MT Column Stats index (#4948) 2022-03-15 10:38:36 -07:00
peanut-chenzhong
4b75cb6f23 fix NPE when run schdule using spark-sql if the commits time < hoodie.compact.inline.max.delta.commits (#4976)
* Update CompactionHoodiePathCommand.scala

fix NPE when run schdule using spark-sql if the commits time < hoodie.compact.inline.max.delta.commits

* Update CompactionHoodiePathCommand.scala

fix IndexOutOfBoundsException when there`s no schedule for compaction

* Update CompactionHoodiePathCommand.scala

fix CI issue
2022-03-14 16:40:38 +08:00
ForwardXu
6c8224cae6 [HUDI-3501] Support savepoints command based on Call Produce Command (#5025) 2022-03-13 16:58:21 +04:00
huberylee
56cb49485d [HUDI-3567] Refactor HoodieCommonUtils to make code more reasonable (#4982) 2022-03-11 13:23:19 -08:00
Alexey Kudinkin
034addaef5 [HUDI-3396] Make sure BaseFileOnlyViewRelation only reads projected columns (#4818)
NOTE: This change is first part of the series to clean up Hudi's Spark DataSource related implementations, making sure there's minimal code duplication among them, implementations are consistent and performant

This PR is making sure that BaseFileOnlyViewRelation only reads projected columns as well as avoiding unnecessary serde from Row to InternalRow

Brief change log
- Introduced HoodieBaseRDD as a base for all custom RDD impls
- Extracted common fields/methods to HoodieBaseRelation
- Cleaned up and streamlined HoodieBaseFileViewOnlyRelation
- Fixed all of the Relations to avoid superfluous Row <> InternalRow conversions
2022-03-09 21:45:25 -05:00
MrSleeping123
8859b48b2a [HUDI-3383] Sync column comments while syncing a hive table (#4960)
Desc: Add a hive sync config(hoodie.datasource.hive_sync.sync_comment). This config defaults to false.
While syncing data source to hudi, add column comments to source avro schema, and the sync_comment is true, syncing column comments to the hive table.
2022-03-10 09:44:39 +08:00
ForwardXu
08fd80c913 [HUDI-3221] Support querying a table as of a savepoint (#4720) 2022-03-08 10:02:34 -08:00
ForwardXu
25385805aa [HUDI-3574] Improve maven module configs for different spark profiles (#4970) 2022-03-08 01:01:05 -08:00
Alexey Kudinkin
a66fd40692 [HUDI-3365] Make sure Metadata Table records are updated appropriately on HDFS (#4739)
- This change makes sure MT records are updated appropriately on HDFS: previously after Log File append operations MT records were updated w/ just the size of the deltas being appended to the original files, which have been found to be the cause of issues in case of Rollbacks that were instead updating MT with records bearing the full file-size.

- To make sure that we hedge against similar issues going f/w, this PR alleviates this discrepancy and streamlines the flow of MT table always ingesting records bearing full file-sizes.
2022-03-07 15:38:27 -05:00
Sivabalan Narayanan
3539578ccb [HUDI-3213] Making commit preserve metadata to true for compaction (#4811)
* Making commit preserve metadata to true

* Fixing integ tests

* Fixing preserve commit metadata for metadata table

* fixed bootstrap tests

* temp diff

* Fixing merge handle

* renaming fallback record

* fixing build issue

* Fixing test failures
2022-03-07 18:02:05 +05:30
Aditya Tiwari
051ad0b033 [HUDI-3130] Fixing Hive getSchema for RT tables addressing different partitions having different schemas (#4468)
* Fixing Hive getSchema for RT tables

* Addressing feedback

* temp diff

* fixing tests after spark datasource read support for metadata table is merged to master

* Adding multi-partition schema evolution tests to HoodieRealTimeRecordReader

Co-authored-by: Aditya Tiwari <aditya.tiwari@flipkart.com>
Co-authored-by: sivabalan <n.siva.b@gmail.com>
2022-03-06 07:51:35 +05:30
shibei
62f534d002 [HUDI-3445] Support Clustering Command Based on Call Procedure Command for Spark SQL (#4901)
* [HUDI-3445] Clustering Command Based on Call Procedure Command for Spark SQL

* [HUDI-3445] Clustering Command Based on Call Procedure Command for Spark SQL

* [HUDI-3445] Clustering Command Based on Call Procedure Command for Spark SQL

Co-authored-by: shibei <huberylee.li@alibaba-inc.com>
2022-03-04 09:33:16 +08:00
RexAn
be9a264885 [HUDI-3548] Fix if user specify key "hoodie.datasource.clustering.async.enable" directly, async clustering not work (#4905)
Co-authored-by: Rex An <bonean131@gmail.com>
2022-03-03 19:14:07 -05:00
Alexey Kudinkin
85f47b53df [HUDI-3469] Refactor HoodieTestDataGenerator to provide for reproducible Builds (#4866) 2022-03-01 22:15:26 -08:00
Sivabalan Narayanan
d5444ff7ff [HUDI-3018] Adding validation to dataframe scheme to ensure reserved field does not have diff data type (#4852) 2022-02-27 11:59:23 -05:00
todd5167
45d1216e91 [HUDI-3401] fix NPE caused by incorrect beforeKeyGenClassName validation (#4774) 2022-02-24 23:31:29 -05:00