1
0
Commit Graph

430 Commits

Author SHA1 Message Date
Alexey Kudinkin
4b296f79cc [HUDI-3935] Adding config to fallback to enabled Partition Values extraction from Partition path (#5377) 2022-04-21 01:36:19 -07:00
Y Ethan Guo
28fdddfee0 [HUDI-3920] Fix partition path construction in metadata table validator (#5365) 2022-04-19 19:40:09 -04:00
Sagar Sumit
4f44e6aeb5 [HUDI-3899] Drop index to delete pending index instants from timeline if applicable (#5342)
Co-authored-by: sivabalan <n.siva.b@gmail.com>
2022-04-18 22:28:46 -04:00
Sagar Sumit
1718bcab84 [HUDI-3707] Fix target schema handling in HoodieSparkUtils while creating RDD (#5347) 2022-04-18 13:34:04 -04:00
Sivabalan Narayanan
05dfc39c29 Fixing async clustering job test in TestHoodieDeltaStreamer (#5317) 2022-04-18 17:38:33 +05:30
董可伦
b8e465fdfc [MINOR] Fix typos in log4j-surefire.properties (#5212) 2022-04-15 13:33:37 -07:00
Raymond Xu
9e8664f4d2 [HOTFIX] add missing license (#5322) (#5324) 2022-04-14 12:35:20 -07:00
Vinoth Govindarajan
2d46d5287e [HUDI-3838] Moved the getPartitionColumns logic to driver. (#5303) 2022-04-12 18:03:00 -04:00
Vinoth Govindarajan
d16740976e [HUDI-3838] Implemented drop partition column feature for delta streamer code path (#5294)
* [HUDI-3838] Implemented drop partition column feature for delta streamer code path

* Ensure drop partition table config is updated in hoodie.props

Co-authored-by: Sagar Sumit <sagarsumit09@gmail.com>
2022-04-12 18:10:30 +05:30
Sagar Sumit
3d8fc78c66 [HUDI-3844] Update props in indexer based on table config (#5293) 2022-04-11 18:16:06 -04:00
Y Ethan Guo
63a099c5b7 [HUDI-3847] Fix NPE due to null schema in HoodieMetadataTableValidator (#5284) 2022-04-10 17:59:29 -07:00
Alexey Kudinkin
81b25c543a [HUDI-3825] Fixing Column Stats Index updating sequence (#5267) 2022-04-08 23:14:08 -07:00
Y Ethan Guo
cd2c346df6 [HUDI-3637] Exclude uncommitted log files from metadata table validation (#5234) 2022-04-07 13:03:03 -07:00
Raymond Xu
e96f08f355 Moving to 0.12.0-SNAPSHOT on master branch. 2022-04-06 15:24:10 +08:00
Prashant Wason
b28f0d6ceb [HUDI-3290] Different file formats for the partition metadata file. (#5179)
* [HUDI-3290] Different file formats for the partition metadata file.

Partition metadata files are stored in each partition to help identify the base path of a table. These files are saved in the properties file format. Some query engines do not work when non Parquet/ORC files are found in a partition.

Added a new table config 'hoodie.partition.metafile.use.data.format' which when enabled (default false for backward compatibility) ensures that partition metafiles will be saved in the same format as the base files of a dataset.

For new datasets, the config can be set via hudi-cli. Deltastreamer has a new parameter --partition-metafile-use-data-format which will create a table with this setting.

* Code review comments

- Adding a new command to migrate from text to base file formats for meta file.
- Reimplementing readFromFS() to first read the text format, then base format
- Avoid extra exists() checks in readFromFS()
- Added unit tests, enabled parquet format across hoodie-hadoop-mr
- Code cleanup, restructuring, naming consistency.

* Wiring in all the other Spark code paths to respect this config

 - Turned on parquet meta format for COW data source tests
 - Removed the deltastreamer command line to keep it shorter

* populate HoodiePartitionMetadata#format after readFromFS()

Co-authored-by: Vinoth Chandar <vinoth@apache.org>
Co-authored-by: Raymond Xu <2701446+xushiyan@users.noreply.github.com>
2022-04-04 08:08:20 -07:00
Sivabalan Narayanan
84064a9b08 [HUDI-3772] Fixing auto adjustment of lock configs for deltastreamer (#5207) 2022-04-02 23:44:10 -07:00
Alexey Kudinkin
cc3737be50 [HUDI-3664] Fixing Column Stats Index composition (#5181)
Co-authored-by: Sagar Sumit <sagarsumit09@gmail.com>
2022-04-02 17:15:52 -07:00
Y Ethan Guo
c19f505b5a [HUDI-3784] Improve docs and logs of HoodieMetadataTableValidator (#5216) 2022-04-02 13:16:17 -07:00
Sagar Sumit
28dafa774e [HUDI-2488][HUDI-3175] Implement async metadata indexing (#4693)
- Add a new action called INDEX, whose state transition is described in the RFC.
- Changes in timeline to support the new action.
- Add an index planner in ScheduleIndexActionExecutor.
- Add index plan executor in RunIndexActionExecutor.
- Add 3 APIs in HoodieTableMetadataWriter; a) scheduleIndex: will generate an index plan based on latest completed instant, initialize file groups and add a requested INDEX instant, b) index: executes the index plan and also takes care of writes that happened after indexing was requested, c) dropIndex: will drop index by removing the given metadata partition.
- Add 2 new table configs to serve as the source of truth for inflight and completed indexes.
- Support upgrade/downgrade taking care of the newly added configs.
- Add tool to trigger indexing in HoodieIndexer.
- Handle corner cases related to partial failures.
- Abort gracefully after deleting partition and instant.
- Handle other actions in timeline to consider before catching up
2022-04-01 01:33:12 +05:30
YueZhang
2b60641d17 [HUDI-3635] Fix HoodieMetadataTableValidator around comparison of partition path listing (#5100)
Co-authored-by: yuezhang <yuezhang@freewheel.tv>
2022-03-30 17:23:37 -04:00
YueZhang
eae8488536 [HUDI-3647] HoodieMetadataTableValidator: check MDT was initialized at first (#5152)
Co-authored-by: yuezhang <yuezhang@freewheel.tv>
2022-03-30 17:18:08 -04:00
harshal
7fa363923c [HUDI-3745] Support for spark datasource options in S3EventsHoodieIncrSource (#5170) 2022-03-30 01:34:49 -04:00
Sivabalan Narayanan
4fed8dd319 [HUDI-3485] Adding scheduler pool configs for async clustering (#5043) 2022-03-29 21:27:45 -04:00
Alexey Kudinkin
e5a2baeed0 [HUDI-3549] Removing dependency on "spark-avro" (#4955)
Hudi will be taking on promise for it bundles to stay compatible with Spark minor versions (for ex 2.4, 3.1, 3.2): meaning that single build of Hudi (for ex "hudi-spark3.2-bundle") will be compatible with ALL patch versions in that minor branch (in that case 3.2.1, 3.2.0, etc)

To achieve that we'll have to remove (and ban) "spark-avro" as a dependency, which on a few occasions was the root-cause of incompatibility b/w consecutive Spark patch versions (most recently 3.2.1 and 3.2.0, due to this PR).

Instead of bundling "spark-avro" as dependency, we will be copying over some of the classes Hudi depends on and maintain them along the Hudi code-base to make sure we're able to provide for the aforementioned guarantee. To workaround arising compatibility issues we will be applying local patches to guarantee compatibility of Hudi bundles w/in the Spark minor version branches.

Following Hudi modules to Spark minor branches is currently maintained:

"hudi-spark3" -> 3.2.x
"hudi-spark3.1.x" -> 3.1.x
"hudi-spark2" -> 2.4.x
Following classes hierarchies (borrowed from "spark-avro") are maintained w/in these Spark-specific modules to guarantee compatibility with respective minor version branches:

AvroSerializer
AvroDeserializer
AvroUtils
Each of these classes has been correspondingly copied from Spark 3.2.1 (for 3.2.x branch), 3.1.2 (for 3.1.x branch), 2.4.4 (for 2.4.x branch) into their respective modules.

SchemaConverters class in turn is shared across all those modules given its relative stability (there're only cosmetical changes from 2.4.4 to 3.2.1).
All of the aforementioned classes have their corresponding scope of visibility limited to corresponding packages (org.apache.spark.sql.avro, org.apache.spark.sql) to make sure broader code-base does not become dependent on them and instead relies on facades abstracting them.

Additionally, given that Hudi plans on supporting all the patch versions of Spark w/in aforementioned minor versions branches of Spark, additional build steps were added to validate that Hudi could be properly compiled against those versions. Testing, however, is performed against the most recent patch versions of Spark with the help of Azure CI.

Brief change log:
- Removing spark-avro bundling from Hudi by default
- Scaffolded Spark 3.2.x hierarchy
- Bootstrapped Spark 3.1.x Avro serializer/deserializer hierarchy
- Bootstrapped Spark 2.4.x Avro serializer/deserializer hierarchy
- Moved ExpressionCodeGen,ExpressionPayload into hudi-spark module
- Fixed AvroDeserializer to stay compatible w/ both Spark 3.2.1 and 3.2.0
- Modified bot.yml to build full matrix of support Spark versions
- Removed "spark-avro" dependency from all modules
- Fixed relocation of spark-avro classes in bundles to assist in running integ-tests.
2022-03-29 14:44:47 -04:00
Raymond Xu
6ccbae4d2a [HUDI-2757] Implement Hudi AWS Glue sync (#5076) 2022-03-28 14:54:59 -04:00
wangxianghu
f20c9867d7 [HUDI-3711] Fix typo in MaxwellJsonKafkaSourcePostProcessor.Config#PRECOMBINE_FIELD_TYPE_PROP (#5096) 2022-03-25 00:02:54 -07:00
Y Ethan Guo
4ddd094ba2 [HUDI-3689] Disable flaky tests in TestHoodieDeltaStreamer (#5127) 2022-03-24 16:42:44 -07:00
Raymond Xu
ff136658a0 [HUDI-3689] Fix delta streamer tests (#5124) 2022-03-24 14:19:53 -07:00
Raymond Xu
686da41696 [HUDI-3689] Fix UT failures in TestHoodieDeltaStreamer (#5120) 2022-03-24 09:10:33 -07:00
Alexey Kudinkin
ccc3728002 [HUDI-3684] Fixing NPE in ParquetUtils (#5102)
* Make sure nulls are properly handled in `HoodieColumnRangeMetadata`
2022-03-24 17:37:38 +05:30
Sagar Sumit
fe2c3989e3 [HUDI-3689] Fix glob path and hive sync in deltastreamer tests (#5117)
* Remove glob pattern basePath from the deltastreamer tests.

* [HUDI-3689] Fix file scheme config

for CI failure in TestHoodieRealTimeRecordReader

Co-authored-by: Raymond Xu <2701446+xushiyan@users.noreply.github.com>
2022-03-24 15:48:35 +05:30
Rajesh Mahindra
5f570ea151 [HUDI-2883] Refactor hive sync tool / config to use reflection and standardize configs (#4175)
- Refactor hive sync tool / config to use reflection and standardize configs

Co-authored-by: sivabalan <n.siva.b@gmail.com>
Co-authored-by: Rajesh Mahindra <rmahindra@Rajeshs-MacBook-Pro.local>
Co-authored-by: Raymond Xu <2701446+xushiyan@users.noreply.github.com>
2022-03-21 22:56:31 -04:00
YueZhang
8ca9a54db0 [Hudi-3376] Add an option to skip under deletion files for HoodieMetadataTableValidator (#4994)
Co-authored-by: yuezhang <yuezhang@freewheel.tv>
2022-03-16 18:31:00 -07:00
wangxianghu
3b59b76952 [HUDI-3547] Introduce MaxwellSourcePostProcessor to extract data from Maxwell json string (#4987)
* [HUDI-3547] Introduce MaxwellSourcePostProcessor to extract data from Maxwell json string

* add ut

* Address comment
2022-03-15 15:06:30 +04:00
Sivabalan Narayanan
22c3ce73db [HUDI-3621] Fixing NullPointerException in DeltaStreamer (#5039) 2022-03-14 18:34:17 -04:00
wangxianghu
e8918b6c2c [HUDI-3569] Introduce ChainedJsonKafkaSourePostProcessor to support setting multi processors at once (#4969) 2022-03-11 17:49:30 -05:00
wangxianghu
b00180342e [HUDI-3575] Use HoodieTestDataGenerator#TRIP_SCHEMA as example schema in TestSchemaPostProcessor (#5019) 2022-03-11 15:03:42 +04:00
wangxianghu
83cff3afee [HUDI-3522] Introduce DropColumnSchemaPostProcessor to support drop columns from schema (#4972)
* [HUDI-3522] Introduce DropColumnSchemaPostProcessor to support drop columns from schema

* Fix case sensitivity
2022-03-11 09:30:37 +04:00
Sivabalan Narayanan
9dc6df5dca [HUDI-3595] Fixing NULL schema provider for empty batch (#5002) 2022-03-10 22:52:55 -05:00
Alexey Kudinkin
034addaef5 [HUDI-3396] Make sure BaseFileOnlyViewRelation only reads projected columns (#4818)
NOTE: This change is first part of the series to clean up Hudi's Spark DataSource related implementations, making sure there's minimal code duplication among them, implementations are consistent and performant

This PR is making sure that BaseFileOnlyViewRelation only reads projected columns as well as avoiding unnecessary serde from Row to InternalRow

Brief change log
- Introduced HoodieBaseRDD as a base for all custom RDD impls
- Extracted common fields/methods to HoodieBaseRelation
- Cleaned up and streamlined HoodieBaseFileViewOnlyRelation
- Fixed all of the Relations to avoid superfluous Row <> InternalRow conversions
2022-03-09 21:45:25 -05:00
wangxianghu
548000b0d6 [HUDI-3568] Introduce ChainedSchemaPostProcessor to support setting multi processors at once (#4968) 2022-03-09 11:16:22 +04:00
Sagar Sumit
575bc63468 [HUDI-3356][HUDI-3203] HoodieData for metadata index records; BloomFilter construction from index based on the type param (#4848)
Rework of #4761 
This diff introduces following changes:

- Write stats are converted to metadata index records during the commit. Making them use the HoodieData type so that the record generation scales up with needs. 
- Metadata index init support for bloom filter and column stats partitions.
- When building the BloomFilter from the index records, using the type param stored in the payload instead of hardcoded type.
- Delta writes can change column ranges and the column stats index need to be properly updated with new ranges to be consistent with the table dataset. This fix add column stats index update support for the delta writes.

Co-authored-by: Manoj Govindassamy <manoj.govindassamy@gmail.com>
2022-03-08 10:39:04 -05:00
Sivabalan Narayanan
3539578ccb [HUDI-3213] Making commit preserve metadata to true for compaction (#4811)
* Making commit preserve metadata to true

* Fixing integ tests

* Fixing preserve commit metadata for metadata table

* fixed bootstrap tests

* temp diff

* Fixing merge handle

* renaming fallback record

* fixing build issue

* Fixing test failures
2022-03-07 18:02:05 +05:30
wangxianghu
c9ffdc493e [HUDI-3525] Introduce JsonkafkaSourceProcessor to support data preprocess before it is transformed to DataSet (#4930) 2022-03-06 15:41:01 -05:00
wangxianghu
4b471772aa [HUDI-3520] Introduce DeleteSupportSchemaPostProcessor to support adding _hoodie_is_deleted column to schema (#4921) 2022-03-06 15:37:09 -05:00
Pratyaksh Sharma
907e60c252 [HUDI-3264]: made schema registry urls configurable with MTDS (#4779) 2022-03-02 15:30:41 -05:00
Alexey Kudinkin
85f47b53df [HUDI-3469] Refactor HoodieTestDataGenerator to provide for reproducible Builds (#4866) 2022-03-01 22:15:26 -08:00
Sivabalan Narayanan
f7088a957c [HUDI-3497] Adding Datatable validator tool (#4902) 2022-02-28 22:46:32 -05:00
Y Ethan Guo
257052a94d [HUDI-3465] Add validation of column stats and bloom filters in HoodieMetadataTableValidator (#4878) 2022-02-28 18:49:30 -08:00
Sivabalan Narayanan
2f99e8458a [HUDI-3521] Fixing kakfa key and value serializer value type from class to string (#4919) 2022-02-27 11:13:13 -05:00