1
0
Commit Graph

456 Commits

Author SHA1 Message Date
Y Ethan Guo
fbda4ad5bd [HUDI-4360] Fix HoodieDropPartitionsTool based on refactored meta sync (#6043) 2022-07-04 23:37:21 -07:00
YueZhang
45fdcf68a1 [HUDI-3116]Add a new HoodieDropPartitionsTool to let users drop table partitions through a standalone job. (#4459)
Co-authored-by: yuezhang <yuezhang@freewheel.tv>
2022-07-04 19:24:18 -07:00
Shiyan Xu
c0e1587966 [HUDI-3730] Improve meta sync class design and hierarchies (#5854)
* [HUDI-3730] Improve meta sync class design and hierarchies (#5754)
* Implements class design proposed in RFC-55

Co-authored-by: jian.feng <fengjian428@gmial.com>
Co-authored-by: jian.feng <jian.feng@shopee.com>
2022-07-03 14:47:25 +05:30
bschell
fd7d25ab63 [HUDI-1176] Upgrade hudi to log4j2 (#5366)
* Move to log4j2

cr: https://code.amazon.com/reviews/CR-71010705

* Upgrade unit tests to log4j2

* update exclusion

Co-authored-by: Brandon Scheller <bschelle@amazon.com>
2022-06-28 12:54:23 -07:00
董可伦
7689e62cd9 [HUDI-4265] Deprecate useless targetTableName parameter in HoodieMultiTableDeltaStreamer (#5883) 2022-06-17 16:57:14 +08:00
董可伦
c291b05699 [HUDI-4218] [HUDI-4218] Expose the real exception information when an exception occurs in the tableExists method (#5827) 2022-06-15 18:10:35 +08:00
Qi Ji
4774c4248f [HUDI-4006] failOnDataLoss on delta-streamer kafka sources (#5718)
add new config key hoodie.deltastreamer.source.kafka.enable.failOnDataLoss
when failOnDataLoss=false (current behaviour, the default), log a warning instead of seeking to earliest silently
when failOnDataLoss is set, fail explicitly
2022-06-13 10:31:57 -04:00
luoyajun
0d859fe58b [HUDI-3863] Add UT for drop partition column in deltastreamer testsuite (#5727) 2022-06-13 10:29:32 -04:00
Shiyan Xu
5aaac21d1d [HUDI-4224] Fix CI issues (#5842)
- Upgrade junit to 5.7.2
- Downgrade surefire and failsafe to 2.22.2
- Fix test failures that were previously not reported
- Improve azure pipeline configs

Co-authored-by: liujinhui1994 <965147871@qq.com>
Co-authored-by: Y Ethan Guo <ethan.guoyihua@gmail.com>
2022-06-12 11:44:18 -07:00
Sivabalan Narayanan
21b903fddb [HUDI-4197] Fix Async indexer to support building FILES partition (#5766)
- When async indexer is invoked only with "FILES" partition, it fails. Fixing it to work with Async indexer. Also, if metadata table itself is not initialized, and if someone is looking to build indexes via AsyncIndexer, first they are expected to index "FILES" partition followed by other partitions. In general, we have a limitation of building only one index at a time w/ AsyncIndexer and hence. Have added guards to ensure these conditions are met.
2022-06-06 15:47:11 -04:00
Qi Ji
7276d0eaa6 [HUDI-3670] free temp views in sql transformers (#5080) 2022-06-01 07:35:40 -07:00
Kumud Kumar Srivatsava Tirupati
795a99ba73 [HUDI-4107] Added --sync-tool-classes config option in HoodieMultiTableDeltaStreamer (#5597)
* added --sync-tool-classes config option in multitable delta streamer

* added a testcase to assert if syncClientToolClassNames is getting picked to the deltastreamer execution context
2022-05-31 20:27:50 +05:30
wangxianghu
58014c147a [HUDI-4160] Make database regex of MaxwellJsonKafkaSourcePostProcessor optional (#5697) 2022-05-28 11:13:24 +04:00
Sagar Sumit
31e13db1f0 [HUDI-4023] Decouple hudi-spark from hudi-utilities-slim-bundle (#5641) 2022-05-26 11:28:49 +05:30
Sivabalan Narayanan
10363c1412 [HUDI-4132] Fixing determining target table schema for delta sync with empty batch (#5648) 2022-05-24 08:17:15 -04:00
Heap
47b764ec33 [HUDI-4134] Fix Method naming consistency issues in FSUtils (#5655) 2022-05-23 15:28:48 -07:00
wangxianghu
2af98303d3 [HUDI-4122] Fix NPE caused by adding kafka nodes (#5632) 2022-05-21 11:12:53 +08:00
Sivabalan Narayanan
7d02b1fd3c [MINOR] Minor fixes to exception log and removing unwanted metrics flush in integ test (#5646) 2022-05-21 07:27:35 +08:00
wqwl611
52e63b39d6 [HUDI-4097] add table info to jobStatus (#5529)
Co-authored-by: wqwl611 <wqwl611@gmail.com>
2022-05-13 21:01:15 -04:00
Sivabalan Narayanan
5c4813f101 [HUDI-4072] Fix NULL schema for empty batches in deltastreamer (#5543) 2022-05-13 17:56:47 +05:30
Sivabalan Narayanan
b10ca7e69f [HUDI-4085] Fixing flakiness with parquet empty batch tests in TestHoodieDeltaStreamer (#5559) 2022-05-11 16:02:54 -04:00
Sivabalan Narayanan
569a76a9a5 [MINOR] fixing flaky tests in deltastreamer tests (#5521) 2022-05-07 15:37:20 -04:00
Sivabalan Narayanan
52fe1c9fae [HUDI-3675] Adding post write termination strategy to deltastreamer continuous mode (#5073)
- Added a postWriteTerminationStrategy to deltastreamer continuous mode. One can enable by setting the appropriate termination strategy using DeltastreamerConfig.postWriteTerminationStrategyClass. If not, continuous mode is expected to run forever.
- Added one concrete impl for termination strategy as NoNewDataTerminationStrategy which shuts down deltastreamer if there is no new data to consume from source for N consecutive rounds.
2022-05-06 09:27:29 -04:00
qianchutao
d794f4fbf9 [MINOR] Optimize code logic (#5499) 2022-05-05 09:33:06 -07:00
Y Ethan Guo
a1d82b4dc5 [MINOR] Fix CI by ignoring SparkContext error (#5468)
Sets spark.driver.allowMultipleContexts = true when constructing Spark conf in UtilHelpers
2022-04-29 11:19:07 -07:00
watermelon12138
cacbd98687 [HUDI-3945] After the async compaction operation is complete, the task should exit. (#5391)
Co-authored-by: y00617041 <yangxuan42@huawei.com>
2022-04-27 21:16:09 +08:00
Alexey Kudinkin
4b296f79cc [HUDI-3935] Adding config to fallback to enabled Partition Values extraction from Partition path (#5377) 2022-04-21 01:36:19 -07:00
Y Ethan Guo
28fdddfee0 [HUDI-3920] Fix partition path construction in metadata table validator (#5365) 2022-04-19 19:40:09 -04:00
Sagar Sumit
4f44e6aeb5 [HUDI-3899] Drop index to delete pending index instants from timeline if applicable (#5342)
Co-authored-by: sivabalan <n.siva.b@gmail.com>
2022-04-18 22:28:46 -04:00
Sagar Sumit
1718bcab84 [HUDI-3707] Fix target schema handling in HoodieSparkUtils while creating RDD (#5347) 2022-04-18 13:34:04 -04:00
Sivabalan Narayanan
05dfc39c29 Fixing async clustering job test in TestHoodieDeltaStreamer (#5317) 2022-04-18 17:38:33 +05:30
董可伦
b8e465fdfc [MINOR] Fix typos in log4j-surefire.properties (#5212) 2022-04-15 13:33:37 -07:00
Raymond Xu
9e8664f4d2 [HOTFIX] add missing license (#5322) (#5324) 2022-04-14 12:35:20 -07:00
Vinoth Govindarajan
2d46d5287e [HUDI-3838] Moved the getPartitionColumns logic to driver. (#5303) 2022-04-12 18:03:00 -04:00
Vinoth Govindarajan
d16740976e [HUDI-3838] Implemented drop partition column feature for delta streamer code path (#5294)
* [HUDI-3838] Implemented drop partition column feature for delta streamer code path

* Ensure drop partition table config is updated in hoodie.props

Co-authored-by: Sagar Sumit <sagarsumit09@gmail.com>
2022-04-12 18:10:30 +05:30
Sagar Sumit
3d8fc78c66 [HUDI-3844] Update props in indexer based on table config (#5293) 2022-04-11 18:16:06 -04:00
Y Ethan Guo
63a099c5b7 [HUDI-3847] Fix NPE due to null schema in HoodieMetadataTableValidator (#5284) 2022-04-10 17:59:29 -07:00
Alexey Kudinkin
81b25c543a [HUDI-3825] Fixing Column Stats Index updating sequence (#5267) 2022-04-08 23:14:08 -07:00
Y Ethan Guo
cd2c346df6 [HUDI-3637] Exclude uncommitted log files from metadata table validation (#5234) 2022-04-07 13:03:03 -07:00
Raymond Xu
e96f08f355 Moving to 0.12.0-SNAPSHOT on master branch. 2022-04-06 15:24:10 +08:00
Prashant Wason
b28f0d6ceb [HUDI-3290] Different file formats for the partition metadata file. (#5179)
* [HUDI-3290] Different file formats for the partition metadata file.

Partition metadata files are stored in each partition to help identify the base path of a table. These files are saved in the properties file format. Some query engines do not work when non Parquet/ORC files are found in a partition.

Added a new table config 'hoodie.partition.metafile.use.data.format' which when enabled (default false for backward compatibility) ensures that partition metafiles will be saved in the same format as the base files of a dataset.

For new datasets, the config can be set via hudi-cli. Deltastreamer has a new parameter --partition-metafile-use-data-format which will create a table with this setting.

* Code review comments

- Adding a new command to migrate from text to base file formats for meta file.
- Reimplementing readFromFS() to first read the text format, then base format
- Avoid extra exists() checks in readFromFS()
- Added unit tests, enabled parquet format across hoodie-hadoop-mr
- Code cleanup, restructuring, naming consistency.

* Wiring in all the other Spark code paths to respect this config

 - Turned on parquet meta format for COW data source tests
 - Removed the deltastreamer command line to keep it shorter

* populate HoodiePartitionMetadata#format after readFromFS()

Co-authored-by: Vinoth Chandar <vinoth@apache.org>
Co-authored-by: Raymond Xu <2701446+xushiyan@users.noreply.github.com>
2022-04-04 08:08:20 -07:00
Sivabalan Narayanan
84064a9b08 [HUDI-3772] Fixing auto adjustment of lock configs for deltastreamer (#5207) 2022-04-02 23:44:10 -07:00
Alexey Kudinkin
cc3737be50 [HUDI-3664] Fixing Column Stats Index composition (#5181)
Co-authored-by: Sagar Sumit <sagarsumit09@gmail.com>
2022-04-02 17:15:52 -07:00
Y Ethan Guo
c19f505b5a [HUDI-3784] Improve docs and logs of HoodieMetadataTableValidator (#5216) 2022-04-02 13:16:17 -07:00
Sagar Sumit
28dafa774e [HUDI-2488][HUDI-3175] Implement async metadata indexing (#4693)
- Add a new action called INDEX, whose state transition is described in the RFC.
- Changes in timeline to support the new action.
- Add an index planner in ScheduleIndexActionExecutor.
- Add index plan executor in RunIndexActionExecutor.
- Add 3 APIs in HoodieTableMetadataWriter; a) scheduleIndex: will generate an index plan based on latest completed instant, initialize file groups and add a requested INDEX instant, b) index: executes the index plan and also takes care of writes that happened after indexing was requested, c) dropIndex: will drop index by removing the given metadata partition.
- Add 2 new table configs to serve as the source of truth for inflight and completed indexes.
- Support upgrade/downgrade taking care of the newly added configs.
- Add tool to trigger indexing in HoodieIndexer.
- Handle corner cases related to partial failures.
- Abort gracefully after deleting partition and instant.
- Handle other actions in timeline to consider before catching up
2022-04-01 01:33:12 +05:30
YueZhang
2b60641d17 [HUDI-3635] Fix HoodieMetadataTableValidator around comparison of partition path listing (#5100)
Co-authored-by: yuezhang <yuezhang@freewheel.tv>
2022-03-30 17:23:37 -04:00
YueZhang
eae8488536 [HUDI-3647] HoodieMetadataTableValidator: check MDT was initialized at first (#5152)
Co-authored-by: yuezhang <yuezhang@freewheel.tv>
2022-03-30 17:18:08 -04:00
harshal
7fa363923c [HUDI-3745] Support for spark datasource options in S3EventsHoodieIncrSource (#5170) 2022-03-30 01:34:49 -04:00
Sivabalan Narayanan
4fed8dd319 [HUDI-3485] Adding scheduler pool configs for async clustering (#5043) 2022-03-29 21:27:45 -04:00
Alexey Kudinkin
e5a2baeed0 [HUDI-3549] Removing dependency on "spark-avro" (#4955)
Hudi will be taking on promise for it bundles to stay compatible with Spark minor versions (for ex 2.4, 3.1, 3.2): meaning that single build of Hudi (for ex "hudi-spark3.2-bundle") will be compatible with ALL patch versions in that minor branch (in that case 3.2.1, 3.2.0, etc)

To achieve that we'll have to remove (and ban) "spark-avro" as a dependency, which on a few occasions was the root-cause of incompatibility b/w consecutive Spark patch versions (most recently 3.2.1 and 3.2.0, due to this PR).

Instead of bundling "spark-avro" as dependency, we will be copying over some of the classes Hudi depends on and maintain them along the Hudi code-base to make sure we're able to provide for the aforementioned guarantee. To workaround arising compatibility issues we will be applying local patches to guarantee compatibility of Hudi bundles w/in the Spark minor version branches.

Following Hudi modules to Spark minor branches is currently maintained:

"hudi-spark3" -> 3.2.x
"hudi-spark3.1.x" -> 3.1.x
"hudi-spark2" -> 2.4.x
Following classes hierarchies (borrowed from "spark-avro") are maintained w/in these Spark-specific modules to guarantee compatibility with respective minor version branches:

AvroSerializer
AvroDeserializer
AvroUtils
Each of these classes has been correspondingly copied from Spark 3.2.1 (for 3.2.x branch), 3.1.2 (for 3.1.x branch), 2.4.4 (for 2.4.x branch) into their respective modules.

SchemaConverters class in turn is shared across all those modules given its relative stability (there're only cosmetical changes from 2.4.4 to 3.2.1).
All of the aforementioned classes have their corresponding scope of visibility limited to corresponding packages (org.apache.spark.sql.avro, org.apache.spark.sql) to make sure broader code-base does not become dependent on them and instead relies on facades abstracting them.

Additionally, given that Hudi plans on supporting all the patch versions of Spark w/in aforementioned minor versions branches of Spark, additional build steps were added to validate that Hudi could be properly compiled against those versions. Testing, however, is performed against the most recent patch versions of Spark with the help of Azure CI.

Brief change log:
- Removing spark-avro bundling from Hudi by default
- Scaffolded Spark 3.2.x hierarchy
- Bootstrapped Spark 3.1.x Avro serializer/deserializer hierarchy
- Bootstrapped Spark 2.4.x Avro serializer/deserializer hierarchy
- Moved ExpressionCodeGen,ExpressionPayload into hudi-spark module
- Fixed AvroDeserializer to stay compatible w/ both Spark 3.2.1 and 3.2.0
- Modified bot.yml to build full matrix of support Spark versions
- Removed "spark-avro" dependency from all modules
- Fixed relocation of spark-avro classes in bundles to assist in running integ-tests.
2022-03-29 14:44:47 -04:00