1
0
Commit Graph

241 Commits

Author SHA1 Message Date
Kumud Kumar Srivatsava Tirupati
795a99ba73 [HUDI-4107] Added --sync-tool-classes config option in HoodieMultiTableDeltaStreamer (#5597)
* added --sync-tool-classes config option in multitable delta streamer

* added a testcase to assert if syncClientToolClassNames is getting picked to the deltastreamer execution context
2022-05-31 20:27:50 +05:30
wangxianghu
58014c147a [HUDI-4160] Make database regex of MaxwellJsonKafkaSourcePostProcessor optional (#5697) 2022-05-28 11:13:24 +04:00
Heap
47b764ec33 [HUDI-4134] Fix Method naming consistency issues in FSUtils (#5655) 2022-05-23 15:28:48 -07:00
Sivabalan Narayanan
5c4813f101 [HUDI-4072] Fix NULL schema for empty batches in deltastreamer (#5543) 2022-05-13 17:56:47 +05:30
Sivabalan Narayanan
b10ca7e69f [HUDI-4085] Fixing flakiness with parquet empty batch tests in TestHoodieDeltaStreamer (#5559) 2022-05-11 16:02:54 -04:00
Sivabalan Narayanan
569a76a9a5 [MINOR] fixing flaky tests in deltastreamer tests (#5521) 2022-05-07 15:37:20 -04:00
Sivabalan Narayanan
52fe1c9fae [HUDI-3675] Adding post write termination strategy to deltastreamer continuous mode (#5073)
- Added a postWriteTerminationStrategy to deltastreamer continuous mode. One can enable by setting the appropriate termination strategy using DeltastreamerConfig.postWriteTerminationStrategyClass. If not, continuous mode is expected to run forever.
- Added one concrete impl for termination strategy as NoNewDataTerminationStrategy which shuts down deltastreamer if there is no new data to consume from source for N consecutive rounds.
2022-05-06 09:27:29 -04:00
Sagar Sumit
4f44e6aeb5 [HUDI-3899] Drop index to delete pending index instants from timeline if applicable (#5342)
Co-authored-by: sivabalan <n.siva.b@gmail.com>
2022-04-18 22:28:46 -04:00
Sagar Sumit
1718bcab84 [HUDI-3707] Fix target schema handling in HoodieSparkUtils while creating RDD (#5347) 2022-04-18 13:34:04 -04:00
Sivabalan Narayanan
05dfc39c29 Fixing async clustering job test in TestHoodieDeltaStreamer (#5317) 2022-04-18 17:38:33 +05:30
董可伦
b8e465fdfc [MINOR] Fix typos in log4j-surefire.properties (#5212) 2022-04-15 13:33:37 -07:00
Raymond Xu
9e8664f4d2 [HOTFIX] add missing license (#5322) (#5324) 2022-04-14 12:35:20 -07:00
Sagar Sumit
3d8fc78c66 [HUDI-3844] Update props in indexer based on table config (#5293) 2022-04-11 18:16:06 -04:00
Sagar Sumit
28dafa774e [HUDI-2488][HUDI-3175] Implement async metadata indexing (#4693)
- Add a new action called INDEX, whose state transition is described in the RFC.
- Changes in timeline to support the new action.
- Add an index planner in ScheduleIndexActionExecutor.
- Add index plan executor in RunIndexActionExecutor.
- Add 3 APIs in HoodieTableMetadataWriter; a) scheduleIndex: will generate an index plan based on latest completed instant, initialize file groups and add a requested INDEX instant, b) index: executes the index plan and also takes care of writes that happened after indexing was requested, c) dropIndex: will drop index by removing the given metadata partition.
- Add 2 new table configs to serve as the source of truth for inflight and completed indexes.
- Support upgrade/downgrade taking care of the newly added configs.
- Add tool to trigger indexing in HoodieIndexer.
- Handle corner cases related to partial failures.
- Abort gracefully after deleting partition and instant.
- Handle other actions in timeline to consider before catching up
2022-04-01 01:33:12 +05:30
Raymond Xu
6ccbae4d2a [HUDI-2757] Implement Hudi AWS Glue sync (#5076) 2022-03-28 14:54:59 -04:00
Y Ethan Guo
4ddd094ba2 [HUDI-3689] Disable flaky tests in TestHoodieDeltaStreamer (#5127) 2022-03-24 16:42:44 -07:00
Raymond Xu
ff136658a0 [HUDI-3689] Fix delta streamer tests (#5124) 2022-03-24 14:19:53 -07:00
Raymond Xu
686da41696 [HUDI-3689] Fix UT failures in TestHoodieDeltaStreamer (#5120) 2022-03-24 09:10:33 -07:00
Sagar Sumit
fe2c3989e3 [HUDI-3689] Fix glob path and hive sync in deltastreamer tests (#5117)
* Remove glob pattern basePath from the deltastreamer tests.

* [HUDI-3689] Fix file scheme config

for CI failure in TestHoodieRealTimeRecordReader

Co-authored-by: Raymond Xu <2701446+xushiyan@users.noreply.github.com>
2022-03-24 15:48:35 +05:30
Rajesh Mahindra
5f570ea151 [HUDI-2883] Refactor hive sync tool / config to use reflection and standardize configs (#4175)
- Refactor hive sync tool / config to use reflection and standardize configs

Co-authored-by: sivabalan <n.siva.b@gmail.com>
Co-authored-by: Rajesh Mahindra <rmahindra@Rajeshs-MacBook-Pro.local>
Co-authored-by: Raymond Xu <2701446+xushiyan@users.noreply.github.com>
2022-03-21 22:56:31 -04:00
wangxianghu
3b59b76952 [HUDI-3547] Introduce MaxwellSourcePostProcessor to extract data from Maxwell json string (#4987)
* [HUDI-3547] Introduce MaxwellSourcePostProcessor to extract data from Maxwell json string

* add ut

* Address comment
2022-03-15 15:06:30 +04:00
wangxianghu
e8918b6c2c [HUDI-3569] Introduce ChainedJsonKafkaSourePostProcessor to support setting multi processors at once (#4969) 2022-03-11 17:49:30 -05:00
wangxianghu
b00180342e [HUDI-3575] Use HoodieTestDataGenerator#TRIP_SCHEMA as example schema in TestSchemaPostProcessor (#5019) 2022-03-11 15:03:42 +04:00
wangxianghu
83cff3afee [HUDI-3522] Introduce DropColumnSchemaPostProcessor to support drop columns from schema (#4972)
* [HUDI-3522] Introduce DropColumnSchemaPostProcessor to support drop columns from schema

* Fix case sensitivity
2022-03-11 09:30:37 +04:00
Sivabalan Narayanan
9dc6df5dca [HUDI-3595] Fixing NULL schema provider for empty batch (#5002) 2022-03-10 22:52:55 -05:00
Alexey Kudinkin
034addaef5 [HUDI-3396] Make sure BaseFileOnlyViewRelation only reads projected columns (#4818)
NOTE: This change is first part of the series to clean up Hudi's Spark DataSource related implementations, making sure there's minimal code duplication among them, implementations are consistent and performant

This PR is making sure that BaseFileOnlyViewRelation only reads projected columns as well as avoiding unnecessary serde from Row to InternalRow

Brief change log
- Introduced HoodieBaseRDD as a base for all custom RDD impls
- Extracted common fields/methods to HoodieBaseRelation
- Cleaned up and streamlined HoodieBaseFileViewOnlyRelation
- Fixed all of the Relations to avoid superfluous Row <> InternalRow conversions
2022-03-09 21:45:25 -05:00
wangxianghu
548000b0d6 [HUDI-3568] Introduce ChainedSchemaPostProcessor to support setting multi processors at once (#4968) 2022-03-09 11:16:22 +04:00
wangxianghu
c9ffdc493e [HUDI-3525] Introduce JsonkafkaSourceProcessor to support data preprocess before it is transformed to DataSet (#4930) 2022-03-06 15:41:01 -05:00
wangxianghu
4b471772aa [HUDI-3520] Introduce DeleteSupportSchemaPostProcessor to support adding _hoodie_is_deleted column to schema (#4921) 2022-03-06 15:37:09 -05:00
Pratyaksh Sharma
907e60c252 [HUDI-3264]: made schema registry urls configurable with MTDS (#4779) 2022-03-02 15:30:41 -05:00
Alexey Kudinkin
85f47b53df [HUDI-3469] Refactor HoodieTestDataGenerator to provide for reproducible Builds (#4866) 2022-03-01 22:15:26 -08:00
Sivabalan Narayanan
14dbbdf4c7 [HUDI-2189] Adding delete partitions support to DeltaStreamer (#4787) 2022-02-22 00:01:30 -05:00
Sivabalan Narayanan
d36fe24c9e [HUDI-3455] Fixing checkpoint management in hoodie incr source (#4850) 2022-02-21 08:19:57 -05:00
Sivabalan Narayanan
66ac1446dd [MINOR] Moving spark scheduling configs out of DataSourceOptions (#4843) 2022-02-20 13:49:18 -05:00
Y Ethan Guo
9a05940a74 [HUDI-3366] Remove hardcoded logic of disabling metadata table in tests (#4792) 2022-02-15 16:41:47 -05:00
冯健
55777fec05 [HUDI-2413] fix Sql source's checkpoint issue (#3648)
* [HUDI-2413] fix Sql source's checkpoint

* Fixing sql source checkpoint handling

* Fixing docs

Co-authored-by: jian.feng <fengjian428@gmial.com>
Co-authored-by: sivabalan <n.siva.b@gmail.com>
2022-02-14 08:07:48 +05:30
Y Ethan Guo
6aba00e84f [MINOR] Fix typos in Spark client related classes (#4781) 2022-02-13 06:41:58 -08:00
Sivabalan Narayanan
60831d6906 [HUDI-3361] Fixing missing begin checkpoint in HoodieIncremental pull (#4755) 2022-02-08 12:03:07 -05:00
Sivabalan Narayanan
ab73047958 Adding support for custom scheduler configs with streaming sink (#4762) 2022-02-08 14:44:10 +05:30
Y Ethan Guo
b8601a9f58 [HUDI-2656] Generalize HoodieIndex for flexible record data type (#3893)
Co-authored-by: Raymond Xu <2701446+xushiyan@users.noreply.github.com>
2022-02-03 20:24:04 -08:00
Raymond Xu
caef3d5c58 [HUDI-3330] Remove fixture test tables for multi writer tests (#4704) 2022-02-02 17:50:10 +05:30
jsbali
7ce0f4522b [HUDI-2711] Fallback to fulltable scan for IncrementalRelation if underlying files have been cleared or moved by cleaner (#3946)
Co-authored-by: sivabalan <n.siva.b@gmail.com>
2022-01-31 23:03:18 -05:00
Sagar Sumit
e78b2f1b55 [HUDI-2943] Complete pending clustering before deltastreamer sync (#4572) 2022-01-28 21:28:04 -05:00
Manoj Govindassamy
f87c47352a [HUDI-2763] Metadata table records - support for key deduplication based on hardcoded key field (#4449)
* [HUDI-2763] Metadata table records - support for key deduplication and virtual keys
- The backing log format for the metadata table is HFile, a KeyValue type.
Since the key field in the metadata record payload is a duplicate of the
Key in the Cell, the redundant key field in the record can be emptied
to save on the cost.

- HoodieHFileWriter and HoodieHFileDataBlock will now serialize records
with the key field emptied by default. HFile writer tries to find if
the record has metadata payload schema field 'key' and if so it does
the key trimming from the record payload.

- HoodieHFileReader when reading the serialized records back from disk,
it materializes the missing keyFields if any. HFile reader tries to
find if the record has metadata payload schema fiels 'key' and if so
it does the key materialization in the record payload.

- Tests have been added to verify the default virtual keys and key
   deduplication support for the metadata table records.

Co-authored-by: Vinoth Chandar <vinoth@apache.org>
2022-01-26 13:34:04 -05:00
Y Ethan Guo
397795c7d0 [HUDI-3007] Fix issues in HoodieRepairTool (#4564) 2022-01-12 09:03:27 -08:00
Pratyaksh Sharma
a392e9ba46 [HUDI-485] Corrected the check for incremental sql (#2768)
* [HUDI-485]: corrected the check for incremental sql

* [HUDI-485]: added tests

* code review comments addressed

* [HUDI-485]: added happy flow test case
2022-01-12 08:22:07 +05:30
Sivabalan Narayanan
2e561defe9 [HUDI-2947] Fixing checkpoint fetch in detlastreamer (#4485)
* Fixing checkpoint fetch in detlastreamer

* Addressing comments
2022-01-07 22:08:58 +05:30
Sivabalan Narayanan
8718c30324 [HUDI-3165] Enabling InProcessLockProvider for all multi-writer tests instead of FileSystemBasedLockProviderTestClass (#4427) 2022-01-06 13:04:10 -05:00
harshal
6409fc733d [HUDI-2374] Fixing AvroDFSSource does not use the overridden schema to deserialize Avro binaries (#4353) 2021-12-27 23:01:21 -05:00
Sivabalan Narayanan
1a5f8693aa [HUDI-3011] Adding ability to read entire data with HoodieIncrSource with empty checkpoint (#4334)
* Adding ability to read entire data with HoodieIncrSource with empty checkpoint

* Addressing comments
2021-12-22 15:43:06 +05:30