1
0
Commit Graph

229 Commits

Author SHA1 Message Date
Sagar Sumit
3d8fc78c66 [HUDI-3844] Update props in indexer based on table config (#5293) 2022-04-11 18:16:06 -04:00
Sagar Sumit
28dafa774e [HUDI-2488][HUDI-3175] Implement async metadata indexing (#4693)
- Add a new action called INDEX, whose state transition is described in the RFC.
- Changes in timeline to support the new action.
- Add an index planner in ScheduleIndexActionExecutor.
- Add index plan executor in RunIndexActionExecutor.
- Add 3 APIs in HoodieTableMetadataWriter; a) scheduleIndex: will generate an index plan based on latest completed instant, initialize file groups and add a requested INDEX instant, b) index: executes the index plan and also takes care of writes that happened after indexing was requested, c) dropIndex: will drop index by removing the given metadata partition.
- Add 2 new table configs to serve as the source of truth for inflight and completed indexes.
- Support upgrade/downgrade taking care of the newly added configs.
- Add tool to trigger indexing in HoodieIndexer.
- Handle corner cases related to partial failures.
- Abort gracefully after deleting partition and instant.
- Handle other actions in timeline to consider before catching up
2022-04-01 01:33:12 +05:30
Raymond Xu
6ccbae4d2a [HUDI-2757] Implement Hudi AWS Glue sync (#5076) 2022-03-28 14:54:59 -04:00
Y Ethan Guo
4ddd094ba2 [HUDI-3689] Disable flaky tests in TestHoodieDeltaStreamer (#5127) 2022-03-24 16:42:44 -07:00
Raymond Xu
ff136658a0 [HUDI-3689] Fix delta streamer tests (#5124) 2022-03-24 14:19:53 -07:00
Raymond Xu
686da41696 [HUDI-3689] Fix UT failures in TestHoodieDeltaStreamer (#5120) 2022-03-24 09:10:33 -07:00
Sagar Sumit
fe2c3989e3 [HUDI-3689] Fix glob path and hive sync in deltastreamer tests (#5117)
* Remove glob pattern basePath from the deltastreamer tests.

* [HUDI-3689] Fix file scheme config

for CI failure in TestHoodieRealTimeRecordReader

Co-authored-by: Raymond Xu <2701446+xushiyan@users.noreply.github.com>
2022-03-24 15:48:35 +05:30
Rajesh Mahindra
5f570ea151 [HUDI-2883] Refactor hive sync tool / config to use reflection and standardize configs (#4175)
- Refactor hive sync tool / config to use reflection and standardize configs

Co-authored-by: sivabalan <n.siva.b@gmail.com>
Co-authored-by: Rajesh Mahindra <rmahindra@Rajeshs-MacBook-Pro.local>
Co-authored-by: Raymond Xu <2701446+xushiyan@users.noreply.github.com>
2022-03-21 22:56:31 -04:00
wangxianghu
3b59b76952 [HUDI-3547] Introduce MaxwellSourcePostProcessor to extract data from Maxwell json string (#4987)
* [HUDI-3547] Introduce MaxwellSourcePostProcessor to extract data from Maxwell json string

* add ut

* Address comment
2022-03-15 15:06:30 +04:00
wangxianghu
e8918b6c2c [HUDI-3569] Introduce ChainedJsonKafkaSourePostProcessor to support setting multi processors at once (#4969) 2022-03-11 17:49:30 -05:00
wangxianghu
b00180342e [HUDI-3575] Use HoodieTestDataGenerator#TRIP_SCHEMA as example schema in TestSchemaPostProcessor (#5019) 2022-03-11 15:03:42 +04:00
wangxianghu
83cff3afee [HUDI-3522] Introduce DropColumnSchemaPostProcessor to support drop columns from schema (#4972)
* [HUDI-3522] Introduce DropColumnSchemaPostProcessor to support drop columns from schema

* Fix case sensitivity
2022-03-11 09:30:37 +04:00
Sivabalan Narayanan
9dc6df5dca [HUDI-3595] Fixing NULL schema provider for empty batch (#5002) 2022-03-10 22:52:55 -05:00
Alexey Kudinkin
034addaef5 [HUDI-3396] Make sure BaseFileOnlyViewRelation only reads projected columns (#4818)
NOTE: This change is first part of the series to clean up Hudi's Spark DataSource related implementations, making sure there's minimal code duplication among them, implementations are consistent and performant

This PR is making sure that BaseFileOnlyViewRelation only reads projected columns as well as avoiding unnecessary serde from Row to InternalRow

Brief change log
- Introduced HoodieBaseRDD as a base for all custom RDD impls
- Extracted common fields/methods to HoodieBaseRelation
- Cleaned up and streamlined HoodieBaseFileViewOnlyRelation
- Fixed all of the Relations to avoid superfluous Row <> InternalRow conversions
2022-03-09 21:45:25 -05:00
wangxianghu
548000b0d6 [HUDI-3568] Introduce ChainedSchemaPostProcessor to support setting multi processors at once (#4968) 2022-03-09 11:16:22 +04:00
wangxianghu
c9ffdc493e [HUDI-3525] Introduce JsonkafkaSourceProcessor to support data preprocess before it is transformed to DataSet (#4930) 2022-03-06 15:41:01 -05:00
wangxianghu
4b471772aa [HUDI-3520] Introduce DeleteSupportSchemaPostProcessor to support adding _hoodie_is_deleted column to schema (#4921) 2022-03-06 15:37:09 -05:00
Pratyaksh Sharma
907e60c252 [HUDI-3264]: made schema registry urls configurable with MTDS (#4779) 2022-03-02 15:30:41 -05:00
Alexey Kudinkin
85f47b53df [HUDI-3469] Refactor HoodieTestDataGenerator to provide for reproducible Builds (#4866) 2022-03-01 22:15:26 -08:00
Sivabalan Narayanan
14dbbdf4c7 [HUDI-2189] Adding delete partitions support to DeltaStreamer (#4787) 2022-02-22 00:01:30 -05:00
Sivabalan Narayanan
d36fe24c9e [HUDI-3455] Fixing checkpoint management in hoodie incr source (#4850) 2022-02-21 08:19:57 -05:00
Sivabalan Narayanan
66ac1446dd [MINOR] Moving spark scheduling configs out of DataSourceOptions (#4843) 2022-02-20 13:49:18 -05:00
Y Ethan Guo
9a05940a74 [HUDI-3366] Remove hardcoded logic of disabling metadata table in tests (#4792) 2022-02-15 16:41:47 -05:00
冯健
55777fec05 [HUDI-2413] fix Sql source's checkpoint issue (#3648)
* [HUDI-2413] fix Sql source's checkpoint

* Fixing sql source checkpoint handling

* Fixing docs

Co-authored-by: jian.feng <fengjian428@gmial.com>
Co-authored-by: sivabalan <n.siva.b@gmail.com>
2022-02-14 08:07:48 +05:30
Y Ethan Guo
6aba00e84f [MINOR] Fix typos in Spark client related classes (#4781) 2022-02-13 06:41:58 -08:00
Sivabalan Narayanan
60831d6906 [HUDI-3361] Fixing missing begin checkpoint in HoodieIncremental pull (#4755) 2022-02-08 12:03:07 -05:00
Sivabalan Narayanan
ab73047958 Adding support for custom scheduler configs with streaming sink (#4762) 2022-02-08 14:44:10 +05:30
Y Ethan Guo
b8601a9f58 [HUDI-2656] Generalize HoodieIndex for flexible record data type (#3893)
Co-authored-by: Raymond Xu <2701446+xushiyan@users.noreply.github.com>
2022-02-03 20:24:04 -08:00
Raymond Xu
caef3d5c58 [HUDI-3330] Remove fixture test tables for multi writer tests (#4704) 2022-02-02 17:50:10 +05:30
jsbali
7ce0f4522b [HUDI-2711] Fallback to fulltable scan for IncrementalRelation if underlying files have been cleared or moved by cleaner (#3946)
Co-authored-by: sivabalan <n.siva.b@gmail.com>
2022-01-31 23:03:18 -05:00
Sagar Sumit
e78b2f1b55 [HUDI-2943] Complete pending clustering before deltastreamer sync (#4572) 2022-01-28 21:28:04 -05:00
Manoj Govindassamy
f87c47352a [HUDI-2763] Metadata table records - support for key deduplication based on hardcoded key field (#4449)
* [HUDI-2763] Metadata table records - support for key deduplication and virtual keys
- The backing log format for the metadata table is HFile, a KeyValue type.
Since the key field in the metadata record payload is a duplicate of the
Key in the Cell, the redundant key field in the record can be emptied
to save on the cost.

- HoodieHFileWriter and HoodieHFileDataBlock will now serialize records
with the key field emptied by default. HFile writer tries to find if
the record has metadata payload schema field 'key' and if so it does
the key trimming from the record payload.

- HoodieHFileReader when reading the serialized records back from disk,
it materializes the missing keyFields if any. HFile reader tries to
find if the record has metadata payload schema fiels 'key' and if so
it does the key materialization in the record payload.

- Tests have been added to verify the default virtual keys and key
   deduplication support for the metadata table records.

Co-authored-by: Vinoth Chandar <vinoth@apache.org>
2022-01-26 13:34:04 -05:00
Y Ethan Guo
397795c7d0 [HUDI-3007] Fix issues in HoodieRepairTool (#4564) 2022-01-12 09:03:27 -08:00
Pratyaksh Sharma
a392e9ba46 [HUDI-485] Corrected the check for incremental sql (#2768)
* [HUDI-485]: corrected the check for incremental sql

* [HUDI-485]: added tests

* code review comments addressed

* [HUDI-485]: added happy flow test case
2022-01-12 08:22:07 +05:30
Sivabalan Narayanan
2e561defe9 [HUDI-2947] Fixing checkpoint fetch in detlastreamer (#4485)
* Fixing checkpoint fetch in detlastreamer

* Addressing comments
2022-01-07 22:08:58 +05:30
Sivabalan Narayanan
8718c30324 [HUDI-3165] Enabling InProcessLockProvider for all multi-writer tests instead of FileSystemBasedLockProviderTestClass (#4427) 2022-01-06 13:04:10 -05:00
harshal
6409fc733d [HUDI-2374] Fixing AvroDFSSource does not use the overridden schema to deserialize Avro binaries (#4353) 2021-12-27 23:01:21 -05:00
Sivabalan Narayanan
1a5f8693aa [HUDI-3011] Adding ability to read entire data with HoodieIncrSource with empty checkpoint (#4334)
* Adding ability to read entire data with HoodieIncrSource with empty checkpoint

* Addressing comments
2021-12-22 15:43:06 +05:30
Raymond Xu
bb99836841 [HUDI-3052] Fix flaky testJsonKafkaSourceResetStrategy (#4381) 2021-12-18 20:58:51 -05:00
Sivabalan Narayanan
77abb5ccb9 [HUDI-3054] Fixing default lock configs for FileSystemBasedLock and fixing a flaky test (#4374) 2021-12-18 16:15:48 -05:00
Sivabalan Narayanan
47852446e8 [HUDI-3043] De-coupling multi writer tests (#4362) 2021-12-17 21:37:45 -05:00
Sivabalan Narayanan
6eba8345cb [HUDI-3043] Adding some test fixes to continuous mode multi writer tests (#4356) 2021-12-17 15:45:05 -05:00
Sagar Sumit
c7473a7b0c [HUDI-2936] Add data count checks in async clustering tests (#4236) 2021-12-10 09:25:37 -05:00
Sagar Sumit
6dab307e6f [MINOR] Remove redundant and conflicting spark-hive dependency (#4228)
Disable TestHiveSchemaProvider
2021-12-06 17:48:32 -08:00
冯健
734c9f5f2d [HUDI-2418] Support HiveSchemaProvider (#3671)
Co-authored-by: jian.feng <fengjian428@gmial.com>
2021-12-05 00:10:13 -08:00
Sivabalan Narayanan
e483f7c776 [HUDI-2902] Fixing populate meta fields with Hfile writers and Disabling virtual keys by default for metadata table (#4194) 2021-12-03 07:20:21 -05:00
Alexey Kudinkin
6f5d8d04cd [HUDI-2840] Fixed DeltaStreaemer to properly respect configuration passed t/h properties file (#4090)
* Rebased `DFSPropertiesConfiguration` to access Hadoop config in liue of FS to avoid confusion

* Fixed `readConfig` to take Hadoop's `Configuration` instead of FS;
Fixing usages

* Added test for local FS access

* Rebase to use `FSUtils.getFs`

* Combine properties provided as a file along w/ overrides provided from the CLI

* Added helper utilities to `HoodieClusteringConfig`;
Make sure corresponding config methods fallback to defaults;

* Fixed DeltaStreamer usage to respect properly combined configuration;
Abstracted `HoodieClusteringConfig.from` convenience utility to init Clustering config from `Properties`

* Tidying up

* `lint`

* Reverting changes to `HoodieWriteConfig`

* Tdiying up

* Fixed incorrect merge of the props

* Converted `HoodieConfig` to wrap around `Properties` into `TypedProperties`

* Fixed compilation

* Fixed compilation
2021-11-25 14:48:22 -08:00
Sivabalan Narayanan
6a0f079866 [HUDI-2858] Fixing handling of cluster update reject exception in deltastreamer (#4120) 2021-11-26 01:04:07 +05:30
satishm
264e1ce63c [HUDI-1290] fixing mysql debezium source (#4119) 2021-11-25 11:26:59 -05:00
rmahindra123
83f8ed2ae3 [HUDI-1290] Add Debezium Source for deltastreamer (#4063)
* add source for postgres debezium

* Add tests for debezium payload

* Fix test

* Fix test

* Add tests for debezium source

* Add tests for debezium source

* Fix schema for debezium

* Fix checkstyle issues

* Fix config issue for schema registry

* Add mysql source for debezium

* Fix checkstyle issues an tests

* Improve code for merging toasted values

* Improve code for merging toasted values

Co-authored-by: Rajesh Mahindra <rmahindra@Rajeshs-MacBook-Pro.local>
2021-11-24 17:57:02 -08:00