1
0
Commit Graph

221 Commits

Author SHA1 Message Date
wangxianghu
3b59b76952 [HUDI-3547] Introduce MaxwellSourcePostProcessor to extract data from Maxwell json string (#4987)
* [HUDI-3547] Introduce MaxwellSourcePostProcessor to extract data from Maxwell json string

* add ut

* Address comment
2022-03-15 15:06:30 +04:00
wangxianghu
e8918b6c2c [HUDI-3569] Introduce ChainedJsonKafkaSourePostProcessor to support setting multi processors at once (#4969) 2022-03-11 17:49:30 -05:00
wangxianghu
b00180342e [HUDI-3575] Use HoodieTestDataGenerator#TRIP_SCHEMA as example schema in TestSchemaPostProcessor (#5019) 2022-03-11 15:03:42 +04:00
wangxianghu
83cff3afee [HUDI-3522] Introduce DropColumnSchemaPostProcessor to support drop columns from schema (#4972)
* [HUDI-3522] Introduce DropColumnSchemaPostProcessor to support drop columns from schema

* Fix case sensitivity
2022-03-11 09:30:37 +04:00
Sivabalan Narayanan
9dc6df5dca [HUDI-3595] Fixing NULL schema provider for empty batch (#5002) 2022-03-10 22:52:55 -05:00
Alexey Kudinkin
034addaef5 [HUDI-3396] Make sure BaseFileOnlyViewRelation only reads projected columns (#4818)
NOTE: This change is first part of the series to clean up Hudi's Spark DataSource related implementations, making sure there's minimal code duplication among them, implementations are consistent and performant

This PR is making sure that BaseFileOnlyViewRelation only reads projected columns as well as avoiding unnecessary serde from Row to InternalRow

Brief change log
- Introduced HoodieBaseRDD as a base for all custom RDD impls
- Extracted common fields/methods to HoodieBaseRelation
- Cleaned up and streamlined HoodieBaseFileViewOnlyRelation
- Fixed all of the Relations to avoid superfluous Row <> InternalRow conversions
2022-03-09 21:45:25 -05:00
wangxianghu
548000b0d6 [HUDI-3568] Introduce ChainedSchemaPostProcessor to support setting multi processors at once (#4968) 2022-03-09 11:16:22 +04:00
wangxianghu
c9ffdc493e [HUDI-3525] Introduce JsonkafkaSourceProcessor to support data preprocess before it is transformed to DataSet (#4930) 2022-03-06 15:41:01 -05:00
wangxianghu
4b471772aa [HUDI-3520] Introduce DeleteSupportSchemaPostProcessor to support adding _hoodie_is_deleted column to schema (#4921) 2022-03-06 15:37:09 -05:00
Pratyaksh Sharma
907e60c252 [HUDI-3264]: made schema registry urls configurable with MTDS (#4779) 2022-03-02 15:30:41 -05:00
Alexey Kudinkin
85f47b53df [HUDI-3469] Refactor HoodieTestDataGenerator to provide for reproducible Builds (#4866) 2022-03-01 22:15:26 -08:00
Sivabalan Narayanan
14dbbdf4c7 [HUDI-2189] Adding delete partitions support to DeltaStreamer (#4787) 2022-02-22 00:01:30 -05:00
Sivabalan Narayanan
d36fe24c9e [HUDI-3455] Fixing checkpoint management in hoodie incr source (#4850) 2022-02-21 08:19:57 -05:00
Sivabalan Narayanan
66ac1446dd [MINOR] Moving spark scheduling configs out of DataSourceOptions (#4843) 2022-02-20 13:49:18 -05:00
Y Ethan Guo
9a05940a74 [HUDI-3366] Remove hardcoded logic of disabling metadata table in tests (#4792) 2022-02-15 16:41:47 -05:00
冯健
55777fec05 [HUDI-2413] fix Sql source's checkpoint issue (#3648)
* [HUDI-2413] fix Sql source's checkpoint

* Fixing sql source checkpoint handling

* Fixing docs

Co-authored-by: jian.feng <fengjian428@gmial.com>
Co-authored-by: sivabalan <n.siva.b@gmail.com>
2022-02-14 08:07:48 +05:30
Y Ethan Guo
6aba00e84f [MINOR] Fix typos in Spark client related classes (#4781) 2022-02-13 06:41:58 -08:00
Sivabalan Narayanan
60831d6906 [HUDI-3361] Fixing missing begin checkpoint in HoodieIncremental pull (#4755) 2022-02-08 12:03:07 -05:00
Sivabalan Narayanan
ab73047958 Adding support for custom scheduler configs with streaming sink (#4762) 2022-02-08 14:44:10 +05:30
Y Ethan Guo
b8601a9f58 [HUDI-2656] Generalize HoodieIndex for flexible record data type (#3893)
Co-authored-by: Raymond Xu <2701446+xushiyan@users.noreply.github.com>
2022-02-03 20:24:04 -08:00
Raymond Xu
caef3d5c58 [HUDI-3330] Remove fixture test tables for multi writer tests (#4704) 2022-02-02 17:50:10 +05:30
jsbali
7ce0f4522b [HUDI-2711] Fallback to fulltable scan for IncrementalRelation if underlying files have been cleared or moved by cleaner (#3946)
Co-authored-by: sivabalan <n.siva.b@gmail.com>
2022-01-31 23:03:18 -05:00
Sagar Sumit
e78b2f1b55 [HUDI-2943] Complete pending clustering before deltastreamer sync (#4572) 2022-01-28 21:28:04 -05:00
Manoj Govindassamy
f87c47352a [HUDI-2763] Metadata table records - support for key deduplication based on hardcoded key field (#4449)
* [HUDI-2763] Metadata table records - support for key deduplication and virtual keys
- The backing log format for the metadata table is HFile, a KeyValue type.
Since the key field in the metadata record payload is a duplicate of the
Key in the Cell, the redundant key field in the record can be emptied
to save on the cost.

- HoodieHFileWriter and HoodieHFileDataBlock will now serialize records
with the key field emptied by default. HFile writer tries to find if
the record has metadata payload schema field 'key' and if so it does
the key trimming from the record payload.

- HoodieHFileReader when reading the serialized records back from disk,
it materializes the missing keyFields if any. HFile reader tries to
find if the record has metadata payload schema fiels 'key' and if so
it does the key materialization in the record payload.

- Tests have been added to verify the default virtual keys and key
   deduplication support for the metadata table records.

Co-authored-by: Vinoth Chandar <vinoth@apache.org>
2022-01-26 13:34:04 -05:00
Y Ethan Guo
397795c7d0 [HUDI-3007] Fix issues in HoodieRepairTool (#4564) 2022-01-12 09:03:27 -08:00
Pratyaksh Sharma
a392e9ba46 [HUDI-485] Corrected the check for incremental sql (#2768)
* [HUDI-485]: corrected the check for incremental sql

* [HUDI-485]: added tests

* code review comments addressed

* [HUDI-485]: added happy flow test case
2022-01-12 08:22:07 +05:30
Sivabalan Narayanan
2e561defe9 [HUDI-2947] Fixing checkpoint fetch in detlastreamer (#4485)
* Fixing checkpoint fetch in detlastreamer

* Addressing comments
2022-01-07 22:08:58 +05:30
Sivabalan Narayanan
8718c30324 [HUDI-3165] Enabling InProcessLockProvider for all multi-writer tests instead of FileSystemBasedLockProviderTestClass (#4427) 2022-01-06 13:04:10 -05:00
harshal
6409fc733d [HUDI-2374] Fixing AvroDFSSource does not use the overridden schema to deserialize Avro binaries (#4353) 2021-12-27 23:01:21 -05:00
Sivabalan Narayanan
1a5f8693aa [HUDI-3011] Adding ability to read entire data with HoodieIncrSource with empty checkpoint (#4334)
* Adding ability to read entire data with HoodieIncrSource with empty checkpoint

* Addressing comments
2021-12-22 15:43:06 +05:30
Raymond Xu
bb99836841 [HUDI-3052] Fix flaky testJsonKafkaSourceResetStrategy (#4381) 2021-12-18 20:58:51 -05:00
Sivabalan Narayanan
77abb5ccb9 [HUDI-3054] Fixing default lock configs for FileSystemBasedLock and fixing a flaky test (#4374) 2021-12-18 16:15:48 -05:00
Sivabalan Narayanan
47852446e8 [HUDI-3043] De-coupling multi writer tests (#4362) 2021-12-17 21:37:45 -05:00
Sivabalan Narayanan
6eba8345cb [HUDI-3043] Adding some test fixes to continuous mode multi writer tests (#4356) 2021-12-17 15:45:05 -05:00
Sagar Sumit
c7473a7b0c [HUDI-2936] Add data count checks in async clustering tests (#4236) 2021-12-10 09:25:37 -05:00
Sagar Sumit
6dab307e6f [MINOR] Remove redundant and conflicting spark-hive dependency (#4228)
Disable TestHiveSchemaProvider
2021-12-06 17:48:32 -08:00
冯健
734c9f5f2d [HUDI-2418] Support HiveSchemaProvider (#3671)
Co-authored-by: jian.feng <fengjian428@gmial.com>
2021-12-05 00:10:13 -08:00
Sivabalan Narayanan
e483f7c776 [HUDI-2902] Fixing populate meta fields with Hfile writers and Disabling virtual keys by default for metadata table (#4194) 2021-12-03 07:20:21 -05:00
Alexey Kudinkin
6f5d8d04cd [HUDI-2840] Fixed DeltaStreaemer to properly respect configuration passed t/h properties file (#4090)
* Rebased `DFSPropertiesConfiguration` to access Hadoop config in liue of FS to avoid confusion

* Fixed `readConfig` to take Hadoop's `Configuration` instead of FS;
Fixing usages

* Added test for local FS access

* Rebase to use `FSUtils.getFs`

* Combine properties provided as a file along w/ overrides provided from the CLI

* Added helper utilities to `HoodieClusteringConfig`;
Make sure corresponding config methods fallback to defaults;

* Fixed DeltaStreamer usage to respect properly combined configuration;
Abstracted `HoodieClusteringConfig.from` convenience utility to init Clustering config from `Properties`

* Tidying up

* `lint`

* Reverting changes to `HoodieWriteConfig`

* Tdiying up

* Fixed incorrect merge of the props

* Converted `HoodieConfig` to wrap around `Properties` into `TypedProperties`

* Fixed compilation

* Fixed compilation
2021-11-25 14:48:22 -08:00
Sivabalan Narayanan
6a0f079866 [HUDI-2858] Fixing handling of cluster update reject exception in deltastreamer (#4120) 2021-11-26 01:04:07 +05:30
satishm
264e1ce63c [HUDI-1290] fixing mysql debezium source (#4119) 2021-11-25 11:26:59 -05:00
rmahindra123
83f8ed2ae3 [HUDI-1290] Add Debezium Source for deltastreamer (#4063)
* add source for postgres debezium

* Add tests for debezium payload

* Fix test

* Fix test

* Add tests for debezium source

* Add tests for debezium source

* Fix schema for debezium

* Fix checkstyle issues

* Fix config issue for schema registry

* Add mysql source for debezium

* Fix checkstyle issues an tests

* Improve code for merging toasted values

* Improve code for merging toasted values

Co-authored-by: Rajesh Mahindra <rmahindra@Rajeshs-MacBook-Pro.local>
2021-11-24 17:57:02 -08:00
Y Ethan Guo
bef373fa1d [MINOR] Fix build failure due to checkstyle issues (#4111) 2021-11-24 17:17:46 -08:00
Sivabalan Narayanan
435ea1543c [HUDI-2793] Fixing deltastreamer checkpoint fetch/copy over (#4034)
- Removed the copy over logic in transaction utils. Deltastreamer will go back to previous commits and get the checkpoint value.
2021-11-24 18:26:40 -05:00
Y Ethan Guo
772af935d5 [HUDI-2737] Use earliest instant by default for async compaction and clustering jobs (#3991)
Address review comments

Fix test failures

Co-authored-by: Sagar Sumit <sagarsumit09@gmail.com>
2021-11-23 06:49:41 +05:30
Sivabalan Narayanan
fc9ca6a07a [HUDI-2559] Converting commit timestamp format to millisecs (#4024)
- Adds support for generating commit timestamps with millisecs granularity. 
- Older commit timestamps (in secs granularity) will be suffixed with 999 and parsed with millisecs format.
2021-11-22 11:44:38 -05:00
zhangyue19921010
a2c91a7a9b [HUDI-2533] New option for hoodieClusteringJob to check, rollback and re-execute the last failed clustering job (#3765)
* coding finished and need to do uts

* add uts

* code review

* code review

Co-authored-by: yuezhang <yuezhang@freewheel.tv>
2021-11-22 16:30:33 +05:30
Manoj Govindassamy
459b34240b [HUDI-2593] Virtual keys support for metadata table (#3968)
- Metadata table today has virtual keys disabled, thereby populating the metafields
  for each record written out and increasing the overall storage space used. Hereby
  adding virtual keys support for metadata table so that metafields are disabled
  for metadata table records.

- Adding a custom KeyGenerator for Metadata table so as to not rely on the
  default Base/SimpleKeyGenerators which currently look for record key
  and partition field set in the table config.

- AbstractHoodieLogRecordReader's version of processing next data block and
  createHoodieRecord() will be a generic version and making the derived class
  HoodieMetadataMergedLogRecordReader take care of the special creation of
  records from explictly passed in partition names.
2021-11-19 18:11:29 -05:00
wenningd
24def0b30d [HUDI-2362] Add external config file support (#3416)
Co-authored-by: Wenning Ding <wenningd@amazon.com>
2021-11-18 01:59:26 -08:00
davehagman
dfe3b84715 [HUDI-2579] Make deltastreamer checkpoint state merging more explicit (#3820)
Co-authored-by: Sivabalan Narayanan <n.siva.b@gmail.com>
2021-11-09 17:37:59 -05:00