1
0
Commit Graph

353 Commits

Author SHA1 Message Date
wangxianghu
b7a79aa943 [HUDI-3283] Bootstrap support overwrite existing table (#4647) 2022-01-20 14:42:52 +04:00
wangxianghu
3d93e857cc [MINOR] Minor improvement in JsonkafkaSource (#4620) 2022-01-18 11:13:05 +04:00
leesf
5ce45c440b [HUDI-3172] Refactor hudi existing modules to make more code reuse in V2 Implementation (#4514)
* Introduce hudi-spark3-common and hudi-spark2-common modules to place classes that would be reused in different spark versions, also introduce hudi-spark3.1.x to support spark 3.1.x.
* Introduce hudi format under hudi-spark2, hudi-spark3, hudi-spark3.1.x modules and change the hudi format in original hudi-spark module to hudi_v1 format.
* Manually tested on Spark 3.1.2 and Spark 3.2.0 SQL.
* Added a README.md file under hudi-spark-datasource module.
2022-01-14 13:42:35 +08:00
Y Ethan Guo
397795c7d0 [HUDI-3007] Fix issues in HoodieRepairTool (#4564) 2022-01-12 09:03:27 -08:00
Sagar Sumit
12e95771ee [HUDI-3235] Fix ClassNotFoundException due to log4j-core dependency (#4574)
- Move log4j-core to top level pom
2022-01-12 11:53:43 -05:00
董可伦
017ddbbfac [MINOR] Fix typos (#4567) 2022-01-11 23:17:10 -08:00
Pratyaksh Sharma
a392e9ba46 [HUDI-485] Corrected the check for incremental sql (#2768)
* [HUDI-485]: corrected the check for incremental sql

* [HUDI-485]: added tests

* code review comments addressed

* [HUDI-485]: added happy flow test case
2022-01-12 08:22:07 +05:30
Sivabalan Narayanan
604d9885f1 [HUDI-3009] making some fixes to S3 incremental source (#4517) 2022-01-09 12:46:52 -05:00
Sagar Sumit
827549949c [HUDI-2909] Handle logical type in TimestampBasedKeyGenerator (#4203)
* [HUDI-2909] Handle logical type in TimestampBasedKeyGenerator

Timestampbased key generator was returning diff values for row writer and non row writer path. this patch fixes it and is guarded by a config flag (`hoodie.datasource.write.keygenerator.consistent.logical.timestamp.enabled`)
2022-01-08 10:22:44 -05:00
Sivabalan Narayanan
2e561defe9 [HUDI-2947] Fixing checkpoint fetch in detlastreamer (#4485)
* Fixing checkpoint fetch in detlastreamer

* Addressing comments
2022-01-07 22:08:58 +05:30
Sivabalan Narayanan
8718c30324 [HUDI-3165] Enabling InProcessLockProvider for all multi-writer tests instead of FileSystemBasedLockProviderTestClass (#4427) 2022-01-06 13:04:10 -05:00
hehexiaoduantui
50fa5a6aa7 Update HiveIncrementalPuller to configure filesystem (#4431)
* Update HiveIncrementalPuller.java

fix get FileSystem bug

* Update HiveIncrementalPuller.java

fix error

* Update HiveIncrementalPuller.java

fie error
2022-01-06 13:19:30 +05:30
Vinish Reddy
eee715b3ff [HUDI-3168] Fixing null schema with empty commit in incremental relation (#4513) 2022-01-05 11:43:10 -05:00
harshal
6409fc733d [HUDI-2374] Fixing AvroDFSSource does not use the overridden schema to deserialize Avro binaries (#4353) 2021-12-27 23:01:21 -05:00
Sivabalan Narayanan
1a5f8693aa [HUDI-3011] Adding ability to read entire data with HoodieIncrSource with empty checkpoint (#4334)
* Adding ability to read entire data with HoodieIncrSource with empty checkpoint

* Addressing comments
2021-12-22 15:43:06 +05:30
Raymond Xu
bb99836841 [HUDI-3052] Fix flaky testJsonKafkaSourceResetStrategy (#4381) 2021-12-18 20:58:51 -05:00
Sivabalan Narayanan
77abb5ccb9 [HUDI-3054] Fixing default lock configs for FileSystemBasedLock and fixing a flaky test (#4374) 2021-12-18 16:15:48 -05:00
Sivabalan Narayanan
47852446e8 [HUDI-3043] De-coupling multi writer tests (#4362) 2021-12-17 21:37:45 -05:00
Sivabalan Narayanan
6eba8345cb [HUDI-3043] Adding some test fixes to continuous mode multi writer tests (#4356) 2021-12-17 15:45:05 -05:00
Y Ethan Guo
b5f05fd153 [HUDI-2906] Add a repair util to clean up dangling data and log files (#4278) 2021-12-11 00:16:05 -08:00
rmahindra123
9797fdfbb2 [HUDI-2974] Make the prefix for metrics name configurable (#4274)
Co-authored-by: Rajesh Mahindra <rmahindra@Rajeshs-MacBook-Pro.local>
2021-12-10 19:42:20 -08:00
Yuwei XIAO
f194566ed4 [HUDI-2849] Improve SparkUI job description for write path (#4222) 2021-12-10 23:22:37 +08:00
Sagar Sumit
c7473a7b0c [HUDI-2936] Add data count checks in async clustering tests (#4236) 2021-12-10 09:25:37 -05:00
Sagar Sumit
6dab307e6f [MINOR] Remove redundant and conflicting spark-hive dependency (#4228)
Disable TestHiveSchemaProvider
2021-12-06 17:48:32 -08:00
冯健
734c9f5f2d [HUDI-2418] Support HiveSchemaProvider (#3671)
Co-authored-by: jian.feng <fengjian428@gmial.com>
2021-12-05 00:10:13 -08:00
ForwardXu
63b15607ff [HUDI-2937] Introduce a pulsar implementation of hoodie write commit … (#4217)
* [HUDI-2937] Introduce a pulsar implementation of hoodie write commit callback

* [HUDI-2937] Introduce a pulsar implementation of hoodie write commit callback

* [HUDI-2937] Introduce a pulsar implementation of hoodie write commit callback

* [HUDI-2937] Introduce a pulsar implementation of hoodie write commit callback

* [HUDI-2937] Introduce a pulsar implementation of hoodie write commit callback

* [HUDI-2937] Introduce a pulsar implementation of hoodie write commit callback

* [HUDI-2937] Introduce a pulsar implementation of hoodie write commit callback
2021-12-05 11:51:06 +04:00
vinoth chandar
36b69d8033 [HUDI-2935] Remove special casing of clustering in deltastreamer checkpoint retrival (#4216)
- We now seek backwards to find the checkpoint
 - No need to return empty anymore
2021-12-04 17:16:11 +08:00
Sivabalan Narayanan
e483f7c776 [HUDI-2902] Fixing populate meta fields with Hfile writers and Disabling virtual keys by default for metadata table (#4194) 2021-12-03 07:20:21 -05:00
yuzhao.cyz
a1d0ff4209 Moving to 0.11.0-SNAPSHOT on master branch. 2021-11-27 17:22:10 +08:00
Manoj Govindassamy
3d75aca40d [HUDI-2850] Fixing Clustering CLI - schedule and run command fixes to avoid NumberFormatException (#4101) 2021-11-26 07:17:23 -05:00
Alexey Kudinkin
6f5d8d04cd [HUDI-2840] Fixed DeltaStreaemer to properly respect configuration passed t/h properties file (#4090)
* Rebased `DFSPropertiesConfiguration` to access Hadoop config in liue of FS to avoid confusion

* Fixed `readConfig` to take Hadoop's `Configuration` instead of FS;
Fixing usages

* Added test for local FS access

* Rebase to use `FSUtils.getFs`

* Combine properties provided as a file along w/ overrides provided from the CLI

* Added helper utilities to `HoodieClusteringConfig`;
Make sure corresponding config methods fallback to defaults;

* Fixed DeltaStreamer usage to respect properly combined configuration;
Abstracted `HoodieClusteringConfig.from` convenience utility to init Clustering config from `Properties`

* Tidying up

* `lint`

* Reverting changes to `HoodieWriteConfig`

* Tdiying up

* Fixed incorrect merge of the props

* Converted `HoodieConfig` to wrap around `Properties` into `TypedProperties`

* Fixed compilation

* Fixed compilation
2021-11-25 14:48:22 -08:00
Sivabalan Narayanan
6a0f079866 [HUDI-2858] Fixing handling of cluster update reject exception in deltastreamer (#4120) 2021-11-26 01:04:07 +05:30
satishm
264e1ce63c [HUDI-1290] fixing mysql debezium source (#4119) 2021-11-25 11:26:59 -05:00
rmahindra123
83f8ed2ae3 [HUDI-1290] Add Debezium Source for deltastreamer (#4063)
* add source for postgres debezium

* Add tests for debezium payload

* Fix test

* Fix test

* Add tests for debezium source

* Add tests for debezium source

* Fix schema for debezium

* Fix checkstyle issues

* Fix config issue for schema registry

* Add mysql source for debezium

* Fix checkstyle issues an tests

* Improve code for merging toasted values

* Improve code for merging toasted values

Co-authored-by: Rajesh Mahindra <rmahindra@Rajeshs-MacBook-Pro.local>
2021-11-24 17:57:02 -08:00
Y Ethan Guo
bef373fa1d [MINOR] Fix build failure due to checkstyle issues (#4111) 2021-11-24 17:17:46 -08:00
Sivabalan Narayanan
435ea1543c [HUDI-2793] Fixing deltastreamer checkpoint fetch/copy over (#4034)
- Removed the copy over logic in transaction utils. Deltastreamer will go back to previous commits and get the checkpoint value.
2021-11-24 18:26:40 -05:00
Y Ethan Guo
ca9bfa2a40 [HUDI-2332] Add clustering and compaction in Kafka Connect Sink (#3857)
* [HUDI-2332] Add clustering and compaction in Kafka Connect Sink

* Disable validation check on instant time for compaction and adjust configs

* Add javadocs

* Add clustering and compaction config

* Fix transaction causing missing records in the target table

* Add debugging logs

* Fix kafka offset sync in participant

* Adjust how clustering and compaction are configured in kafka-connect

* Fix clustering strategy

* Remove irrelevant changes from other published PRs

* Update clustering logic and others

* Update README

* Fix test failures

* Fix indentation

* Fix clustering config

* Add JavaCustomColumnsSortPartitioner and make async compaction enabled by default

* Add test for JavaCustomColumnsSortPartitioner

* Add more changes after IDE sync

* Update README with clarification

* Fix clustering logic after rebasing

* Remove unrelated changes
2021-11-23 14:23:28 +05:30
Y Ethan Guo
772af935d5 [HUDI-2737] Use earliest instant by default for async compaction and clustering jobs (#3991)
Address review comments

Fix test failures

Co-authored-by: Sagar Sumit <sagarsumit09@gmail.com>
2021-11-23 06:49:41 +05:30
Sivabalan Narayanan
fc9ca6a07a [HUDI-2559] Converting commit timestamp format to millisecs (#4024)
- Adds support for generating commit timestamps with millisecs granularity. 
- Older commit timestamps (in secs granularity) will be suffixed with 999 and parsed with millisecs format.
2021-11-22 11:44:38 -05:00
Sagar Sumit
89452063b4 [MINOR] Fix instant parsing in HoodieClusteringJob (#4071) 2021-11-22 08:57:44 -05:00
zhangyue19921010
a2c91a7a9b [HUDI-2533] New option for hoodieClusteringJob to check, rollback and re-execute the last failed clustering job (#3765)
* coding finished and need to do uts

* add uts

* code review

* code review

Co-authored-by: yuezhang <yuezhang@freewheel.tv>
2021-11-22 16:30:33 +05:30
董可伦
2533a9cc17 [MINOR] Fix typos (#4053) 2021-11-21 16:34:59 +08:00
dufeng1010
305d160081 [MINOR] optimize in constructor of inputbatch class (#4040)
Co-authored-by: 闫杜峰 <yandufeng@sinochem.com>
2021-11-21 10:11:01 +08:00
Harsha Teja Kanna
f4b974ac7b [HUDI-2742] Added S3 object filter to support multiple S3EventsHoodieIncrSources single S3 meta table (#4025) 2021-11-20 14:54:21 +05:30
Manoj Govindassamy
459b34240b [HUDI-2593] Virtual keys support for metadata table (#3968)
- Metadata table today has virtual keys disabled, thereby populating the metafields
  for each record written out and increasing the overall storage space used. Hereby
  adding virtual keys support for metadata table so that metafields are disabled
  for metadata table records.

- Adding a custom KeyGenerator for Metadata table so as to not rely on the
  default Base/SimpleKeyGenerators which currently look for record key
  and partition field set in the table config.

- AbstractHoodieLogRecordReader's version of processing next data block and
  createHoodieRecord() will be a generic version and making the derived class
  HoodieMetadataMergedLogRecordReader take care of the special creation of
  records from explictly passed in partition names.
2021-11-19 18:11:29 -05:00
wenningd
24def0b30d [HUDI-2362] Add external config file support (#3416)
Co-authored-by: Wenning Ding <wenningd@amazon.com>
2021-11-18 01:59:26 -08:00
davehagman
dfe3b84715 [HUDI-2579] Make deltastreamer checkpoint state merging more explicit (#3820)
Co-authored-by: Sivabalan Narayanan <n.siva.b@gmail.com>
2021-11-09 17:37:59 -05:00
Prashant Wason
b7ee341e14 [HUDI-1794] Moved static COMMIT_FORMATTER to thread local variable as SimpleDateFormat is not thread safe. (#2819) 2021-11-05 09:31:42 -04:00
Sagar Sumit
5b1992a92d [HUDI-1500] Support replace commit in DeltaSync with commit metadata preserved (#3802) 2021-10-29 13:09:09 -04:00
Raymond Xu
d8560377c3 [HUDI-2077] Fix TestHoodieDeltaStreamerWithMultiWriter (#3849)
Remove the logic of using deltastreamer to prep test table. Use fixture (compressed test table) instead.
2021-10-24 21:14:39 -07:00