1
0
Commit Graph

325 Commits

Author SHA1 Message Date
yuzhao.cyz
a1d0ff4209 Moving to 0.11.0-SNAPSHOT on master branch. 2021-11-27 17:22:10 +08:00
Manoj Govindassamy
3d75aca40d [HUDI-2850] Fixing Clustering CLI - schedule and run command fixes to avoid NumberFormatException (#4101) 2021-11-26 07:17:23 -05:00
Alexey Kudinkin
6f5d8d04cd [HUDI-2840] Fixed DeltaStreaemer to properly respect configuration passed t/h properties file (#4090)
* Rebased `DFSPropertiesConfiguration` to access Hadoop config in liue of FS to avoid confusion

* Fixed `readConfig` to take Hadoop's `Configuration` instead of FS;
Fixing usages

* Added test for local FS access

* Rebase to use `FSUtils.getFs`

* Combine properties provided as a file along w/ overrides provided from the CLI

* Added helper utilities to `HoodieClusteringConfig`;
Make sure corresponding config methods fallback to defaults;

* Fixed DeltaStreamer usage to respect properly combined configuration;
Abstracted `HoodieClusteringConfig.from` convenience utility to init Clustering config from `Properties`

* Tidying up

* `lint`

* Reverting changes to `HoodieWriteConfig`

* Tdiying up

* Fixed incorrect merge of the props

* Converted `HoodieConfig` to wrap around `Properties` into `TypedProperties`

* Fixed compilation

* Fixed compilation
2021-11-25 14:48:22 -08:00
Sivabalan Narayanan
6a0f079866 [HUDI-2858] Fixing handling of cluster update reject exception in deltastreamer (#4120) 2021-11-26 01:04:07 +05:30
satishm
264e1ce63c [HUDI-1290] fixing mysql debezium source (#4119) 2021-11-25 11:26:59 -05:00
rmahindra123
83f8ed2ae3 [HUDI-1290] Add Debezium Source for deltastreamer (#4063)
* add source for postgres debezium

* Add tests for debezium payload

* Fix test

* Fix test

* Add tests for debezium source

* Add tests for debezium source

* Fix schema for debezium

* Fix checkstyle issues

* Fix config issue for schema registry

* Add mysql source for debezium

* Fix checkstyle issues an tests

* Improve code for merging toasted values

* Improve code for merging toasted values

Co-authored-by: Rajesh Mahindra <rmahindra@Rajeshs-MacBook-Pro.local>
2021-11-24 17:57:02 -08:00
Y Ethan Guo
bef373fa1d [MINOR] Fix build failure due to checkstyle issues (#4111) 2021-11-24 17:17:46 -08:00
Sivabalan Narayanan
435ea1543c [HUDI-2793] Fixing deltastreamer checkpoint fetch/copy over (#4034)
- Removed the copy over logic in transaction utils. Deltastreamer will go back to previous commits and get the checkpoint value.
2021-11-24 18:26:40 -05:00
Y Ethan Guo
ca9bfa2a40 [HUDI-2332] Add clustering and compaction in Kafka Connect Sink (#3857)
* [HUDI-2332] Add clustering and compaction in Kafka Connect Sink

* Disable validation check on instant time for compaction and adjust configs

* Add javadocs

* Add clustering and compaction config

* Fix transaction causing missing records in the target table

* Add debugging logs

* Fix kafka offset sync in participant

* Adjust how clustering and compaction are configured in kafka-connect

* Fix clustering strategy

* Remove irrelevant changes from other published PRs

* Update clustering logic and others

* Update README

* Fix test failures

* Fix indentation

* Fix clustering config

* Add JavaCustomColumnsSortPartitioner and make async compaction enabled by default

* Add test for JavaCustomColumnsSortPartitioner

* Add more changes after IDE sync

* Update README with clarification

* Fix clustering logic after rebasing

* Remove unrelated changes
2021-11-23 14:23:28 +05:30
Y Ethan Guo
772af935d5 [HUDI-2737] Use earliest instant by default for async compaction and clustering jobs (#3991)
Address review comments

Fix test failures

Co-authored-by: Sagar Sumit <sagarsumit09@gmail.com>
2021-11-23 06:49:41 +05:30
Sivabalan Narayanan
fc9ca6a07a [HUDI-2559] Converting commit timestamp format to millisecs (#4024)
- Adds support for generating commit timestamps with millisecs granularity. 
- Older commit timestamps (in secs granularity) will be suffixed with 999 and parsed with millisecs format.
2021-11-22 11:44:38 -05:00
Sagar Sumit
89452063b4 [MINOR] Fix instant parsing in HoodieClusteringJob (#4071) 2021-11-22 08:57:44 -05:00
zhangyue19921010
a2c91a7a9b [HUDI-2533] New option for hoodieClusteringJob to check, rollback and re-execute the last failed clustering job (#3765)
* coding finished and need to do uts

* add uts

* code review

* code review

Co-authored-by: yuezhang <yuezhang@freewheel.tv>
2021-11-22 16:30:33 +05:30
董可伦
2533a9cc17 [MINOR] Fix typos (#4053) 2021-11-21 16:34:59 +08:00
dufeng1010
305d160081 [MINOR] optimize in constructor of inputbatch class (#4040)
Co-authored-by: 闫杜峰 <yandufeng@sinochem.com>
2021-11-21 10:11:01 +08:00
Harsha Teja Kanna
f4b974ac7b [HUDI-2742] Added S3 object filter to support multiple S3EventsHoodieIncrSources single S3 meta table (#4025) 2021-11-20 14:54:21 +05:30
Manoj Govindassamy
459b34240b [HUDI-2593] Virtual keys support for metadata table (#3968)
- Metadata table today has virtual keys disabled, thereby populating the metafields
  for each record written out and increasing the overall storage space used. Hereby
  adding virtual keys support for metadata table so that metafields are disabled
  for metadata table records.

- Adding a custom KeyGenerator for Metadata table so as to not rely on the
  default Base/SimpleKeyGenerators which currently look for record key
  and partition field set in the table config.

- AbstractHoodieLogRecordReader's version of processing next data block and
  createHoodieRecord() will be a generic version and making the derived class
  HoodieMetadataMergedLogRecordReader take care of the special creation of
  records from explictly passed in partition names.
2021-11-19 18:11:29 -05:00
wenningd
24def0b30d [HUDI-2362] Add external config file support (#3416)
Co-authored-by: Wenning Ding <wenningd@amazon.com>
2021-11-18 01:59:26 -08:00
davehagman
dfe3b84715 [HUDI-2579] Make deltastreamer checkpoint state merging more explicit (#3820)
Co-authored-by: Sivabalan Narayanan <n.siva.b@gmail.com>
2021-11-09 17:37:59 -05:00
Prashant Wason
b7ee341e14 [HUDI-1794] Moved static COMMIT_FORMATTER to thread local variable as SimpleDateFormat is not thread safe. (#2819) 2021-11-05 09:31:42 -04:00
Sagar Sumit
5b1992a92d [HUDI-1500] Support replace commit in DeltaSync with commit metadata preserved (#3802) 2021-10-29 13:09:09 -04:00
Raymond Xu
d8560377c3 [HUDI-2077] Fix TestHoodieDeltaStreamerWithMultiWriter (#3849)
Remove the logic of using deltastreamer to prep test table. Use fixture (compressed test table) instead.
2021-10-24 21:14:39 -07:00
Raymond Xu
f5d7362ee8 [HUDI-2077] Fix flakiness in TestHoodieDeltaStreamer (#3829) 2021-10-20 23:57:12 -04:00
zhangyue19921010
e6711b171a [HUDI-2435][BUG]Fix clustering handle errors (#3666)
* done

* remove unused imports

* code reviewed

* code reviewed

Co-authored-by: yuezhang <yuezhang@freewheel.tv>
2021-10-12 15:24:48 -07:00
Sivabalan Narayanan
5f32162a2f [HUDI-2285][HUDI-2476] Metadata table synchronous design. Rebased and Squashed from pull/3426 (#3590)
* [HUDI-2285] Adding Synchronous updates to metadata before completion of commits in data timelime.

- This patch adds synchronous updates to metadata table. In other words, every write is first committed to metadata table followed by data table. While reading metadata table, we ignore any delta commits that are present only in metadata table and not in data table timeline.
- Compaction of metadata table is fenced by the condition that we trigger compaction only when there are no inflight requests in datatable. This ensures that all base files in metadata table is always in sync with data table(w/o any holes) and only there could be some extra invalid commits among delta log files in metadata table.
- Due to this, archival of data table also fences itself up until compacted instant in metadata table.
All writes to metadata table happens within the datatable lock. So, metadata table works in one writer mode only. This might be tough to loosen since all writers write to same FILES partition and so, will result in a conflict anyways.
- As part of this, have added acquiring locks in data table for those operations which were not before while committing (rollback, clean, compaction, cluster). To note, we were not doing any conflict resolution. All we are doing here is to commit by taking a lock. So that all writes to metadata table is always a single writer. 
- Also added building block to add buckets for partitions, which will be leveraged by other indexes like record level index, etc. For now, FILES partition has only one bucket. In general, any number of buckets per partition is allowed and each partition has a fixed fileId prefix with incremental suffix for each bucket within each partition.
Have fixed [HUDI-2476]. This fix is about retrying a failed compaction if it succeeded in metadata for first time, but failed w/ data table.
- Enabling metadata table by default.
- Adding more tests for metadata table

Co-authored-by: Prashant Wason <pwason@uber.com>
2021-10-06 00:17:52 -04:00
zhangyue19921010
dd1bd62684 [HUDI-2277] HoodieDeltaStreamer reading ORC files directly using ORCDFSSource (#3413)
* add ORCDFSSource to support reading orc file into hudi format && add UTs

* remove ununsed import

* simplify tes

* code review

* code review

* code review

* code review

* code review

* code review

Co-authored-by: yuezhang <yuezhang@freewheel.tv>
2021-09-29 08:54:12 -07:00
qianchutao
9067657a5f [HUDI-2487] Fix JsonKafkaSource cannot filter empty messages from kafka (#3715) 2021-09-28 13:47:15 +08:00
董可伦
36be287121 [MINOR] Fix typo,'Kakfa' corrected to 'Kafka' & 'parquest' corrected to 'parquet' (#3717) 2021-09-26 21:53:39 +08:00
qianchutao
7e887b54d7 [MINOR] fix typo,'SPAKR' corrected to 'SPARK' (#3721) 2021-09-26 21:52:35 +08:00
zhangyue19921010
2d5ac55195 [HUDI-2355][Bug]Archive service executed after cleaner finished. (#3545)
Co-authored-by: yuezhang <yuezhang@freewheel.tv>
2021-09-15 19:00:04 -04:00
liujinhui
35a04c43a5 [HUDI-2425] TestHoodieMultiTableDeltaStreamer CI failed due to exception (#3654) 2021-09-13 06:57:04 -07:00
K.I. (Dennis) Jung
c79017cb74 [HUDI-2397] Add --enable-sync parameter (#3608)
* add meta-sync config

* update test

* keep enableMetaSync same with enableHiveSync

* Switch check logic to use `enableMetaSync`
2021-09-13 12:04:49 +05:30
rmahindra123
e528dd798a [HUDI-2394] Implement Kafka Sink Protocol for Hudi for Ingesting Immutable Data (#3592)
- Fixing packaging, naming of classes
 - Use of log4j over slf4j for uniformity
- More follow-on fixes
 - Added a version to control/coordinator events.
 - Eliminated the config added to write config
 - Fixed fetching of checkpoints based on table type
 - Clean up of naming, code placement

Co-authored-by: Rajesh Mahindra <rmahindra@Rajeshs-MacBook-Pro.local>
Co-authored-by: Vinoth Chandar <vinoth@apache.org>
2021-09-10 18:20:26 -07:00
Raymond Xu
57c8113ee1 [HUDI-2408] Deprecate FunctionalTestHarness to avoid init DFS (#3628) 2021-09-09 11:29:04 -04:00
liujinhui
eb5e7eec0a MINOR_CHECKSTYLE (#3616)
Fix checkstyle
2021-09-07 18:19:39 +08:00
Raymond Xu
073c318d9f [HUDI-1989] Disable HDFSParquetImporter related tests (#3597)
Also mark HDFSParquetImportCommand and HDFSParquetImporter as deprecated.
2021-09-03 23:08:11 -04:00
董可伦
bf5a52e51b [HUDI-2320] Add support ByteArrayDeserializer in AvroKafkaSource (#3502) 2021-08-30 10:01:15 +08:00
Udit Mehrotra
c350d05dd3 Restore 0.8.0 config keys with deprecated annotation (#3506)
Co-authored-by: Sagar Sumit <sagarsumit09@gmail.com>
Co-authored-by: Vinoth Chandar <vinoth@apache.org>
2021-08-19 13:36:40 -07:00
Udit Mehrotra
3e301196bf Moving to 0.10.0-SNAPSHOT on master branch. 2021-08-14 18:51:09 -07:00
Y Ethan Guo
23dca6c237 [HUDI-2268] Add upgrade and downgrade to and from 0.9.0 (#3470)
- Added upgrade and downgrade step to and from 0.9.0. Upgrade adds few table properties. Downgrade recreates timeline server based marker files if any.
2021-08-14 20:20:23 -04:00
Sagar Sumit
5cc96e85c1 [HUDI-1897] Deltastreamer source for AWS S3 (#3433)
- Added two sources for two stage pipeline. a. S3EventsSource that fetches events from SQS and ingests to a meta hoodie table. b. S3EventsHoodieIncrSource reads S3 events from this meta hoodie table, fetches actual objects from S3 and ingests to sink hoodie table. 
- Added selectors to assist in S3EventsSource. 

Co-authored-by: Satish M <84978833+satishmittal1111@users.noreply.github.com>
Co-authored-by: Vinoth Chandar <vinoth@apache.org>
2021-08-14 08:25:10 -04:00
Sagar Sumit
0544d70d8f [MINOR] Deprecate older configs (#3464)
Rename and deprecate props in HoodieWriteConfig

Rename and deprecate older props
2021-08-12 20:31:04 -07:00
Sivabalan Narayanan
b651336454 [HUDI-2294] Adding virtual keys support to deltastreamer (#3450) 2021-08-12 08:02:39 -04:00
liujinhui
c0fc9cdaf3 MINOR (#3459)
MOVE hoodie Deltrstreamer to hudi-utilties
2021-08-12 18:19:05 +08:00
vinoyang
dc3cbb28e7 [MINOR] Correct TestKafkaSource class and comment (#3451) 2021-08-12 09:11:00 +08:00
Y Ethan Guo
4783176554 [HUDI-1138] Add timeline-server-based marker file strategy for improving marker-related latency (#3233)
- Can be enabled for cloud stores like S3. Not supported for hdfs yet, due to partial write failures.
2021-08-11 11:48:13 -04:00
Sivabalan Narayanan
1196736185 [HUDI-1129] Improving schema evolution support in hudi (#2927)
* Adding support to ingest records with old schema after table's schema is evolved

* Rebasing against latest master

- Trimming test file to be < 800 lines
- Renaming config names

* Addressing feedback

Co-authored-by: Vinoth Chandar <vinoth@apache.org>
2021-08-10 09:15:37 -07:00
wenningd
91bb0d1318 [HUDI-2255] Refactor Datasource options (#3373)
Co-authored-by: Wenning Ding <wenningd@amazon.com>
2021-08-03 17:50:30 -07:00
rmahindra123
245e1fd17d [HUDI-2272] Pass base file format to sync clients (#3397)
Co-authored-by: Rajesh Mahindra <rmahindra@Rajeshs-MacBook-Pro.local>
2021-08-03 14:46:02 -07:00
zhangyue19921010
dde57b293c [HUDI-2164] Let users build cluster plan and execute this plan at once using HoodieClusteringJob for async clustering (#3259)
* add --mode schedule/execute/scheduleandexecute

* fix checkstyle

* add UT testHoodieAsyncClusteringJobWithScheduleAndExecute

* log changed

* try to make ut success

* try to fix ut

* modify ut

* review changed

* code review

* code review

* code review

* code review

Co-authored-by: yuezhang <yuezhang@freewheel.tv>
2021-08-02 08:07:59 +08:00