1
0
Commit Graph

251 Commits

Author SHA1 Message Date
YueZhang
359fbfde79 [HUDI-2648] Retry FileSystem action instead of failed directly. (#3887)
Co-authored-by: yuezhang <yuezhang@freewheel.tv>
2022-02-20 15:31:31 -05:00
Bo Cui
83279971a1 [HUDI-3446] Supports batch reader in BootstrapOperator#loadRecords (#4837)
* [HUDI-3446] Supports batch Reader in BootstrapOperator#loadRecords
2022-02-19 21:21:48 +08:00
Sagar Sumit
ed106f671e [HUDI-2809] Introduce a checksum mechanism for validating hoodie.properties (#4712)
Fix dependency conflict

Fix repairs command

Implement putIfAbsent for DDB lock provider

Add upgrade step and validate while fetching configs

Validate checksum for latest table version only while fetching config

Move generateChecksum to BinaryUtil

Rebase and resolve conflict

Fix table version check
2022-02-18 10:17:06 +05:30
Y Ethan Guo
9a05940a74 [HUDI-3366] Remove hardcoded logic of disabling metadata table in tests (#4792) 2022-02-15 16:41:47 -05:00
Y Ethan Guo
b8601a9f58 [HUDI-2656] Generalize HoodieIndex for flexible record data type (#3893)
Co-authored-by: Raymond Xu <2701446+xushiyan@users.noreply.github.com>
2022-02-03 20:24:04 -08:00
Alexey Kudinkin
819e8018ff [HUDI-3322][HUDI-3343] Fixing Metadata Table Records Duplication Issues (#4716)
This change is addressing issues in regards to Metadata Table observing ingesting duplicated records leading to it persisting incorrect file-sizes for the files referred to in those records.

There are multiple issues that were leading to that:

- [HUDI-3322] Incorrect Rollback Plan generation: Rollback Plan generated for MOR tables was overly expansively listing all log-files with the latest base-instant as the ones that have been affected by the rollback, leading to invalid MT records being ingested referring to those.
- [HUDI-3343] Metadata Table including Uncommitted Log Files during Bootstrap: Since MT is bootstrapped at the end of the commit operation execution (after FS activity, but before committing to the timeline), it was actually incorrectly ingesting some files that were part of the intermediate state of the operation being committed.

This change will unblock Stack of PRs based off #4556
2022-02-02 16:10:51 -05:00
Alexey Kudinkin
a68e1dc2db [HUDI-431] Adding support for Parquet in MOR LogBlocks (#4333)
- Adding support for Parquet in MOR tables Log blocks

Co-authored-by: Sivabalan Narayanan <n.siva.b@gmail.com>
2022-02-02 14:35:05 -05:00
Raymond Xu
caef3d5c58 [HUDI-3330] Remove fixture test tables for multi writer tests (#4704) 2022-02-02 17:50:10 +05:30
董可伦
56cd8ffae0 [HUDI-2837] Add support for using database name in incremental query (#4083) 2022-01-22 22:11:27 -08:00
Alexey Kudinkin
4bea758738 [HUDI-3191] Rebasing Hive's FileInputFormat onto AbstractHoodieTableFileIndex (#4531) 2022-01-18 14:54:51 -08:00
0x574C
28b3b6ad8f [MINOR] Remove org.apache.directory.api.util.Strings import (#4601) 2022-01-16 16:58:18 +08:00
Y Ethan Guo
7d163ee3de [MINOR] Fix local flaky test in TestFSUtils (#4596)
Co-authored-by: Raymond Xu <2701446+xushiyan@users.noreply.github.com>
2022-01-13 22:48:57 -08:00
Sagar Sumit
195dac90fa [MINOR] Disable flaky tests to unlock CI (#4592) 2022-01-13 19:43:27 -08:00
Y Ethan Guo
397795c7d0 [HUDI-3007] Fix issues in HoodieRepairTool (#4564) 2022-01-12 09:03:27 -08:00
Sivabalan Narayanan
7a8b94c82d [HUDI-3180] Include files from completed commits while bootstrapping metadata table (#4519) 2022-01-10 15:33:15 -05:00
YueZhang
cf362fb2d5 [MINOR] Fix some code style issues based on check-style plugin (#4532)
Co-authored-by: yuezhang <yuezhang@freewheel.tv>
2022-01-09 01:14:56 -08:00
Sagar Sumit
827549949c [HUDI-2909] Handle logical type in TimestampBasedKeyGenerator (#4203)
* [HUDI-2909] Handle logical type in TimestampBasedKeyGenerator

Timestampbased key generator was returning diff values for row writer and non row writer path. this patch fixes it and is guarded by a config flag (`hoodie.datasource.write.keygenerator.consistent.logical.timestamp.enabled`)
2022-01-08 10:22:44 -05:00
Sagar Sumit
518488c633 [HUDI-3185] HoodieConfig#getBoolean should return false when default not set (#4536)
Remove unnecessary config
2022-01-07 16:20:11 -05:00
董可伦
436becf3ea [HUDI-2675] Fix the exception 'Not an Avro data file' when archive and clean (#4016) 2021-12-29 22:53:17 -05:00
Danny Chan
7b07aac286 [HUDI-3101] Excluding compaction instants from pending rollback info (#4443) 2021-12-25 14:10:45 +08:00
Raymond Xu
32a44bbe06 [HUDI-2970] Add test for archiving replace commit (#4345) 2021-12-21 00:01:59 -05:00
Danny Chan
ea2eba1a55 [HUDI-3015] Implement #reset and #sync for metadata filesystem view (#4307) 2021-12-16 15:26:16 +08:00
Alexey Kudinkin
2d864f7524 [HUDI-2814] Make Z-index more generic Column-Stats Index (#4106) 2021-12-10 14:56:09 -08:00
zhangyue19921010
3ba2909690 [HUDI-2892][BUG] Pending Clustering may stain the ActiveTimeLine and lead to incomplete query results (#4172)
Co-authored-by: yuezhang <yuezhang@freewheel.tv>
2021-12-10 09:57:01 -08:00
Sivabalan Narayanan
be368264f4 [HUDI-2952] Fixing metadata table for non-partitioned dataset (#4243) 2021-12-10 11:11:42 -05:00
leesf
5ac9ce7289 [MINOR] Fix Compile broken (#4263) 2021-12-09 13:12:18 +08:00
guanziyue
9c8ad0f0fa [HUDI-2665] Fix overflow of huge log file in HoodieLogFormatWriter (#3912)
Co-authored-by: guanziyue.gzy <guanziyue.gzy@bytedance.com>
2021-12-09 10:47:13 +08:00
Ron
84b531ae75 [HUDI-2900] Fix corrupt block end position (#4181)
* [HUDI-2900] Fix corrupt block end position

* add a test
2021-12-06 20:38:39 +08:00
Sivabalan Narayanan
1d4fb827e7 [HUDI-2923] Fixing metadata table reader when metadata compaction is inflight (#4206)
* [HUDI-2923] Fixing metadata table reader when metadata compaction is inflight

* Fixing retry of pending compaction in metadata table and enhancing tests
2021-12-03 21:44:50 -08:00
Sivabalan Narayanan
f4c25ba3fd [HUDI-2880] Fixing loading of props from default dir (#4167)
* Fixing loading of props from default dir

* addressing comments
2021-12-01 16:02:30 +08:00
Sivabalan Narayanan
f8e0176eb0 [HUDI-2861] Re-use same rollback instant time for failed rollbacks (#4123) 2021-11-26 16:36:42 -05:00
Sivabalan Narayanan
8340ccb503 [HUDI-2005] Removing direct fs call in HoodieLogFileReader (#3865) 2021-11-25 18:51:38 -05:00
Alexey Kudinkin
6f5d8d04cd [HUDI-2840] Fixed DeltaStreaemer to properly respect configuration passed t/h properties file (#4090)
* Rebased `DFSPropertiesConfiguration` to access Hadoop config in liue of FS to avoid confusion

* Fixed `readConfig` to take Hadoop's `Configuration` instead of FS;
Fixing usages

* Added test for local FS access

* Rebase to use `FSUtils.getFs`

* Combine properties provided as a file along w/ overrides provided from the CLI

* Added helper utilities to `HoodieClusteringConfig`;
Make sure corresponding config methods fallback to defaults;

* Fixed DeltaStreamer usage to respect properly combined configuration;
Abstracted `HoodieClusteringConfig.from` convenience utility to init Clustering config from `Properties`

* Tidying up

* `lint`

* Reverting changes to `HoodieWriteConfig`

* Tdiying up

* Fixed incorrect merge of the props

* Converted `HoodieConfig` to wrap around `Properties` into `TypedProperties`

* Fixed compilation

* Fixed compilation
2021-11-25 14:48:22 -08:00
Sivabalan Narayanan
f692078d32 [HUDI-2671] Making error -> warn logs from timeline server with concurrent writers for inconsistent state (#4088)
* Making error -> warn logs from timeline server with concurrent writers for inconsistent state

* Fixing bad request response exception for timeline out of sync

* Addressing feedback. removed write concurrency mode depedency
2021-11-25 11:21:32 -08:00
Sivabalan Narayanan
a9bd20804b [HUDI-2792] Configure metadata payload consistency check (#4035)
- Relax metadata payload consistency check to consider spark task failures with spurious deletes
2021-11-24 21:56:31 -05:00
rmahindra123
83f8ed2ae3 [HUDI-1290] Add Debezium Source for deltastreamer (#4063)
* add source for postgres debezium

* Add tests for debezium payload

* Fix test

* Fix test

* Add tests for debezium source

* Add tests for debezium source

* Fix schema for debezium

* Fix checkstyle issues

* Fix config issue for schema registry

* Add mysql source for debezium

* Fix checkstyle issues an tests

* Improve code for merging toasted values

* Improve code for merging toasted values

Co-authored-by: Rajesh Mahindra <rmahindra@Rajeshs-MacBook-Pro.local>
2021-11-24 17:57:02 -08:00
Manoj Govindassamy
973f78f5ca [HUDI-2443] Hudi KVComparator for all HFile writer usages (#3889)
* [HUDI-2443] Hudi KVComparator for all HFile writer usages

- Hudi relies on custom class shading for Hbase's KeyValue.KVComparator to
  avoid versioning and class loading issues. There are few places which are
  still using the Hbase's comparator class directly and version upgrades
  would make them obsolete. Refactoring the HoodieKVComparator and making
  all HFile writer creation using the same shaded class.

* [HUDI-2443] Hudi KVComparator for all HFile writer usages

- Moving HoodieKVComparator from common.bootstrap.index to common.util

* [HUDI-2443] Hudi KVComparator for all HFile writer usages

- Retaining the old HoodieKVComparatorV2 for boostrap case. Adding the
  new comparator as HoodieKVComparatorV2 to differentiate from the old
  one.

* [HUDI-2443] Hudi KVComparator for all HFile writer usages

 - Renamed HoodieKVComparatorV2 to HoodieMetadataKVComparator and moved it
   under the package org.apache.hudi.metadata.

* Make comparator classname configurable

* Revert new config and address other review comments

Co-authored-by: Sagar Sumit <sagarsumit09@gmail.com>
2021-11-24 10:05:36 -08:00
Y Ethan Guo
772af935d5 [HUDI-2737] Use earliest instant by default for async compaction and clustering jobs (#3991)
Address review comments

Fix test failures

Co-authored-by: Sagar Sumit <sagarsumit09@gmail.com>
2021-11-23 06:49:41 +05:30
Sivabalan Narayanan
fc9ca6a07a [HUDI-2559] Converting commit timestamp format to millisecs (#4024)
- Adds support for generating commit timestamps with millisecs granularity. 
- Older commit timestamps (in secs granularity) will be suffixed with 999 and parsed with millisecs format.
2021-11-22 11:44:38 -05:00
vinoth chandar
ae0c67d9fc [HUDI-2795] Add mechanism to safely update,delete and recover table properties (#4038)
* [HUDI-2795] Add mechanism to safely update,delete and recover table properties

  - Fail safe mechanism, that lets queries succeed off a backup file
  - Readers who are not upgraded to this version of code will just fail until recovery is done.
  - Added unit tests that exercises all these scenarios.
  - Adding CLI for recovery, updation to table command.
  - [Pending] Add some hash based verfication to ensure any rare partial writes for HDFS

* Fixing upgrade/downgrade infrastructure to use new updation method
2021-11-20 08:07:40 -08:00
Manoj Govindassamy
c8617d9390 [HUDI-2472] Enabling metadata table for TestHoodieMergeOnReadTable and TestHoodieCompactor (#4023) 2021-11-19 20:02:21 -05:00
Manoj Govindassamy
459b34240b [HUDI-2593] Virtual keys support for metadata table (#3968)
- Metadata table today has virtual keys disabled, thereby populating the metafields
  for each record written out and increasing the overall storage space used. Hereby
  adding virtual keys support for metadata table so that metafields are disabled
  for metadata table records.

- Adding a custom KeyGenerator for Metadata table so as to not rely on the
  default Base/SimpleKeyGenerators which currently look for record key
  and partition field set in the table config.

- AbstractHoodieLogRecordReader's version of processing next data block and
  createHoodieRecord() will be a generic version and making the derived class
  HoodieMetadataMergedLogRecordReader take care of the special creation of
  records from explictly passed in partition names.
2021-11-19 18:11:29 -05:00
wenningd
24def0b30d [HUDI-2362] Add external config file support (#3416)
Co-authored-by: Wenning Ding <wenningd@amazon.com>
2021-11-18 01:59:26 -08:00
Manoj Govindassamy
f715cf607f [HUDI-2716] InLineFS support for S3FS logs (#3977) 2021-11-17 13:59:38 -05:00
Sivabalan Narayanan
53d2d6ae24 [HUDI-2744] Fix parsing of metadadata table compaction timestamp when metrics are enabled (#3976) 2021-11-15 07:27:35 -05:00
Manoj Govindassamy
9720820975 [HUDI-2718] ExternalSpillableMap payload size re-estimation throws ArithmeticException (#3955)
- ExternalSpillableMap does the payload/value size estimation on the first put to
  determine when to spill over to disk map. The payload size re-estimation also
  happens after a minimum threshold of puts. This size re-estimation goes my the
  current in-memory map size for calculating average payload size and does attempts
  divide by zero operation when the map is size is empty. Avoiding the
  ArithmeticException during the payload size re-estimate by checking the map size
  upfront.
2021-11-12 08:18:40 -05:00
manasaks
e0285800fb HUDI-1827 : Add ORC support in Bootstrap Op (#3457)
Co-authored-by: Sivabalan Narayanan <n.siva.b@gmail.com>
2021-11-06 12:23:20 -04:00
Prashant Wason
b7ee341e14 [HUDI-1794] Moved static COMMIT_FORMATTER to thread local variable as SimpleDateFormat is not thread safe. (#2819) 2021-11-05 09:31:42 -04:00
peanut-chenzhong
f2fb1aac26 [HUDI-2509] OverwriteNonDefaultsWithLatestAvroPayload doesn`t work when upsert data with some null value column (#3761)
Co-authored-by: 502395931@qq.com <lzyadam315>
2021-11-03 08:23:40 -04:00
Yann Byron
6351e5f4d0 [HUDI-2538] persist some configs to hoodie.properties when the first write (#3823) 2021-11-03 10:04:23 +08:00