1
0
Commit Graph

2137 Commits

Author SHA1 Message Date
Sagar Sumit
b972aa5bf2 [HUDI-2800] Remove rdd.isEmpty() validation to prevent CreateHandle being called twice (#4121) 2021-11-25 23:46:36 +05:30
satishm
264e1ce63c [HUDI-1290] fixing mysql debezium source (#4119) 2021-11-25 11:26:59 -05:00
Danny Chan
a2eb2b0b0a [HUDI-2480] FileSlice after pending compaction-requested instant-time… (#3703)
* [HUDI-2480] FileSlice after pending compaction-requested instant-time is ignored by MOR snapshot reader

* include file slice after a pending compaction for spark reader

Co-authored-by: garyli1019 <yanjia.gary.li@gmail.com>
2021-11-25 22:30:09 +08:00
董可伦
88067f57a2 [HUDI-2855] Change the default value of 'PAYLOAD_CLASS_NAME' to 'DefaultHoodieRecordPayload' (#4115) 2021-11-25 19:17:38 +08:00
Sivabalan Narayanan
a9bd20804b [HUDI-2792] Configure metadata payload consistency check (#4035)
- Relax metadata payload consistency check to consider spark task failures with spurious deletes
2021-11-24 21:56:31 -05:00
rmahindra123
83f8ed2ae3 [HUDI-1290] Add Debezium Source for deltastreamer (#4063)
* add source for postgres debezium

* Add tests for debezium payload

* Fix test

* Fix test

* Add tests for debezium source

* Add tests for debezium source

* Fix schema for debezium

* Fix checkstyle issues

* Fix config issue for schema registry

* Add mysql source for debezium

* Fix checkstyle issues an tests

* Improve code for merging toasted values

* Improve code for merging toasted values

Co-authored-by: Rajesh Mahindra <rmahindra@Rajeshs-MacBook-Pro.local>
2021-11-24 17:57:02 -08:00
rmahindra123
abc0175cf7 [HUDI-1290] [RFC-39] Deltastreamer avro source for Debezium CDC (#4048)
* Add RFC entry for deltastreamer source for debezium

* Add RFC for debezium source

* Add RFC for debezium source

* Add RFC for debezium source

* fix hyperlink issue and rebase

* Update progress

Co-authored-by: Rajesh Mahindra <rmahindra@Rajeshs-MacBook-Pro.local>
2021-11-24 17:31:34 -08:00
Y Ethan Guo
bef373fa1d [MINOR] Fix build failure due to checkstyle issues (#4111) 2021-11-24 17:17:46 -08:00
Alexey Kudinkin
51297736ca [HUDI-2844][CLI] Fixing archived Timeline crashing if timeline contains REPLACE_COMMIT (#4091) 2021-11-24 16:53:29 -08:00
rmahindra123
7286b56d30 [HUDI-2853] Add JMX deps in hudi utilities and kafka connect bundles (#4108)
Co-authored-by: Rajesh Mahindra <rmahindra@Rajeshs-MacBook-Pro.local>
2021-11-24 19:03:01 -05:00
Sivabalan Narayanan
435ea1543c [HUDI-2793] Fixing deltastreamer checkpoint fetch/copy over (#4034)
- Removed the copy over logic in transaction utils. Deltastreamer will go back to previous commits and get the checkpoint value.
2021-11-24 18:26:40 -05:00
Sagar Sumit
ff94d92980 [HUDI-2766] Cluster update strategy should not be fenced by write config (#4093)
Fix pending clustering rollback test
2021-11-24 10:15:40 -08:00
Alexey Kudinkin
60b23b9797 [HUDI-2788] Fixing issues w/ Z-order Layout Optimization (#4026)
* Simplyfying, tidying up

* Fixed packaging for `TestOptimizeTable`

* Cleaned up `HoodiFileIndex` file filtering seq;
Removed optimization manually reading Parquet table circumventing Spark

* Refactored `DataSkippingUtils`:
  - Fixed checks to validate all statistics cols are present
  - Fixed some predicates being constructed incorrectly
  - Rewrote comments for easier comprehension, added more notes
  - Tidying up

* Tidying up tests

* `lint`

* Fixing compilation

* `TestOptimizeTable` > `TestTableLayoutOptimization`;
Added assertions to test data skipping paths

* Fixed tests to properly hit data-skipping path

* Fixed pruned files candidates lookup seq to conservatively included all non-indexed files

* Added java-doc

* Fixed compilation
2021-11-24 10:10:28 -08:00
Manoj Govindassamy
973f78f5ca [HUDI-2443] Hudi KVComparator for all HFile writer usages (#3889)
* [HUDI-2443] Hudi KVComparator for all HFile writer usages

- Hudi relies on custom class shading for Hbase's KeyValue.KVComparator to
  avoid versioning and class loading issues. There are few places which are
  still using the Hbase's comparator class directly and version upgrades
  would make them obsolete. Refactoring the HoodieKVComparator and making
  all HFile writer creation using the same shaded class.

* [HUDI-2443] Hudi KVComparator for all HFile writer usages

- Moving HoodieKVComparator from common.bootstrap.index to common.util

* [HUDI-2443] Hudi KVComparator for all HFile writer usages

- Retaining the old HoodieKVComparatorV2 for boostrap case. Adding the
  new comparator as HoodieKVComparatorV2 to differentiate from the old
  one.

* [HUDI-2443] Hudi KVComparator for all HFile writer usages

 - Renamed HoodieKVComparatorV2 to HoodieMetadataKVComparator and moved it
   under the package org.apache.hudi.metadata.

* Make comparator classname configurable

* Revert new config and address other review comments

Co-authored-by: Sagar Sumit <sagarsumit09@gmail.com>
2021-11-24 10:05:36 -08:00
rmahindra123
90f2ea2f12 [HUDI-2671] Fix kafka offset handling in Kafka Connect protocol (#4021)
Co-authored-by: Rajesh Mahindra <rmahindra@Rajeshs-MacBook-Pro.local>
2021-11-24 10:03:58 -08:00
Sagar Sumit
9af219b7c1 [HUDI-2688] Claim the next rfc 40 for Hudi connector for Trino (#4105) 2021-11-24 11:43:37 -05:00
Yann Byron
a234833f0a [HUDI-2759] extract HoodieCatalogTable to coordinate spark catalog table and hoodie table (#3998) 2021-11-24 02:12:38 -08:00
Danny Chan
0bb506fa00 [HUDI-2847] Flink metadata table supports virtual keys (#4096) 2021-11-24 17:34:42 +08:00
Danny Chan
323be33f18 Revert "[HUDI-2799] Fix the classloader of flink write task (#4042)" (#4069)
This reverts commit 8281cbf762.
2021-11-24 12:01:18 +08:00
Yann Byron
0cf2f103e0 [HUDI-2838] refresh table after drop partition (#4084) 2021-11-23 19:46:48 -08:00
Raymond Xu
5078d29eb4 [HUDI-2818] Fix 2to3 upgrade when set hoodie.table.keygenerator.class (#4077) 2021-11-23 19:43:34 -08:00
Alexey Kudinkin
18cf59507f [HUDI-2831] Securing usages of SimpleDateFormat to be thread-safe (#4073) 2021-11-23 20:25:11 -05:00
rmahindra123
fbff0799b9 [HUDI-2325] Add hive sync support to kafka connect (#3660)
Co-authored-by: Rajesh Mahindra <rmahindra@Rajeshs-MacBook-Pro.local>
2021-11-23 15:48:06 -08:00
董可伦
969a5bf11e [MINOR] Fix typo,rename 'HooodieAvroDeserializer' to 'HoodieAvroDeserializer' (#4064) 2021-11-23 19:10:57 +08:00
Y Ethan Guo
ca9bfa2a40 [HUDI-2332] Add clustering and compaction in Kafka Connect Sink (#3857)
* [HUDI-2332] Add clustering and compaction in Kafka Connect Sink

* Disable validation check on instant time for compaction and adjust configs

* Add javadocs

* Add clustering and compaction config

* Fix transaction causing missing records in the target table

* Add debugging logs

* Fix kafka offset sync in participant

* Adjust how clustering and compaction are configured in kafka-connect

* Fix clustering strategy

* Remove irrelevant changes from other published PRs

* Update clustering logic and others

* Update README

* Fix test failures

* Fix indentation

* Fix clustering config

* Add JavaCustomColumnsSortPartitioner and make async compaction enabled by default

* Add test for JavaCustomColumnsSortPartitioner

* Add more changes after IDE sync

* Update README with clarification

* Fix clustering logic after rebasing

* Remove unrelated changes
2021-11-23 14:23:28 +05:30
zhangyue19921010
9ed28b1570 [HUDI-2409] Using HBase shaded jars in Hudi presto bundle (#3623)
* using hbase-shaded-jars-in-hudi-presto-hundle

* test

* add hudi-common-bundle

* code review

* code review

* code review

* code review

* test

* test

Co-authored-by: yuezhang <yuezhang@freewheel.tv>
2021-11-23 11:25:12 +05:30
xiarixiaoyao
9de9951348 [HUDI-2778] Optimize statistics collection related codes and add some docs for z-order add fix some bugs (#4013)
* [HUDI-2778] Optimize statistics collection related codes and add more docs for z-order.

* add test code for multi-thread parquet footer read
2021-11-22 21:46:02 -08:00
Sagar Sumit
c88c2af8bf [HUDI-2743] Assume path exists and defer fs.exists() in AbstractTableFileSystemView (#4002) 2021-11-22 22:13:10 -05:00
Y Ethan Guo
6aa710eae0 [MINOR] Add more configuration to Kafka setup script (#3992)
* [MINOR] Add more configuration to Kafka setup script

* Add option to reuse Kafka topic

* Minor fixes to README
2021-11-23 07:33:38 +05:30
Sagar Sumit
e22150fe15 [HUDI-1937] Rollback unfinished replace commit to allow updates (#3869)
* [HUDI-1937] Rollback unfinished replace commit to allow updates while clustering

* Revert and delete requested replacecommit too

* Rollback pending clustering instants transactionally

* No double locking and add a config to enable rollback

* Update config to be clear about rollback only on conflict
2021-11-23 07:29:03 +05:30
Jimmy.Zhou
0d1e7ecdab [MINOR] Fix typo,'multipe' corrected to 'multiple' (#4068) 2021-11-22 17:20:23 -08:00
Y Ethan Guo
772af935d5 [HUDI-2737] Use earliest instant by default for async compaction and clustering jobs (#3991)
Address review comments

Fix test failures

Co-authored-by: Sagar Sumit <sagarsumit09@gmail.com>
2021-11-23 06:49:41 +05:30
Alexey Kudinkin
3bdab01a49 [HUDI-2550] Expand File-Group candidates list for appending for MOR tables (#3986) 2021-11-22 19:19:59 -05:00
Sagar Sumit
fe57e9beea [HUDI-2599] Make addFilesToview and fetchLatestBaseFiles public (#4066) 2021-11-22 12:23:50 -05:00
Sivabalan Narayanan
fc9ca6a07a [HUDI-2559] Converting commit timestamp format to millisecs (#4024)
- Adds support for generating commit timestamps with millisecs granularity. 
- Older commit timestamps (in secs granularity) will be suffixed with 999 and parsed with millisecs format.
2021-11-22 11:44:38 -05:00
Sagar Sumit
89452063b4 [MINOR] Fix instant parsing in HoodieClusteringJob (#4071) 2021-11-22 08:57:44 -05:00
Manoj Govindassamy
7f3b89fad7 [HUDI-2472] Enabling metadata table for TestHoodieIndex test case (#4045)
- Enablng the metadata table for testSimpleGlobalIndexTagLocationWhenShouldUpdatePartitionPath.
   This is more of a test issue.
2021-11-22 07:21:24 -05:00
zhangyue19921010
a2c91a7a9b [HUDI-2533] New option for hoodieClusteringJob to check, rollback and re-execute the last failed clustering job (#3765)
* coding finished and need to do uts

* add uts

* code review

* code review

Co-authored-by: yuezhang <yuezhang@freewheel.tv>
2021-11-22 16:30:33 +05:30
Raymond Xu
02f7ca2b05 [HUDI-1870] Add more Spark CI build tasks (#4022)
* [HUDI-1870] Add more Spark CI build tasks

- build for spark3.0.x
- build for spark-shade-unbundle-avro
- fix build failures
  - delete unnecessary assertion for spark 3.0.x
  - use AvroConversionUtils#convertAvroSchemaToStructType instead of calling SchemaConverters#toSqlType directly to solve the compilation failures with spark-shade-unbundle-avro (#5)

Co-authored-by: Yann <biyan900116@gmail.com>
2021-11-22 02:16:45 -08:00
Danny Chan
8281cbf762 [HUDI-2799] Fix the classloader of flink write task (#4042) 2021-11-22 11:05:05 +08:00
董可伦
2533a9cc17 [MINOR] Fix typos (#4053) 2021-11-21 16:34:59 +08:00
Nate Radtke
887787e8b9 [HUDI-1932] Update Hive sync timestamp when change detected (#3053)
* Update Hive sync timestamp when change detected

Only update the last commit timestamp on the Hive table when the table schema
has changed or a partition is created/updated.

When using AWS Glue Data Catalog as the metastore for Hive this will ensure
that table versions are substantive (including schema and/or partition
changes). Prior to this change when a Hive sync is performed without schema
or partition changes the table in the Glue Data Catalog would have a new
version published with the only change being the timestamp property.

https://issues.apache.org/jira/browse/HUDI-1932

* add conditional sync flag

* fix testSyncWithoutDiffs

* fix HiveSyncConfig

Co-authored-by: Raymond Xu <2701446+xushiyan@users.noreply.github.com>
2021-11-21 12:11:05 +05:30
Danny Chan
520538b15d [HUDI-2392] Make flink parquet reader compatible with decimal BINARY encoding (#4057) 2021-11-21 13:27:18 +08:00
Danny Chan
0411f73c7d [HUDI-2804] Add option to skip compaction instants for streaming read (#4051) 2021-11-21 12:38:56 +08:00
leesf
74b59a44ec [HUDI-2813] Claim RFC number for RFC for spark datasource V2 Integration (#4059) 2021-11-20 18:59:12 -08:00
dufeng1010
305d160081 [MINOR] optimize in constructor of inputbatch class (#4040)
Co-authored-by: 闫杜峰 <yandufeng@sinochem.com>
2021-11-21 10:11:01 +08:00
rmahindra123
1a5484d2db [MINOR] Claim RFC number for RFC for debezium source for deltastreamer (#4047) 2021-11-21 09:28:48 +08:00
vinoth chandar
ae0c67d9fc [HUDI-2795] Add mechanism to safely update,delete and recover table properties (#4038)
* [HUDI-2795] Add mechanism to safely update,delete and recover table properties

  - Fail safe mechanism, that lets queries succeed off a backup file
  - Readers who are not upgraded to this version of code will just fail until recovery is done.
  - Added unit tests that exercises all these scenarios.
  - Adding CLI for recovery, updation to table command.
  - [Pending] Add some hash based verfication to ensure any rare partial writes for HDFS

* Fixing upgrade/downgrade infrastructure to use new updation method
2021-11-20 08:07:40 -08:00
Harsha Teja Kanna
f4b974ac7b [HUDI-2742] Added S3 object filter to support multiple S3EventsHoodieIncrSources single S3 meta table (#4025) 2021-11-20 14:54:21 +05:30
Ron
6cc97cc0c9 Remove the aws packages from hudi flink bundle jar (#4050) 2021-11-20 11:55:12 +08:00