1
0
Commit Graph

2175 Commits

Author SHA1 Message Date
zzzhy
61a03bc072 [MINOR] Fix the wrong usage of timestamp length variable bug (#4179)
Signed-off-by: zzzhy <candle_1667@163.com>
2021-12-02 22:47:31 +08:00
Alexey Kudinkin
772f5ca24e Fixed partitions produced by layout optimization in case order-by key is composed of a single column (#4183) 2021-12-01 20:56:04 -08:00
Shawy Geng
5284730175 [HUDI-2881] Compact the file group with larger log files to reduce write amplification (#4152) 2021-12-02 09:41:04 +08:00
Sivabalan Narayanan
f4c25ba3fd [HUDI-2880] Fixing loading of props from default dir (#4167)
* Fixing loading of props from default dir

* addressing comments
2021-12-01 16:02:30 +08:00
Y Ethan Guo
9b254b6fc5 Revert "[HUDI-2856] Bit cask disk map delete modified (#4116)" (#4171)
This reverts commit 257a6a7456.
2021-12-01 14:08:44 +08:00
Alexey Kudinkin
24380c2060 Revert "[HUDI-2855] Change the default value of 'PAYLOAD_CLASS_NAME' to 'DefaultHoodieRecordPayload' (#4115)" (#4169)
This reverts commit 88067f57a2.
2021-11-30 17:47:16 -08:00
Y Ethan Guo
ea009b55a3 [HUDI-2891] Fix write configs for Java engine in Kafka Connect Sink (#4161) 2021-11-30 09:45:50 -05:00
董可伦
a398aad1fc [HUDI-2642] Add support ignoring case in update sql operation (#3882) 2021-11-29 22:36:36 -08:00
董可伦
3433f00cb5 [MINOR] Fix typo,rename 'getUrlEncodePartitoning' to 'getUrlEncodePartitioning' (#4130) 2021-11-29 18:31:22 -08:00
Danny Chan
536af4b954 [MINOR] Fix syntax error in create_source_release.sh (#4150) 2021-11-29 14:17:24 +08:00
Sivabalan Narayanan
38e75ea806 Removing rfc from release package and fixing release validation script (#4147) 2021-11-29 13:18:35 +08:00
Sivabalan Narayanan
52aae36b53 [MINOR] Fixing integ test suite for hudi-aws and archival validation (#4142) 2021-11-28 20:11:50 -05:00
vortual
eca1693288 [MINOR] fix typo (#4140) 2021-11-28 17:13:50 +08:00
yuzhao.cyz
a1d0ff4209 Moving to 0.11.0-SNAPSHOT on master branch. 2021-11-27 17:22:10 +08:00
xiarixiaoyao
780a2ac5b2 [HUDI-2102] Support hilbert curve for hudi (#3952)
Co-authored-by: Y Ethan Guo <ethan.guoyihua@gmail.com>
2021-11-26 23:20:19 -08:00
Manoj Govindassamy
2c7656c35f [HUDI-2475] [HUDI-2862] Metadata table creation and avoid bootstrapping race for write client & add locking for upgrade (#4114)
Co-authored-by: Sivabalan Narayanan <n.siva.b@gmail.com>
2021-11-26 23:19:26 -08:00
Raymond Xu
3a8d64e584 [HUDI-2868] Fix skipped HoodieSparkSqlWriterSuite (#4125)
- Co-authored-by: Yann Byron <biyan900116@gmail.com>
2021-11-26 22:59:20 -05:00
Sivabalan Narayanan
9c059ef8e5 [MINOR] Follow ups from HUDI-2861 (re-use same rollback instant for failed rollback) (#4133) 2021-11-26 19:22:53 -05:00
xuzifu666
257a6a7456 [HUDI-2856] Bit cask disk map delete modified (#4116)
* modified BitCaskDiskMap_close_function

* change iterators location to finally

* Update BitCaskDiskMap.java
2021-11-26 15:11:01 -08:00
rmahindra123
9028e6e1e4 [HUDI-2864] Fix README and scripts with current limitations of hive sync (#4129)
* Fix README with current limitations of hive sync

* Fix README with current limitations of hive sync

* Fix dep issue

* Fix Copy on Write flow

Co-authored-by: Rajesh Mahindra <rmahindra@Rajeshs-MacBook-Pro.local>
2021-11-26 15:09:32 -08:00
huleilei
8402cac407 [HUDI-2848] Excluse guava from hudi-cli pom (#4100) 2021-11-26 16:56:03 -05:00
Manoj Govindassamy
445208a0d2 [HUDI-2845] Metadata CLI - files/partition file listing fix and new validate option (#4092)
- Co-authored-by: Sivabalan Narayanan <n.siva.b@gmail.com>
2021-11-26 16:44:16 -05:00
Y Ethan Guo
d1e83e4ba0 [HUDI-2767] Enabling timeline-server-based marker as default (#4112)
- Changes the default config of marker type (HoodieWriteConfig.MARKERS_TYPE or hoodie.write.markers.type) from DIRECT to TIMELINE_SERVER_BASED for Spark Engine.
- Adds engine-specific marker type configs: Spark -> TIMELINE_SERVER_BASED, Flink -> DIRECT, Java -> DIRECT.
- Uses DIRECT markers as well for Spark structured streaming due to timeline server only available for the first mini-batch.
- Fixes the marker creation method for non-partitioned table in TimelineServerBasedWriteMarkers.
- Adds the fallback to direct markers even when TIMELINE_SERVER_BASED is configured, in WriteMarkersFactory: when HDFS is used, or embedded timeline server is disabled, the fallback to direct markers happens.
- Fixes the closing of timeline service.
- Fixes tests that depend on markers, mainly by starting the timeline service for each test.
2021-11-26 16:41:05 -05:00
Sivabalan Narayanan
f8e0176eb0 [HUDI-2861] Re-use same rollback instant time for failed rollbacks (#4123) 2021-11-26 16:36:42 -05:00
Sivabalan Narayanan
a88691fed3 [MINOR] Fixing test failure to fix CI build failure (#4132) 2021-11-26 13:50:10 -05:00
Alexey Kudinkin
5755ff25a4 [HUDI-2814] Addressing issues w/ Z-order Layout Optimization (#4060)
* `ZCurveOptimizeHelper` > `ZOrderingIndexHelper`;
Moved Z-index helper under `hudi.index.zorder` package

* Tidying up `ZOrderingIndexHelper`

* Fixing compilation

* Fixed index new/original table merging sequence to always prefer values from new index;
Cleaned up `HoodieSparkUtils`

* Added test for `mergeIndexSql`

* Abstracted Z-index name composition w/in `ZOrderingIndexHelper`;

* Fixed `DataSkippingUtils` to interrupt prunning in case data filter contains non-indexed column reference

* Properly handle exceptions origination during pruning in `HoodieFileIndex`

* Make sure no errors are logged upon encountering `AnalysisException`

* Cleaned up Z-index updating sequence;
Tidying up comments, java-docs;

* Fixed Z-index to properly handle changes of the list of clustered columns

* Tidying up

* `lint`

* Suppressing `JavaDocStyle` first sentence check

* Fixed compilation

* Fixing incorrect `DecimalType` conversion

* Refactored test `TestTableLayoutOptimization`
  - Added Z-index table composition test (against fixtures)
  - Separated out GC test;
Tidying up

* Fixed tests re-shuffling column order for Z-Index table `DataFrame` to align w/ the one by one loaded from JSON

* Scaffolded `DataTypeUtils` to do basic checks of Spark types;
Added proper compatibility checking b/w old/new index-tables

* Added test for Z-index tables merging

* Fixed import being shaded by creating internal `hudi.util` package

* Fixed packaging for `TestOptimizeTable`

* Revised `updateMetadataIndex` seq to provide Z-index updating process w/ source table schema

* Make sure existing Z-index table schema is sync'd to source table's one

* Fixed shaded refs

* Fixed tests

* Fixed type conversion of Parquet provided metadata values into Spark expected schemas

* Fixed `composeIndexSchema` utility to propose proper schema

* Added more tests for Z-index:
  - Checking that Z-index table is built correctly
  - Checking that Z-index tables are merged correctly (during update)

* Fixing source table

* Fixing tests to read from Parquet w/ proper schema

* Refactored `ParquetUtils` utility reading stats from Parquet footers

* Fixed incorrect handling of Decimals extracted from Parquet footers

* Worked around issues in javac failign to compile stream's collection

* Fixed handling of `Date` type

* Fixed handling of `DateType` to be parsed as `LocalDate`

* Updated fixture;
Make sure test loads Z-index fixture using proper schema

* Removed superfluous scheme adjusting when reading from Parquet, since Spark is actually able to perfectly restore schema (given Parquet was previously written by Spark as well)

* Fixing race-condition in Parquet's `DateStringifier` trying to share `SimpleDataFormat` object which is inherently not thread-safe

* Tidying up

* Make sure schema is used upon reading to validate input files are in the appropriate format;
Tidying up;

* Worked around javac (1.8) inability to infer expression type properly

* Updated fixtures;
Tidying up

* Fixing compilation after rebase

* Assert clustering have in Z-order layout optimization testing

* Tidying up exception messages

* XXX

* Added test validating Z-index lookup filter correctness

* Added more test-cases;
Tidying up

* Added tests for string expressions

* Fixed incorrect Z-index filter lookup translations

* Added more test-cases

* Added proper handling on complex negations of AND/OR expressions by pushing NOT operator down into inner expressions for appropriate handling

* Added `-target:jvm-1.8` for `hudi-spark` module

* Adding more tests

* Added tests for non-indexed columns

* Properly handle non-indexed columns by falling back to a re-write of containing expression as  `TrueLiteral` instead

* Fixed tests

* Removing the parquet test files and disabling corresponding tests

Co-authored-by: Vinoth Chandar <vinoth@apache.org>
2021-11-26 10:02:15 -08:00
Manoj Govindassamy
3d75aca40d [HUDI-2850] Fixing Clustering CLI - schedule and run command fixes to avoid NumberFormatException (#4101) 2021-11-26 07:17:23 -05:00
Danny Chan
e9efbdb63c [HUDI-2863] Rename option 'hoodie.parquet.page.size' to 'write.parquet.page.size' (#4128) 2021-11-26 16:40:53 +08:00
mincwang
e554c7f468 [HUDI-2852] Table metadata returns empty for non-exist partition (#4117)
* [HUDI-2852] Table metadata returns empty for non-exist partition

* add unit test

* fix code checkstyle

Co-authored-by: wangminchao <wangminchao@asinking.com>
2021-11-26 16:24:03 +08:00
Danny Chan
f5da9b50fa [MINOR] Include hudi-aws in flink bundle jar (#4127)
HUDI-2801 makes this jar as required.
2021-11-26 14:36:44 +08:00
Ron
38585e4e57 [HUDI-2851] Shade org.apache.hadoop.hive.ql.optimizer package for flink bundle jar (#4104) 2021-11-26 11:27:21 +08:00
Sivabalan Narayanan
8340ccb503 [HUDI-2005] Removing direct fs call in HoodieLogFileReader (#3865) 2021-11-25 18:51:38 -05:00
Alexey Kudinkin
6f5d8d04cd [HUDI-2840] Fixed DeltaStreaemer to properly respect configuration passed t/h properties file (#4090)
* Rebased `DFSPropertiesConfiguration` to access Hadoop config in liue of FS to avoid confusion

* Fixed `readConfig` to take Hadoop's `Configuration` instead of FS;
Fixing usages

* Added test for local FS access

* Rebase to use `FSUtils.getFs`

* Combine properties provided as a file along w/ overrides provided from the CLI

* Added helper utilities to `HoodieClusteringConfig`;
Make sure corresponding config methods fallback to defaults;

* Fixed DeltaStreamer usage to respect properly combined configuration;
Abstracted `HoodieClusteringConfig.from` convenience utility to init Clustering config from `Properties`

* Tidying up

* `lint`

* Reverting changes to `HoodieWriteConfig`

* Tdiying up

* Fixed incorrect merge of the props

* Converted `HoodieConfig` to wrap around `Properties` into `TypedProperties`

* Fixed compilation

* Fixed compilation
2021-11-25 14:48:22 -08:00
Udit Mehrotra
e0125a7911 [HUDI-2801] Add Amazon CloudWatch metrics reporter (#4081) 2021-11-25 13:33:16 -08:00
Sivabalan Narayanan
8e1379384a [HUDI-2841] Fixing lazy rollback for MOR with list based strategy (#4110) 2021-11-25 16:06:04 -05:00
Sivabalan Narayanan
6a0f079866 [HUDI-2858] Fixing handling of cluster update reject exception in deltastreamer (#4120) 2021-11-26 01:04:07 +05:30
Sivabalan Narayanan
f692078d32 [HUDI-2671] Making error -> warn logs from timeline server with concurrent writers for inconsistent state (#4088)
* Making error -> warn logs from timeline server with concurrent writers for inconsistent state

* Fixing bad request response exception for timeline out of sync

* Addressing feedback. removed write concurrency mode depedency
2021-11-25 11:21:32 -08:00
Sivabalan Narayanan
7bb90e8caf [HUDI-2794] Guarding table service commits within a single lock to commit to both data table and metadata table (#4037)
* Fixing a single lock to commit table services across metadata table and data table

* Addressing comments

* rebasing with master
2021-11-25 11:19:30 -08:00
Sagar Sumit
b972aa5bf2 [HUDI-2800] Remove rdd.isEmpty() validation to prevent CreateHandle being called twice (#4121) 2021-11-25 23:46:36 +05:30
satishm
264e1ce63c [HUDI-1290] fixing mysql debezium source (#4119) 2021-11-25 11:26:59 -05:00
Danny Chan
a2eb2b0b0a [HUDI-2480] FileSlice after pending compaction-requested instant-time… (#3703)
* [HUDI-2480] FileSlice after pending compaction-requested instant-time is ignored by MOR snapshot reader

* include file slice after a pending compaction for spark reader

Co-authored-by: garyli1019 <yanjia.gary.li@gmail.com>
2021-11-25 22:30:09 +08:00
董可伦
88067f57a2 [HUDI-2855] Change the default value of 'PAYLOAD_CLASS_NAME' to 'DefaultHoodieRecordPayload' (#4115) 2021-11-25 19:17:38 +08:00
Sivabalan Narayanan
a9bd20804b [HUDI-2792] Configure metadata payload consistency check (#4035)
- Relax metadata payload consistency check to consider spark task failures with spurious deletes
2021-11-24 21:56:31 -05:00
rmahindra123
83f8ed2ae3 [HUDI-1290] Add Debezium Source for deltastreamer (#4063)
* add source for postgres debezium

* Add tests for debezium payload

* Fix test

* Fix test

* Add tests for debezium source

* Add tests for debezium source

* Fix schema for debezium

* Fix checkstyle issues

* Fix config issue for schema registry

* Add mysql source for debezium

* Fix checkstyle issues an tests

* Improve code for merging toasted values

* Improve code for merging toasted values

Co-authored-by: Rajesh Mahindra <rmahindra@Rajeshs-MacBook-Pro.local>
2021-11-24 17:57:02 -08:00
rmahindra123
abc0175cf7 [HUDI-1290] [RFC-39] Deltastreamer avro source for Debezium CDC (#4048)
* Add RFC entry for deltastreamer source for debezium

* Add RFC for debezium source

* Add RFC for debezium source

* Add RFC for debezium source

* fix hyperlink issue and rebase

* Update progress

Co-authored-by: Rajesh Mahindra <rmahindra@Rajeshs-MacBook-Pro.local>
2021-11-24 17:31:34 -08:00
Y Ethan Guo
bef373fa1d [MINOR] Fix build failure due to checkstyle issues (#4111) 2021-11-24 17:17:46 -08:00
Alexey Kudinkin
51297736ca [HUDI-2844][CLI] Fixing archived Timeline crashing if timeline contains REPLACE_COMMIT (#4091) 2021-11-24 16:53:29 -08:00
rmahindra123
7286b56d30 [HUDI-2853] Add JMX deps in hudi utilities and kafka connect bundles (#4108)
Co-authored-by: Rajesh Mahindra <rmahindra@Rajeshs-MacBook-Pro.local>
2021-11-24 19:03:01 -05:00
Sivabalan Narayanan
435ea1543c [HUDI-2793] Fixing deltastreamer checkpoint fetch/copy over (#4034)
- Removed the copy over logic in transaction utils. Deltastreamer will go back to previous commits and get the checkpoint value.
2021-11-24 18:26:40 -05:00
Sagar Sumit
ff94d92980 [HUDI-2766] Cluster update strategy should not be fenced by write config (#4093)
Fix pending clustering rollback test
2021-11-24 10:15:40 -08:00