1
0
Commit Graph

2712 Commits

Author SHA1 Message Date
Danny Chan
b9fbada2f2 [minor] Follow 3178, fix the flink metadata table compaction (#5175) 2022-03-30 20:45:29 +08:00
harshal
7fa363923c [HUDI-3745] Support for spark datasource options in S3EventsHoodieIncrSource (#5170) 2022-03-30 01:34:49 -04:00
Sivabalan Narayanan
4fed8dd319 [HUDI-3485] Adding scheduler pool configs for async clustering (#5043) 2022-03-29 21:27:45 -04:00
Danny Chan
5c1b482a1b [HUDI-3741] Fix flink bucket index bulk insert generates too many small files (#5164) 2022-03-30 08:18:36 +08:00
ForwardXu
941c254c33 [HUDI-2520] Fix CTAS statment issue when sync to hive (#5145) 2022-03-29 12:25:31 -07:00
Alexey Kudinkin
e5a2baeed0 [HUDI-3549] Removing dependency on "spark-avro" (#4955)
Hudi will be taking on promise for it bundles to stay compatible with Spark minor versions (for ex 2.4, 3.1, 3.2): meaning that single build of Hudi (for ex "hudi-spark3.2-bundle") will be compatible with ALL patch versions in that minor branch (in that case 3.2.1, 3.2.0, etc)

To achieve that we'll have to remove (and ban) "spark-avro" as a dependency, which on a few occasions was the root-cause of incompatibility b/w consecutive Spark patch versions (most recently 3.2.1 and 3.2.0, due to this PR).

Instead of bundling "spark-avro" as dependency, we will be copying over some of the classes Hudi depends on and maintain them along the Hudi code-base to make sure we're able to provide for the aforementioned guarantee. To workaround arising compatibility issues we will be applying local patches to guarantee compatibility of Hudi bundles w/in the Spark minor version branches.

Following Hudi modules to Spark minor branches is currently maintained:

"hudi-spark3" -> 3.2.x
"hudi-spark3.1.x" -> 3.1.x
"hudi-spark2" -> 2.4.x
Following classes hierarchies (borrowed from "spark-avro") are maintained w/in these Spark-specific modules to guarantee compatibility with respective minor version branches:

AvroSerializer
AvroDeserializer
AvroUtils
Each of these classes has been correspondingly copied from Spark 3.2.1 (for 3.2.x branch), 3.1.2 (for 3.1.x branch), 2.4.4 (for 2.4.x branch) into their respective modules.

SchemaConverters class in turn is shared across all those modules given its relative stability (there're only cosmetical changes from 2.4.4 to 3.2.1).
All of the aforementioned classes have their corresponding scope of visibility limited to corresponding packages (org.apache.spark.sql.avro, org.apache.spark.sql) to make sure broader code-base does not become dependent on them and instead relies on facades abstracting them.

Additionally, given that Hudi plans on supporting all the patch versions of Spark w/in aforementioned minor versions branches of Spark, additional build steps were added to validate that Hudi could be properly compiled against those versions. Testing, however, is performed against the most recent patch versions of Spark with the help of Azure CI.

Brief change log:
- Removing spark-avro bundling from Hudi by default
- Scaffolded Spark 3.2.x hierarchy
- Bootstrapped Spark 3.1.x Avro serializer/deserializer hierarchy
- Bootstrapped Spark 2.4.x Avro serializer/deserializer hierarchy
- Moved ExpressionCodeGen,ExpressionPayload into hudi-spark module
- Fixed AvroDeserializer to stay compatible w/ both Spark 3.2.1 and 3.2.0
- Modified bot.yml to build full matrix of support Spark versions
- Removed "spark-avro" dependency from all modules
- Fixed relocation of spark-avro classes in bundles to assist in running integ-tests.
2022-03-29 14:44:47 -04:00
ForwardXu
0802510ca9 [HUDI-2520] Fix drop partition issue when sync to hive (#5147) 2022-03-29 11:28:19 -07:00
Alexey Kudinkin
fcb003ec76 [HUDI-3731] Fixing Column Stats Index record Merging sequence missing columnName (#5159)
* Added `DataSkippingFailureMode` to control how DS handles failures in the flow (either "strict", when exception would be thrown, or "fallback" when it will just fallback to the full-scan)

* Make sure tests execute in `DataSkippingFailureMode.Strict`

* Fixed Column Stats Index record merging sequence missing `columnName`
2022-03-29 21:09:56 +05:30
Raymond Xu
1b2fb71afc [MINOR] Move Experiemental to javadoc (#5161) 2022-03-28 21:07:59 -07:00
Nicolas Paris
7c7ecb11d5 [HUDI-3736] Fix default dynamodblock url default value (#4967) 2022-03-28 20:31:46 -07:00
leesf
8f8a8158e2 [HUDI-2520] Fix drop table issue when sync to Hive (#5143) 2022-03-28 19:34:12 -07:00
Danny Chan
3bf9c5ffe8 [HUDI-3728] Set the sort operator parallelism for flink bucket bulk insert (#5154) 2022-03-29 09:52:35 +08:00
ForwardXu
72e0b52b18 [HUDI-3722] Fix truncate hudi table's error (#5140) 2022-03-29 09:44:18 +08:00
Sivabalan Narayanan
d074089c62 [HUDI-2566] Adding multi-writer test support to integ test (#5065) 2022-03-28 17:05:00 -04:00
Raymond Xu
6ccbae4d2a [HUDI-2757] Implement Hudi AWS Glue sync (#5076) 2022-03-28 14:54:59 -04:00
Y Ethan Guo
4ed84b216d [HUDI-3720] Fix the logic of reattempting pending rollback (#5148) 2022-03-28 14:54:31 -04:00
Shawy Geng
2e2d08cb72 [HUDI-3539] Flink bucket index bucketID bootstrap optimization. (#5093)
* [HUDI-3539] Flink bucket index bucketID bootstrap optimization.

Co-authored-by: gengxiaoyu <gengxiaoyu@bytedance.com>
2022-03-28 19:50:36 +08:00
huberylee
1d0f4ccfe0 [HUDI-3538] Support Compaction Command Based on Call Procedure Command for Spark SQL (#4945)
* Support Compaction Command Based on Call Procedure Command for Spark SQL

* Addressed review comments
2022-03-28 14:11:35 +08:00
ForwardXu
d31cde284c [MINOR] Fix call command parser use spark3.2 (#5144) 2022-03-28 11:13:44 +08:00
Sivabalan Narayanan
f2a93ead3b [HUDI-3724] Fixing closure of ParquetReader (#5141) 2022-03-28 09:36:15 +08:00
xiarixiaoyao
9da2dd416e [HUDI-3719] High performance costs of AvroSerizlizer in DataSource wr… (#5137)
* [HUDI-3719] High performance costs of AvroSerizlizer in DataSource writing

* add benchmark framework which modify from spark
add avroSerDerBenchmark
2022-03-27 11:01:43 -07:00
Sivabalan Narayanan
85c4a6cfc1 [MINOR] Relaxing cleaner and archival configs (#5142) 2022-03-27 12:26:24 -04:00
Y Ethan Guo
484b3407e0 [HUDI-3604] Adjust the order of timeline changes in rollbacks (#5114) 2022-03-26 22:37:44 -07:00
Danny Chan
4d940bbf8a [HUDI-3716] OOM occurred when use bulk_insert cow table with flink BUCKET index (#5135) 2022-03-27 09:13:58 +08:00
Alexey Kudinkin
189d5297b8 [HUDI-3709] Fixing ParquetWriter impls not respecting Parquet Max File Size limit (#5129) 2022-03-26 17:51:36 -04:00
RexAn
57b4f39c31 [HUDI-3612] Clustering strategy should create new TypedProperties when modifying it (#5027) 2022-03-26 16:16:03 +05:30
Danny Chan
0c09a973fb [HUDI-3435] Do not throw exception when instant to rollback does not exist in metadata table active timeline (#4821) 2022-03-26 11:42:54 +08:00
Alexey Kudinkin
51034fecf1 [HUDI-3396] Refactoring MergeOnReadRDD to avoid duplication, fetch only projected columns (#4888) 2022-03-25 09:32:03 -07:00
ForwardXu
12cc8e715b [MINOR] fix QuickstartUtils move (#5133) 2022-03-25 07:34:35 -07:00
ForwardXu
e5c3f9089b [HUDI-3563] Make quickstart examples covered by CI tests (#5082) 2022-03-25 01:37:17 -07:00
wangxianghu
f20c9867d7 [HUDI-3711] Fix typo in MaxwellJsonKafkaSourcePostProcessor.Config#PRECOMBINE_FIELD_TYPE_PROP (#5096) 2022-03-25 00:02:54 -07:00
Alexey Kudinkin
8b38ddedc2 [HUDI-3594] Supporting Composite Expressions over Data Table Columns in Data Skipping flow (#4996) 2022-03-24 22:27:15 -07:00
Danny Chan
8896864d7b [HUDI-3678] Fix record rewrite of create handle when 'preserveMetadata' is true (#5088) 2022-03-25 11:48:50 +08:00
Surya Prasanna
2fd9a4de5c [HUDI-3580] Claim RFC number 48 for LogCompaction action RFC (#5128) 2022-03-24 20:26:04 -07:00
Zhaojing Yu
483ee843e6 [HUDI-3703] Reset taskID in restoreWriteMetadata (#5122) 2022-03-25 10:18:28 +08:00
Y Ethan Guo
eaa4c4f2e2 [HUDI-1180] Upgrade HBase to 2.4.9 (#5004)
Co-authored-by: Sagar Sumit <sagarsumit09@gmail.com>
2022-03-24 19:04:53 -07:00
Danny Chan
5e86cdd1e9 [HUDI-3701] Flink bulk_insert support bucket hash index (#5118) 2022-03-25 09:01:42 +08:00
Y Ethan Guo
608d4bf32d [HUDI-3638] Make ZookeeperBasedLockProvider serializable (#5112) 2022-03-24 17:59:47 -07:00
Y Ethan Guo
9b3dd2e0b7 [HUDI-3624] Check all instants before starting a commit in metadata table (#5098) 2022-03-24 17:13:58 -07:00
Y Ethan Guo
4ddd094ba2 [HUDI-3689] Disable flaky tests in TestHoodieDeltaStreamer (#5127) 2022-03-24 16:42:44 -07:00
Raymond Xu
ff136658a0 [HUDI-3689] Fix delta streamer tests (#5124) 2022-03-24 14:19:53 -07:00
Y Ethan Guo
44ab3b73ed [HUDI-3706] Downgrade maven surefire and failsafe version (#5123) 2022-03-24 09:31:46 -07:00
Raymond Xu
686da41696 [HUDI-3689] Fix UT failures in TestHoodieDeltaStreamer (#5120) 2022-03-24 09:10:33 -07:00
Raymond Xu
b14706502b [HUDI-3689] Remove Azure CI cache (#5121) 2022-03-24 05:39:11 -07:00
Alexey Kudinkin
ccc3728002 [HUDI-3684] Fixing NPE in ParquetUtils (#5102)
* Make sure nulls are properly handled in `HoodieColumnRangeMetadata`
2022-03-24 17:37:38 +05:30
Sagar Sumit
fe2c3989e3 [HUDI-3689] Fix glob path and hive sync in deltastreamer tests (#5117)
* Remove glob pattern basePath from the deltastreamer tests.

* [HUDI-3689] Fix file scheme config

for CI failure in TestHoodieRealTimeRecordReader

Co-authored-by: Raymond Xu <2701446+xushiyan@users.noreply.github.com>
2022-03-24 15:48:35 +05:30
Danny Chan
a1c42fcc07 [minor] Checks the data block type for archived timeline (#5106) 2022-03-24 14:10:43 +08:00
Sivabalan Narayanan
52f0498330 Fixing non partitioned all files record in MDT (#5108) 2022-03-23 19:26:39 -07:00
Sagar Sumit
f96ba7abf0 [HUDI-3642] Handle NPE due to empty requested replacecommit metadata (#5090) 2022-03-23 12:13:02 -07:00
Rajesh Mahindra
5f570ea151 [HUDI-2883] Refactor hive sync tool / config to use reflection and standardize configs (#4175)
- Refactor hive sync tool / config to use reflection and standardize configs

Co-authored-by: sivabalan <n.siva.b@gmail.com>
Co-authored-by: Rajesh Mahindra <rmahindra@Rajeshs-MacBook-Pro.local>
Co-authored-by: Raymond Xu <2701446+xushiyan@users.noreply.github.com>
2022-03-21 22:56:31 -04:00