1
0
Commit Graph

277 Commits

Author SHA1 Message Date
Sagar Sumit
dfcd6d9a86 [HUDI-4011] Add hudi-aws-bundle (#5674)
Co-authored-by: Raymond Xu <2701446+xushiyan@users.noreply.github.com>
2022-06-01 05:30:29 -07:00
Sagar Sumit
31e13db1f0 [HUDI-4023] Decouple hudi-spark from hudi-utilities-slim-bundle (#5641) 2022-05-26 11:28:49 +05:30
uday08bce
32a5d268f5 [HUDI-3890] fix rat plugin issue with sql files (#5644) 2022-05-21 12:22:55 -04:00
Raymond Xu
b5adba3e55 [MINOR] remove unused gson test dependency (#5652) 2022-05-21 05:34:08 -07:00
cxzl25
199f64255e [HUDI-4111] Bump ANTLR runtime version in Spark 3.x (#5606) 2022-05-18 19:18:52 +08:00
Raymond Xu
2e6e302efe [HUDI-3859] Fix spark profiles and utilities-slim dep (#5297) 2022-04-12 15:33:08 -07:00
Raymond Xu
5e65aefc61 [HUDI-3837] Fix license and rat check settings (#5273)
- add missing licenses
- fix CI setting to run rat plugin
- fix deploy script to include integ test modules
2022-04-09 11:01:18 -07:00
Danny Chan
e33149be9a [HUDI-3808] Flink bulk_insert timestamp(3) can not be read by Spark (#5236) 2022-04-07 15:17:39 +08:00
Raymond Xu
e96f08f355 Moving to 0.12.0-SNAPSHOT on master branch. 2022-04-06 15:24:10 +08:00
Vinoth Govindarajan
20964df770 [HUDI-3357] MVP implementation of BigQuerySyncTool (#5125)
Co-authored-by: Raymond Xu <2701446+xushiyan@users.noreply.github.com>
2022-04-02 13:18:06 -07:00
xiarixiaoyao
444ff496a4 [RFC-33] [HUDI-2429][Stacked on HUDI-2560] Support full Schema evolution for Spark (#4910)
* [HUDI-2560] introduce id_based schema to support full schema evolution.

* add test for FileBasedInternalSchemaStorageManger and rebase code

* add support for change column type and fix some test case

* fix some bugs encountered in the production env and delete useless code

* fix test error

* rebase code

* fixed some nested schema change bugs

* [HUDI-2429][Stacked On HUDI-2560]Support full schema evolution for spark

* [use dummyInternalSchema instead of null]

* add support for spark3.1.x

* remove support for spark3.1.x , sicne some compile fail

* support spark3.1.x

* rebase and prepare solve all comments

* address all comments

* rebase code

* fixed the count(*) bug

* try to get internalSchema by parser commit file/history file directly, not use metaclient which is time cost
address some comments

* fixed all comments

* fix new comments

* rebase code,fix UT failed

* fixed mistake

* rebase code ,fixed new comments

* rebase code , and prepare for address new comments

* address commits

* address new comments

* fix new issues

* control fallback original write logical
2022-04-01 13:20:24 -07:00
Y Ethan Guo
4fb1a590b1 [HUDI-3700] Add hudi-utilities-slim-bundle excluding hudi-spark-datasource modules (#5176) 2022-03-30 18:08:35 -07:00
Y Ethan Guo
9830005e9b [HUDI-3681] Provision additional hudi-spark-bundle with different versions (#5171) 2022-03-30 17:35:56 -07:00
Raymond Xu
31d4a16deb [HUDI-3536] Add hudi-datahub-sync implementation (#5155) 2022-03-30 14:38:02 -07:00
Alexey Kudinkin
e5a2baeed0 [HUDI-3549] Removing dependency on "spark-avro" (#4955)
Hudi will be taking on promise for it bundles to stay compatible with Spark minor versions (for ex 2.4, 3.1, 3.2): meaning that single build of Hudi (for ex "hudi-spark3.2-bundle") will be compatible with ALL patch versions in that minor branch (in that case 3.2.1, 3.2.0, etc)

To achieve that we'll have to remove (and ban) "spark-avro" as a dependency, which on a few occasions was the root-cause of incompatibility b/w consecutive Spark patch versions (most recently 3.2.1 and 3.2.0, due to this PR).

Instead of bundling "spark-avro" as dependency, we will be copying over some of the classes Hudi depends on and maintain them along the Hudi code-base to make sure we're able to provide for the aforementioned guarantee. To workaround arising compatibility issues we will be applying local patches to guarantee compatibility of Hudi bundles w/in the Spark minor version branches.

Following Hudi modules to Spark minor branches is currently maintained:

"hudi-spark3" -> 3.2.x
"hudi-spark3.1.x" -> 3.1.x
"hudi-spark2" -> 2.4.x
Following classes hierarchies (borrowed from "spark-avro") are maintained w/in these Spark-specific modules to guarantee compatibility with respective minor version branches:

AvroSerializer
AvroDeserializer
AvroUtils
Each of these classes has been correspondingly copied from Spark 3.2.1 (for 3.2.x branch), 3.1.2 (for 3.1.x branch), 2.4.4 (for 2.4.x branch) into their respective modules.

SchemaConverters class in turn is shared across all those modules given its relative stability (there're only cosmetical changes from 2.4.4 to 3.2.1).
All of the aforementioned classes have their corresponding scope of visibility limited to corresponding packages (org.apache.spark.sql.avro, org.apache.spark.sql) to make sure broader code-base does not become dependent on them and instead relies on facades abstracting them.

Additionally, given that Hudi plans on supporting all the patch versions of Spark w/in aforementioned minor versions branches of Spark, additional build steps were added to validate that Hudi could be properly compiled against those versions. Testing, however, is performed against the most recent patch versions of Spark with the help of Azure CI.

Brief change log:
- Removing spark-avro bundling from Hudi by default
- Scaffolded Spark 3.2.x hierarchy
- Bootstrapped Spark 3.1.x Avro serializer/deserializer hierarchy
- Bootstrapped Spark 2.4.x Avro serializer/deserializer hierarchy
- Moved ExpressionCodeGen,ExpressionPayload into hudi-spark module
- Fixed AvroDeserializer to stay compatible w/ both Spark 3.2.1 and 3.2.0
- Modified bot.yml to build full matrix of support Spark versions
- Removed "spark-avro" dependency from all modules
- Fixed relocation of spark-avro classes in bundles to assist in running integ-tests.
2022-03-29 14:44:47 -04:00
Y Ethan Guo
eaa4c4f2e2 [HUDI-1180] Upgrade HBase to 2.4.9 (#5004)
Co-authored-by: Sagar Sumit <sagarsumit09@gmail.com>
2022-03-24 19:04:53 -07:00
Y Ethan Guo
44ab3b73ed [HUDI-3706] Downgrade maven surefire and failsafe version (#5123) 2022-03-24 09:31:46 -07:00
Raymond Xu
686da41696 [HUDI-3689] Fix UT failures in TestHoodieDeltaStreamer (#5120) 2022-03-24 09:10:33 -07:00
Danny Chan
799c78e688 [HUDI-3665] Support flink multiple versions (#5072) 2022-03-21 10:34:50 +08:00
Sivabalan Narayanan
d40adfa2d7 [HUDI-3620] Adding spark3.2.0 profile (#5038) 2022-03-14 19:14:00 -04:00
Raymond Xu
ed26c5265c [HUDI-3584] Skip integ test modules by default (#4986) 2022-03-08 06:32:04 -08:00
ForwardXu
25385805aa [HUDI-3574] Improve maven module configs for different spark profiles (#4970) 2022-03-08 01:01:05 -08:00
Alexey Kudinkin
85e8a5c4de [HUDI-1296] Support Metadata Table in Spark Datasource (#4789)
* Bootstrapping initial support for Metadata Table in Spark Datasource

- Consolidated Avro/Row conversion utilities to center around Spark's AvroDeserializer ; removed duplication
- Bootstrapped HoodieBaseRelation
- Updated HoodieMergeOnReadRDD to be able to handle Metadata Table
- Modified MOR relations to be able to read different Base File formats (Parquet, HFile)
2022-02-24 16:23:13 -05:00
Yann Byron
0c950181aa [HUDI-3423] upgrade spark to 3.2.1 (#4815) 2022-02-21 16:52:21 -08:00
Sagar Sumit
ed106f671e [HUDI-2809] Introduce a checksum mechanism for validating hoodie.properties (#4712)
Fix dependency conflict

Fix repairs command

Implement putIfAbsent for DDB lock provider

Add upgrade step and validate while fetching configs

Validate checksum for latest table version only while fetching config

Move generateChecksum to BinaryUtil

Rebase and resolve conflict

Fix table version check
2022-02-18 10:17:06 +05:30
Yuqi Gu
e639d99387 [HUDI-1657] Fix the build on aarch64, Fedora 33 (#4617) 2022-02-14 15:10:18 -08:00
Yann Byron
d971974063 [HUDI-3333] fix that getNestedFieldVal breaks with Spark 3.2 (#4783) 2022-02-10 06:12:16 -08:00
Danny Chan
b3b44236fe [HUDI-3389] Bump flink version to 1.14.3 (#4776) 2022-02-10 11:32:01 +08:00
Sivabalan Narayanan
16138db4f2 [HUDI-3368] Revert "[HUDI-3306] Upgrade rocksdb version (#4663)" (#4733)
This reverts commit 6f10107998.
2022-02-01 14:18:38 -05:00
Satyam Raj
6f10107998 [HUDI-3306] Upgrade rocksdb version (#4663)
Co-authored-by: Satyam Raj <satyam.raj@olacabs.com>
2022-01-24 14:53:20 -05:00
leesf
5ce45c440b [HUDI-3172] Refactor hudi existing modules to make more code reuse in V2 Implementation (#4514)
* Introduce hudi-spark3-common and hudi-spark2-common modules to place classes that would be reused in different spark versions, also introduce hudi-spark3.1.x to support spark 3.1.x.
* Introduce hudi format under hudi-spark2, hudi-spark3, hudi-spark3.1.x modules and change the hudi format in original hudi-spark module to hudi_v1 format.
* Manually tested on Spark 3.1.2 and Spark 3.2.0 SQL.
* Added a README.md file under hudi-spark-datasource module.
2022-01-14 13:42:35 +08:00
Sagar Sumit
12e95771ee [HUDI-3235] Fix ClassNotFoundException due to log4j-core dependency (#4574)
- Move log4j-core to top level pom
2022-01-12 11:53:43 -05:00
Raymond Xu
f74cd57320 [HUDI-3195] Fix spark 3 pom (#4554)
- drop 3.0.x profile
- update readme
- update build CI bot.yml
- fix spark 3 bundle name
2022-01-10 19:11:22 -08:00
Yann Byron
03a83ffeb5 [HUDI-3195] optimize spark3 pom and modify build command (#4538) 2022-01-07 23:21:39 -08:00
leesf
188d0338c4 [HUDI-3134] Fix insert error after adding columns on Spark 3.2.0 (#4488) 2022-01-01 17:38:14 -08:00
Udit Mehrotra
9412281cb1 [HUDI-2983] Remove Log4j2 transitive dependencies (#4281) 2021-12-28 07:15:05 -08:00
Yann Byron
05942e018c [HUDI-2811] Support Spark 3.2 (#4270) 2021-12-28 00:12:44 -08:00
zhangyue19921010
f3f6112b75 [HUDI-3070] Add rerunFailingTestsCount for flakly testes (#4398)
Co-authored-by: yuezhang <yuezhang@freewheel.tv>
2021-12-20 19:59:50 -08:00
wenningd
15444c951f [HUDI-2946] Upgrade maven plugins to be compatible with higher Java versions (#4232)
Co-authored-by: Wenning Ding <wenningd@amazon.com>
2021-12-11 20:18:39 -08:00
Y Ethan Guo
72901a33a1 [HUDI-2784] Add a hudi-trino-bundle for Trino (#4279) 2021-12-10 14:27:22 -08:00
ForwardXu
63b15607ff [HUDI-2937] Introduce a pulsar implementation of hoodie write commit … (#4217)
* [HUDI-2937] Introduce a pulsar implementation of hoodie write commit callback

* [HUDI-2937] Introduce a pulsar implementation of hoodie write commit callback

* [HUDI-2937] Introduce a pulsar implementation of hoodie write commit callback

* [HUDI-2937] Introduce a pulsar implementation of hoodie write commit callback

* [HUDI-2937] Introduce a pulsar implementation of hoodie write commit callback

* [HUDI-2937] Introduce a pulsar implementation of hoodie write commit callback

* [HUDI-2937] Introduce a pulsar implementation of hoodie write commit callback
2021-12-05 11:51:06 +04:00
yuzhao.cyz
a1d0ff4209 Moving to 0.11.0-SNAPSHOT on master branch. 2021-11-27 17:22:10 +08:00
wenningd
1ee12cfa6f [HUDI-2314] Add support for DynamoDb based lock provider (#3486)
- Co-authored-by: Wenning Ding <wenningd@amazon.com>
- Co-authored-by: Sivabalan Narayanan <n.siva.b@gmail.com>
2021-11-17 12:09:31 -05:00
Alexey Kudinkin
cbcbec4d38 [MINOR] Fixed checkstyle config to be based off Maven root-dir (requires Maven >=3.3.1 to work properly); (#4009)
Updated README
2021-11-16 21:30:16 -05:00
Yann Byron
1f17467f73 [HUDI-1869] Upgrading Spark3 To 3.1 (#3844)
Co-authored-by: pengzhiwei <pengzhiwei2015@icloud.com>
2021-11-02 18:25:12 -07:00
Sivabalan Narayanan
f9bc3e03e5 [MINOR] Adding a deprecated constructor to AbstractSyncHoodieClient (#3902) 2021-11-02 12:16:38 -04:00
Sagar Sumit
5302b9a4ef [HUDI-2662] Downloads from Nexus Pentaho repo taking too long (#3901)
Co-authored-by: Sivabalan Narayanan <n.siva.b@gmail.com>
2021-11-01 19:14:48 -04:00
vinoyang
b1c4acf0ae [HUDI-2614] Remove duplicated hadoop-hdfs with tests classifier exists in bundles (#3864) 2021-10-26 22:36:10 +08:00
rmahindra123
3686c25fae [HUDI-2469] [Kafka Connect] Replace json based payload with protobuf for Transaction protocol. (#3694)
* Substitue Control Event with protobuf

* Fix tests

* Fix unit tests

* Add javadocs

* Add javadocs

* Address reviewer comments

Co-authored-by: Rajesh Mahindra <rmahindra@Rajeshs-MacBook-Pro.local>
2021-10-19 14:29:48 -07:00
rmahindra123
e528dd798a [HUDI-2394] Implement Kafka Sink Protocol for Hudi for Ingesting Immutable Data (#3592)
- Fixing packaging, naming of classes
 - Use of log4j over slf4j for uniformity
- More follow-on fixes
 - Added a version to control/coordinator events.
 - Eliminated the config added to write config
 - Fixed fetching of checkpoints based on table type
 - Clean up of naming, code placement

Co-authored-by: Rajesh Mahindra <rmahindra@Rajeshs-MacBook-Pro.local>
Co-authored-by: Vinoth Chandar <vinoth@apache.org>
2021-09-10 18:20:26 -07:00