- Documented principles applied for redesign at packaging/README.md
- No longer depends on incl commons-codec, commons-io, commons-pool, commons-dbcp, commons-lang, commons-logging, avro-mapred
- Introduce new FileIOUtils & added checkstyle rule for illegal import of above
- Parquet, Avro dependencies moved to provided scope to enable being picked up from Hive/Spark/Presto instead
- Pickup jackson jars for Hive sync tool from HIVE_HOME & unbundling jackson everywhere
- Remove hive-jdbc standalone jar from being bundled in Spark/Hive/Utilities bundles
- 6.5x reduced number of classes across bundles
- Redo all classes based on org.parquet only
- remove unuused dependencies like parquet-hadoop, common-configuration2
- timeline-service does not build a fat jar anymore
- Fix utilities and hadoop-mr bundles based on above
- [HUDI-172] Cleanup Maven POM/Classpath
- Fix ordering of dependencies in poms, to enable better resolution
- Idea is to place more specific ones at the top
- And place dependencies which use them below them
- [HUDI-68] : Automate demo steps on docker setup
- Move hive queries from hive cli to beeline
- Standardize on taking query input from text command files
- Deltastreamer ingest, also does hive sync in a single step
- Spark Incremental Query materialized as a derived Hive table using datasource
- Fix flakiness in HDFS spin up and output comparison
- Code cleanup around streamlining and loc reduction
- Also fixed pom to not shade some hive classs in spark, to enable hive sync
1. Go through dependencies list one round to ensure compliance. Generated current NOTICE list in all submodules (other apache projects like flink does this).
To be on conservative side regarding licensing, NOTICE.txt lists all dependencies including transitive. Pending Compliance questions reported in https://issues.apache.org/jira/browse/LEGAL-461
2. Automate generating NOTICE.txt files to allow future package compliance issues be identified early as part of code-review process.
3. Added NOTICE.txt and LICENSE.txt to all HUDI jars
jira link https://issues.apache.org/jira/browse/HUDI-101
issue link https://github.com/apache/incubator-hudi/issues/516#issue-386048519
when using spark-shell with hoodie save data like :
```
./spark-shell --master yarn --jars /home/hdfs/software/spark/hoodie/hoodie-spark-bundle-0.4.8-SNAPSHOT.jar --conf spark.sql.hive.convertMetastoreParquet=false --packages com.databricks:spark-avro_2.11:4.0.0
```
and
```
inputDF.write.format("com.uber.hoodie")
.option("hoodie.insert.shuffle.parallelism", "1") // any hoodie client config can be passed like this
.option("hoodie.upsert.shuffle.parallelism", "1") // full list in HoodieWriteConfig & its package
.option(DataSourceWriteOptions.STORAGE_TYPE_OPT_KEY, HoodieTableType.COPY_ON_WRITE.name())
.option(DataSourceWriteOptions.OPERATION_OPT_KEY, DataSourceWriteOptions.UPSERT_OPERATION_OPT_VAL) // insert
.option(DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY, "_row_key")
.option(DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY, "partition")
.option(DataSourceWriteOptions.PRECOMBINE_FIELD_OPT_KEY, "extend_deal_date")
.option(HoodieWriteConfig.TABLE_NAME, "c_upload_code")
.mode(SaveMode.Overwrite)
.save("/tmp/test/hoodie")
```
It also report error `Invalid signature file digest for Manifest main attributes`. Need to scan all infected dependency.
- hoodie-utilities can now be pulled in as compile time dependency
- Lets users test their DeltaStreamer transformers for e.g
- Tested the docker demo works & takes in the bundle
- Doc changes to follow, to move DeltaStreamer commands to bundle jar
(1) Apply transformation when using delta-streamer to ingest data.
(2) Add Hudi Incremental Source for Delta Streamer
(3) Allow delta-streamer config-property to be passed as command-line
(4) Add Hive Integration to Delta-Streamer and address Review comments
(5) Ensure MultiPartKeysValueExtractor handle hive style partition description
(6) Reuse same spark session on both source and transformer
(7) Support extracting partition fields from _hoodie_partition_path for HoodieIncrSource
(8) Reuse Binary Avro coders
(9) Add push down filter for Incremental source
(10) Add Hoodie DeltaStreamer metrics to track total time taken
- Tests redone in the process
- Main changes are to RealtimeRecordReader and how it treats maps/arrays
- Make hive sync work with Hive 1/2 and CDH environments
- Fixes to make corner cases for Hive queries
- Spark Hive integration - Working version across Apache and CDH versions
- Known Issue - https://github.com/uber/hudi/issues/439