jira link https://issues.apache.org/jira/browse/HUDI-101
issue link https://github.com/apache/incubator-hudi/issues/516#issue-386048519
when using spark-shell with hoodie save data like :
```
./spark-shell --master yarn --jars /home/hdfs/software/spark/hoodie/hoodie-spark-bundle-0.4.8-SNAPSHOT.jar --conf spark.sql.hive.convertMetastoreParquet=false --packages com.databricks:spark-avro_2.11:4.0.0
```
and
```
inputDF.write.format("com.uber.hoodie")
.option("hoodie.insert.shuffle.parallelism", "1") // any hoodie client config can be passed like this
.option("hoodie.upsert.shuffle.parallelism", "1") // full list in HoodieWriteConfig & its package
.option(DataSourceWriteOptions.STORAGE_TYPE_OPT_KEY, HoodieTableType.COPY_ON_WRITE.name())
.option(DataSourceWriteOptions.OPERATION_OPT_KEY, DataSourceWriteOptions.UPSERT_OPERATION_OPT_VAL) // insert
.option(DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY, "_row_key")
.option(DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY, "partition")
.option(DataSourceWriteOptions.PRECOMBINE_FIELD_OPT_KEY, "extend_deal_date")
.option(HoodieWriteConfig.TABLE_NAME, "c_upload_code")
.mode(SaveMode.Overwrite)
.save("/tmp/test/hoodie")
```
It also report error `Invalid signature file digest for Manifest main attributes`. Need to scan all infected dependency.
- hoodie-utilities can now be pulled in as compile time dependency
- Lets users test their DeltaStreamer transformers for e.g
- Tested the docker demo works & takes in the bundle
- Doc changes to follow, to move DeltaStreamer commands to bundle jar
(1) Apply transformation when using delta-streamer to ingest data.
(2) Add Hudi Incremental Source for Delta Streamer
(3) Allow delta-streamer config-property to be passed as command-line
(4) Add Hive Integration to Delta-Streamer and address Review comments
(5) Ensure MultiPartKeysValueExtractor handle hive style partition description
(6) Reuse same spark session on both source and transformer
(7) Support extracting partition fields from _hoodie_partition_path for HoodieIncrSource
(8) Reuse Binary Avro coders
(9) Add push down filter for Incremental source
(10) Add Hoodie DeltaStreamer metrics to track total time taken
- Tests redone in the process
- Main changes are to RealtimeRecordReader and how it treats maps/arrays
- Make hive sync work with Hive 1/2 and CDH environments
- Fixes to make corner cases for Hive queries
- Spark Hive integration - Working version across Apache and CDH versions
- Known Issue - https://github.com/uber/hudi/issues/439