- Eliminating in-memory deltaRecordsMap
- Use writerSchema to generate generic record needed by custom payloads
- changes to make tests work with hive 2.x
- Generalized BloomIndex to work with file ids instead of paths
- Abstracted away Bloom filter checking into HoodieLookupHandle
- Abstracted away range information retrieval into HoodieRangeInfoHandle
1. Go through dependencies list one round to ensure compliance. Generated current NOTICE list in all submodules (other apache projects like flink does this).
To be on conservative side regarding licensing, NOTICE.txt lists all dependencies including transitive. Pending Compliance questions reported in https://issues.apache.org/jira/browse/LEGAL-461
2. Automate generating NOTICE.txt files to allow future package compliance issues be identified early as part of code-review process.
3. Added NOTICE.txt and LICENSE.txt to all HUDI jars
jira link https://issues.apache.org/jira/browse/HUDI-101
issue link https://github.com/apache/incubator-hudi/issues/516#issue-386048519
when using spark-shell with hoodie save data like :
```
./spark-shell --master yarn --jars /home/hdfs/software/spark/hoodie/hoodie-spark-bundle-0.4.8-SNAPSHOT.jar --conf spark.sql.hive.convertMetastoreParquet=false --packages com.databricks:spark-avro_2.11:4.0.0
```
and
```
inputDF.write.format("com.uber.hoodie")
.option("hoodie.insert.shuffle.parallelism", "1") // any hoodie client config can be passed like this
.option("hoodie.upsert.shuffle.parallelism", "1") // full list in HoodieWriteConfig & its package
.option(DataSourceWriteOptions.STORAGE_TYPE_OPT_KEY, HoodieTableType.COPY_ON_WRITE.name())
.option(DataSourceWriteOptions.OPERATION_OPT_KEY, DataSourceWriteOptions.UPSERT_OPERATION_OPT_VAL) // insert
.option(DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY, "_row_key")
.option(DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY, "partition")
.option(DataSourceWriteOptions.PRECOMBINE_FIELD_OPT_KEY, "extend_deal_date")
.option(HoodieWriteConfig.TABLE_NAME, "c_upload_code")
.mode(SaveMode.Overwrite)
.save("/tmp/test/hoodie")
```
It also report error `Invalid signature file digest for Manifest main attributes`. Need to scan all infected dependency.
- hoodie-utilities can now be pulled in as compile time dependency
- Lets users test their DeltaStreamer transformers for e.g
- Tested the docker demo works & takes in the bundle
- Doc changes to follow, to move DeltaStreamer commands to bundle jar
- Join based on HoodieKey and not RecordKey during tagging
- Unit tests changed to run with duplicate keys
- Special casing GlobalBloom to still join by recordkey