- When write to a new hoodie table, if checkpoint dir is under target path, Spark will create the base path and thus skip initializing .hoodie which result in error
- apply .hoodie existent check for all save mode
When running docker demo, NoSuchElementException was thrown because spark.scheduler.mode is not set.
Also we want to check before initializing the Spark Context to avoid polute the SparkConf
with unused config.
- Resolving the issue related to ambiguity in recordKey by creating and parsing json object as string.
- added unit test for ComplexKeyGenerator
- minor changes
- Eliminating in-memory deltaRecordsMap
- Use writerSchema to generate generic record needed by custom payloads
- changes to make tests work with hive 2.x
- Generalized BloomIndex to work with file ids instead of paths
- Abstracted away Bloom filter checking into HoodieLookupHandle
- Abstracted away range information retrieval into HoodieRangeInfoHandle
1. Go through dependencies list one round to ensure compliance. Generated current NOTICE list in all submodules (other apache projects like flink does this).
To be on conservative side regarding licensing, NOTICE.txt lists all dependencies including transitive. Pending Compliance questions reported in https://issues.apache.org/jira/browse/LEGAL-461
2. Automate generating NOTICE.txt files to allow future package compliance issues be identified early as part of code-review process.
3. Added NOTICE.txt and LICENSE.txt to all HUDI jars
jira link https://issues.apache.org/jira/browse/HUDI-101
issue link https://github.com/apache/incubator-hudi/issues/516#issue-386048519
when using spark-shell with hoodie save data like :
```
./spark-shell --master yarn --jars /home/hdfs/software/spark/hoodie/hoodie-spark-bundle-0.4.8-SNAPSHOT.jar --conf spark.sql.hive.convertMetastoreParquet=false --packages com.databricks:spark-avro_2.11:4.0.0
```
and
```
inputDF.write.format("com.uber.hoodie")
.option("hoodie.insert.shuffle.parallelism", "1") // any hoodie client config can be passed like this
.option("hoodie.upsert.shuffle.parallelism", "1") // full list in HoodieWriteConfig & its package
.option(DataSourceWriteOptions.STORAGE_TYPE_OPT_KEY, HoodieTableType.COPY_ON_WRITE.name())
.option(DataSourceWriteOptions.OPERATION_OPT_KEY, DataSourceWriteOptions.UPSERT_OPERATION_OPT_VAL) // insert
.option(DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY, "_row_key")
.option(DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY, "partition")
.option(DataSourceWriteOptions.PRECOMBINE_FIELD_OPT_KEY, "extend_deal_date")
.option(HoodieWriteConfig.TABLE_NAME, "c_upload_code")
.mode(SaveMode.Overwrite)
.save("/tmp/test/hoodie")
```
It also report error `Invalid signature file digest for Manifest main attributes`. Need to scan all infected dependency.
- hoodie-utilities can now be pulled in as compile time dependency
- Lets users test their DeltaStreamer transformers for e.g
- Tested the docker demo works & takes in the bundle
- Doc changes to follow, to move DeltaStreamer commands to bundle jar