When running docker demo, NoSuchElementException was thrown because spark.scheduler.mode is not set.
Also we want to check before initializing the Spark Context to avoid polute the SparkConf
with unused config.
- Eliminating in-memory deltaRecordsMap
- Use writerSchema to generate generic record needed by custom payloads
- changes to make tests work with hive 2.x
1. Go through dependencies list one round to ensure compliance. Generated current NOTICE list in all submodules (other apache projects like flink does this).
To be on conservative side regarding licensing, NOTICE.txt lists all dependencies including transitive. Pending Compliance questions reported in https://issues.apache.org/jira/browse/LEGAL-461
2. Automate generating NOTICE.txt files to allow future package compliance issues be identified early as part of code-review process.
3. Added NOTICE.txt and LICENSE.txt to all HUDI jars
- hoodie-utilities can now be pulled in as compile time dependency
- Lets users test their DeltaStreamer transformers for e.g
- Tested the docker demo works & takes in the bundle
- Doc changes to follow, to move DeltaStreamer commands to bundle jar
1. Parquet Avro schema mismatch errors when ingesting are sometimes silently ignored due to race-condition in BoundedInMemoryExecutor. This was reproducible when running long-running delta-streamer with wrong schema and it caused data-loss
2. Fix behavior of Delta-Streamer to error out by default if there are any error records
3. Fix a bug in tracking write errors in WriteStats. Earlier the write errors were tracking sampled errors as opposed to total errors.
4. Delta Streamer does not commit the changes done as part of inline compaction as auto-commit is force disabled. Fix this behavior to always auto-commit inline compaction as it would not otherwise commit.
- Addresses issue where insert will combine and remove duplicates within batch
- Setting default insert combining to false (write client default)
- Set to true if filtering duplicates on insert/bulk_insert
(1) Apply transformation when using delta-streamer to ingest data.
(2) Add Hudi Incremental Source for Delta Streamer
(3) Allow delta-streamer config-property to be passed as command-line
(4) Add Hive Integration to Delta-Streamer and address Review comments
(5) Ensure MultiPartKeysValueExtractor handle hive style partition description
(6) Reuse same spark session on both source and transformer
(7) Support extracting partition fields from _hoodie_partition_path for HoodieIncrSource
(8) Reuse Binary Avro coders
(9) Add push down filter for Incremental source
(10) Add Hoodie DeltaStreamer metrics to track total time taken
- Check to ensure written files are listable on storage
- Docs reflected to capture how this helps with s3 storage
- Unit tests added, corrections to existing tests
- Fix DeltaStreamer to manage archived commits in a separate folder
- Tests redone in the process
- Main changes are to RealtimeRecordReader and how it treats maps/arrays
- Make hive sync work with Hive 1/2 and CDH environments
- Fixes to make corner cases for Hive queries
- Spark Hive integration - Working version across Apache and CDH versions
- Known Issue - https://github.com/uber/hudi/issues/439
- Standardize version of jackson
- DFSPropertiesConfiguration replaces usage of commons PropertiesConfiguration
- Remove dependency on ConstructorUtils
- Throw error if ordering value is not present, during key generation
- Switch to shade plugin for hoodie-utilities
- Added support for consumption for Confluent avro kafka serdes
- Support for Confluent schema registry
- KafkaSource now deals with skews nicely, by doing round robin allocation of source limit across partitions
- Added support for BULK_INSERT operations as well
- Pass in the payload class config properly into HoodieWriteClient
- Fix documentation based on new usage
- Adding tests on deltastreamer, sources and all new util classes.