1. Use the DAG Node's label from the yaml as its name instead of UUID names which are not descriptive when debugging issues from logs.
2. Fix CleanNode constructor which is not correctly implemented
3. When generating upsets, allows more granualar control over the number of inserts and upserts - zero or more inserts and upserts can be specified instead of always requiring both inserts and upserts.
4. Fixed generation of records of specific size
- The current code was using a class variable "shouldAddMore" which was reset to false after the first record generation causing subsequent records to be of minimum size.
- In this change, we pre-calculate the extra size of the complex fields. When generating records, for complex fields we read the field size from this map.
5. Refresh the timeline of the DeltaSync service before calling readFromSource. This ensures that only the newest generated data is read and data generated in the older Dag Nodes is ignored (as their AVRO files will have an older timestamp).
6. Making --workload-generator-classname an optional parameter as most probably the default will be used
For Delete API, "hoodie.delete.shuffle.parallelism" isn't used as opposed to "hoodie.upsert.shuffle.parallelism" is used for upsert, this creates the performance difference between delete by upsert API with "EmptyHoodieRecordPayload" and delete API for certain cases.
This patch makes the following fixes in this regard.
- Let deduplicateKeys method use "hoodie.delete.shuffle.parallelism"
- Repartition inputRDD as "hoodie.delete.shuffle.parallelism" in case "hoodie.combine.before.delete=false"
- Generalize the hive-sync module for syncing to multiple metastores
- Added new options for datasource
- Added new command line for delta streamer
Co-authored-by: Vinoth Chandar <vinoth@apache.org>
- [HUDI-418] Bootstrap Index Implementation using HFile with unit-test
- [HUDI-421] FileSystem View Changes to support Bootstrap with unit-tests
- [HUDI-424] Implement Query Side Integration for querying tables containing bootstrap file slices
- [HUDI-423] Implement upsert functionality for handling updates to these bootstrap file slices
- [HUDI-421] Bootstrap Write Client with tests
- [HUDI-425] Added HoodieDeltaStreamer support
- [HUDI-899] Add a knob to change partition-path style while performing metadata bootstrap
- [HUDI-900] Metadata Bootstrap Key Generator needs to handle complex keys correctly
- [HUDI-424] Simplify Record reader implementation
- [HUDI-423] Implement upsert functionality for handling updates to these bootstrap file slices
- [HUDI-420] Hoodie Demo working with hive and sparkSQL. Also, Hoodie CLI working with bootstrap tables
Co-authored-by: Mehrotra <uditme@amazon.com>
Co-authored-by: Vinoth Chandar <vinoth@apache.org>
Co-authored-by: Balaji Varadarajan <varadarb@uber.com>
- Storage Type replaced with Table Type (remaining instances)
- View types replaced with query types;
- ReadOptimized view referred as Snapshot Query
- TableFileSystemView sub interfaces renamed to BaseFileOnly and Slice Views
- HoodieDataFile renamed to HoodieBaseFile
- Hive Sync tool will register RO tables for MOR with a `_ro` suffix
- Datasource/Deltastreamer options renamed accordingly
- Support fallback to old config values as well, so migration is painless
- Config for controlling _ro suffix addition
- Renaming DataFile to BaseFile across DTOs, HoodieFileSlice and AbstractTableFileSystemView
- Add spotless format fixing to project
- One time reformatting for conformity
- Build fails for formatting changes and mvn spotless:apply autofixes them
1. Remove LICENSE and NOTICE files in hoodie child modules.
2. Remove developers and contributor section from pom
3. Also ensure any failures in validation script is reported appropriately
4. Make hoodie parent pom consistent with that of its parent apache-21 (https://github.com/apache/maven-apache-parent/blob/apache-21/pom.xml)
1. Remove dnl utils jar from git
2. Add LICENSE Headers in missing files
3. Fix NOTICE and LICENSE in all HUDI packages and in top-level
4. Fix License wording in certain HUDI source files
5. Include non java/scala code in RAT licensing check
6. Use whitelist to include dependencies as part of timeline-server bundling
- [HUDI-172] Cleanup Maven POM/Classpath
- Fix ordering of dependencies in poms, to enable better resolution
- Idea is to place more specific ones at the top
- And place dependencies which use them below them
- [HUDI-68] : Automate demo steps on docker setup
- Move hive queries from hive cli to beeline
- Standardize on taking query input from text command files
- Deltastreamer ingest, also does hive sync in a single step
- Spark Incremental Query materialized as a derived Hive table using datasource
- Fix flakiness in HDFS spin up and output comparison
- Code cleanup around streamlining and loc reduction
- Also fixed pom to not shade some hive classs in spark, to enable hive sync
1. Go through dependencies list one round to ensure compliance. Generated current NOTICE list in all submodules (other apache projects like flink does this).
To be on conservative side regarding licensing, NOTICE.txt lists all dependencies including transitive. Pending Compliance questions reported in https://issues.apache.org/jira/browse/LEGAL-461
2. Automate generating NOTICE.txt files to allow future package compliance issues be identified early as part of code-review process.
3. Added NOTICE.txt and LICENSE.txt to all HUDI jars