* [HUDI-1181] Fix decimal type display issue for record key field
* Remove getNestedFieldVal method from DataSourceUtils
* resolve comments
Co-authored-by: Wenning Ding <wenningd@amazon.com>
For Delete API, "hoodie.delete.shuffle.parallelism" isn't used as opposed to "hoodie.upsert.shuffle.parallelism" is used for upsert, this creates the performance difference between delete by upsert API with "EmptyHoodieRecordPayload" and delete API for certain cases.
This patch makes the following fixes in this regard.
- Let deduplicateKeys method use "hoodie.delete.shuffle.parallelism"
- Repartition inputRDD as "hoodie.delete.shuffle.parallelism" in case "hoodie.combine.before.delete=false"
- Adding ability to use native spark row writing for bulk_insert
- Controlled by `ENABLE_ROW_WRITER_OPT_KEY` datasource write option
- Introduced KeyGeneratorInterface in hudi-client, moved KeyGenerator back to hudi-spark
- Simplified the new API additions to just two new methods : getRecordKey(row), getPartitionPath(row)
- Fixed all built-in key generators with new APIs
- Made the field position map lazily created upon the first call to row based apis
- Implemented native row based key generators for CustomKeyGenerator
- Fixed all the tests, with these new APIs
Co-authored-by: Balaji Varadarajan <varadarb@uber.com>
Co-authored-by: Vinoth Chandar <vinoth@apache.org>
- This PR implements Spark Datasource for MOR table in the RDD approach.
- Implemented SnapshotRelation
- Implemented HudiMergeOnReadRDD
- Implemented separate Iterator to handle merge and unmerge record reader.
- Added TestMORDataSource to verify this feature.
- Clean up test file name, add tests for mixed query type tests
- We can now revert the change made in DefaultSource
Co-authored-by: Vinoth Chandar <vchandar@confluent.io>
- [HUDI-418] Bootstrap Index Implementation using HFile with unit-test
- [HUDI-421] FileSystem View Changes to support Bootstrap with unit-tests
- [HUDI-424] Implement Query Side Integration for querying tables containing bootstrap file slices
- [HUDI-423] Implement upsert functionality for handling updates to these bootstrap file slices
- [HUDI-421] Bootstrap Write Client with tests
- [HUDI-425] Added HoodieDeltaStreamer support
- [HUDI-899] Add a knob to change partition-path style while performing metadata bootstrap
- [HUDI-900] Metadata Bootstrap Key Generator needs to handle complex keys correctly
- [HUDI-424] Simplify Record reader implementation
- [HUDI-423] Implement upsert functionality for handling updates to these bootstrap file slices
- [HUDI-420] Hoodie Demo working with hive and sparkSQL. Also, Hoodie CLI working with bootstrap tables
Co-authored-by: Mehrotra <uditme@amazon.com>
Co-authored-by: Vinoth Chandar <vinoth@apache.org>
Co-authored-by: Balaji Varadarajan <varadarb@uber.com>
- Consolidate transform functions for tests in Transformations.java
- Consolidate assertion functions for tests in Assertions.java
- Make use of SchemaTestUtil for loading schema from resource
* [HUDI-472] Introduce the configuration and new modes of record sorting for bulk_insert(#1149). Three sorting modes are implemented: global sort ("global_sort"), local sort inside each RDD partition ("partition_sort") and no sort ("none")
* adding check for table name for Append Save mode
* adding existing table validation for delete and upsert operation
Co-authored-by: Aakash Pradeep <apradeep@twilio.com>
This is to ensure that tests will execute all code paths, even the ones
written under DEBUG log levels. This will improve coverage as well as
ensure there are no surprised when DEBUG log level is enabled in
production.
- Brings more order and cohesion to the classes in hudi-common
- Utils classes related to a particular concept (avro, timeline,...) are placed near to the package
- common.fs package now contains all the filesystem level classes including wrapper filesystem
- bloom.filter package renamed to just bloom
- config package contains classes that help store properties
- common.fs.inline package contains all the inline filesystem classes/impl
- common.table.timeline now consolidates all timeline related classes
- common.table.view consolidates all the classes related to filesystem view metadata
- common.table.timeline.versioning contains all classes related to versioning of timeline
- Fix few unit tests as a result
- Moved the test packages around to match the source file move
- Rename AvroUtils to TimelineMetadataUtils & minor fixes/typos
- Storage Type replaced with Table Type (remaining instances)
- View types replaced with query types;
- ReadOptimized view referred as Snapshot Query
- TableFileSystemView sub interfaces renamed to BaseFileOnly and Slice Views
- HoodieDataFile renamed to HoodieBaseFile
- Hive Sync tool will register RO tables for MOR with a `_ro` suffix
- Datasource/Deltastreamer options renamed accordingly
- Support fallback to old config values as well, so migration is painless
- Config for controlling _ro suffix addition
- Renaming DataFile to BaseFile across DTOs, HoodieFileSlice and AbstractTableFileSystemView
- Upgrade Spark to 2.4.4, Parquet to 1.10.1, Avro to 1.8.2
- Remove spark-avro from hudi-spark-bundle. Users need to provide --packages org.apache.spark:spark-avro:2.4.4 when running spark-shell or spark-submit
- Replace com.databricks:spark-avro with org.apache.spark:spark-avro
- Shade avro in hudi-hadoop-mr-bundle to make sure it does not conflict with hive's avro version.