* Fix flaky MOR unit test
* Update Spark APIs to make it be compatible with both spark2 & spark3
* Refactor bulk insert v2 part to make Hudi be able to compile with Spark3
* Add spark3 profile to handle fasterxml & spark version
* Create hudi-spark-common module & refactor hudi-spark related modules
Co-authored-by: Wenning Ding <wenningd@amazon.com>
* [HUDI-1326] Added an API to force publish metrics and flush them.
Using the added API, publish metrics after each level of the DAG completed in hudi-test-suite.
* Code cleanups
Co-authored-by: Vinoth Chandar <vinoth@apache.org>
Remove APIs in `HoodieTestUtils`
- `createCommitFiles`
- `createDataFile`
- `createNewLogFile`
- `createCompactionRequest`
Migrated usages in `TestCleaner#testPendingCompactions`.
Also improved some API names in `HoodieTestTable`.
When the hbase index is used, when the record partition is changed to another partition, the path does not change according to the value of the partition column
Co-authored-by: huangjing <huangjing@clinbrain.com>
This helps in removing reporter once the test has complete. Prevents log pollution from un-necessary metric logs.
- Added an API to shutdown the metrics reporter after tests.
- This change breaks `hudi-client` into `hudi-client-common` and `hudi-spark-client` modules
- Simple usages of Spark using jsc.parallelize() has been redone using EngineContext#map, EngineContext#flatMap etc
- Code changes in the PR, break classes into `BaseXYZ` parent classes with no spark dependencies living in `hudi-client-common`
- Classes on `hudi-spark-client` are named `SparkXYZ` extending the parent classes with all the Spark dependencies
- To simplify/cleanup, HoodieIndex#fetchRecordLocation has been removed and its usages in tests replaced with alternatives
Co-authored-by: Vinoth Chandar <vinoth@apache.org>
Migrate deprecated APIs in HoodieTestUtils to HoodieTestTable for test classes
- TestClientRollback
- TestCopyOnWriteRollbackActionExecutor
Use FileCreateUtils APIs in CompactionTestUtils.
Then remove unused deprecated APIs after migration.
* [HUDI-995] Use HoodieTestTable in more classes
Migrate test data prep logic in
- TestStatsCommand
- TestHoodieROTablePathFilter
Re-implement methods for create new commit times in HoodieTestUtils and HoodieClientTestHarness
- Move relevant APIs to HoodieTestTable
- Migrate usages
After changing to HoodieTestTable APIs, removed unused deprecated APIs in HoodieTestUtils
- Introduce HoodieWriteableTestTable for writing records into files
- Migrate writeParquetFiles() in HoodieClientTestUtils to HoodieWriteableTestTable
- Adopt HoodieWrittableTestTable for test cases in
- ITTestRepairsCommand.java
- TestHoodieIndex.java
- TestHoodieKeyLocationFetchHandle.java
- TestHoodieGlobalBloomIndex.java
- TestHoodieBloomIndex.java
- Renamed HoodieTestTable and FileCreateUtils APIs
- dataFile changed to baseFile
For Delete API, "hoodie.delete.shuffle.parallelism" isn't used as opposed to "hoodie.upsert.shuffle.parallelism" is used for upsert, this creates the performance difference between delete by upsert API with "EmptyHoodieRecordPayload" and delete API for certain cases.
This patch makes the following fixes in this regard.
- Let deduplicateKeys method use "hoodie.delete.shuffle.parallelism"
- Repartition inputRDD as "hoodie.delete.shuffle.parallelism" in case "hoodie.combine.before.delete=false"
* [HUDI-960] Implementation of the HFile base and log file format.
1. Includes HFileWriter and HFileReader
2. Includes HFileInputFormat for both snapshot and realtime input format for Hive
3. Unit test for new code
4. IT for using HFile format and querying using Hive (Presto and SparkSQL are not supported)
Advantage:
HFile file format saves data as binary key-value pairs. This implementation chooses the following values:
1. Key = Hoodie Record Key (as bytes)
2. Value = Avro encoded GenericRecord (as bytes)
HFile allows efficient lookup of a record by key or range of keys. Hence, this base file format is well suited to applications like RFC-15, RFC-08 which will benefit from the ability to lookup records by key or search in a range of keys without having to read the entire data/log format.
Limitations:
HFile storage format has certain limitations when used as a general purpose data storage format.
1. Does not have a implemented reader for Presto and SparkSQL
2. Is not a columnar file format and hence may lead to lower compression levels and greater IO on query side due to lack of column pruning
Other changes:
- Remove databricks/avro from pom
- Fix HoodieClientTestUtils from not using scala imports/reflection based conversion etc
- Breaking up limitFileSize(), per parquet and hfile base files
- Added three new configs for HoodieHFileConfig - prefetchBlocksOnOpen, cacheDataInL1, dropBehindCacheCompaction
- Throw UnsupportedException in HFileReader.getRecordKeys()
- Updated HoodieCopyOnWriteTable to create the correct merge handle (HoodieSortedMergeHandle for HFile and HoodieMergeHandle otherwise)
* Fixing checkstyle
Co-authored-by: Vinoth Chandar <vinoth@apache.org>
- To determine insert bucket location for a given key, hudi walks through all insert buckets with O(N) cost, while this patch adds an optimization to make it O(logN).