1. Use the DAG Node's label from the yaml as its name instead of UUID names which are not descriptive when debugging issues from logs.
2. Fix CleanNode constructor which is not correctly implemented
3. When generating upsets, allows more granualar control over the number of inserts and upserts - zero or more inserts and upserts can be specified instead of always requiring both inserts and upserts.
4. Fixed generation of records of specific size
- The current code was using a class variable "shouldAddMore" which was reset to false after the first record generation causing subsequent records to be of minimum size.
- In this change, we pre-calculate the extra size of the complex fields. When generating records, for complex fields we read the field size from this map.
5. Refresh the timeline of the DeltaSync service before calling readFromSource. This ensures that only the newest generated data is read and data generated in the older Dag Nodes is ignored (as their AVRO files will have an older timestamp).
6. Making --workload-generator-classname an optional parameter as most probably the default will be used
This helps in removing reporter once the test has complete. Prevents log pollution from un-necessary metric logs.
- Added an API to shutdown the metrics reporter after tests.
- This change breaks `hudi-client` into `hudi-client-common` and `hudi-spark-client` modules
- Simple usages of Spark using jsc.parallelize() has been redone using EngineContext#map, EngineContext#flatMap etc
- Code changes in the PR, break classes into `BaseXYZ` parent classes with no spark dependencies living in `hudi-client-common`
- Classes on `hudi-spark-client` are named `SparkXYZ` extending the parent classes with all the Spark dependencies
- To simplify/cleanup, HoodieIndex#fetchRecordLocation has been removed and its usages in tests replaced with alternatives
Co-authored-by: Vinoth Chandar <vinoth@apache.org>
Migrate deprecated APIs in HoodieTestUtils to HoodieTestTable for test classes
- TestClientRollback
- TestCopyOnWriteRollbackActionExecutor
Use FileCreateUtils APIs in CompactionTestUtils.
Then remove unused deprecated APIs after migration.
* [HUDI-995] Use HoodieTestTable in more classes
Migrate test data prep logic in
- TestStatsCommand
- TestHoodieROTablePathFilter
Re-implement methods for create new commit times in HoodieTestUtils and HoodieClientTestHarness
- Move relevant APIs to HoodieTestTable
- Migrate usages
After changing to HoodieTestTable APIs, removed unused deprecated APIs in HoodieTestUtils
Add new Payload(OverwriteNonDefaultsWithLatestAvroPayload) for updating specified fields in storage
## Brief change log
update current value for several fields that you want to change.
The default payload OverwriteWithLatestAvroPayload overwrite the whole record when
compared to `orderingVal`.This doesn't meet our need when we just want to change specified fields.
For example: (suppose Default value is null)
```
current Value
Field: name age gender
Value: karl 20 male
```
```
insert Value
Field: name age gender
Value: null 30 null
```
```
After insert:
Field: name age gender
Value: karl 30 male
```
## Verify this pull request
Added TestOverwriteNonDefaultsWithLatestAvroPayload to verify the change.
* [HUDI-1181] Fix decimal type display issue for record key field
* Remove getNestedFieldVal method from DataSourceUtils
* resolve comments
Co-authored-by: Wenning Ding <wenningd@amazon.com>
When unit tests are run on shared machines (e.g. jenkins cluster), the unit tests sometimes fail due to BindException in starting HDFS Cluster. This is because the port chosen may have been bound by another process using the same machine. The fix is to retry the port selection a few times.
- Introduce HoodieWriteableTestTable for writing records into files
- Migrate writeParquetFiles() in HoodieClientTestUtils to HoodieWriteableTestTable
- Adopt HoodieWrittableTestTable for test cases in
- ITTestRepairsCommand.java
- TestHoodieIndex.java
- TestHoodieKeyLocationFetchHandle.java
- TestHoodieGlobalBloomIndex.java
- TestHoodieBloomIndex.java
- Renamed HoodieTestTable and FileCreateUtils APIs
- dataFile changed to baseFile
For Delete API, "hoodie.delete.shuffle.parallelism" isn't used as opposed to "hoodie.upsert.shuffle.parallelism" is used for upsert, this creates the performance difference between delete by upsert API with "EmptyHoodieRecordPayload" and delete API for certain cases.
This patch makes the following fixes in this regard.
- Let deduplicateKeys method use "hoodie.delete.shuffle.parallelism"
- Repartition inputRDD as "hoodie.delete.shuffle.parallelism" in case "hoodie.combine.before.delete=false"
* [HUDI-960] Implementation of the HFile base and log file format.
1. Includes HFileWriter and HFileReader
2. Includes HFileInputFormat for both snapshot and realtime input format for Hive
3. Unit test for new code
4. IT for using HFile format and querying using Hive (Presto and SparkSQL are not supported)
Advantage:
HFile file format saves data as binary key-value pairs. This implementation chooses the following values:
1. Key = Hoodie Record Key (as bytes)
2. Value = Avro encoded GenericRecord (as bytes)
HFile allows efficient lookup of a record by key or range of keys. Hence, this base file format is well suited to applications like RFC-15, RFC-08 which will benefit from the ability to lookup records by key or search in a range of keys without having to read the entire data/log format.
Limitations:
HFile storage format has certain limitations when used as a general purpose data storage format.
1. Does not have a implemented reader for Presto and SparkSQL
2. Is not a columnar file format and hence may lead to lower compression levels and greater IO on query side due to lack of column pruning
Other changes:
- Remove databricks/avro from pom
- Fix HoodieClientTestUtils from not using scala imports/reflection based conversion etc
- Breaking up limitFileSize(), per parquet and hfile base files
- Added three new configs for HoodieHFileConfig - prefetchBlocksOnOpen, cacheDataInL1, dropBehindCacheCompaction
- Throw UnsupportedException in HFileReader.getRecordKeys()
- Updated HoodieCopyOnWriteTable to create the correct merge handle (HoodieSortedMergeHandle for HFile and HoodieMergeHandle otherwise)
* Fixing checkstyle
Co-authored-by: Vinoth Chandar <vinoth@apache.org>