When unit tests are run on shared machines (e.g. jenkins cluster), the unit tests sometimes fail due to BindException in starting HDFS Cluster. This is because the port chosen may have been bound by another process using the same machine. The fix is to retry the port selection a few times.
- Introduce HoodieWriteableTestTable for writing records into files
- Migrate writeParquetFiles() in HoodieClientTestUtils to HoodieWriteableTestTable
- Adopt HoodieWrittableTestTable for test cases in
- ITTestRepairsCommand.java
- TestHoodieIndex.java
- TestHoodieKeyLocationFetchHandle.java
- TestHoodieGlobalBloomIndex.java
- TestHoodieBloomIndex.java
- Renamed HoodieTestTable and FileCreateUtils APIs
- dataFile changed to baseFile
For Delete API, "hoodie.delete.shuffle.parallelism" isn't used as opposed to "hoodie.upsert.shuffle.parallelism" is used for upsert, this creates the performance difference between delete by upsert API with "EmptyHoodieRecordPayload" and delete API for certain cases.
This patch makes the following fixes in this regard.
- Let deduplicateKeys method use "hoodie.delete.shuffle.parallelism"
- Repartition inputRDD as "hoodie.delete.shuffle.parallelism" in case "hoodie.combine.before.delete=false"
* [HUDI-960] Implementation of the HFile base and log file format.
1. Includes HFileWriter and HFileReader
2. Includes HFileInputFormat for both snapshot and realtime input format for Hive
3. Unit test for new code
4. IT for using HFile format and querying using Hive (Presto and SparkSQL are not supported)
Advantage:
HFile file format saves data as binary key-value pairs. This implementation chooses the following values:
1. Key = Hoodie Record Key (as bytes)
2. Value = Avro encoded GenericRecord (as bytes)
HFile allows efficient lookup of a record by key or range of keys. Hence, this base file format is well suited to applications like RFC-15, RFC-08 which will benefit from the ability to lookup records by key or search in a range of keys without having to read the entire data/log format.
Limitations:
HFile storage format has certain limitations when used as a general purpose data storage format.
1. Does not have a implemented reader for Presto and SparkSQL
2. Is not a columnar file format and hence may lead to lower compression levels and greater IO on query side due to lack of column pruning
Other changes:
- Remove databricks/avro from pom
- Fix HoodieClientTestUtils from not using scala imports/reflection based conversion etc
- Breaking up limitFileSize(), per parquet and hfile base files
- Added three new configs for HoodieHFileConfig - prefetchBlocksOnOpen, cacheDataInL1, dropBehindCacheCompaction
- Throw UnsupportedException in HFileReader.getRecordKeys()
- Updated HoodieCopyOnWriteTable to create the correct merge handle (HoodieSortedMergeHandle for HFile and HoodieMergeHandle otherwise)
* Fixing checkstyle
Co-authored-by: Vinoth Chandar <vinoth@apache.org>
- To determine insert bucket location for a given key, hudi walks through all insert buckets with O(N) cost, while this patch adds an optimization to make it O(logN).
- Adding ability to use native spark row writing for bulk_insert
- Controlled by `ENABLE_ROW_WRITER_OPT_KEY` datasource write option
- Introduced KeyGeneratorInterface in hudi-client, moved KeyGenerator back to hudi-spark
- Simplified the new API additions to just two new methods : getRecordKey(row), getPartitionPath(row)
- Fixed all built-in key generators with new APIs
- Made the field position map lazily created upon the first call to row based apis
- Implemented native row based key generators for CustomKeyGenerator
- Fixed all the tests, with these new APIs
Co-authored-by: Balaji Varadarajan <varadarb@uber.com>
Co-authored-by: Vinoth Chandar <vinoth@apache.org>
The purpose of this pull request is to implement changes required on Hudi side to get Bootstrapped tables integrated with Presto. The testing was done against presto 0.232 and following changes were identified to make it work:
Annotation UseRecordReaderFromInputFormat is required on HoodieParquetInputFormat as well, because the reading for bootstrapped tables needs to happen through record reader to be able to perform the merge. On presto side, this annotation is already handled.
We need to internally maintain VIRTUAL_COLUMN_NAMES because presto's internal hive version hive-apache-1.2.2 has VirutalColumn as a class, versus the one we depend on in hudi which is an enum.
Dependency changes in hudi-presto-bundle to avoid runtime exceptions.