NOTE: This change is first part of the series to clean up Hudi's Spark DataSource related implementations, making sure there's minimal code duplication among them, implementations are consistent and performant
This PR is making sure that BaseFileOnlyViewRelation only reads projected columns as well as avoiding unnecessary serde from Row to InternalRow
Brief change log
- Introduced HoodieBaseRDD as a base for all custom RDD impls
- Extracted common fields/methods to HoodieBaseRelation
- Cleaned up and streamlined HoodieBaseFileViewOnlyRelation
- Fixed all of the Relations to avoid superfluous Row <> InternalRow conversions
Rework of #4761
This diff introduces following changes:
- Write stats are converted to metadata index records during the commit. Making them use the HoodieData type so that the record generation scales up with needs.
- Metadata index init support for bloom filter and column stats partitions.
- When building the BloomFilter from the index records, using the type param stored in the payload instead of hardcoded type.
- Delta writes can change column ranges and the column stats index need to be properly updated with new ranges to be consistent with the table dataset. This fix add column stats index update support for the delta writes.
Co-authored-by: Manoj Govindassamy <manoj.govindassamy@gmail.com>
* [HUDI-2763] Metadata table records - support for key deduplication and virtual keys
- The backing log format for the metadata table is HFile, a KeyValue type.
Since the key field in the metadata record payload is a duplicate of the
Key in the Cell, the redundant key field in the record can be emptied
to save on the cost.
- HoodieHFileWriter and HoodieHFileDataBlock will now serialize records
with the key field emptied by default. HFile writer tries to find if
the record has metadata payload schema field 'key' and if so it does
the key trimming from the record payload.
- HoodieHFileReader when reading the serialized records back from disk,
it materializes the missing keyFields if any. HFile reader tries to
find if the record has metadata payload schema fiels 'key' and if so
it does the key materialization in the record payload.
- Tests have been added to verify the default virtual keys and key
deduplication support for the metadata table records.
Co-authored-by: Vinoth Chandar <vinoth@apache.org>
* Introduce hudi-spark3-common and hudi-spark2-common modules to place classes that would be reused in different spark versions, also introduce hudi-spark3.1.x to support spark 3.1.x.
* Introduce hudi format under hudi-spark2, hudi-spark3, hudi-spark3.1.x modules and change the hudi format in original hudi-spark module to hudi_v1 format.
* Manually tested on Spark 3.1.2 and Spark 3.2.0 SQL.
* Added a README.md file under hudi-spark-datasource module.
* [HUDI-2909] Handle logical type in TimestampBasedKeyGenerator
Timestampbased key generator was returning diff values for row writer and non row writer path. this patch fixes it and is guarded by a config flag (`hoodie.datasource.write.keygenerator.consistent.logical.timestamp.enabled`)