Refactoring Spark DataSource Relations to avoid code duplication.
Following Relations were in scope:
- BaseFileOnlyViewRelation
- MergeOnReadSnapshotRelaation
- MergeOnReadIncrementalRelation
- Adopt HoodieData in Spark action commit executors
- Make Spark independent DeleteHelper, WriteHelper, MergeHelper in hudi-client-common
- Make HoodieTable in WriteClient APIs have raw type to decouple with Client's generic types
* [HUDI-3607] Support backend switch in HoodieFlinkStreamer
* [HUDI-3607] Support backend switch in HoodieFlinkStreamer
1. checkstyle fix
* [HUDI-3607] Support backend switch in HoodieFlinkStreamer
1. change the msg
* Update CompactionHoodiePathCommand.scala
fix NPE when run schdule using spark-sql if the commits time < hoodie.compact.inline.max.delta.commits
* Update CompactionHoodiePathCommand.scala
fix IndexOutOfBoundsException when there`s no schedule for compaction
* Update CompactionHoodiePathCommand.scala
fix CI issue
Create new TypedProperties while performing clustering
Add OrderedProperties and minor refactoring
Add javadoc and remove getters from OrderedProperties
NOTE: This change is first part of the series to clean up Hudi's Spark DataSource related implementations, making sure there's minimal code duplication among them, implementations are consistent and performant
This PR is making sure that BaseFileOnlyViewRelation only reads projected columns as well as avoiding unnecessary serde from Row to InternalRow
Brief change log
- Introduced HoodieBaseRDD as a base for all custom RDD impls
- Extracted common fields/methods to HoodieBaseRelation
- Cleaned up and streamlined HoodieBaseFileViewOnlyRelation
- Fixed all of the Relations to avoid superfluous Row <> InternalRow conversions
Desc: Add a hive sync config(hoodie.datasource.hive_sync.sync_comment). This config defaults to false.
While syncing data source to hudi, add column comments to source avro schema, and the sync_comment is true, syncing column comments to the hive table.
Rework of #4761
This diff introduces following changes:
- Write stats are converted to metadata index records during the commit. Making them use the HoodieData type so that the record generation scales up with needs.
- Metadata index init support for bloom filter and column stats partitions.
- When building the BloomFilter from the index records, using the type param stored in the payload instead of hardcoded type.
- Delta writes can change column ranges and the column stats index need to be properly updated with new ranges to be consistent with the table dataset. This fix add column stats index update support for the delta writes.
Co-authored-by: Manoj Govindassamy <manoj.govindassamy@gmail.com>