- Added pure immutable test yamls to integ test framework. Added SparkBulkInsertNode as part of it.
- Added delete_partition support to integ test framework using spark-datasource.
- Added a single yaml to test all non core write operations (insert overwrite, insert overwrite table and delete partitions)
- Added tests for 4 concurrent spark datasource writers (multi-writer tests).
- Fixed readme w/ sample commands for multi-writer.
- Avoid using udf for key generator for SimpleKeyGen and NonPartitionedKeyGen.
- Fixed NonPartitioned Key generator to directly fetch record key from row rather than involving GenericRecord.
- Other minor fixes around using static values instead of looking up hashmap.
- Due to the fact that Spark 3.2.1 is non-BWC w/ 3.2.0, we have to handle all these incompatibilities in Spark32HoodieParquetFileFormat. This PR is addressing that.
Co-authored-by: Raymond Xu <2701446+xushiyan@users.noreply.github.com>
This PR fixes the projection logic around a nested field which is used as the pre-combined key field. The fix is to only check and append the root level field for projection, i.e., "a", for a nested field "a.b.c" in the mandatory columns.
- Changes the logic to check and append the root level field for a required nested field in the mandatory columns in HoodieBaseRelation.appendMandatoryColumns
- when columns names are renamed (schema evolution enabled), while copying records from old data file with HoodieMergeHande, renamed columns wasn't handled well.
- Scaffolded `Spark24HoodieParquetFileFormat` extending `ParquetFileFormat` and overriding the behavior of adding partition columns to every row
- Amended `SparkAdapter`s `createHoodieParquetFileFormat` API to be able to configure whether to append partition values or not
- Fallback to append partition values in cases when the source columns are not persisted in data-file
- Fixing HoodieBaseRelation incorrectly handling mandatory columns
Make truncate partition and drop partition behave as drop partition with purge, which delete all records via Hudi DELETE_PARTITION; partition removed from metastore
Currently, Data Skipping is not handling correctly the case when column-stats are not aligned and, for ex, some of the (column, file) combinations are missing from the CSI.
This could occur in different scenarios (schema evolution, CSI config changes), and has to be handled properly when we're composing CSI projection for Data Skipping. This PR addresses that.
- Added appropriate aligning for the transposed CSI projection
Addressing the problem of Data Skipping not respecting Metadata Table configs which might differ b/w write/read paths. More details could be found in HUDI-3812.
- Fixing Data Skipping configuration to respect MT configs (on the Read path)
- Tightening up DS handling of cases when no top-level columns are in the target query
- Enhancing tests to cover all possible case