- Avoid using udf for key generator for SimpleKeyGen and NonPartitionedKeyGen.
- Fixed NonPartitioned Key generator to directly fetch record key from row rather than involving GenericRecord.
- Other minor fixes around using static values instead of looking up hashmap.
This PR fixes the projection logic around a nested field which is used as the pre-combined key field. The fix is to only check and append the root level field for projection, i.e., "a", for a nested field "a.b.c" in the mandatory columns.
- Changes the logic to check and append the root level field for a required nested field in the mandatory columns in HoodieBaseRelation.appendMandatoryColumns
- when columns names are renamed (schema evolution enabled), while copying records from old data file with HoodieMergeHande, renamed columns wasn't handled well.
- Scaffolded `Spark24HoodieParquetFileFormat` extending `ParquetFileFormat` and overriding the behavior of adding partition columns to every row
- Amended `SparkAdapter`s `createHoodieParquetFileFormat` API to be able to configure whether to append partition values or not
- Fallback to append partition values in cases when the source columns are not persisted in data-file
- Fixing HoodieBaseRelation incorrectly handling mandatory columns
* Fixing incorrect selection of MT partitions to be updated
* Ensure that metadata partitions table config is inherited correctly
Co-authored-by: Sagar Sumit <sagarsumit09@gmail.com>
Fixing performance hits in reading Column Stats Index:
[HUDI-3834] There's substantial performance degradation in Avro 1.10 default generated Builder classes: they by default rely on SpecificData.getForSchema that load corresponding model's class using reflection, which takes a hit when executed on the hot-path (this was bringing overall runtime to read full Column Stats Index of 800k records to 60s, whereas now it's taking mere 3s)
Addressing memory churn by over-used Hadoop's Path creation: Path ctor is not a lightweight sequence and produces quite a bit of memory churn adding pressure on GC. Cleaning such avoidable allocations up to make sure there's no unnecessarily added pressure on GC.
* Filter out empty string (for non-partitioned table) being added to "__all_partitions__" record
* Instead of filtering, transform empty partition-id to `NON_PARTITIONED_NAME`
* Cleaned up `HoodieBackedTableMetadataWriter`
* Make sure REPLACE_COMMITS are handled as well
* Depend on FSUtils#getRelativePartitionPath(basePath, logFilePath.getParent)
to get the partition.
* If the list of log file paths in the split is empty, then fallback to usual behaviour.
- Adding capability to fetch Metadata Records by key prefix so that Data Skipping could fetch only Column Stats
- Index records pertaining to the columns being queried by, instead of reading out whole Index.
- Fixed usages of HFileScanner in HFileReader. few code paths uses cached scanner if available. Other code paths uses its own HFileScanner w/ positional read.
Brief change log
- Rebasing ColumnStatsIndexSupport to rely on HoodieBackedTableMetadata in lieu of reading t/h Spark DS
- Adding methods enabling key-prefix lookups to HoodiFileReader, HoodieHFileReader
- Wiring key-prefix lookup t/h LogRecordScanner impls
- Cleaning up HoodieHFileReader impl
Co-authored-by: sivabalan <n.siva.b@gmail.com>
Co-authored-by: Sagar Sumit <sagarsumit09@gmail.com>
* [HUDI-3290] Different file formats for the partition metadata file.
Partition metadata files are stored in each partition to help identify the base path of a table. These files are saved in the properties file format. Some query engines do not work when non Parquet/ORC files are found in a partition.
Added a new table config 'hoodie.partition.metafile.use.data.format' which when enabled (default false for backward compatibility) ensures that partition metafiles will be saved in the same format as the base files of a dataset.
For new datasets, the config can be set via hudi-cli. Deltastreamer has a new parameter --partition-metafile-use-data-format which will create a table with this setting.
* Code review comments
- Adding a new command to migrate from text to base file formats for meta file.
- Reimplementing readFromFS() to first read the text format, then base format
- Avoid extra exists() checks in readFromFS()
- Added unit tests, enabled parquet format across hoodie-hadoop-mr
- Code cleanup, restructuring, naming consistency.
* Wiring in all the other Spark code paths to respect this config
- Turned on parquet meta format for COW data source tests
- Removed the deltastreamer command line to keep it shorter
* populate HoodiePartitionMetadata#format after readFromFS()
Co-authored-by: Vinoth Chandar <vinoth@apache.org>
Co-authored-by: Raymond Xu <2701446+xushiyan@users.noreply.github.com>
* Add checks for metadata table init to avoid possible out-of-sync
* Revise the logic to reuse existing table config
* Revise docs and naming
Co-authored-by: yuezhang <yuezhang@freewheel.tv>
Co-authored-by: Y Ethan Guo <ethan.guoyihua@gmail.com>
* [HUDI-2560] introduce id_based schema to support full schema evolution.
* add test for FileBasedInternalSchemaStorageManger and rebase code
* add support for change column type and fix some test case
* fix some bugs encountered in the production env and delete useless code
* fix test error
* rebase code
* fixed some nested schema change bugs
* [HUDI-2429][Stacked On HUDI-2560]Support full schema evolution for spark
* [use dummyInternalSchema instead of null]
* add support for spark3.1.x
* remove support for spark3.1.x , sicne some compile fail
* support spark3.1.x
* rebase and prepare solve all comments
* address all comments
* rebase code
* fixed the count(*) bug
* try to get internalSchema by parser commit file/history file directly, not use metaclient which is time cost
address some comments
* fixed all comments
* fix new comments
* rebase code,fix UT failed
* fixed mistake
* rebase code ,fixed new comments
* rebase code , and prepare for address new comments
* address commits
* address new comments
* fix new issues
* control fallback original write logical
In order to drop any metadata partition (index), we can reuse the DELETE_PARTITION operation in metadata table. Subsequent to this, we can support drop index (with table config update) for async metadata indexer.
- Add a new API in HoodieTableMetadataWriter
- Current only supported for Spark metadata writer
- Add a new action called INDEX, whose state transition is described in the RFC.
- Changes in timeline to support the new action.
- Add an index planner in ScheduleIndexActionExecutor.
- Add index plan executor in RunIndexActionExecutor.
- Add 3 APIs in HoodieTableMetadataWriter; a) scheduleIndex: will generate an index plan based on latest completed instant, initialize file groups and add a requested INDEX instant, b) index: executes the index plan and also takes care of writes that happened after indexing was requested, c) dropIndex: will drop index by removing the given metadata partition.
- Add 2 new table configs to serve as the source of truth for inflight and completed indexes.
- Support upgrade/downgrade taking care of the newly added configs.
- Add tool to trigger indexing in HoodieIndexer.
- Handle corner cases related to partial failures.
- Abort gracefully after deleting partition and instant.
- Handle other actions in timeline to consider before catching up
As of now, delete partitions will ensure all file groups are deleted, but the partition as such is not deleted. So, get all partitions might be returning the deleted partitions as well. but no data will be served since all file groups are deleted. With this patch, we are fixing it. We are letting cleaner take care of deleting the partitions when all file groups pertaining to a partitions are deleted.
- Fixed the CleanPlanActionExecutor to return meta info about list of partitions to be deleted. If there are no valid file groups for a partition, clean planner will include the partition to be deleted.
- Fixed HoodieCleanPlan avro schema to include the list of partitions to be deleted
- CleanActionExecutor is fixed to delete partitions if any (as per clean plan)
- Same info is added to HoodieCleanMetadata
- Metadata table when applying clean metadata, will check for partitions to be deleted and will update the "all_partitions" record for the deleted partitions.
Co-authored-by: sivabalan <n.siva.b@gmail.com>