* Filter out empty string (for non-partitioned table) being added to "__all_partitions__" record
* Instead of filtering, transform empty partition-id to `NON_PARTITIONED_NAME`
* Cleaned up `HoodieBackedTableMetadataWriter`
* Make sure REPLACE_COMMITS are handled as well
* Depend on FSUtils#getRelativePartitionPath(basePath, logFilePath.getParent)
to get the partition.
* If the list of log file paths in the split is empty, then fallback to usual behaviour.
- Adding capability to fetch Metadata Records by key prefix so that Data Skipping could fetch only Column Stats
- Index records pertaining to the columns being queried by, instead of reading out whole Index.
- Fixed usages of HFileScanner in HFileReader. few code paths uses cached scanner if available. Other code paths uses its own HFileScanner w/ positional read.
Brief change log
- Rebasing ColumnStatsIndexSupport to rely on HoodieBackedTableMetadata in lieu of reading t/h Spark DS
- Adding methods enabling key-prefix lookups to HoodiFileReader, HoodieHFileReader
- Wiring key-prefix lookup t/h LogRecordScanner impls
- Cleaning up HoodieHFileReader impl
Co-authored-by: sivabalan <n.siva.b@gmail.com>
Co-authored-by: Sagar Sumit <sagarsumit09@gmail.com>
* [HUDI-3290] Different file formats for the partition metadata file.
Partition metadata files are stored in each partition to help identify the base path of a table. These files are saved in the properties file format. Some query engines do not work when non Parquet/ORC files are found in a partition.
Added a new table config 'hoodie.partition.metafile.use.data.format' which when enabled (default false for backward compatibility) ensures that partition metafiles will be saved in the same format as the base files of a dataset.
For new datasets, the config can be set via hudi-cli. Deltastreamer has a new parameter --partition-metafile-use-data-format which will create a table with this setting.
* Code review comments
- Adding a new command to migrate from text to base file formats for meta file.
- Reimplementing readFromFS() to first read the text format, then base format
- Avoid extra exists() checks in readFromFS()
- Added unit tests, enabled parquet format across hoodie-hadoop-mr
- Code cleanup, restructuring, naming consistency.
* Wiring in all the other Spark code paths to respect this config
- Turned on parquet meta format for COW data source tests
- Removed the deltastreamer command line to keep it shorter
* populate HoodiePartitionMetadata#format after readFromFS()
Co-authored-by: Vinoth Chandar <vinoth@apache.org>
Co-authored-by: Raymond Xu <2701446+xushiyan@users.noreply.github.com>
* Add checks for metadata table init to avoid possible out-of-sync
* Revise the logic to reuse existing table config
* Revise docs and naming
Co-authored-by: yuezhang <yuezhang@freewheel.tv>
Co-authored-by: Y Ethan Guo <ethan.guoyihua@gmail.com>
* [HUDI-2560] introduce id_based schema to support full schema evolution.
* add test for FileBasedInternalSchemaStorageManger and rebase code
* add support for change column type and fix some test case
* fix some bugs encountered in the production env and delete useless code
* fix test error
* rebase code
* fixed some nested schema change bugs
* [HUDI-2429][Stacked On HUDI-2560]Support full schema evolution for spark
* [use dummyInternalSchema instead of null]
* add support for spark3.1.x
* remove support for spark3.1.x , sicne some compile fail
* support spark3.1.x
* rebase and prepare solve all comments
* address all comments
* rebase code
* fixed the count(*) bug
* try to get internalSchema by parser commit file/history file directly, not use metaclient which is time cost
address some comments
* fixed all comments
* fix new comments
* rebase code,fix UT failed
* fixed mistake
* rebase code ,fixed new comments
* rebase code , and prepare for address new comments
* address commits
* address new comments
* fix new issues
* control fallback original write logical
In order to drop any metadata partition (index), we can reuse the DELETE_PARTITION operation in metadata table. Subsequent to this, we can support drop index (with table config update) for async metadata indexer.
- Add a new API in HoodieTableMetadataWriter
- Current only supported for Spark metadata writer