The timeline refresh on table initialization invokes the fs view #sync, which has two actions now:
1. reload the timeline of the fs view, so that the next fs view request is based on this timeline metadata
2. if this is a local fs view, clear all the local states; if this is a remote fs view, send request to sync the remote fs view
But, let's see the construction, the meta client is instantiated freshly so the timeline is already the latest,
the table is also constructed freshly, so the fs view has no local states, that means, the #sync is unnecessary totally.
In this patch, the metadata lifecycle and data set fs view are kept in sync, when the fs view is refreshed, the underneath metadata
is also refreshed synchronouly. The freshness of the metadata follows the same rules as data fs view:
1. if the fs view is local, the visibility is based on the client table metadata client's latest commit
2. if the fs view is remote, the timeline server would #sync the fs view and metadata together based on the lagging server local timeline
From the perspective of client, no need to care about the refresh action anymore no matter whether the metadata table is enabled or not.
That make the client logic more clear and less error-prone.
Removes the timeline refresh has another benefit: if avoids unncecessary #refresh of the remote fs view, if all the clients send request to #sync the
remote fs view, the server would encounter conflicts and the client encounters a response error.
Rework of #4761
This diff introduces following changes:
- Write stats are converted to metadata index records during the commit. Making them use the HoodieData type so that the record generation scales up with needs.
- Metadata index init support for bloom filter and column stats partitions.
- When building the BloomFilter from the index records, using the type param stored in the payload instead of hardcoded type.
- Delta writes can change column ranges and the column stats index need to be properly updated with new ranges to be consistent with the table dataset. This fix add column stats index update support for the delta writes.
Co-authored-by: Manoj Govindassamy <manoj.govindassamy@gmail.com>
* [HUDI-1295] Metadata Index - Bloom filter and Column stats index to speed up index lookups
- Today, base files have bloom filter at their footers and index lookups
have to load the base file to perform any bloom lookups. Though we have
interval tree based file purging, we still end up in significant amount
of base file read for the bloom filter for the end index lookups for the
keys. This index lookup operation can be made more performant by having
all the bloom filters in a new metadata partition and doing pointed
lookups based on keys.
* [HUDI-1295] Metadata Index - Bloom filter and Column stats index to speed up index lookups
- Adding indexing support for clean, restore and rollback operations.
Each of these operations will now be converted to index records for
bloom filter and column stats additionally.
* [HUDI-1295] Metadata Index - Bloom filter and Column stats index to speed up index lookups
- Making hoodie key consistent for both column stats and bloom index by
including fileId instead of fileName, in both read and write paths.
- Performance optimization for looking up records in the metadata table.
- Avoiding multi column sorting needed for HoodieBloomMetaIndexBatchCheckFunction
* [HUDI-1295] Metadata Index - Bloom filter and Column stats index to speed up index lookups
- HoodieBloomMetaIndexBatchCheckFunction cleanup to remove unused classes
- Base file checking before reading the file footer for bloom or column stats
* [HUDI-1295] Metadata Index - Bloom filter and Column stats index to speed up index lookups
- Updating the bloom index and column stats index to have full file name
included in the key instead of just file id.
- Minor test fixes.
* [HUDI-1295] Metadata Index - Bloom filter and Column stats index to speed up index lookups
- Fixed flink commit method to handle metadata table all partition update records
- TestBloomIndex fixes
* [HUDI-1295] Metadata Index - Bloom filter and Column stats index to speed up index lookups
- SparkHoodieBloomIndexHelper code simplification for various config modes
- Signature change for getBloomFilters() and getColumnStats(). Callers can
just pass in interested partition and file names, the index key is then
constructed internally based on the passed in parameters.
- KeyLookupHandle and KeyLookupResults code refactoring
- Metadata schema changes - removed the reserved field
* [HUDI-1295] Metadata Index - Bloom filter and Column stats index to speed up index lookups
- Removing HoodieColumnStatsMetadata and using HoodieColumnRangeMetadata instead.
Fixed the users of the the removed class.
* [HUDI-1295] Metadata Index - Bloom filter and Column stats index to speed up index lookups
- Extending meta index test to cover deletes, compactions, clean
and restore table operations. Also, fixed the getBloomFilters()
and getColumnStats() to account for deleted entries.
* [HUDI-1295] Metadata Index - Bloom filter and Column stats index to speed up index lookups
- Addressing review comments - java doc for new classes, keys sorting for
lookup, index methods renaming.
* [HUDI-1295] Metadata Index - Bloom filter and Column stats index to speed up index lookups
- Consolidated the bloom filter checking for keys in to one
HoodieMetadataBloomIndexCheckFunction instead of a spearate batch
and lazy mode. Removed all the configs around it.
- Made the metadata table partition file group count configurable.
- Fixed the HoodieKeyLookupHandle to have auto closable file reader
when checking bloom filter and range keys.
- Config property renames. Test fixes.
* [HUDI-1295] Metadata Index - Bloom filter and Column stats index to speed up index lookups
- Enabling column stats indexing for all columns by default
- Handling column stat generation errors and test update
* [HUDI-1295] Metadata Index - Bloom filter and Column stats index to speed up index lookups
- Metadata table partition file group count taken from the slices when
the table is bootstrapped.
- Prep records for the commit refactored to the base class
- HoodieFileReader interface changes for filtering keys
- Multi column and data types support for colums stats index
* [HUDI-1295] Metadata Index - Bloom filter and Column stats index to speed up index lookups
- rebase to latest master and merge fixes for the build and test failures
* [HUDI-1295] Metadata Index - Bloom filter and Column stats index to speed up index lookups
- Extending the metadata column stats type payload schema to include
more statistics about the column ranges to help query integration.
* [HUDI-1295] Metadata Index - Bloom filter and Column stats index to speed up index lookups
- Addressing review comments