- Generalized BloomIndex to work with file ids instead of paths
- Abstracted away Bloom filter checking into HoodieLookupHandle
- Abstracted away range information retrieval into HoodieRangeInfoHandle
1. Parquet Avro schema mismatch errors when ingesting are sometimes silently ignored due to race-condition in BoundedInMemoryExecutor. This was reproducible when running long-running delta-streamer with wrong schema and it caused data-loss
2. Fix behavior of Delta-Streamer to error out by default if there are any error records
3. Fix a bug in tracking write errors in WriteStats. Earlier the write errors were tracking sampled errors as opposed to total errors.
4. Delta Streamer does not commit the changes done as part of inline compaction as auto-commit is force disabled. Fix this behavior to always auto-commit inline compaction as it would not otherwise commit.
The code-style rules follow google style with some changes:
1. Increase line length from 100 to 120
2. Disable JavaDoc related checkstyles as this needs more manual work.
Both source and test code are checked for code-style
- HoodieLogFormat V2 has support for LogFormat evolution through versioning
- LogVersion is associated with a LogBlock not a LogFile
- Based on a version for a LogBlock, approporiate code path is executed
- Implemented LazyReading of Hoodie Log Blocks with Memory / IO tradeoff
- Implemented Reverse pointer to be able to traverse the log in reverse
- Introduce new MAGIC for backwards compatibility with logs without versions
- Fallback to old way of rollback by listing all partitions
- Added null check to ensure only partitions which are to be rolledback are considered
- Added location (committime) to workload stat
- Added checks in CompactedScanner to guard against task retries
- Introduce new logic for rollback (bounded by instant_time and target_instant time)
- Reversed logfiles order
- Merged all filter* and get* methods
- new constructor takes filestatus[]
- All existing tests pass
- FileGroup is all files that belong to a fileID within a partition
- FileSlice is a generation of data and log files, starting at a base commit
1. Create HoodieTable abstraction for commits and fileSystemView
2. HoodieMergeOnReadTable created
3. View is now always obtained from the table and the correct view based on the table type is returned