* [HUDI-839] Introducing rollback strategy using marker files
- Adds a new mechanism for rollbacks where it's based on the marker files generated during the write
- Consequently, marker file/dir deletion now happens post commit, instead of during finalize
- Marker files are also generated for AppendHandle, making it consistent throughout the write path
- Until upgrade-downgrade mechanism can upgrade non-marker based inflight writes to marker based, this should only be turned on for new datasets.
- Added marker dir deletion after successful commit/rollback, individual files are not deleted during finalize
- Fail safe for deleting marker directories, now during timeline archival process
- Added check to ensure completed instants are not rolled back using marker based strategy. This will be incorrect
- Reworked tests to rollback inflight instants, instead of completed instants whenever necessary
- Added an unit test for MarkerBasedRollbackStrategy
Co-authored-by: Vinoth Chandar <vinoth@apache.org>
Notable changes:
1. HoodieFileWriter and HoodieFileReader abstractions for writer/reader side of a base file format
2. HoodieDataBlock abstraction for creation specific data blocks for base file formats. (e.g. Parquet has HoodieAvroDataBlock)
3. All hardocded references to Parquet / Parquet based classes have been abstracted to call methods which accept a base file format
4. HiveSyncTool accepts the base file format as a CLI parameter
5. HoodieDeltaStreamer accepts the base file format as a CLI parameter
6. HoodieSparkSqlWriter accepts the base file format as a parameter
add hoodie delta streamer mock source example and dfs source and kafka source examples
Signed-off-by: dengziming <dengziming1993@gmail.com>
add defaultSparkConf utils method
change version of hudi-examples to 0.5.2-SNAPSHOT
change the artifcatId of hudi-spark and hudi-utilities
alter some code to adapt kafka2.0
Update scritps
Add license
* Adding proper default to hudi metadata fields and proper handling to rewrite routine
* Handle fields declared with a null default
Co-authored-by: Alex Filipchik <alex.filipchik@csscompany.com>
- Savepoint and compaction classes moved to table.action.* packages
- HoodieWriteClient#savepoint(...) returns void
- Renamed HoodieCommitArchiveLog -> HoodieTimelineArchiveLog
- Fixed tests to take into account the additional validation done
- Moved helper code into CompactHelpers and SavepointHelpers
HUDI specific validation of schema evolution should ensure that a newer schema can be used for the dataset by checking that the data written using the old schema can be read using the new schema.
Code changes:
1. Added a new config in HoodieWriteConfig to enable schema validation check (disabled by default)
2. Moved code that reads schema from base/log files into hudi-common from hudi-hive-sync
3. Added writerSchema to the extraMetadata of compaction commits in MOR table. This is same as that for commits on COW table.
Testing changes:
4. Extended TestHoodieClientBase to add insertBatch API which allows inserting a new batch of unique records into a HUDI table
5. Added a unit test to verify schema evolution for both COW and MOR tables.
6. Added unit tests for schema compatiblity checks.
- rollback() and restore() table level APIs introduced
- Restore is implemented by wrapping calls to rollback executor
- Existing tests transparently cover this, since its just a refactor