- Changes the default config of marker type (HoodieWriteConfig.MARKERS_TYPE or hoodie.write.markers.type) from DIRECT to TIMELINE_SERVER_BASED for Spark Engine.
- Adds engine-specific marker type configs: Spark -> TIMELINE_SERVER_BASED, Flink -> DIRECT, Java -> DIRECT.
- Uses DIRECT markers as well for Spark structured streaming due to timeline server only available for the first mini-batch.
- Fixes the marker creation method for non-partitioned table in TimelineServerBasedWriteMarkers.
- Adds the fallback to direct markers even when TIMELINE_SERVER_BASED is configured, in WriteMarkersFactory: when HDFS is used, or embedded timeline server is disabled, the fallback to direct markers happens.
- Fixes the closing of timeline service.
- Fixes tests that depend on markers, mainly by starting the timeline service for each test.
* Making error -> warn logs from timeline server with concurrent writers for inconsistent state
* Fixing bad request response exception for timeline out of sync
* Addressing feedback. removed write concurrency mode depedency
- Added upgrade and downgrade step to and from 0.9.0. Upgrade adds few table properties. Downgrade recreates timeline server based marker files if any.
- Rollback infers the directory structure and does rollback based on the strategy used while markers were written. "write markers type" in write config is used to determine marker strategy only for new writes.
* Reduces log level for too verbose messages from info to debug level.
* Sort config output.
* Code Review : Small restructuring + rebasing to master
- Fixing flaky multi delta streamer test
- Using isDebugEnabled() checks
- Some changes to shorten log message without moving to DEBUG
Co-authored-by: volodymyr.burenin <volodymyr.burenin@cloudkitchens.com>
Co-authored-by: Vinoth Chandar <vinoth@apache.org>
Addresses leaks, perf degradation observed during testing. These were regressions from the original rfc-15 PoC implementation.
* Pass a single instance of HoodieTableMetadata everywhere
* Fix tests and add config for enabling metrics
- Removed special casing of assumeDatePartitioning inside FSUtils#getAllPartitionPaths()
- Consequently, IOException is never thrown and many files had to be adjusted
- More diligent handling of open file handles in metadata table
- Added config for controlling reuse of connections
- Added config for turning off fallback to listing, so we can see tests fail
- Changed all ipf listing code to cache/amortize the open/close for better performance
- Timelineserver also reuses connections, for better performance
- Without timelineserver, when metadata table is opened from executors, reuse is not allowed
- HoodieMetadataConfig passed into HoodieTableMetadata#create as argument.
- Fix TestHoodieBackedTableMetadata#testSync
* [HUDI-1276] [HUDI-1459] Make Clustering/ReplaceCommit and Metadata table be compatible
* Use filesystemview and json format from metadata. Add tests
Co-authored-by: Satish Kotha <satishkotha@uber.com>
- use codecov flags for each module to report coverage
- parallelize CI jobs for shorter time
- add a testcase for MetricsReporterFactory (to trigger codecov comment)
This is to ensure that tests will execute all code paths, even the ones
written under DEBUG log levels. This will improve coverage as well as
ensure there are no surprised when DEBUG log level is enabled in
production.
- Brings more order and cohesion to the classes in hudi-common
- Utils classes related to a particular concept (avro, timeline,...) are placed near to the package
- common.fs package now contains all the filesystem level classes including wrapper filesystem
- bloom.filter package renamed to just bloom
- config package contains classes that help store properties
- common.fs.inline package contains all the inline filesystem classes/impl
- common.table.timeline now consolidates all timeline related classes
- common.table.view consolidates all the classes related to filesystem view metadata
- common.table.timeline.versioning contains all classes related to versioning of timeline
- Fix few unit tests as a result
- Moved the test packages around to match the source file move
- Rename AvroUtils to TimelineMetadataUtils & minor fixes/typos
- Storage Type replaced with Table Type (remaining instances)
- View types replaced with query types;
- ReadOptimized view referred as Snapshot Query
- TableFileSystemView sub interfaces renamed to BaseFileOnly and Slice Views
- HoodieDataFile renamed to HoodieBaseFile
- Hive Sync tool will register RO tables for MOR with a `_ro` suffix
- Datasource/Deltastreamer options renamed accordingly
- Support fallback to old config values as well, so migration is painless
- Config for controlling _ro suffix addition
- Renaming DataFile to BaseFile across DTOs, HoodieFileSlice and AbstractTableFileSystemView
- Docs were talking about storage types before, cWiki moved to "Table"
- Most of code already has HoodieTable, HoodieTableMetaClient - correct naming
- Replacing renaming use of dataset across code/comments
- Few usages in comments and use of Spark SQL DataSet remain unscathed
Before this change, Cleaner performs cleaning of old file versions and then stores the deleted files in .clean files.
With this setup, we will not be able to track file deletions if a cleaner fails after deleting files but before writing .clean metadata.
This is fine for regular file-system view generation but Incremental timeline syncing relies on clean/commit/compaction metadata to keep a consistent file-system view.
Cleaner state transitions is now similar to that of compaction.
1. Requested : HoodieWriteClient.scheduleClean() selects the list of files that needs to be deleted and stores them in metadata
2. Inflight : HoodieWriteClient marks the state to be inflight before it starts deleting
3. Completed : HoodieWriteClient marks the state after completing the deletion according to the cleaner plan