1
0
Commit Graph

18 Commits

Author SHA1 Message Date
Sivabalan Narayanan
f9bc3e03e5 [MINOR] Adding a deprecated constructor to AbstractSyncHoodieClient (#3902) 2021-11-02 12:16:38 -04:00
董可伦
10e3a9a3fb [MINOR] Fix typo,'properites' corrected to 'properties' (#3738) 2021-10-06 20:37:01 -04:00
Sivabalan Narayanan
5f32162a2f [HUDI-2285][HUDI-2476] Metadata table synchronous design. Rebased and Squashed from pull/3426 (#3590)
* [HUDI-2285] Adding Synchronous updates to metadata before completion of commits in data timelime.

- This patch adds synchronous updates to metadata table. In other words, every write is first committed to metadata table followed by data table. While reading metadata table, we ignore any delta commits that are present only in metadata table and not in data table timeline.
- Compaction of metadata table is fenced by the condition that we trigger compaction only when there are no inflight requests in datatable. This ensures that all base files in metadata table is always in sync with data table(w/o any holes) and only there could be some extra invalid commits among delta log files in metadata table.
- Due to this, archival of data table also fences itself up until compacted instant in metadata table.
All writes to metadata table happens within the datatable lock. So, metadata table works in one writer mode only. This might be tough to loosen since all writers write to same FILES partition and so, will result in a conflict anyways.
- As part of this, have added acquiring locks in data table for those operations which were not before while committing (rollback, clean, compaction, cluster). To note, we were not doing any conflict resolution. All we are doing here is to commit by taking a lock. So that all writes to metadata table is always a single writer. 
- Also added building block to add buckets for partitions, which will be leveraged by other indexes like record level index, etc. For now, FILES partition has only one bucket. In general, any number of buckets per partition is allowed and each partition has a fixed fileId prefix with incremental suffix for each bucket within each partition.
Have fixed [HUDI-2476]. This fix is about retrying a failed compaction if it succeeded in metadata for first time, but failed w/ data table.
- Enabling metadata table by default.
- Adding more tests for metadata table

Co-authored-by: Prashant Wason <pwason@uber.com>
2021-10-06 00:17:52 -04:00
Udit Mehrotra
3e301196bf Moving to 0.10.0-SNAPSHOT on master branch. 2021-08-14 18:51:09 -07:00
swuferhong
21db6d7a84 [HUDI-1771] Propagate CDC format for hoodie (#3285) 2021-08-10 20:23:23 +08:00
pengzhiwei
ffa934182a [HUDI-2045] Support Read Hoodie As DataSource Table For Flink And DeltaStreamer 2021-07-12 13:03:14 +08:00
Roc Marshal
e4fd195d9f [MINOR] Refactor method up to parent-class (#2822) 2021-04-27 21:32:32 +08:00
pengzhiwei
aacb8be521 [HUDI-1415] Read Hoodie Table As Spark DataSource Table (#2283) 2021-04-20 14:21:38 -07:00
garyli1019
6e803e08b1 Moving to 0.9.0-SNAPSHOT on master branch. 2021-03-24 21:37:14 +08:00
Sivabalan Narayanan
c9fcf964b2 [HUDI-1315] Adding builder for HoodieTableMetaClient initialization (#2534) 2021-02-20 09:54:26 +08:00
Vinoth Chandar
3719e7b388 Moving to 0.8.0-SNAPSHOT on master branch. 2021-01-20 11:31:22 -08:00
vinoth chandar
5ca0625b27 [HUDI 1308] Harden RFC-15 Implementation based on production testing (#2441)
Addresses leaks, perf degradation observed during testing. These were regressions from the original rfc-15 PoC implementation.

* Pass a single instance of HoodieTableMetadata everywhere
* Fix tests and add config for enabling metrics
 - Removed special casing of assumeDatePartitioning inside FSUtils#getAllPartitionPaths()
 - Consequently, IOException is never thrown and many files had to be adjusted
- More diligent handling of open file handles in metadata table
 - Added config for controlling reuse of connections
 - Added config for turning off fallback to listing, so we can see tests fail
 - Changed all ipf listing code to cache/amortize the open/close for better performance
 - Timelineserver also reuses connections, for better performance
 - Without timelineserver, when metadata table is opened from executors, reuse is not allowed
 - HoodieMetadataConfig passed into HoodieTableMetadata#create as argument.
 -  Fix TestHoodieBackedTableMetadata#testSync
2021-01-19 21:20:28 -08:00
Sivabalan Narayanan
a43e191d6c [MINOR] Bumping snapshot version to 0.7.0 (#2435) 2021-01-16 09:56:28 -05:00
Udit Mehrotra
7ce3ac778e [HUDI-1479] Use HoodieEngineContext to parallelize fetching of partiton paths (#2417)
* [HUDI-1479] Use HoodieEngineContext to parallelize fetching of partition paths

* Adding testClass for FileSystemBackedTableMetadata

Co-authored-by: Nishith Agarwal <nagarwal@uber.com>
2021-01-10 21:19:52 -08:00
Udit Mehrotra
4e64226844 [HUDI-1450] Use metadata table for listing in HoodieROTablePathFilter (apache#2326)
[HUDI-1394] [RFC-15] Use metadata table (if present) to get all partition paths (apache#2351)
2021-01-04 07:59:47 -08:00
Satish Kotha
492ddcbb06 [HUDI-1191] Add incremental meta client API to query partitions modified in a time window 2020-08-25 12:40:10 -07:00
Bhavani Sudha Saktheeswaran
4226d75144 Moving to 0.6.1-SNAPSHOT on master branch. 2020-08-14 12:54:15 -07:00
lw0090
51ea27d665 [HUDI-875] Abstract hudi-sync-common, and support hudi-hive-sync, hudi-dla-sync (#1810)
- Generalize the hive-sync module for syncing to multiple metastores
- Added new options for datasource
- Added new command line for delta streamer 

Co-authored-by: Vinoth Chandar <vinoth@apache.org>
2020-08-05 21:34:55 -07:00