1
0
Commit Graph

33 Commits

Author SHA1 Message Date
Udit Mehrotra
3e301196bf Moving to 0.10.0-SNAPSHOT on master branch. 2021-08-14 18:51:09 -07:00
Y Ethan Guo
9056c68744 [HUDI-2305] Add MARKERS.type and fix marker-based rollback (#3472)
- Rollback infers the directory structure and does rollback based on the strategy used while markers were written. "write markers type" in write config is used to determine marker strategy only for new writes.
2021-08-14 08:18:49 -04:00
Prashant Wason
76bc686a77 [HUDI-1292] Created a config to enable/disable syncing of metadata table. (#3427)
* [HUDI-1292] Created a config to enable/disable syncing of metadata table.

- Metadata Table should only be synced from a single pipeline to prevent conflicts.
- Skip syncing metadata table for clustering and compaction
- Renamed useFileListingMetadata

Co-authored-by: Vinoth Chandar <vinoth@apache.org>
2021-08-12 15:45:57 -07:00
Y Ethan Guo
4783176554 [HUDI-1138] Add timeline-server-based marker file strategy for improving marker-related latency (#3233)
- Can be enabled for cloud stores like S3. Not supported for hdfs yet, due to partial write failures.
2021-08-11 11:48:13 -04:00
Sagar Sumit
70b6bd485f [HUDI-1468] Support custom clustering strategies and preserve commit metadata as part of clustering (#3419)
Co-authored-by: Satish Kotha <satishkotha@uber.com>
2021-08-06 22:53:08 -04:00
Sivabalan Narayanan
61148c1c43 [HUDI-2176, 2178, 2179] Adding virtual key support to COW table (#3306) 2021-07-26 17:21:04 -04:00
pengzhiwei
f760ec543e [HUDI-1659] Basic Implement Of Spark Sql Support For Hoodie (#2645)
Main functions:
Support create table for hoodie.
Support CTAS.
Support Insert for hoodie. Including dynamic partition and static partition insert.
Support MergeInto for hoodie.
Support DELETE
Support UPDATE
Both support spark2 & spark3 based on DataSourceV1.

Main changes:
Add sql parser for spark2.
Add HoodieAnalysis for sql resolve and logical plan rewrite.
Add commands implementation for CREATE TABLE、INSERT、MERGE INTO & CTAS.
In order to push down the update&insert logical to the HoodieRecordPayload for MergeInto, I make same change to the
HoodieWriteHandler and other related classes.
1、Add the inputSchema for parser the incoming record. This is because the inputSchema for MergeInto is different from writeSchema as there are some transforms in the update& insert expression.
2、Add WRITE_SCHEMA to HoodieWriteConfig to pass the write schema for merge into.
3、Pass properties to HoodieRecordPayload#getInsertValue to pass the insert expression and table schema.


Verify this pull request
Add TestCreateTable for test create hoodie tables and CTAS.
Add TestInsertTable for test insert hoodie tables.
Add TestMergeIntoTable for test merge hoodie tables.
Add TestUpdateTable for test update hoodie tables.
Add TestDeleteTable for test delete hoodie tables.
Add TestSqlStatement for test supported ddl/dml currently.
2021-06-07 23:24:32 -07:00
Roc Marshal
fcedbfcb58 [MINOR][hudi-client] Code-cleanup,remove redundant variable declarations (#2956) 2021-05-17 13:34:42 +08:00
TeRS-K
be9db2c4f5 [HUDI-1055] Remove hardcoded parquet in tests (#2740)
* Remove hardcoded parquet in tests
* Use DataFileUtils.getInstance
* Renaming DataFileUtils to BaseFileUtils

Co-authored-by: Vinoth Chandar <vinoth@apache.org>
2021-05-11 10:01:45 -07:00
leo-Iamok
8bc65b9318 [HUDI-1731] Rename UpsertPartitioner in hudi-java-client (#2734)
Co-authored-by: lei.zhu <lei.zhu@envisioncn.com>
2021-03-31 11:06:04 +08:00
Shen Hong
ecbd389a3f [HUDI-1478] Introduce HoodieBloomIndex to hudi-java-client (#2608) 2021-03-28 20:28:40 +08:00
garyli1019
6e803e08b1 Moving to 0.9.0-SNAPSHOT on master branch. 2021-03-24 21:37:14 +08:00
n3nash
74241947c1 [HUDI-845] Added locking capability to allow multiple writers (#2374)
* [HUDI-845] Added locking capability to allow multiple writers
1. Added LockProvider API for pluggable lock methodologies
2. Added Resolution Strategy API to allow for pluggable conflict resolution
3. Added TableService client API to schedule table services
4. Added Transaction Manager for wrapping actions within transactions
2021-03-16 16:43:53 -07:00
Shen Hong
2efd0760ac [HUDI-1477] Support copyOnWriteTable in java client (#2382) 2021-02-23 20:50:55 +08:00
n3nash
ffcfb58bac [HUDI-1486] Remove inline inflight rollback in hoodie writer (#2359)
1. Refactor rollback and move cleaning failed commits logic into cleaner
2. Introduce hoodie heartbeat to ascertain failed commits
3. Fix test cases
2021-02-19 20:12:22 -08:00
Sivabalan Narayanan
c9fcf964b2 [HUDI-1315] Adding builder for HoodieTableMetaClient initialization (#2534) 2021-02-20 09:54:26 +08:00
Danny Chan
4c5b6923cc [HUDI-1557] Make Flink write pipeline write task scalable (#2506)
This is the #step 2 of RFC-24:
https://cwiki.apache.org/confluence/display/HUDI/RFC+-+24%3A+Hoodie+Flink+Writer+Proposal

This PR introduce a BucketAssigner that assigns bucket ID (partition
path & fileID) for each stream record.

There is no need to look up index and partition the records anymore in
the following pipeline for these records,
we actually decide the write target location before the write and each
record computes its location when the BucketAssigner receives it, thus,
the indexing is with streaming style.

Computing locations for a batch of records all at a time is resource
consuming so a pressure to the engine,
we should avoid that in streaming system.
2021-02-06 22:03:52 +08:00
Shen Hong
c4afd179c1 [HUDI-1476] Introduce unit test infra for java client (#2478) 2021-01-24 11:17:19 -08:00
Vinoth Chandar
3719e7b388 Moving to 0.8.0-SNAPSHOT on master branch. 2021-01-20 11:31:22 -08:00
teeyog
c931dc5406 [MINOR] Remove redundant judgments (#2466) 2021-01-20 20:41:09 +08:00
Sivabalan Narayanan
b9c2856d16 [HUDI-1535] Fix 0.7.0 snapshot (#2456)
* Revert "[MINOR] Bumping snapshot version to 0.7.0 (#2435)"

This reverts commit a43e191d6c.

* Fixing 0.7.0 snapshot bump
2021-01-19 12:20:43 -08:00
Sivabalan Narayanan
a43e191d6c [MINOR] Bumping snapshot version to 0.7.0 (#2435) 2021-01-16 09:56:28 -05:00
Udit Mehrotra
17df517b81 [HUDI-1510] Move HoodieEngineContext and its dependencies to hudi-common (#2410) 2021-01-07 11:34:06 -08:00
vinoth chandar
5ff8e88d58 [HUDI-1513] Introduce WriteClient#preWrite() and relocate metadata table syncing (#2413)
- Syncing to metadata table, setting operation type, starting async cleaner done in preWrite()
 - Fixes an issues where delete() was not starting async cleaner correctly
 - Fixed tests and enabled metadata table for TestAsyncCompaction
2021-01-06 23:08:30 -08:00
Prashant Wason
298808baaf [HUDI-842] Implementation of HUDI RFC-15.
- Introduced an internal metadata table, that stores file listings.
 - metadata table is kept upto date with
 - Fixed handling of CleanerPlan.
 - [HUDI-842] Reduce parallelism to speed up the test.
 - [HUDI-842] Implementation of CLI commands for metadata operations and lookups.
 - [HUDI-842] Extend rollback metadata to include the files which have been appended to.
 - [HUDI-842] Support for rollbacks in MOR Table.
 - MarkerBasedRollbackStrategy needs to correctly provide the list of files for which rollback blocks were appended.
 - [HUDI-842] Added unit test for rollback of partial commits (inflight but not completed yet).
 - [HUDI-842] Handled the error case where metadata update succeeds but dataset commit fails.
 - [HUDI-842] Schema evolution strategy for Metadata Table. Each type of metadata saved (FilesystemMetadata, ColumnIndexMetadata, etc.) will be a separate field with default null. The type of the record will identify the valid field. This way, we can grow the schema when new type of information is saved within in which still keeping it backward compatible.
 - [HUDI-842] Fix non-partitioned case and speedup initial creation of metadata table.Choose only 1 partition for jsc as the number of records is low (hundreds to thousands). There is more overhead of creating large number of partitions for JavaRDD and it slows down operations like WorkloadProfile.
For the non-partitioned case, use "." as the name of the partition to prevent empty keys in HFile.
 - [HUDI-842] Reworked metrics pusblishing.
 - Code has been split into reader and writer side. HoodieMetadata code to be accessed by using HoodieTable.metadata() to get instance of metdata for the table.
Code is serializable to allow executors to use the functionality.
 - [RFC-15] Add metrics to track the time for each file system call.
 - [RFC-15] Added a distributed metrics registry for spark which can be used to collect metrics from executors. This helps create a stats dashboard which shows the metadata table improvements in real-time for production tables.
 - [HUDI-1321] Created HoodieMetadataConfig to specify configuration for the metadata table. This is safer than full-fledged properties for the metadata table (like HoodieWriteConfig) as it makes burdensome to tune the metadata. With limited configuration, we can control the performance of the metadata table closely.

[HUDI-1319][RFC-15] Adding interfaces for HoodieMetadata, HoodieMetadataWriter (apache#2266)
 - moved MetadataReader to HoodieBackedTableMetadata, under the HoodieTableMetadata interface
 - moved MetadataWriter to HoodieBackedTableMetadataWriter, under the HoodieTableMetadataWriter
 - Pulled all the metrics into HoodieMetadataMetrics
 - Writer now wraps the metadata, instead of extending it
 - New enum for MetadataPartitionType
 - Streamlined code flow inside HoodieBackedTableMetadataWriter w.r.t initializing metadata state
 - [HUDI-1319] Make async operations work with metadata table (apache#2332)
 - Changes the syncing model to only move over completed instants on data timeline
 - Syncing happens postCommit and on writeClient initialization
 - Latest delta commit on the metadata table is sufficient as the watermark for data timeline archival
 - Cleaning/Compaction use a suffix to the last instant written to metadata table, such that we keep the 1-1
 - .. mapping between data and metadata timelines.
 - Got rid of a lot of the complexity around checking for valid commits during open of base/log files
 - Tests now use local FS, to simulate more failure scenarios
 - Some failure scenarios exposed HUDI-1434, which is needed for MOR to work correctly

co-authored by: Vinoth Chandar <vinoth@apache.org>
2021-01-04 07:59:47 -08:00
Shen Hong
ff8313caf1 [HUDI-1423] Support delete in hudi-java-client (#2353) 2021-01-03 20:38:45 +08:00
steven zhang
a23aa41a1a [MINOR] Sync UpsertPartitioner modify of HUDI-1398 to flink/java (#2390)
Co-authored-by: zhang wen <wen.zhang@dmall.com>
2020-12-31 11:06:41 +08:00
Gary Li
605b617cfa [HUDI-1434] fix incorrect log file path in HoodieWriteStat (#2300)
* [HUDI-1434] fix incorrect log file path in HoodieWriteStat

* HoodieWriteHandle#close() returns a list of WriteStatus objs

* Handle rolled-over log files and return a WriteStatus per log file written

 - Combined data and delete block logging into a single call
 - Lazily initialize and manage write status based on returned AppendResult
 - Use FSUtils.getFileSize() to set final file size, consistent with other handles
 - Added tests around returned values in AppendResult
 - Added validation of the file sizes returned in write stat

Co-authored-by: Vinoth Chandar <vinoth@apache.org>
2020-12-30 14:22:15 -08:00
lw0090
e177466fd2 [HUDI-1350] Support Partition level delete API in HUDI (#2254)
* [HUDI-1350] Support Partition level delete API in HUDI

* [HUDI-1350] Support Partition level delete API in HUDI base InsertOverwriteCommitAction

* [HUDI-1350] Support Partition level delete API in HUDI base InsertOverwriteCommitAction
2020-12-28 15:01:06 -08:00
Satish Kotha
6dc03b65bf [HUDI-1075] Implement simple clustering strategies to create ClusteringPlan and to run the plan 2020-12-21 17:34:15 -08:00
Shen Hong
e4e2fbc3bb [HUDI-1419] Add base implementation for hudi java client (#2286) 2020-12-19 19:25:27 -08:00
Shen Hong
236d1b0dec [HUDI-1439] Remove scala dependency from hudi-client-common (#2306) 2020-12-11 00:36:37 -08:00
Shen Hong
d9411c38db [HUDI-1364] Add HoodieJavaEngineContext to hudi-java-client (#2222) 2020-11-23 10:06:28 -08:00