1
0
Commit Graph

78 Commits

Author SHA1 Message Date
ForwardXu
dbec6c512b [HUDI-3022] Fix NPE for isDropPartition method (#4319)
* [HUDI-3022] Fix NPE for isDropPartition method
2021-12-15 19:38:02 +08:00
ForwardXu
dd96129191 [HUDI-2990] Sync to HMS when deleting partitions (#4291) 2021-12-13 20:40:06 +08:00
fengli
568181a3e7 [HUDI-2934] Optimize RequestHandler code style
close apache/hudi#4215
2021-12-04 15:30:52 +08:00
yuzhao.cyz
a1d0ff4209 Moving to 0.11.0-SNAPSHOT on master branch. 2021-11-27 17:22:10 +08:00
Nate Radtke
887787e8b9 [HUDI-1932] Update Hive sync timestamp when change detected (#3053)
* Update Hive sync timestamp when change detected

Only update the last commit timestamp on the Hive table when the table schema
has changed or a partition is created/updated.

When using AWS Glue Data Catalog as the metastore for Hive this will ensure
that table versions are substantive (including schema and/or partition
changes). Prior to this change when a Hive sync is performed without schema
or partition changes the table in the Glue Data Catalog would have a new
version published with the only change being the timestamp property.

https://issues.apache.org/jira/browse/HUDI-1932

* add conditional sync flag

* fix testSyncWithoutDiffs

* fix HiveSyncConfig

Co-authored-by: Raymond Xu <2701446+xushiyan@users.noreply.github.com>
2021-11-21 12:11:05 +05:30
xiarixiaoyao
acc40625f5 [HUDI-2676] Hudi should synchronize owner information to hudi _rt/_ro table. (#3911) 2021-11-03 20:36:01 +08:00
Yann Byron
1f17467f73 [HUDI-1869] Upgrading Spark3 To 3.1 (#3844)
Co-authored-by: pengzhiwei <pengzhiwei2015@icloud.com>
2021-11-02 18:25:12 -07:00
Sivabalan Narayanan
f9bc3e03e5 [MINOR] Adding a deprecated constructor to AbstractSyncHoodieClient (#3902) 2021-11-02 12:16:38 -04:00
vinoyang
b1c4acf0ae [HUDI-2614] Remove duplicated hadoop-hdfs with tests classifier exists in bundles (#3864) 2021-10-26 22:36:10 +08:00
vinoyang
220bf6a7e6 [HUDI-2600] Remove duplicated hadoop-common with tests classifier exists in bundles (#3847) 2021-10-25 13:45:28 +08:00
董可伦
48a3906ccc [MINOR] Fix typo,'paritition' corrected to 'partition' (#3764) 2021-10-11 14:07:34 -04:00
董可伦
10e3a9a3fb [MINOR] Fix typo,'properites' corrected to 'properties' (#3738) 2021-10-06 20:37:01 -04:00
Sivabalan Narayanan
5f32162a2f [HUDI-2285][HUDI-2476] Metadata table synchronous design. Rebased and Squashed from pull/3426 (#3590)
* [HUDI-2285] Adding Synchronous updates to metadata before completion of commits in data timelime.

- This patch adds synchronous updates to metadata table. In other words, every write is first committed to metadata table followed by data table. While reading metadata table, we ignore any delta commits that are present only in metadata table and not in data table timeline.
- Compaction of metadata table is fenced by the condition that we trigger compaction only when there are no inflight requests in datatable. This ensures that all base files in metadata table is always in sync with data table(w/o any holes) and only there could be some extra invalid commits among delta log files in metadata table.
- Due to this, archival of data table also fences itself up until compacted instant in metadata table.
All writes to metadata table happens within the datatable lock. So, metadata table works in one writer mode only. This might be tough to loosen since all writers write to same FILES partition and so, will result in a conflict anyways.
- As part of this, have added acquiring locks in data table for those operations which were not before while committing (rollback, clean, compaction, cluster). To note, we were not doing any conflict resolution. All we are doing here is to commit by taking a lock. So that all writes to metadata table is always a single writer. 
- Also added building block to add buckets for partitions, which will be leveraged by other indexes like record level index, etc. For now, FILES partition has only one bucket. In general, any number of buckets per partition is allowed and each partition has a fixed fileId prefix with incremental suffix for each bucket within each partition.
Have fixed [HUDI-2476]. This fix is about retrying a failed compaction if it succeeded in metadata for first time, but failed w/ data table.
- Enabling metadata table by default.
- Adding more tests for metadata table

Co-authored-by: Prashant Wason <pwason@uber.com>
2021-10-06 00:17:52 -04:00
Vinay Patil
73e8ba7620 [HUDI-2499] Making jdbc-url, user and pass as non-required field for other sync modes (#3732) 2021-09-30 11:41:15 -04:00
qianchutao
7e887b54d7 [MINOR] fix typo,'SPAKR' corrected to 'SPARK' (#3721) 2021-09-26 21:52:35 +08:00
jsbali
f52cb32f5f [HUDI-2248] Fixing the closing of hms client (#3364)
* [HUDI-2248] Fixing the closing of hms client

* [HUDI-2248] Using Hive.closeCurrent() over client.close()
2021-09-23 13:45:24 -07:00
董可伦
3a150ee181 [HUDI-2447] Extract common business logic & Fix typo (#3683) 2021-09-17 19:45:22 +08:00
董可伦
8a652171cf [MINOR] Fix typo,'compatiblity' corrected to 'compatibility' (#3675) 2021-09-17 09:43:23 +08:00
Wei
4abcb4f659 [MINOR] Remove unused variables (#3631) 2021-09-09 23:21:16 +08:00
Udit Mehrotra
c350d05dd3 Restore 0.8.0 config keys with deprecated annotation (#3506)
Co-authored-by: Sagar Sumit <sagarsumit09@gmail.com>
Co-authored-by: Vinoth Chandar <vinoth@apache.org>
2021-08-19 13:36:40 -07:00
Udit Mehrotra
3e301196bf Moving to 0.10.0-SNAPSHOT on master branch. 2021-08-14 18:51:09 -07:00
Raymond Xu
8255a86cb4 [HUDI-1939] remove joda time in hivesync module (#3430) 2021-08-10 20:25:41 -07:00
swuferhong
21db6d7a84 [HUDI-1771] Propagate CDC format for hoodie (#3285) 2021-08-10 20:23:23 +08:00
pengzhiwei
0dcd6a8fca [HUDI-2233] Use HMS To Sync Hive Meta For Spark Sql (#3387) 2021-08-05 09:57:22 -04:00
swuferhong
eedfadeb46 [HUDI-2244] Fix database alreadyExists exception while hive sync (#3361) 2021-07-28 19:40:16 +08:00
Sivabalan Narayanan
61148c1c43 [HUDI-2176, 2178, 2179] Adding virtual key support to COW table (#3306) 2021-07-26 17:21:04 -04:00
jsbali
66207ed91a [HUDI-1848] Adding support for HMS for running DDL queries in hive-sy… (#2879)
* [HUDI-1848] Adding support for HMS for running DDL queries in hive-sync-tool

* [HUDI-1848] Fixing test cases

* [HUDI-1848] CR changes

* [HUDI-1848] Fix checkstyle violations

* [HUDI-1848] Fixed a bug when metastore api fails for complex schemas with multiple levels.

* [HUDI-1848] Adding the complex schema and resolving merge conflicts

* [HUDI-1848] Adding some more javadocs

* [HUDI-1848] Added javadocs for DDLExecutor impls

* [HUDI-1848] Fixed style issue
2021-07-23 09:03:15 -07:00
vinoyang
a62a6cff32 [MINOR] Refactor hive sync tool to reduce duplicate code (#3276)
* [MINOR] Refactor hive sync tool to reduce duplicate code
2021-07-15 23:54:38 +08:00
pengzhiwei
93967404a7 [HUDI-2180] Fix Compile Error For Spark3 (#3274) 2021-07-14 09:02:28 -07:00
pengzhiwei
ffa934182a [HUDI-2045] Support Read Hoodie As DataSource Table For Flink And DeltaStreamer 2021-07-12 13:03:14 +08:00
vinoth chandar
c50c24908a [MINOR] Fix build broken from #3186 (#3245) 2021-07-08 14:23:52 -07:00
xiarixiaoyao
de07e61382 [HUDI-2099]hive lock which state is WATING should be released, otherwise this hive lock will be locked forever (#3186) 2021-07-08 10:30:48 -04:00
xiarixiaoyao
6a71412f78 [HUDI-2116] Support batch synchronization of partition datas to hive metastore to avoid oom problem (#3209) 2021-07-04 22:30:36 +08:00
pengzhiwei
4f215e2938 [HUDI-2057] CTAS Generate An External Table When Create Managed Table (#3146) 2021-07-03 15:55:36 +08:00
wenningd
d412fb2fe6 [HUDI-89] Add configOption & refactor all configs based on that (#2833)
Co-authored-by: Wenning Ding <wenningd@amazon.com>
2021-06-30 14:26:30 -07:00
Raymond Xu
0749cc826a [HUDI-2081] Move schema util tests out from TestHiveSyncTool (#3166) 2021-06-29 11:23:46 +08:00
n3nash
23dbc09a0d [MINOR] Removing un-used files and references (#3150) 2021-06-24 22:17:40 -07:00
s-sanjay
0fb8556b0d Add ability to provide multi-region (global) data consistency across HMS in different regions (#2542)
[global-hive-sync-tool] Add a global hive sync tool to sync hudi table across clusters. Add a way to rollback the replicated time stamp if we fail to sync or if we partly sync

Co-authored-by: Jagmeet Bali <jsbali@uber.com>
2021-06-24 20:26:26 -07:00
pengzhiwei
ad53cf450e [HUDI-1879] Fix RO Tables Returning Snapshot Result (#2925) 2021-06-17 04:18:21 -07:00
Wei
75d663f65d [HUDI-1980] Optimize the code to prevent other exceptions from causing resources not to be closed (#3038)
Co-authored-by: wei.zhang2 <wei.zhang2@dmall.com>
2021-06-08 21:58:34 +08:00
pengzhiwei
f760ec543e [HUDI-1659] Basic Implement Of Spark Sql Support For Hoodie (#2645)
Main functions:
Support create table for hoodie.
Support CTAS.
Support Insert for hoodie. Including dynamic partition and static partition insert.
Support MergeInto for hoodie.
Support DELETE
Support UPDATE
Both support spark2 & spark3 based on DataSourceV1.

Main changes:
Add sql parser for spark2.
Add HoodieAnalysis for sql resolve and logical plan rewrite.
Add commands implementation for CREATE TABLE、INSERT、MERGE INTO & CTAS.
In order to push down the update&insert logical to the HoodieRecordPayload for MergeInto, I make same change to the
HoodieWriteHandler and other related classes.
1、Add the inputSchema for parser the incoming record. This is because the inputSchema for MergeInto is different from writeSchema as there are some transforms in the update& insert expression.
2、Add WRITE_SCHEMA to HoodieWriteConfig to pass the write schema for merge into.
3、Pass properties to HoodieRecordPayload#getInsertValue to pass the insert expression and table schema.


Verify this pull request
Add TestCreateTable for test create hoodie tables and CTAS.
Add TestInsertTable for test insert hoodie tables.
Add TestMergeIntoTable for test merge hoodie tables.
Add TestUpdateTable for test update hoodie tables.
Add TestDeleteTable for test delete hoodie tables.
Add TestSqlStatement for test supported ddl/dml currently.
2021-06-07 23:24:32 -07:00
Raymond Xu
441076b2cc [HUDI-1950] Move TestHiveMetastoreBasedLockProvider to functional (#3043)
HiveTestUtil static setup mini servers caused connection refused issue in Azure CI environment, as TestHiveSyncTool and TestHiveMetastoreBasedLockProvider share the same test facilities. Moving TestHiveMetastoreBasedLockProvider (the easier one) to functional test with a separate and improved mini server setup resolved the issue.

Also cleaned up dfs cluster from HiveTestUtil.

The next step is to move TestHiveSyncTool to functional as well.
2021-06-07 15:38:59 -07:00
Wei
dab13f7473 [HUDI-1979] Optimize logic to improve code readability (#3037)
Co-authored-by: wei.zhang2 <wei.zhang2@dmall.com>
2021-06-05 19:40:45 +08:00
vinoth chandar
d02c0e5387 [MINOR] Resolve build issue arising from inaccessible pentaho jar (#3034)
- Fixes #160 #2479
2021-06-04 15:28:44 -04:00
Volodymyr Burenin
8a48d16e41 [HUDI-1707] Reduces log level for too verbose messages from info to debug level. (#2714)
* Reduces log level for too verbose messages from info to debug level.
* Sort config output.
* Code Review : Small restructuring + rebasing to master
 - Fixing flaky multi delta streamer test
 - Using isDebugEnabled() checks
 - Some changes to shorten log message without moving to DEBUG

Co-authored-by: volodymyr.burenin <volodymyr.burenin@cloudkitchens.com>
Co-authored-by: Vinoth Chandar <vinoth@apache.org>
2021-05-10 07:16:02 -07:00
li36909
2c5a661a64 [HUDI-1759] Save one connection retry to hive metastore when hiveSyncTool run with useJdbc=false (#2759)
* [HUDI-1759] Save one connection retry to hive metastore when hiveSyncTool run with useJdbc=false

* Fix review comment
2021-05-07 15:30:26 -07:00
pengzhiwei
c9bcb5e33f [HUDI-1845] Exception Throws When Sync Non-Partitioned Table To Hive With MultiPartKeysValueExtractor (#2876) 2021-04-28 19:11:46 -07:00
Roc Marshal
e4fd195d9f [MINOR] Refactor method up to parent-class (#2822) 2021-04-27 21:32:32 +08:00
pengzhiwei
aacb8be521 [HUDI-1415] Read Hoodie Table As Spark DataSource Table (#2283) 2021-04-20 14:21:38 -07:00
Roc Marshal
f7b6b68063 [MINOR][hudi-sync] Fix typos (#2844) 2021-04-19 16:27:13 +08:00