1
0
Commit Graph

2929 Commits

Author SHA1 Message Date
YueZhang
4f5cad8029 [MINOR][RFC-53] Fix typos (#5764)
Co-authored-by: yuezhang <yuezhang@freewheel.tv>
2022-06-07 08:28:28 +08:00
Raymond Xu
e5710a8e7c [MINOR] Mark AWSGlueCatalogSyncClient experimental (#5775) 2022-06-07 08:25:59 +08:00
Sivabalan Narayanan
7da97c8096 [HUDI-4171] Fixing Non partitioned with virtual keys in read path (#5747)
- When Non partitioned key gen is used with virtual keys, read path could break since partition path may not exist.
2022-06-06 15:48:21 -04:00
Sivabalan Narayanan
21b903fddb [HUDI-4197] Fix Async indexer to support building FILES partition (#5766)
- When async indexer is invoked only with "FILES" partition, it fails. Fixing it to work with Async indexer. Also, if metadata table itself is not initialized, and if someone is looking to build indexes via AsyncIndexer, first they are expected to index "FILES" partition followed by other partitions. In general, we have a limitation of building only one index at a time w/ AsyncIndexer and hence. Have added guards to ensure these conditions are met.
2022-06-06 15:47:11 -04:00
Sivabalan Narayanan
4f6fc726d0 [HUDI-4140] Fixing hive style partitioning and default partition with bulk insert row writer with SimpleKeyGen and virtual keys (#5664)
Bulk insert row writer code path had a gap wrt hive style partitioning and default partition when virtual keys are enabled with SimpleKeyGen.  This patch fixes the issue.
2022-06-06 10:21:00 -07:00
Alexey Kudinkin
4f7ea8c79a [HUDI-4176] Fixing TableSchemaResolver to avoid repeated HoodieCommitMetadata parsing (#5733)
As has been outlined in HUDI-4176, we've hit a roadblock while testing Hudi on a large dataset (~1Tb) having pretty fat commits where Hudi's commit metadata could reach into 100s of Mbs.
Given the size some of ours commit metadata instances Spark's parsing and resolving phase (when spark.sql(...) is involved, but before returned Dataset is dereferenced) starts to dominate some of our queries' execution time.

- Rebased onto new APIs to avoid excessive Hadoop's Path allocations
- Eliminated hasOperationField completely to avoid repeatitive computations
- Cleaning up duplication in HoodieActiveTimeline
- Added caching for common instances of HoodieCommitMetadata
- Made tableStructSchema lazy;
2022-06-06 13:14:26 -04:00
HunterXHunter
132c0aa8c7 [HUDI-4101] When BucketIndexPartitioner take partition path for dispersion may cause the fileID of the task to not be loaded correctly (#5763)
Co-authored-by: john.wick <john.wick@vipshop.com>
2022-06-06 21:53:55 +08:00
Sagar Sumit
21ab0ff8be [HUDI-4195] Bulk insert should use right keygen for non-partitioned table (#5759) 2022-06-06 07:19:03 -04:00
Danny Chan
22c45a7704 [HUDI-4188] Fix flaky ITTestDataSTreamWrite.testWriteCopyOnWrite (#5749) 2022-06-06 12:12:48 +08:00
marchpure
73b0be3c96 [HUDI-4192] HoodieHFileReader scan top cells after bottom cells throw NullPointerException (#5755)
SeekTo top cells avoid NullPointerException
2022-06-06 12:07:26 +08:00
Y Ethan Guo
5d18b80343 [HUDI-4190] Include hbase-protocol for shading in the bundles (#5750) 2022-06-05 17:42:16 -07:00
Saisai Shao
bd26d633d7 [HUDI-4168] Add Call Procedure for marker deletion (#5738)
* Add Call Procedure for marker deletion
2022-06-05 11:05:38 +08:00
Nicolas Paris
80783c27f5 [HUDI-4187] Fix partition order in aws glue sync (#5731) 2022-06-04 02:16:52 -07:00
leesf
3759a38b99 [HUDI-4183] Fix using HoodieCatalog to create non-hudi tables (#5743) 2022-06-03 17:16:48 +08:00
KnightChess
51602a34f7 [HUDI-4179] Cluster with sort cloumns invalid (#5739) 2022-06-02 20:28:21 +08:00
Danny Chan
7f8630cc57 [HUDI-4167] Remove the timeline refresh with initializing hoodie table (#5716)
The timeline refresh on table initialization invokes the fs view #sync, which has two actions now:

1. reload the timeline of the fs view, so that the next fs view request is based on this timeline metadata
2. if this is a local fs view, clear all the local states; if this is a remote fs view, send request to sync the remote fs view

But, let's see the construction, the meta client is instantiated freshly so the timeline is already the latest,
the table is also constructed freshly, so the fs view has no local states, that means, the #sync is unnecessary totally.

In this patch, the metadata lifecycle and data set fs view are kept in sync, when the fs view is refreshed, the underneath metadata
is also refreshed synchronouly. The freshness of the metadata follows the same rules as data fs view:

1. if the fs view is local, the visibility is based on the client table metadata client's latest commit
2. if the fs view is remote, the timeline server would #sync the fs view and metadata together based on the lagging server local timeline

From the perspective of client, no need to care about the refresh action anymore no matter whether the metadata table is enabled or not.
That make the client logic more clear and less error-prone.

Removes the timeline refresh has another benefit: if avoids unncecessary #refresh of the remote fs view, if all the clients send request to #sync the
remote fs view, the server would encounter conflicts and the client encounters a response error.
2022-06-02 09:48:48 +08:00
Qi Ji
7276d0eaa6 [HUDI-3670] free temp views in sql transformers (#5080) 2022-06-01 07:35:40 -07:00
Sagar Sumit
dfcd6d9a86 [HUDI-4011] Add hudi-aws-bundle (#5674)
Co-authored-by: Raymond Xu <2701446+xushiyan@users.noreply.github.com>
2022-06-01 05:30:29 -07:00
Danny Chan
0d069b5e57 [HUDI-4174] Add hive conf dir option for flink sink (#5725) 2022-06-01 16:17:36 +08:00
Kumud Kumar Srivatsava Tirupati
795a99ba73 [HUDI-4107] Added --sync-tool-classes config option in HoodieMultiTableDeltaStreamer (#5597)
* added --sync-tool-classes config option in multitable delta streamer

* added a testcase to assert if syncClientToolClassNames is getting picked to the deltastreamer execution context
2022-05-31 20:27:50 +05:30
Jin Xing
918c4f4e0b [HUDI-4149] Drop-Table fails when underlying table directory is broken (#5672) 2022-05-30 19:09:26 +08:00
Danny Chan
329da34ee0 [HUDI-4163] Catch general exception instead of IOException while fetching rollback plan during rollback (#5703)
If the avro file is corrupted, an InvalidAvroMagicException throws.
2022-05-30 13:08:02 +08:00
苏承祥
7e86884604 [HUDI-4086] Use CustomizedThreadFactory in async compaction and clustering (#5563)
Co-authored-by: 苏承祥 <sucx@tuya.com>
2022-05-28 22:35:47 -07:00
Raymond Xu
0a72458291 [HUDI-3551] Fix testStorageSchemes for oci storage (#5711) 2022-05-28 12:13:37 -07:00
Carter Shanklin
62d792368b [HUDI-3551] Add the Oracle Cloud Infrastructure (oci) Object Storage URI scheme (#4952) 2022-05-28 08:26:14 -07:00
uday08bce
48062a5708 [HUDI-4166] Added SimpleClient plugin for integ test (#5710) 2022-05-28 08:20:52 -07:00
ForwardXu
8fa8f26031 [MINOR] Fix Hive and meta sync config for sql statement (#5316) 2022-05-28 07:56:39 -07:00
wangxianghu
58014c147a [HUDI-4160] Make database regex of MaxwellJsonKafkaSourcePostProcessor optional (#5697) 2022-05-28 11:13:24 +04:00
Bo Cui
93fe5a497e [HUDI-4151] flink split_reader supports rocksdb (#5675)
* [HUDI-4151] flink split_reader supports rocksdb
2022-05-28 08:37:34 +08:00
RexAn
554caa3421 [MINOR] Fix the issue when handling conf hoodie.datasource.write.operation=bulk_insert in sql mode (#5679)
Co-authored-by: Rex An <bonean131@gmail.com>
2022-05-27 04:45:09 -07:00
Alexey Kudinkin
1767ff5e7c [HUDI-4161] Make sure partition values are taken from partition path (#5699) 2022-05-27 02:36:30 -07:00
watermelon12138
57dbe57bed [HUDI-4162] Fixed some constant mapping issues. (#5700)
Co-authored-by: y00617041 <yangxuan42@huawei.com>
2022-05-27 14:08:54 +08:00
YueZhang
85962ee55d [HUDI-3963][RFC-53] Use Lock-Free Message Queue Disruptor Improving Hoodie Writing Efficiency (#5567)
Co-authored-by: yuezhang <yuezhang@freewheel.tv>
2022-05-26 23:03:09 +08:00
komao
8d2f009048 [HUDI-4124] Add valid check in Spark Datasource configs (#5637)
Co-authored-by: wangzixuan.wzxuan <wangzixuan.wzxuan@bytedance.com>
2022-05-26 05:21:28 -07:00
Sagar Sumit
31e13db1f0 [HUDI-4023] Decouple hudi-spark from hudi-utilities-slim-bundle (#5641) 2022-05-26 11:28:49 +05:30
RexAn
98c5c6c654 [HUDI-4040] Bulk insert Support CustomColumnsSortPartitioner with Row (#5502)
* Along the lines of RDDCustomColumnsSortPartitioner but for Row
2022-05-26 10:39:04 +05:30
Danny Chan
4e42ed5eae [HUDI-4145] Archives the metadata file in HoodieInstant.State sequence (part2) (#5676) 2022-05-26 11:21:39 +08:00
Sagar Sumit
cf837b4900 [HUDI-3193] Decouple hudi-aws from hudi-client-common (#5666)
Move HoodieMetricsCloudWatchConfig to hudi-client-common
2022-05-25 19:38:56 +05:30
冯健
a6bc9e8e81 [HUDI-4146] Claim RFC-55 for Improve Hive/Meta sync class design and hierachies (#5682) 2022-05-25 05:31:39 -07:00
luoyajun
f30b3aef3e [MINOR] Fix a potential NPE and some finer points of hudi cli (#5656) 2022-05-24 11:13:18 -07:00
Zhaojing Yu
18635b533e Merge pull request #3599 from yuzhaojing/HUDI-2207
[HUDI-2207] Support independent flink hudi clustering function
2022-05-25 00:47:28 +08:00
Sivabalan Narayanan
10363c1412 [HUDI-4132] Fixing determining target table schema for delta sync with empty batch (#5648) 2022-05-24 08:17:15 -04:00
喻兆靖
c20db99a7b [HUDI-2207] Support independent flink hudi clustering function 2022-05-24 20:16:48 +08:00
liujinhui
0caa55ecb4 [HUDI-4135] remove netty and netty-all (#5663) 2022-05-24 03:56:28 -07:00
Danny Chan
eb219010d2 [HUDI-4145] Archives the metadata file in HoodieInstant.State sequence (#5669) 2022-05-24 17:33:30 +08:00
Sivabalan Narayanan
c05ebf2417 [HUDI-2473] Fixing compaction write operation in commit metadata (#5203) 2022-05-24 13:03:21 +05:30
Danny Chan
676d5cefe0 [HUDI-4138] Fix the concurrency modification of hoodie table config for flink (#5660)
* Remove the metadata cleaning strategy for flink, that means the multi-modal index may be affected
* Improve the HoodieTable#clearMetadataTablePartitionsConfig to only update table config when necessary
* Remove the modification of read code path in HoodieTableConfig
2022-05-24 13:07:55 +08:00
Sivabalan Narayanan
af1128acf9 [HUDI-4084] Add support to test async table services with integ test suite framework (#5557)
* Add support to test async table services with integ test suite framework

* Make await time for validation configurable
2022-05-24 08:35:56 +05:30
Heap
47b764ec33 [HUDI-4134] Fix Method naming consistency issues in FSUtils (#5655) 2022-05-23 15:28:48 -07:00
felixYyu
716e995a38 [MINOR] Removing redundant semicolons and line breaks (#5662) 2022-05-23 15:26:36 -07:00