1
0
Commit Graph

3000 Commits

Author SHA1 Message Date
Danny Chan
0811bb38fb [HUDI-4255] Make the flink merge and replace handle intermediate file visible (#5866) 2022-06-15 14:23:23 +08:00
Danny Chan
25bbff64cf [minor] Following HUDI-4207, remote the new wrapper #init method (#5865) 2022-06-15 08:48:13 +08:00
felixYyu
f16b1e8982 [MINOR] Fix typo of DisruptorExecutor in RFC 53 (#5860) 2022-06-13 23:30:17 -07:00
HunterXHunter
264b15df87 [HUDI-4207] HoodieFlinkWriteClient.getOrCreateWriteHandle throws an e… (#5788)
Adding more logs to assist in debugging with HoodieFlinkWriteClient.getOrCreateWriteHandle throwing exception
2022-06-13 10:36:06 -04:00
Qi Ji
4774c4248f [HUDI-4006] failOnDataLoss on delta-streamer kafka sources (#5718)
add new config key hoodie.deltastreamer.source.kafka.enable.failOnDataLoss
when failOnDataLoss=false (current behaviour, the default), log a warning instead of seeking to earliest silently
when failOnDataLoss is set, fail explicitly
2022-06-13 10:31:57 -04:00
luoyajun
0d859fe58b [HUDI-3863] Add UT for drop partition column in deltastreamer testsuite (#5727) 2022-06-13 10:29:32 -04:00
xi chaomin
e89f5627e4 [HUDI-3682] testReaderFilterRowKeys fails in TestHoodieOrcReaderWriter (#5790)
TestReaderFilterRowKeys needs to get the key from RECORD_KEY_METADATA_FIELD, but the writer in current UT does not populate the meta field and the schema does not contains meta fields.

This fix writes data with schema which contains meta fields and calls writeAvroWithMetadata for writing.

Co-authored-by: xicm <xicm@asiainfo.com>
2022-06-13 10:22:12 -04:00
superche
14d8735a1c Strip extra spaces when creating new configuration (#5849)
Co-authored-by: superche <superche@tencent.com>
2022-06-13 19:10:38 +08:00
sandyfog
c82e3462e3 [MINOR] fix AvroSchemaConverter duplicate branch in 'switch' (#5813) 2022-06-13 10:55:24 +08:00
Shiyan Xu
5aaac21d1d [HUDI-4224] Fix CI issues (#5842)
- Upgrade junit to 5.7.2
- Downgrade surefire and failsafe to 2.22.2
- Fix test failures that were previously not reported
- Improve azure pipeline configs

Co-authored-by: liujinhui1994 <965147871@qq.com>
Co-authored-by: Y Ethan Guo <ethan.guoyihua@gmail.com>
2022-06-12 11:44:18 -07:00
Y Ethan Guo
fd8f7c5f6c [HUDI-4205] Fix NullPointerException in HFile reader creation (#5841)
Replace SerializableConfiguration with SerializableWritable for broadcasting the hadoop configuration before initializing HFile readers
2022-06-11 14:46:43 -07:00
Y Ethan Guo
97ccf5dd18 [HUDI-4223] Fix NullPointerException from getLogRecordScanner when reading metadata table (#5840)
When explicitly specifying the metadata table path for reading in spark, the "hoodie.metadata.enable" is overwritten to true for proper read behavior.
2022-06-11 13:19:24 -07:00
Sivabalan Narayanan
08fe281091 [HUDI-4221] Fixing getAllPartitionPaths perf hit w/ FileSystemBackedMetadata (#5829) 2022-06-11 13:17:42 -07:00
xi chaomin
2b3a85528a [HUDI-3889] Do not validate table config if save mode is set to Overwrite (#5619)
Co-authored-by: xicm <xicm@asiainfo.com>
2022-06-09 19:23:51 -04:00
yanenze
ba47904fa2 [HUDI-4139]improvement for flink write operator name to identify tables easily (#5744)
Co-authored-by: yanenze <yanenze@keytop.com.cn>
2022-06-09 17:48:20 -04:00
Danny Chan
c608dbd6c2 [HUDI-4213] Infer keygen clazz for Spark SQL (#5815) 2022-06-09 20:37:58 +08:00
sandyfog
8ff17b0470 [MINOR] FlinkStateBackendConverter add more exception message (#5809)
* [MINOR] FlinkStateBackendConverter add more  exception message
2022-06-09 15:13:27 +08:00
liuzhuang2017
f5ab921300 [MINOR][DOCS] Update the README.md file in hudi-examples (#5803) 2022-06-08 17:45:00 -07:00
Alexey Kudinkin
35afdb4316 [HUDI-4178] Addressing performance regressions in Spark DataSourceV2 Integration (#5737)
There are multiple issues with our current DataSource V2 integrations: b/c we advertise Hudi tables as V2, Spark expects it to implement certain APIs which are not implemented at the moment, instead we're using custom Resolution rule (in HoodieSpark3Analysis) to instead manually fallback to V1 APIs.  This commit fixes the issue by reverting DSv2 APIs and making Spark use V1, except for schema evaluation logic.
2022-06-07 16:30:46 -07:00
Raymond Xu
1349b596a1 [HUDI-4198] Fix hive config for AWSGlueClientFactory (#5768)
* HiveConf needs to load fs conf to allow instantiation via AWSGlueClientFactory

* Resolve metastore uri config before loading fs conf

* Skip hiveql due to CI issue

Co-authored-by: Sagar Sumit <sagarsumit09@gmail.com>
2022-06-07 20:21:31 +05:30
Sivabalan Narayanan
f85cd9b16d [HUDI-4200] Fixing sorting of keys fetched from metadata table (#5773)
- Key fetched from metadata table especially from base file reader is not sorted. and hence may result in throwing NPE (key prefix search) or unnecessary seeks to starting of Hfile (full key look ups). Fixing the same in this patch. This is not an issue with log blocks, since sorting is taking care within HoodieHfileDataBlock.
- Commit where the sorting was mistakenly reverted [HUDI-3760] Adding capability to fetch Metadata Records by prefix  #5208
2022-06-07 08:19:52 -04:00
YueZhang
4f5cad8029 [MINOR][RFC-53] Fix typos (#5764)
Co-authored-by: yuezhang <yuezhang@freewheel.tv>
2022-06-07 08:28:28 +08:00
Raymond Xu
e5710a8e7c [MINOR] Mark AWSGlueCatalogSyncClient experimental (#5775) 2022-06-07 08:25:59 +08:00
Sivabalan Narayanan
7da97c8096 [HUDI-4171] Fixing Non partitioned with virtual keys in read path (#5747)
- When Non partitioned key gen is used with virtual keys, read path could break since partition path may not exist.
2022-06-06 15:48:21 -04:00
Sivabalan Narayanan
21b903fddb [HUDI-4197] Fix Async indexer to support building FILES partition (#5766)
- When async indexer is invoked only with "FILES" partition, it fails. Fixing it to work with Async indexer. Also, if metadata table itself is not initialized, and if someone is looking to build indexes via AsyncIndexer, first they are expected to index "FILES" partition followed by other partitions. In general, we have a limitation of building only one index at a time w/ AsyncIndexer and hence. Have added guards to ensure these conditions are met.
2022-06-06 15:47:11 -04:00
Sivabalan Narayanan
4f6fc726d0 [HUDI-4140] Fixing hive style partitioning and default partition with bulk insert row writer with SimpleKeyGen and virtual keys (#5664)
Bulk insert row writer code path had a gap wrt hive style partitioning and default partition when virtual keys are enabled with SimpleKeyGen.  This patch fixes the issue.
2022-06-06 10:21:00 -07:00
Alexey Kudinkin
4f7ea8c79a [HUDI-4176] Fixing TableSchemaResolver to avoid repeated HoodieCommitMetadata parsing (#5733)
As has been outlined in HUDI-4176, we've hit a roadblock while testing Hudi on a large dataset (~1Tb) having pretty fat commits where Hudi's commit metadata could reach into 100s of Mbs.
Given the size some of ours commit metadata instances Spark's parsing and resolving phase (when spark.sql(...) is involved, but before returned Dataset is dereferenced) starts to dominate some of our queries' execution time.

- Rebased onto new APIs to avoid excessive Hadoop's Path allocations
- Eliminated hasOperationField completely to avoid repeatitive computations
- Cleaning up duplication in HoodieActiveTimeline
- Added caching for common instances of HoodieCommitMetadata
- Made tableStructSchema lazy;
2022-06-06 13:14:26 -04:00
HunterXHunter
132c0aa8c7 [HUDI-4101] When BucketIndexPartitioner take partition path for dispersion may cause the fileID of the task to not be loaded correctly (#5763)
Co-authored-by: john.wick <john.wick@vipshop.com>
2022-06-06 21:53:55 +08:00
Sagar Sumit
21ab0ff8be [HUDI-4195] Bulk insert should use right keygen for non-partitioned table (#5759) 2022-06-06 07:19:03 -04:00
Danny Chan
22c45a7704 [HUDI-4188] Fix flaky ITTestDataSTreamWrite.testWriteCopyOnWrite (#5749) 2022-06-06 12:12:48 +08:00
marchpure
73b0be3c96 [HUDI-4192] HoodieHFileReader scan top cells after bottom cells throw NullPointerException (#5755)
SeekTo top cells avoid NullPointerException
2022-06-06 12:07:26 +08:00
Y Ethan Guo
5d18b80343 [HUDI-4190] Include hbase-protocol for shading in the bundles (#5750) 2022-06-05 17:42:16 -07:00
Saisai Shao
bd26d633d7 [HUDI-4168] Add Call Procedure for marker deletion (#5738)
* Add Call Procedure for marker deletion
2022-06-05 11:05:38 +08:00
Nicolas Paris
80783c27f5 [HUDI-4187] Fix partition order in aws glue sync (#5731) 2022-06-04 02:16:52 -07:00
leesf
3759a38b99 [HUDI-4183] Fix using HoodieCatalog to create non-hudi tables (#5743) 2022-06-03 17:16:48 +08:00
KnightChess
51602a34f7 [HUDI-4179] Cluster with sort cloumns invalid (#5739) 2022-06-02 20:28:21 +08:00
Danny Chan
7f8630cc57 [HUDI-4167] Remove the timeline refresh with initializing hoodie table (#5716)
The timeline refresh on table initialization invokes the fs view #sync, which has two actions now:

1. reload the timeline of the fs view, so that the next fs view request is based on this timeline metadata
2. if this is a local fs view, clear all the local states; if this is a remote fs view, send request to sync the remote fs view

But, let's see the construction, the meta client is instantiated freshly so the timeline is already the latest,
the table is also constructed freshly, so the fs view has no local states, that means, the #sync is unnecessary totally.

In this patch, the metadata lifecycle and data set fs view are kept in sync, when the fs view is refreshed, the underneath metadata
is also refreshed synchronouly. The freshness of the metadata follows the same rules as data fs view:

1. if the fs view is local, the visibility is based on the client table metadata client's latest commit
2. if the fs view is remote, the timeline server would #sync the fs view and metadata together based on the lagging server local timeline

From the perspective of client, no need to care about the refresh action anymore no matter whether the metadata table is enabled or not.
That make the client logic more clear and less error-prone.

Removes the timeline refresh has another benefit: if avoids unncecessary #refresh of the remote fs view, if all the clients send request to #sync the
remote fs view, the server would encounter conflicts and the client encounters a response error.
2022-06-02 09:48:48 +08:00
Qi Ji
7276d0eaa6 [HUDI-3670] free temp views in sql transformers (#5080) 2022-06-01 07:35:40 -07:00
Sagar Sumit
dfcd6d9a86 [HUDI-4011] Add hudi-aws-bundle (#5674)
Co-authored-by: Raymond Xu <2701446+xushiyan@users.noreply.github.com>
2022-06-01 05:30:29 -07:00
Danny Chan
0d069b5e57 [HUDI-4174] Add hive conf dir option for flink sink (#5725) 2022-06-01 16:17:36 +08:00
Kumud Kumar Srivatsava Tirupati
795a99ba73 [HUDI-4107] Added --sync-tool-classes config option in HoodieMultiTableDeltaStreamer (#5597)
* added --sync-tool-classes config option in multitable delta streamer

* added a testcase to assert if syncClientToolClassNames is getting picked to the deltastreamer execution context
2022-05-31 20:27:50 +05:30
Jin Xing
918c4f4e0b [HUDI-4149] Drop-Table fails when underlying table directory is broken (#5672) 2022-05-30 19:09:26 +08:00
Danny Chan
329da34ee0 [HUDI-4163] Catch general exception instead of IOException while fetching rollback plan during rollback (#5703)
If the avro file is corrupted, an InvalidAvroMagicException throws.
2022-05-30 13:08:02 +08:00
苏承祥
7e86884604 [HUDI-4086] Use CustomizedThreadFactory in async compaction and clustering (#5563)
Co-authored-by: 苏承祥 <sucx@tuya.com>
2022-05-28 22:35:47 -07:00
Raymond Xu
0a72458291 [HUDI-3551] Fix testStorageSchemes for oci storage (#5711) 2022-05-28 12:13:37 -07:00
Carter Shanklin
62d792368b [HUDI-3551] Add the Oracle Cloud Infrastructure (oci) Object Storage URI scheme (#4952) 2022-05-28 08:26:14 -07:00
uday08bce
48062a5708 [HUDI-4166] Added SimpleClient plugin for integ test (#5710) 2022-05-28 08:20:52 -07:00
ForwardXu
8fa8f26031 [MINOR] Fix Hive and meta sync config for sql statement (#5316) 2022-05-28 07:56:39 -07:00
wangxianghu
58014c147a [HUDI-4160] Make database regex of MaxwellJsonKafkaSourcePostProcessor optional (#5697) 2022-05-28 11:13:24 +04:00
Bo Cui
93fe5a497e [HUDI-4151] flink split_reader supports rocksdb (#5675)
* [HUDI-4151] flink split_reader supports rocksdb
2022-05-28 08:37:34 +08:00