1
0
Commit Graph

2530 Commits

Author SHA1 Message Date
Danny Chan
1d57bd17c2 [minor] Cosmetic changes following HUDI-3315 (#4934) 2022-03-02 17:44:52 +08:00
Gary Li
10d866f083 [HUDI-3315] RFC-35 Part-1 Support bucket index in Flink writer (#4679)
* Support bucket index in Flink writer
* Use record key as default index key
2022-03-02 15:14:44 +08:00
Alexey Kudinkin
85f47b53df [HUDI-3469] Refactor HoodieTestDataGenerator to provide for reproducible Builds (#4866) 2022-03-01 22:15:26 -08:00
yuzhaojing
3b2da9f138 [HUDI-2631] In CompactFunction, set up the write schema each time with the latest schema (#4000)
Co-authored-by: yuzhaojing <yuzhaojing@bytedance.com>
2022-03-02 11:18:17 +08:00
stayrascal
3cfb52c413 [MINOR] fix get builtin function issue from Hudi catalog (#4917) 2022-03-02 11:16:19 +08:00
Bo Cui
3fdc9332e5 [HUDI-3516] Implement record iterator for HoodieDataBlock (#4909)
*  Use iterator to void eager materialization to be memory friendly
2022-03-02 10:19:36 +08:00
ForwardXu
a81a6326d5 [HUDI-3441] Add support for "marker delete" in hudi-cli (#4922) 2022-03-01 16:03:53 +08:00
Sivabalan Narayanan
f7088a957c [HUDI-3497] Adding Datatable validator tool (#4902) 2022-02-28 22:46:32 -05:00
Y Ethan Guo
257052a94d [HUDI-3465] Add validation of column stats and bloom filters in HoodieMetadataTableValidator (#4878) 2022-02-28 18:49:30 -08:00
yuzhaojing
44b8ab6048 [HUDI-3418] Save timeout option for remote RemoteFileSystemView (#4809)
Co-authored-by: yuzhaojing <yuzhaojing@bytedance.com>
2022-02-28 15:16:40 -05:00
wenningd
18dc89cf79 [HUDI-3450] Avoid passing empty string spark master to hudi cli (#4844)
Co-authored-by: Wenning Ding <wenningd@amazon.com>
2022-02-28 11:37:24 -05:00
Y Ethan Guo
05e395ae5f [HUDI-3341] Fix log file reader for S3 with hadoop-aws 2.7.x (#4897) 2022-02-28 11:14:35 -05:00
stayrascal
8f1e4f5b3e [HUDI-3528] Fix String convert issue and overwrite putAll method in TypedProperties.java (#4920) 2022-02-28 10:45:47 -05:00
Sivabalan Narayanan
4a59876c8b [HUDI-2917] rollback insert data appended to log file when using Hbase Index (#4840)
Co-authored-by: guanziyue <guanziyue@gmail.com>
2022-02-28 08:13:17 -05:00
Bo Cui
193215201c [MINOR] Change MINI_BATCH_SIZE to 2048 (#4862)
ParquetColumnarRowSplitReader#batchSize is 2048, so Changing MINI_BATCH_SIZE to 2048 will reduce memory cache.
2022-02-28 10:45:28 +08:00
Sivabalan Narayanan
d5444ff7ff [HUDI-3018] Adding validation to dataframe scheme to ensure reserved field does not have diff data type (#4852) 2022-02-27 11:59:23 -05:00
Sivabalan Narayanan
2f99e8458a [HUDI-3521] Fixing kakfa key and value serializer value type from class to string (#4919) 2022-02-27 11:13:13 -05:00
Raymond Xu
c77b2591d0 [HUDI-2439] Remove SparkBoundedInMemoryExecutor (#4860) 2022-02-26 08:02:12 -05:00
Sivabalan Narayanan
1379300b5b [HUDI-3483] Adding insert override nodes to integ test suite and few clean ups (#4895) 2022-02-26 08:00:15 -05:00
Sagar Sumit
6a5cfb45b9 [MINOR] Fix table type in input format test (#4912) 2022-02-25 13:51:53 -05:00
苏承祥
92cdc5987a [HUDI-3515] Making rdd unpersist optional at the end of writes (#4898)
Co-authored-by: 苏承祥 <sucx@tuya.com>
2022-02-25 11:30:10 -05:00
Raymond Xu
b50f4b491c [HUDI-3042] Refactor clustering executors (#4847) 2022-02-25 05:39:43 -08:00
YueZhang
742810070b [HUDI-3421]Pending clustering may break AbstractTableFileSystemView#getxxBaseFile() (#4810) 2022-02-25 16:46:27 +05:30
Danny Chan
a4ee7463ae [HUDI-3474] Add more document to Pipelines for the usage of this tool to build a write pipeline (#4906) 2022-02-25 19:08:51 +08:00
todd5167
45d1216e91 [HUDI-3401] fix NPE caused by incorrect beforeKeyGenClassName validation (#4774) 2022-02-24 23:31:29 -05:00
YueZhang
3694485609 [HUDI-3429] Support clustering scheduleAndExecute for hudi-cli and add clustering-cli Tests (#4817)
Co-authored-by: yuezhang <yuezhang@freewheel.tv>
2022-02-24 23:28:38 -05:00
ForwardXu
aa1810d737 [HUDI-3493] Not table to get execution plan (#4894) 2022-02-24 17:04:44 -08:00
Alexey Kudinkin
85e8a5c4de [HUDI-1296] Support Metadata Table in Spark Datasource (#4789)
* Bootstrapping initial support for Metadata Table in Spark Datasource

- Consolidated Avro/Row conversion utilities to center around Spark's AvroDeserializer ; removed duplication
- Bootstrapped HoodieBaseRelation
- Updated HoodieMergeOnReadRDD to be able to handle Metadata Table
- Modified MOR relations to be able to read different Base File formats (Parquet, HFile)
2022-02-24 16:23:13 -05:00
ForwardXu
521338b4d9 [HUDI-3161] Add Call Produce Command for Spark SQL (#4535) 2022-02-24 07:45:37 -08:00
yanenze
943b99775b [HUDI-3488] The flink small file list should exclude file slices with pending compaction (#4893)
# this happens when the async-compaction has been configured

Co-authored-by: yanenze <yanenze@keytop.com.cn>
2022-02-24 14:45:03 +08:00
Sivabalan Narayanan
62605be413 [HUDI-3480][HUDI-3481] Enchancements to integ test suite (#4884) 2022-02-23 15:56:35 -05:00
leesf
2a93b8efb2 [HUDI-3489] Unify config to avoid duplicate code (#4883) 2022-02-23 08:14:30 -05:00
Y Ethan Guo
4e8accc179 [HUDI-3486] Fix wrong field order for constructing HoodieMetadataColumnStats (#4875) 2022-02-23 10:27:02 +05:30
yuzhaojing
dabae80423 [HUDI-3420] Remove duplicates type in HoodieClusteringGroup.avsc (#4808)
Co-authored-by: yuzhaojing <yuzhaojing@bytedance.com>
2022-02-23 10:49:47 +08:00
从大数据到人工智能
01cbddef78 Add hive-standalone-metastore dependency to hudi-flink-bundle module (#4870) 2022-02-23 09:16:21 +08:00
Sivabalan Narayanan
9678c3fbcf [MINOR] Fixing checkpoint management in S3IncrSource (#4871) 2022-02-22 09:15:16 -05:00
Danny Chan
b87e95d621 [HUDI-3476] Remove the shade pattern for parquet for flink bundle jar (#4869) 2022-02-22 19:21:57 +08:00
Danny Chan
4affdd0c8f [HUDI-3461] The archived timeline for flink streaming reader should not be reused (#4861)
* Before the patch, the flink streaming reader caches the meta client thus the archived timeline,
  when fetching the instant details from the reused timeline, the exception throws
* Add a method in HoodieTableMetaClient to return a fresh new archived timeline each time
2022-02-22 15:54:29 +08:00
wangxianghu
4d1f74ebea [HUDI-3464] Fix wrong exception thrown from HiveSchemaProvider (#4865) 2022-02-22 10:20:20 +04:00
Sivabalan Narayanan
14dbbdf4c7 [HUDI-2189] Adding delete partitions support to DeltaStreamer (#4787) 2022-02-22 00:01:30 -05:00
Y Ethan Guo
7e1ea06eb9 [MINOR] Fix typos and improve docs in HoodieMetadataConfig (#4867) 2022-02-21 19:36:20 -08:00
Prashant Wason
0dee8edc97 [HUDI-2925] Fix duplicate cleaning of same files when unfinished clean operations are present using a config. (#4212)
Co-authored-by: sivabalan <n.siva.b@gmail.com>
2022-02-21 21:53:03 -05:00
Yann Byron
0c950181aa [HUDI-3423] upgrade spark to 3.2.1 (#4815) 2022-02-21 16:52:21 -08:00
RexAn
801fdab55c [HUDI-3042] Abstract Spark update Strategy to make code more clean and remove duplicates (#4845)
Co-authored-by: Hui An <hui.an@shopee.com>
2022-02-21 06:53:09 -08:00
Pratyaksh Sharma
bf16bc122a [HUDI-349]: Added new cleaning policy based on number of hours (#3646) 2022-02-21 09:04:42 -05:00
Sivabalan Narayanan
d36fe24c9e [HUDI-3455] Fixing checkpoint management in hoodie incr source (#4850) 2022-02-21 08:19:57 -05:00
Sivabalan Narayanan
17cb5cb433 [HUDI-3432] Fixing restore with metadata enabled (#4849)
* Fixing restore with metadata enabled

* Fixing test failures
2022-02-21 18:25:30 +05:30
leesf
76b6ad6491 [HUDI-2732][RFC-38] Spark Datasource V2 Integration (#3964) 2022-02-21 20:14:07 +08:00
YueZhang
359fbfde79 [HUDI-2648] Retry FileSystem action instead of failed directly. (#3887)
Co-authored-by: yuezhang <yuezhang@freewheel.tv>
2022-02-20 15:31:31 -05:00
Raymond Xu
0938f55a2b [HUDI-3458] Fix BulkInsertPartitioner generic type (#4854) 2022-02-20 13:51:58 -05:00