1
0
Commit Graph

398 Commits

Author SHA1 Message Date
Alexey Kudinkin
189d5297b8 [HUDI-3709] Fixing ParquetWriter impls not respecting Parquet Max File Size limit (#5129) 2022-03-26 17:51:36 -04:00
RexAn
57b4f39c31 [HUDI-3612] Clustering strategy should create new TypedProperties when modifying it (#5027) 2022-03-26 16:16:03 +05:30
Danny Chan
0c09a973fb [HUDI-3435] Do not throw exception when instant to rollback does not exist in metadata table active timeline (#4821) 2022-03-26 11:42:54 +08:00
Danny Chan
8896864d7b [HUDI-3678] Fix record rewrite of create handle when 'preserveMetadata' is true (#5088) 2022-03-25 11:48:50 +08:00
Y Ethan Guo
eaa4c4f2e2 [HUDI-1180] Upgrade HBase to 2.4.9 (#5004)
Co-authored-by: Sagar Sumit <sagarsumit09@gmail.com>
2022-03-24 19:04:53 -07:00
Danny Chan
5e86cdd1e9 [HUDI-3701] Flink bulk_insert support bucket hash index (#5118) 2022-03-25 09:01:42 +08:00
Y Ethan Guo
608d4bf32d [HUDI-3638] Make ZookeeperBasedLockProvider serializable (#5112) 2022-03-24 17:59:47 -07:00
Sagar Sumit
f96ba7abf0 [HUDI-3642] Handle NPE due to empty requested replacecommit metadata (#5090) 2022-03-23 12:13:02 -07:00
Rajesh Mahindra
5f570ea151 [HUDI-2883] Refactor hive sync tool / config to use reflection and standardize configs (#4175)
- Refactor hive sync tool / config to use reflection and standardize configs

Co-authored-by: sivabalan <n.siva.b@gmail.com>
Co-authored-by: Rajesh Mahindra <rmahindra@Rajeshs-MacBook-Pro.local>
Co-authored-by: Raymond Xu <2701446+xushiyan@users.noreply.github.com>
2022-03-21 22:56:31 -04:00
Y Ethan Guo
9b6e138af2 [HUDI-3640] Set SimpleKeyGenerator as default in 2to3 table upgrade for Spark engine (#5075) 2022-03-21 20:35:06 -04:00
Pratyaksh Sharma
ca0931d332 [HUDI-1436]: Provide an option to trigger clean every nth commit (#4385)
- Provided option to trigger clean every nth commit with default number of commits as 1 so that existing users are not affected.
Co-authored-by: sivabalan <n.siva.b@gmail.com>
2022-03-21 20:06:30 -04:00
Raymond Xu
7446ff95a7 [HUDI-2439] Replace RDD with HoodieData in HoodieSparkTable and commit executors (#4856)
- Adopt HoodieData in Spark action commit executors
- Make Spark independent DeleteHelper, WriteHelper, MergeHelper in hudi-client-common
- Make HoodieTable in WriteClient APIs have raw type to decouple with Client's generic types
2022-03-17 04:17:56 -07:00
Y Ethan Guo
5ba2d9ab2f [HUDI-3494] Consider triggering condition of MOR compaction during archival (#4974) 2022-03-17 01:28:11 -04:00
Y Ethan Guo
95e6e53810 [HUDI-3404] Automatically adjust write configs based on metadata table and write concurrency mode (#4975) 2022-03-17 01:25:04 -04:00
liujinhui
e60acc1258 [HUDI-3583] Fix MarkerBasedRollbackStrategy NoSuchElementException (#4984)
Co-authored-by: Y Ethan Guo <ethan.guoyihua@gmail.com>
2022-03-12 23:00:50 -08:00
Sivabalan Narayanan
e7bb0413af [HUDI-3556] Re-use rollback instant for rolling back of clustering and compaction if rollback failed mid-way (#4971) 2022-03-11 18:40:13 -05:00
Danny Chan
ec24407191 [HUDI-3581] Reorganize some clazz for hudi flink (#4983) 2022-03-10 15:55:15 +08:00
Alexey Kudinkin
034addaef5 [HUDI-3396] Make sure BaseFileOnlyViewRelation only reads projected columns (#4818)
NOTE: This change is first part of the series to clean up Hudi's Spark DataSource related implementations, making sure there's minimal code duplication among them, implementations are consistent and performant

This PR is making sure that BaseFileOnlyViewRelation only reads projected columns as well as avoiding unnecessary serde from Row to InternalRow

Brief change log
- Introduced HoodieBaseRDD as a base for all custom RDD impls
- Extracted common fields/methods to HoodieBaseRelation
- Cleaned up and streamlined HoodieBaseFileViewOnlyRelation
- Fixed all of the Relations to avoid superfluous Row <> InternalRow conversions
2022-03-09 21:45:25 -05:00
Sivabalan Narayanan
4324e874ae [HUDI-3587] Making SupportsUpgradeDowngrade serializable (#4991) 2022-03-09 00:04:42 -05:00
Sagar Sumit
575bc63468 [HUDI-3356][HUDI-3203] HoodieData for metadata index records; BloomFilter construction from index based on the type param (#4848)
Rework of #4761 
This diff introduces following changes:

- Write stats are converted to metadata index records during the commit. Making them use the HoodieData type so that the record generation scales up with needs. 
- Metadata index init support for bloom filter and column stats partitions.
- When building the BloomFilter from the index records, using the type param stored in the payload instead of hardcoded type.
- Delta writes can change column ranges and the column stats index need to be properly updated with new ranges to be consistent with the table dataset. This fix add column stats index update support for the delta writes.

Co-authored-by: Manoj Govindassamy <manoj.govindassamy@gmail.com>
2022-03-08 10:39:04 -05:00
Sivabalan Narayanan
29040762fa [HUDI-3576] Configuring timeline refreshes based on latest commit (#4973) 2022-03-07 17:01:49 -05:00
Alexey Kudinkin
a66fd40692 [HUDI-3365] Make sure Metadata Table records are updated appropriately on HDFS (#4739)
- This change makes sure MT records are updated appropriately on HDFS: previously after Log File append operations MT records were updated w/ just the size of the deltas being appended to the original files, which have been found to be the cause of issues in case of Rollbacks that were instead updating MT with records bearing the full file-size.

- To make sure that we hedge against similar issues going f/w, this PR alleviates this discrepancy and streamlines the flow of MT table always ingesting records bearing full file-sizes.
2022-03-07 15:38:27 -05:00
Alexey Kudinkin
f0bcee3c01 [HUDI-3561] Avoid including whole MultipleSparkJobExecutionStrategy object into the closure for Spark to serialize (#4954)
- Avoid including whole MultipleSparkJobExecutionStrategy object into the closure for Spark to serialize
2022-03-07 13:42:03 -05:00
Sivabalan Narayanan
3539578ccb [HUDI-3213] Making commit preserve metadata to true for compaction (#4811)
* Making commit preserve metadata to true

* Fixing integ tests

* Fixing preserve commit metadata for metadata table

* fixed bootstrap tests

* temp diff

* Fixing merge handle

* renaming fallback record

* fixing build issue

* Fixing test failures
2022-03-07 18:02:05 +05:30
shibei
62f534d002 [HUDI-3445] Support Clustering Command Based on Call Procedure Command for Spark SQL (#4901)
* [HUDI-3445] Clustering Command Based on Call Procedure Command for Spark SQL

* [HUDI-3445] Clustering Command Based on Call Procedure Command for Spark SQL

* [HUDI-3445] Clustering Command Based on Call Procedure Command for Spark SQL

Co-authored-by: shibei <huberylee.li@alibaba-inc.com>
2022-03-04 09:33:16 +08:00
Sivabalan Narayanan
876a891979 [HUDI-3544] Fixing "populate meta fields" update to metadata table (#4941)
* Fixing populateMeta fields update to metadata table

* Fix checkstyle violations

Co-authored-by: Sagar Sumit <sagarsumit09@gmail.com>
2022-03-03 17:02:25 +05:30
Danny Chan
1d57bd17c2 [minor] Cosmetic changes following HUDI-3315 (#4934) 2022-03-02 17:44:52 +08:00
Gary Li
10d866f083 [HUDI-3315] RFC-35 Part-1 Support bucket index in Flink writer (#4679)
* Support bucket index in Flink writer
* Use record key as default index key
2022-03-02 15:14:44 +08:00
Bo Cui
3fdc9332e5 [HUDI-3516] Implement record iterator for HoodieDataBlock (#4909)
*  Use iterator to void eager materialization to be memory friendly
2022-03-02 10:19:36 +08:00
yuzhaojing
44b8ab6048 [HUDI-3418] Save timeout option for remote RemoteFileSystemView (#4809)
Co-authored-by: yuzhaojing <yuzhaojing@bytedance.com>
2022-02-28 15:16:40 -05:00
Sivabalan Narayanan
4a59876c8b [HUDI-2917] rollback insert data appended to log file when using Hbase Index (#4840)
Co-authored-by: guanziyue <guanziyue@gmail.com>
2022-02-28 08:13:17 -05:00
Raymond Xu
c77b2591d0 [HUDI-2439] Remove SparkBoundedInMemoryExecutor (#4860) 2022-02-26 08:02:12 -05:00
苏承祥
92cdc5987a [HUDI-3515] Making rdd unpersist optional at the end of writes (#4898)
Co-authored-by: 苏承祥 <sucx@tuya.com>
2022-02-25 11:30:10 -05:00
Raymond Xu
b50f4b491c [HUDI-3042] Refactor clustering executors (#4847) 2022-02-25 05:39:43 -08:00
Alexey Kudinkin
85e8a5c4de [HUDI-1296] Support Metadata Table in Spark Datasource (#4789)
* Bootstrapping initial support for Metadata Table in Spark Datasource

- Consolidated Avro/Row conversion utilities to center around Spark's AvroDeserializer ; removed duplication
- Bootstrapped HoodieBaseRelation
- Updated HoodieMergeOnReadRDD to be able to handle Metadata Table
- Modified MOR relations to be able to read different Base File formats (Parquet, HFile)
2022-02-24 16:23:13 -05:00
Prashant Wason
0dee8edc97 [HUDI-2925] Fix duplicate cleaning of same files when unfinished clean operations are present using a config. (#4212)
Co-authored-by: sivabalan <n.siva.b@gmail.com>
2022-02-21 21:53:03 -05:00
Pratyaksh Sharma
bf16bc122a [HUDI-349]: Added new cleaning policy based on number of hours (#3646) 2022-02-21 09:04:42 -05:00
YueZhang
359fbfde79 [HUDI-2648] Retry FileSystem action instead of failed directly. (#3887)
Co-authored-by: yuezhang <yuezhang@freewheel.tv>
2022-02-20 15:31:31 -05:00
Raymond Xu
0938f55a2b [HUDI-3458] Fix BulkInsertPartitioner generic type (#4854) 2022-02-20 13:51:58 -05:00
Bo Cui
83279971a1 [HUDI-3446] Supports batch reader in BootstrapOperator#loadRecords (#4837)
* [HUDI-3446] Supports batch Reader in BootstrapOperator#loadRecords
2022-02-19 21:21:48 +08:00
RexAn
5009138d04 [HUDI-3438] Avoid getSmallFiles if hoodie.parquet.small.file.limit is 0 (#4823)
Co-authored-by: Hui An <hui.an@shopee.com>
2022-02-18 08:57:04 -05:00
Y Ethan Guo
fba5822ee3 [HUDI-3430] Fix Deltastreamer to properly shut down the services upon failure (#4824) 2022-02-18 08:44:56 -05:00
luokey
de8161ae96 HoodieSortedMergeHandle#close write data disorder (#4841)
Co-authored-by: 854194341@qq.com <loukey_7821>
2022-02-18 13:31:38 +04:00
Sagar Sumit
ed106f671e [HUDI-2809] Introduce a checksum mechanism for validating hoodie.properties (#4712)
Fix dependency conflict

Fix repairs command

Implement putIfAbsent for DDB lock provider

Add upgrade step and validate while fetching configs

Validate checksum for latest table version only while fetching config

Move generateChecksum to BinaryUtil

Rebase and resolve conflict

Fix table version check
2022-02-18 10:17:06 +05:30
Sagar Sumit
ba0afe1426 [HUDI-3426] Sync datasource clustering config (#4828) 2022-02-16 19:02:49 -05:00
YueZhang
3363c66468 [HUDI-3394] Check isWriteLockedByCurrentThread before unlock for InProcessLockProvider (#4819)
Co-authored-by: yuezhang <yuezhang@freewheel.tv>
Co-authored-by: Y Ethan Guo <ethan.guoyihua@gmail.com>
2022-02-15 22:41:25 -08:00
Y Ethan Guo
9a05940a74 [HUDI-3366] Remove hardcoded logic of disabling metadata table in tests (#4792) 2022-02-15 16:41:47 -05:00
Raymond Xu
538ec44fa8 [HUDI-2931] Add config to disable table services (#4777) 2022-02-15 09:49:53 -05:00
Yann Byron
cb6ca7f0d1 [HUDI-3204] fix problem that spark on TimestampKeyGenerator has no re… (#4714) 2022-02-14 23:38:38 -05:00
Raymond Xu
27bd7b538e [HUDI-1576] Make archiving an async service (#4795) 2022-02-14 21:15:06 -05:00