lanyuanxiaoyao/hudi - hudi - Gitea: Git with a cup of tea

Author	SHA1	Message	Date
ForwardXu	08fd80c913	[HUDI-3221] Support querying a table as of a savepoint (#4720 )	2022-03-08 10:02:34 -08:00
Sagar Sumit	575bc63468	[HUDI-3356][HUDI-3203] HoodieData for metadata index records; BloomFilter construction from index based on the type param (#4848 ) Rework of #4761 This diff introduces following changes: - Write stats are converted to metadata index records during the commit. Making them use the HoodieData type so that the record generation scales up with needs. - Metadata index init support for bloom filter and column stats partitions. - When building the BloomFilter from the index records, using the type param stored in the payload instead of hardcoded type. - Delta writes can change column ranges and the column stats index need to be properly updated with new ranges to be consistent with the table dataset. This fix add column stats index update support for the delta writes. Co-authored-by: Manoj Govindassamy <manoj.govindassamy@gmail.com>	2022-03-08 10:39:04 -05:00
Sivabalan Narayanan	29040762fa	[HUDI-3576] Configuring timeline refreshes based on latest commit (#4973 )	2022-03-07 17:01:49 -05:00
Alexey Kudinkin	a66fd40692	[HUDI-3365] Make sure Metadata Table records are updated appropriately on HDFS (#4739 ) - This change makes sure MT records are updated appropriately on HDFS: previously after Log File append operations MT records were updated w/ just the size of the deltas being appended to the original files, which have been found to be the cause of issues in case of Rollbacks that were instead updating MT with records bearing the full file-size. - To make sure that we hedge against similar issues going f/w, this PR alleviates this discrepancy and streamlines the flow of MT table always ingesting records bearing full file-sizes.	2022-03-07 15:38:27 -05:00
Alexey Kudinkin	f0bcee3c01	[HUDI-3561] Avoid including whole `MultipleSparkJobExecutionStrategy` object into the closure for Spark to serialize (#4954 ) - Avoid including whole MultipleSparkJobExecutionStrategy object into the closure for Spark to serialize	2022-03-07 13:42:03 -05:00
Sivabalan Narayanan	3539578ccb	[HUDI-3213] Making commit preserve metadata to true for compaction (#4811 ) * Making commit preserve metadata to true * Fixing integ tests * Fixing preserve commit metadata for metadata table * fixed bootstrap tests * temp diff * Fixing merge handle * renaming fallback record * fixing build issue * Fixing test failures	2022-03-07 18:02:05 +05:30
Aditya Tiwari	051ad0b033	[HUDI-3130] Fixing Hive getSchema for RT tables addressing different partitions having different schemas (#4468 ) * Fixing Hive getSchema for RT tables * Addressing feedback * temp diff * fixing tests after spark datasource read support for metadata table is merged to master * Adding multi-partition schema evolution tests to HoodieRealTimeRecordReader Co-authored-by: Aditya Tiwari <aditya.tiwari@flipkart.com> Co-authored-by: sivabalan <n.siva.b@gmail.com>	2022-03-06 07:51:35 +05:30
Sivabalan Narayanan	6a46130037	[HUDI-2761] Fixing timeline server for repeated refreshes (#4812 ) * Fixing timeline server for repeated refreshes	2022-03-05 10:04:16 +08:00
shibei	62f534d002	[HUDI-3445] Support Clustering Command Based on Call Procedure Command for Spark SQL (#4901 ) * [HUDI-3445] Clustering Command Based on Call Procedure Command for Spark SQL * [HUDI-3445] Clustering Command Based on Call Procedure Command for Spark SQL * [HUDI-3445] Clustering Command Based on Call Procedure Command for Spark SQL Co-authored-by: shibei <huberylee.li@alibaba-inc.com>	2022-03-04 09:33:16 +08:00
Sivabalan Narayanan	876a891979	[HUDI-3544] Fixing "populate meta fields" update to metadata table (#4941 ) * Fixing populateMeta fields update to metadata table * Fix checkstyle violations Co-authored-by: Sagar Sumit <sagarsumit09@gmail.com>	2022-03-03 17:02:25 +05:30
Danny Chan	1d57bd17c2	[minor] Cosmetic changes following HUDI-3315 (#4934 )	2022-03-02 17:44:52 +08:00
Gary Li	10d866f083	[HUDI-3315] RFC-35 Part-1 Support bucket index in Flink writer (#4679 ) * Support bucket index in Flink writer * Use record key as default index key	2022-03-02 15:14:44 +08:00
Alexey Kudinkin	85f47b53df	[HUDI-3469] Refactor `HoodieTestDataGenerator` to provide for reproducible Builds (#4866 )	2022-03-01 22:15:26 -08:00
Bo Cui	3fdc9332e5	[HUDI-3516] Implement record iterator for HoodieDataBlock (#4909 ) * Use iterator to void eager materialization to be memory friendly	2022-03-02 10:19:36 +08:00
yuzhaojing	44b8ab6048	[HUDI-3418] Save timeout option for remote RemoteFileSystemView (#4809 ) Co-authored-by: yuzhaojing <yuzhaojing@bytedance.com>	2022-02-28 15:16:40 -05:00
Sivabalan Narayanan	4a59876c8b	[HUDI-2917] rollback insert data appended to log file when using Hbase Index (#4840 ) Co-authored-by: guanziyue <guanziyue@gmail.com>	2022-02-28 08:13:17 -05:00
Sivabalan Narayanan	d5444ff7ff	[HUDI-3018] Adding validation to dataframe scheme to ensure reserved field does not have diff data type (#4852 )	2022-02-27 11:59:23 -05:00
Raymond Xu	c77b2591d0	[HUDI-2439] Remove SparkBoundedInMemoryExecutor (#4860 )	2022-02-26 08:02:12 -05:00
苏承祥	92cdc5987a	[HUDI-3515] Making rdd unpersist optional at the end of writes (#4898 ) Co-authored-by: 苏承祥 <sucx@tuya.com>	2022-02-25 11:30:10 -05:00
Raymond Xu	b50f4b491c	[HUDI-3042] Refactor clustering executors (#4847 )	2022-02-25 05:39:43 -08:00
YueZhang	742810070b	[HUDI-3421]Pending clustering may break AbstractTableFileSystemView#getxxBaseFile() (#4810 )	2022-02-25 16:46:27 +05:30
Alexey Kudinkin	85e8a5c4de	[HUDI-1296] Support Metadata Table in Spark Datasource (#4789 ) * Bootstrapping initial support for Metadata Table in Spark Datasource - Consolidated Avro/Row conversion utilities to center around Spark's AvroDeserializer ; removed duplication - Bootstrapped HoodieBaseRelation - Updated HoodieMergeOnReadRDD to be able to handle Metadata Table - Modified MOR relations to be able to read different Base File formats (Parquet, HFile)	2022-02-24 16:23:13 -05:00
Prashant Wason	0dee8edc97	[HUDI-2925] Fix duplicate cleaning of same files when unfinished clean operations are present using a config. (#4212 ) Co-authored-by: sivabalan <n.siva.b@gmail.com>	2022-02-21 21:53:03 -05:00
Yann Byron	0c950181aa	[HUDI-3423] upgrade spark to 3.2.1 (#4815 )	2022-02-21 16:52:21 -08:00
RexAn	801fdab55c	[HUDI-3042] Abstract Spark update Strategy to make code more clean and remove duplicates (#4845 ) Co-authored-by: Hui An <hui.an@shopee.com>	2022-02-21 06:53:09 -08:00
Pratyaksh Sharma	bf16bc122a	[HUDI-349]: Added new cleaning policy based on number of hours (#3646 )	2022-02-21 09:04:42 -05:00
Sivabalan Narayanan	17cb5cb433	[HUDI-3432] Fixing restore with metadata enabled (#4849 ) * Fixing restore with metadata enabled * Fixing test failures	2022-02-21 18:25:30 +05:30
YueZhang	359fbfde79	[HUDI-2648] Retry FileSystem action instead of failed directly. (#3887 ) Co-authored-by: yuezhang <yuezhang@freewheel.tv>	2022-02-20 15:31:31 -05:00
Raymond Xu	0938f55a2b	[HUDI-3458] Fix BulkInsertPartitioner generic type (#4854 )	2022-02-20 13:51:58 -05:00
Bo Cui	83279971a1	[HUDI-3446] Supports batch reader in BootstrapOperator#loadRecords (#4837 ) * [HUDI-3446] Supports batch Reader in BootstrapOperator#loadRecords	2022-02-19 21:21:48 +08:00
RexAn	5009138d04	[HUDI-3438] Avoid getSmallFiles if hoodie.parquet.small.file.limit is 0 (#4823 ) Co-authored-by: Hui An <hui.an@shopee.com>	2022-02-18 08:57:04 -05:00
Y Ethan Guo	fba5822ee3	[HUDI-3430] Fix Deltastreamer to properly shut down the services upon failure (#4824 )	2022-02-18 08:44:56 -05:00
luokey	de8161ae96	HoodieSortedMergeHandle#close write data disorder (#4841 ) Co-authored-by: 854194341@qq.com <loukey_7821>	2022-02-18 13:31:38 +04:00
Sagar Sumit	ed106f671e	[HUDI-2809] Introduce a checksum mechanism for validating hoodie.properties (#4712 ) Fix dependency conflict Fix repairs command Implement putIfAbsent for DDB lock provider Add upgrade step and validate while fetching configs Validate checksum for latest table version only while fetching config Move generateChecksum to BinaryUtil Rebase and resolve conflict Fix table version check	2022-02-18 10:17:06 +05:30
Sagar Sumit	ba0afe1426	[HUDI-3426] Sync datasource clustering config (#4828 )	2022-02-16 19:02:49 -05:00
YueZhang	3363c66468	[HUDI-3394] Check isWriteLockedByCurrentThread before unlock for InProcessLockProvider (#4819 ) Co-authored-by: yuezhang <yuezhang@freewheel.tv> Co-authored-by: Y Ethan Guo <ethan.guoyihua@gmail.com>	2022-02-15 22:41:25 -08:00
Y Ethan Guo	9a05940a74	[HUDI-3366] Remove hardcoded logic of disabling metadata table in tests (#4792 )	2022-02-15 16:41:47 -05:00
Raymond Xu	538ec44fa8	[HUDI-2931] Add config to disable table services (#4777 )	2022-02-15 09:49:53 -05:00
Yann Byron	cb6ca7f0d1	[HUDI-3204] fix problem that spark on TimestampKeyGenerator has no re… (#4714 )	2022-02-14 23:38:38 -05:00
Raymond Xu	27bd7b538e	[HUDI-1576] Make archiving an async service (#4795 )	2022-02-14 21:15:06 -05:00
Raymond Xu	bcfd8efe66	[MINOR] Prevent async service from starting twice (#4801 )	2022-02-14 11:06:31 -08:00
leesf	0db1e978c6	[HUDI-3254] Introduce HoodieCatalog to manage tables for Spark Datasource V2 (#4611 )	2022-02-14 06:26:58 -08:00
YueZhang	76e2faa28d	[HUDI-3370] The files recorded in the commit may not match the actual ones for MOR Compaction (#4753 ) * use HoodieCommitMetadata to replace writeStatuses computation Co-authored-by: yuezhang <yuezhang@freewheel.tv>	2022-02-14 11:12:52 +08:00
Y Ethan Guo	6aba00e84f	[MINOR] Fix typos in Spark client related classes (#4781 )	2022-02-13 06:41:58 -08:00
satishkotha	89ed6f062e	[HUDI-3362] Fix restore to rollback pending clustering operations followed by other rolling back other commits (#4772 )	2022-02-11 14:12:45 -05:00
Yann Byron	b431246710	[HUDI-3338] Custom relation instead of HadoopFsRelation (#4709 ) Currently, HadoopFsRelation will use the value of the real partition path as the value of the partition field. However, different from the normal table, Hudi will persist the partition value in the parquet file. And in some cases, it's different between the value of the real partition path and the value of the partition field. So here we implement BaseFileOnlyViewRelation which lets Hudi manage its own relation.	2022-02-11 10:48:44 -08:00
Yann Byron	10474e0962	[HUDI-3402] Set TIMESTAMP_MICROS as the default value for hoodie.parquet.outputtimestamptype (#4749 )	2022-02-11 12:23:55 -05:00
Sivabalan Narayanan	ba4e732ba7	[HUDI-2987] Update all deprecated calls to new apis in HoodieRecordPayload (#4681 )	2022-02-10 19:19:33 -05:00
Sivabalan Narayanan	e7ec3a82dc	[HUDI-2432] Adding restore.requested instant and restore plan for restore action (#4605 ) - This adds a restore plan and serializes it to restore.requested meta file in timeline. This also means that we are introducing schedule and execution phases for restore which was not present before.	2022-02-10 08:06:23 -05:00
Sivabalan Narayanan	0ababcfaa7	[HUDI-1847] Adding inline scheduling support for spark datasource path for compaction and clustering (#4420 ) - This adds support in spark-datasource to just schedule table services inline so that users can leverage async execution w/o the need for lock service providers.	2022-02-10 08:04:55 -05:00

1 2 3 4 5 ...

801 Commits