lanyuanxiaoyao/hudi - hudi - Gitea: Git with a cup of tea

Author	SHA1	Message	Date
Sivabalan Narayanan	7dfb168003	[HUDI-3763] Fixing hadoop conf class loading for inline reading (#5194 )	2022-04-01 11:27:40 -04:00
Danny Chan	6df14f15a3	[HUDI-2752] The MOR DELETE block breaks the event time sequence of CDC (#4880 )	2022-04-01 20:46:51 +08:00
ForwardXu	98b4e9796e	[HUDI-3406] Rollback incorrectly relying on FS listing instead of Com… (#4957 ) * [HUDI-3406] Rollback incorrectly relying on FS listing instead of Commit Metadata * [HUDI-3406] Rollback incorrectly relying on FS listing instead of Commit Metadata * [HUDI-3406] Rollback incorrectly relying on FS listing instead of Commit Metadata * fix comments * fix comments * fix comments	2022-04-01 10:01:41 +08:00
Sagar Sumit	a048e940fd	[HUDI-3743] Support DELETE_PARTITION for metadata table (#5169 ) In order to drop any metadata partition (index), we can reuse the DELETE_PARTITION operation in metadata table. Subsequent to this, we can support drop index (with table config update) for async metadata indexer. - Add a new API in HoodieTableMetadataWriter - Current only supported for Spark metadata writer	2022-03-31 21:29:17 -04:00
Sagar Sumit	28dafa774e	[HUDI-2488][HUDI-3175] Implement async metadata indexing (#4693 ) - Add a new action called INDEX, whose state transition is described in the RFC. - Changes in timeline to support the new action. - Add an index planner in ScheduleIndexActionExecutor. - Add index plan executor in RunIndexActionExecutor. - Add 3 APIs in HoodieTableMetadataWriter; a) scheduleIndex: will generate an index plan based on latest completed instant, initialize file groups and add a requested INDEX instant, b) index: executes the index plan and also takes care of writes that happened after indexing was requested, c) dropIndex: will drop index by removing the given metadata partition. - Add 2 new table configs to serve as the source of truth for inflight and completed indexes. - Support upgrade/downgrade taking care of the newly added configs. - Add tool to trigger indexing in HoodieIndexer. - Handle corner cases related to partial failures. - Abort gracefully after deleting partition and instant. - Handle other actions in timeline to consider before catching up	2022-04-01 01:33:12 +05:30
Sivabalan Narayanan	73a21092f8	[HUDI-3732] Fixing rollback validation (#5157 ) * Fixing rollback validation * Adding tests	2022-03-31 17:25:24 +05:30
ForwardXu	80011df995	[HUDI-3135] Make delete partitions lazy to be executed by the cleaner (#4489 ) As of now, delete partitions will ensure all file groups are deleted, but the partition as such is not deleted. So, get all partitions might be returning the deleted partitions as well. but no data will be served since all file groups are deleted. With this patch, we are fixing it. We are letting cleaner take care of deleting the partitions when all file groups pertaining to a partitions are deleted. - Fixed the CleanPlanActionExecutor to return meta info about list of partitions to be deleted. If there are no valid file groups for a partition, clean planner will include the partition to be deleted. - Fixed HoodieCleanPlan avro schema to include the list of partitions to be deleted - CleanActionExecutor is fixed to delete partitions if any (as per clean plan) - Same info is added to HoodieCleanMetadata - Metadata table when applying clean metadata, will check for partitions to be deleted and will update the "all_partitions" record for the deleted partitions. Co-authored-by: sivabalan <n.siva.b@gmail.com>	2022-03-31 03:35:39 -04:00
Yuwei XIAO	ce45f7f129	[HUDI-3692] MetadataFileSystemView includes compaction in timeline (#5110 )	2022-03-31 02:24:59 -04:00
xiarixiaoyao	2d73c8ae86	[HUDI-3355] Issue with out of order commits in the timeline when ingestion writers using SparkAllowUpdateStrategy (#4962 )	2022-03-30 15:54:25 -07:00
YueZhang	2b60641d17	[HUDI-3635] Fix HoodieMetadataTableValidator around comparison of partition path listing (#5100 ) Co-authored-by: yuezhang <yuezhang@freewheel.tv>	2022-03-30 17:23:37 -04:00
Sivabalan Narayanan	4fed8dd319	[HUDI-3485] Adding scheduler pool configs for async clustering (#5043 )	2022-03-29 21:27:45 -04:00
Alexey Kudinkin	fcb003ec76	[HUDI-3731] Fixing Column Stats Index record Merging sequence missing `columnName` (#5159 ) * Added `DataSkippingFailureMode` to control how DS handles failures in the flow (either "strict", when exception would be thrown, or "fallback" when it will just fallback to the full-scan) * Make sure tests execute in `DataSkippingFailureMode.Strict` * Fixed Column Stats Index record merging sequence missing `columnName`	2022-03-29 21:09:56 +05:30
Raymond Xu	6ccbae4d2a	[HUDI-2757] Implement Hudi AWS Glue sync (#5076 )	2022-03-28 14:54:59 -04:00
Sivabalan Narayanan	f2a93ead3b	[HUDI-3724] Fixing closure of ParquetReader (#5141 )	2022-03-28 09:36:15 +08:00
Danny Chan	0c09a973fb	[HUDI-3435] Do not throw exception when instant to rollback does not exist in metadata table active timeline (#4821 )	2022-03-26 11:42:54 +08:00
Danny Chan	8896864d7b	[HUDI-3678] Fix record rewrite of create handle when 'preserveMetadata' is true (#5088 )	2022-03-25 11:48:50 +08:00
Y Ethan Guo	eaa4c4f2e2	[HUDI-1180] Upgrade HBase to 2.4.9 (#5004 ) Co-authored-by: Sagar Sumit <sagarsumit09@gmail.com>	2022-03-24 19:04:53 -07:00
Alexey Kudinkin	ccc3728002	[HUDI-3684] Fixing NPE in `ParquetUtils` (#5102 ) * Make sure nulls are properly handled in `HoodieColumnRangeMetadata`	2022-03-24 17:37:38 +05:30
Danny Chan	a1c42fcc07	[minor] Checks the data block type for archived timeline (#5106 )	2022-03-24 14:10:43 +08:00
Sivabalan Narayanan	52f0498330	Fixing non partitioned all files record in MDT (#5108 )	2022-03-23 19:26:39 -07:00
Sagar Sumit	f96ba7abf0	[HUDI-3642] Handle NPE due to empty requested replacecommit metadata (#5090 )	2022-03-23 12:13:02 -07:00
Rajesh Mahindra	5f570ea151	[HUDI-2883] Refactor hive sync tool / config to use reflection and standardize configs (#4175 ) - Refactor hive sync tool / config to use reflection and standardize configs Co-authored-by: sivabalan <n.siva.b@gmail.com> Co-authored-by: Rajesh Mahindra <rmahindra@Rajeshs-MacBook-Pro.local> Co-authored-by: Raymond Xu <2701446+xushiyan@users.noreply.github.com>	2022-03-21 22:56:31 -04:00
Pratyaksh Sharma	ca0931d332	[HUDI-1436]: Provide an option to trigger clean every nth commit (#4385 ) - Provided option to trigger clean every nth commit with default number of commits as 1 so that existing users are not affected. Co-authored-by: sivabalan <n.siva.b@gmail.com>	2022-03-21 20:06:30 -04:00
Y Ethan Guo	15d1c18625	[MINOR] Remove flaky assert in TestInLineFileSystem (#5069 )	2022-03-20 18:58:30 -04:00
Alexey Kudinkin	1b6e201160	[HUDI-3663] Fixing Column Stats index to properly handle first Data Table commit (#5070 ) * Fixed metadata conversion util to extract schema from `HoodieCommitMetadata` * Fixed failure to fetch columns to index in empty table * Abort indexing seq in case there are no columns to index * Fallback to index at least primary key columns, in case no writer schema could be obtained to index all columns * Fixed `getRecordFields` incorrectly ignoring default value * Make sure Hudi metadata fields are also indexed	2022-03-20 10:24:13 +05:30
Alexey Kudinkin	099c2c099a	[HUDI-3457] Refactored Spark DataSource Relations to avoid code duplication (#4877 ) Refactoring Spark DataSource Relations to avoid code duplication. Following Relations were in scope: - BaseFileOnlyViewRelation - MergeOnReadSnapshotRelaation - MergeOnReadIncrementalRelation	2022-03-18 22:32:16 -07:00
Raymond Xu	7446ff95a7	[HUDI-2439] Replace RDD with HoodieData in HoodieSparkTable and commit executors (#4856 ) - Adopt HoodieData in Spark action commit executors - Make Spark independent DeleteHelper, WriteHelper, MergeHelper in hudi-client-common - Make HoodieTable in WriteClient APIs have raw type to decouple with Client's generic types	2022-03-17 04:17:56 -07:00
Y Ethan Guo	5ba2d9ab2f	[HUDI-3494] Consider triggering condition of MOR compaction during archival (#4974 )	2022-03-17 01:28:11 -04:00
Sagar Sumit	d514570e90	[HUDI-3633] Allow non-string values to be set in TypedProperties (#5045 ) * [HUDI-3633] Allow non-string values to be set in TypedProperties * Override getProperty to ignore instanceof string check	2022-03-15 22:33:22 +04:00
Alexey Kudinkin	5e8ff8d793	[HUDI-3514] Rebase Data Skipping flow to rely on MT Column Stats index (#4948 )	2022-03-15 10:38:36 -07:00
l-shen	9bdda2a312	[HUDI-3619] Fix HoodieOperation fromValue using wrong constant value (#5033 ) Co-authored-by: root <l-shen@localhost.localdomain>	2022-03-15 16:34:31 +04:00
wangxianghu	3b59b76952	[HUDI-3547] Introduce MaxwellSourcePostProcessor to extract data from Maxwell json string (#4987 ) * [HUDI-3547] Introduce MaxwellSourcePostProcessor to extract data from Maxwell json string * add ut * Address comment	2022-03-15 15:06:30 +04:00
Sagar Sumit	eee96e9af3	[HUDI-3593] Restore TypedProperties and flush checksum in table config (#5013 ) Create new TypedProperties while performing clustering Add OrderedProperties and minor refactoring Add javadoc and remove getters from OrderedProperties	2022-03-13 07:58:55 +05:30
RexAn	93277b2bcd	[HUDI-3592] Fix NPE of DefaultHoodieRecordPayload if Property is empty (#4999 ) Co-authored-by: Rex An <bonean131@gmail.com>	2022-03-11 17:45:40 -05:00
苏承祥	faed6996ee	[HUDI-3566] Add thread factory in BoundedInMemoryExecutor (#4926 ) Co-authored-by: 苏承祥 <sucx@tuya.com>	2022-03-11 18:58:49 +08:00
Sivabalan Narayanan	9dc6df5dca	[HUDI-3595] Fixing NULL schema provider for empty batch (#5002 )	2022-03-10 22:52:55 -05:00
Alexey Kudinkin	034addaef5	[HUDI-3396] Make sure `BaseFileOnlyViewRelation` only reads projected columns (#4818 ) NOTE: This change is first part of the series to clean up Hudi's Spark DataSource related implementations, making sure there's minimal code duplication among them, implementations are consistent and performant This PR is making sure that BaseFileOnlyViewRelation only reads projected columns as well as avoiding unnecessary serde from Row to InternalRow Brief change log - Introduced HoodieBaseRDD as a base for all custom RDD impls - Extracted common fields/methods to HoodieBaseRelation - Cleaned up and streamlined HoodieBaseFileViewOnlyRelation - Fixed all of the Relations to avoid superfluous Row <> InternalRow conversions	2022-03-09 21:45:25 -05:00
MrSleeping123	8859b48b2a	[HUDI-3383] Sync column comments while syncing a hive table (#4960 ) Desc: Add a hive sync config(hoodie.datasource.hive_sync.sync_comment). This config defaults to false. While syncing data source to hudi, add column comments to source avro schema, and the sync_comment is true, syncing column comments to the hive table.	2022-03-10 09:44:39 +08:00
Sagar Sumit	575bc63468	[HUDI-3356][HUDI-3203] HoodieData for metadata index records; BloomFilter construction from index based on the type param (#4848 ) Rework of #4761 This diff introduces following changes: - Write stats are converted to metadata index records during the commit. Making them use the HoodieData type so that the record generation scales up with needs. - Metadata index init support for bloom filter and column stats partitions. - When building the BloomFilter from the index records, using the type param stored in the payload instead of hardcoded type. - Delta writes can change column ranges and the column stats index need to be properly updated with new ranges to be consistent with the table dataset. This fix add column stats index update support for the delta writes. Co-authored-by: Manoj Govindassamy <manoj.govindassamy@gmail.com>	2022-03-08 10:39:04 -05:00
Alexey Kudinkin	a66fd40692	[HUDI-3365] Make sure Metadata Table records are updated appropriately on HDFS (#4739 ) - This change makes sure MT records are updated appropriately on HDFS: previously after Log File append operations MT records were updated w/ just the size of the deltas being appended to the original files, which have been found to be the cause of issues in case of Rollbacks that were instead updating MT with records bearing the full file-size. - To make sure that we hedge against similar issues going f/w, this PR alleviates this discrepancy and streamlines the flow of MT table always ingesting records bearing full file-sizes.	2022-03-07 15:38:27 -05:00
Sivabalan Narayanan	3539578ccb	[HUDI-3213] Making commit preserve metadata to true for compaction (#4811 ) * Making commit preserve metadata to true * Fixing integ tests * Fixing preserve commit metadata for metadata table * fixed bootstrap tests * temp diff * Fixing merge handle * renaming fallback record * fixing build issue * Fixing test failures	2022-03-07 18:02:05 +05:30
苏承祥	6f57bbfac4	[HUDI-3069] Improve HoodieMergedLogRecordScanner avoid putting unnecessary hoodie records (#4932 ) * log scanner optimization * payload equals switches to `=` Co-authored-by: 苏承祥 <sucx@tuya.com>	2022-03-07 14:35:55 +08:00
Aditya Tiwari	051ad0b033	[HUDI-3130] Fixing Hive getSchema for RT tables addressing different partitions having different schemas (#4468 ) * Fixing Hive getSchema for RT tables * Addressing feedback * temp diff * fixing tests after spark datasource read support for metadata table is merged to master * Adding multi-partition schema evolution tests to HoodieRealTimeRecordReader Co-authored-by: Aditya Tiwari <aditya.tiwari@flipkart.com> Co-authored-by: sivabalan <n.siva.b@gmail.com>	2022-03-06 07:51:35 +05:30
Yuwei XIAO	f449807630	[MINOR] fix UTC timezone config (#4950 )	2022-03-04 11:09:39 +04:00
shibei	62f534d002	[HUDI-3445] Support Clustering Command Based on Call Procedure Command for Spark SQL (#4901 ) * [HUDI-3445] Clustering Command Based on Call Procedure Command for Spark SQL * [HUDI-3445] Clustering Command Based on Call Procedure Command for Spark SQL * [HUDI-3445] Clustering Command Based on Call Procedure Command for Spark SQL Co-authored-by: shibei <huberylee.li@alibaba-inc.com>	2022-03-04 09:33:16 +08:00
Danny Chan	a4ba0fff07	[HUDI-3552] Strength the NetworkUtils#getHostname by checking network interfaces first (#4942 ) * In some complex network environment, the current code returns wildcard address 0.0.0.0 which is not desired.	2022-03-03 21:11:08 +08:00
Alexey Kudinkin	85f47b53df	[HUDI-3469] Refactor `HoodieTestDataGenerator` to provide for reproducible Builds (#4866 )	2022-03-01 22:15:26 -08:00
Bo Cui	3fdc9332e5	[HUDI-3516] Implement record iterator for HoodieDataBlock (#4909 ) * Use iterator to void eager materialization to be memory friendly	2022-03-02 10:19:36 +08:00
Y Ethan Guo	257052a94d	[HUDI-3465] Add validation of column stats and bloom filters in HoodieMetadataTableValidator (#4878 )	2022-02-28 18:49:30 -08:00
yuzhaojing	44b8ab6048	[HUDI-3418] Save timeout option for remote RemoteFileSystemView (#4809 ) Co-authored-by: yuzhaojing <yuzhaojing@bytedance.com>	2022-02-28 15:16:40 -05:00

1 2 3 4 5 ...

621 Commits