lanyuanxiaoyao/hudi - hudi - Gitea: Git with a cup of tea

Author	SHA1	Message	Date
Sagar Sumit	a048e940fd	[HUDI-3743] Support DELETE_PARTITION for metadata table (#5169 ) In order to drop any metadata partition (index), we can reuse the DELETE_PARTITION operation in metadata table. Subsequent to this, we can support drop index (with table config update) for async metadata indexer. - Add a new API in HoodieTableMetadataWriter - Current only supported for Spark metadata writer	2022-03-31 21:29:17 -04:00
Sagar Sumit	28dafa774e	[HUDI-2488][HUDI-3175] Implement async metadata indexing (#4693 ) - Add a new action called INDEX, whose state transition is described in the RFC. - Changes in timeline to support the new action. - Add an index planner in ScheduleIndexActionExecutor. - Add index plan executor in RunIndexActionExecutor. - Add 3 APIs in HoodieTableMetadataWriter; a) scheduleIndex: will generate an index plan based on latest completed instant, initialize file groups and add a requested INDEX instant, b) index: executes the index plan and also takes care of writes that happened after indexing was requested, c) dropIndex: will drop index by removing the given metadata partition. - Add 2 new table configs to serve as the source of truth for inflight and completed indexes. - Support upgrade/downgrade taking care of the newly added configs. - Add tool to trigger indexing in HoodieIndexer. - Handle corner cases related to partial failures. - Abort gracefully after deleting partition and instant. - Handle other actions in timeline to consider before catching up	2022-04-01 01:33:12 +05:30
liuhe0702	1da196c1e8	[HUDI-2777] Improve HoodieSparkSqlWriter write performance (#5187 )	2022-03-31 12:48:47 -07:00
codejoyan	51a701cef1	[HUDI-3020] Utility to create manifest file (#5153 ) Co-authored-by: joyan <joyan.sil@walmart.com>	2022-03-31 07:22:03 -07:00
xiarixiaoyao	7889c7852f	[HUDI-3729][SPARK] fixed the per regression by enable vectorizeReader for parquet file (#5168 ) * [MINOR][SPARK] fixed the per regression by enable vectorizeReader for parquet file * address comments * add perf result	2022-03-31 20:09:26 +08:00
Sivabalan Narayanan	73a21092f8	[HUDI-3732] Fixing rollback validation (#5157 ) * Fixing rollback validation * Adding tests	2022-03-31 17:25:24 +05:30
ForwardXu	80011df995	[HUDI-3135] Make delete partitions lazy to be executed by the cleaner (#4489 ) As of now, delete partitions will ensure all file groups are deleted, but the partition as such is not deleted. So, get all partitions might be returning the deleted partitions as well. but no data will be served since all file groups are deleted. With this patch, we are fixing it. We are letting cleaner take care of deleting the partitions when all file groups pertaining to a partitions are deleted. - Fixed the CleanPlanActionExecutor to return meta info about list of partitions to be deleted. If there are no valid file groups for a partition, clean planner will include the partition to be deleted. - Fixed HoodieCleanPlan avro schema to include the list of partitions to be deleted - CleanActionExecutor is fixed to delete partitions if any (as per clean plan) - Same info is added to HoodieCleanMetadata - Metadata table when applying clean metadata, will check for partitions to be deleted and will update the "all_partitions" record for the deleted partitions. Co-authored-by: sivabalan <n.siva.b@gmail.com>	2022-03-31 03:35:39 -04:00
Sivabalan Narayanan	3cdb590e15	[HUDI-3733] Adding HoodieFailedWritesCleaningPolicy for restore with hudi-cli (#5158 ) * Adding HoodieFailedWritesCleaningPolicy for restore with hudi-cli Co-authored-by: Sagar Sumit <sagarsumit09@gmail.com>	2022-03-31 13:00:49 +05:30
Yuwei XIAO	ce45f7f129	[HUDI-3692] MetadataFileSystemView includes compaction in timeline (#5110 )	2022-03-31 02:24:59 -04:00
Sivabalan Narayanan	4569734d60	[HUDI-3713] Guarding archival for multi-writer (#5138 )	2022-03-31 01:44:31 -04:00
Y Ethan Guo	f6ff95f97c	[MINOR][DOCS] Update hudi-utilities-slim-bundle docs (#5184 )	2022-03-30 21:48:54 -07:00
YueZhang	2dbb273d26	[HUDI-3721] Delete MDT if necessary when trigger rollback to savepoint (#5173 ) Co-authored-by: yuezhang <yuezhang@freewheel.tv>	2022-03-30 20:26:37 -07:00
KnightChess	2c4554fada	[HUDI-3750] Fix NPE when build HoodieFileIndex (#5134 ) Co-authored-by: wulingqi <wulingqi@baijiahulian.com>	2022-03-30 19:19:05 -07:00
Sivabalan Narayanan	d80c80699f	[MINOR] Fixing flakiness in TestHoodieSparkMergeOnReadTableRollback.testRollbackWithDeltaAndCompactionCommit (#5183 )	2022-03-30 22:07:22 -04:00
Y Ethan Guo	4fb1a590b1	[HUDI-3700] Add hudi-utilities-slim-bundle excluding hudi-spark-datasource modules (#5176 )	2022-03-30 18:08:35 -07:00
Y Ethan Guo	9830005e9b	[HUDI-3681] Provision additional hudi-spark-bundle with different versions (#5171 )	2022-03-30 17:35:56 -07:00
xiarixiaoyao	2d73c8ae86	[HUDI-3355] Issue with out of order commits in the timeline when ingestion writers using SparkAllowUpdateStrategy (#4962 )	2022-03-30 15:54:25 -07:00
Nicolas Paris	9ff6a48f60	[HUDI-3736] Fix null pointer when key not specified (#5167 )	2022-03-30 18:11:26 -04:00
Raymond Xu	31d4a16deb	[HUDI-3536] Add hudi-datahub-sync implementation (#5155 )	2022-03-30 14:38:02 -07:00
Bo Cui	17d11f4839	[MINOR] Repeated execution of update status (#5089 )	2022-03-30 17:30:06 -04:00
YueZhang	2b60641d17	[HUDI-3635] Fix HoodieMetadataTableValidator around comparison of partition path listing (#5100 ) Co-authored-by: yuezhang <yuezhang@freewheel.tv>	2022-03-30 17:23:37 -04:00
YueZhang	eae8488536	[HUDI-3647] HoodieMetadataTableValidator: check MDT was initialized at first (#5152 ) Co-authored-by: yuezhang <yuezhang@freewheel.tv>	2022-03-30 17:18:08 -04:00
Alexey Kudinkin	8b796e9686	[HUDI-3653] Cleaning up bespoke Column Stats Index implementation (#5062 )	2022-03-30 10:01:43 -07:00
Sagar Sumit	04478a45d9	[MINOR] Fix dates as per UTC in TestDataSkippingUtils (#5166 ) * Fix timezone in test	2022-03-30 20:03:14 +05:30
Danny Chan	b9fbada2f2	[minor] Follow 3178, fix the flink metadata table compaction (#5175 )	2022-03-30 20:45:29 +08:00
harshal	7fa363923c	[HUDI-3745] Support for spark datasource options in S3EventsHoodieIncrSource (#5170 )	2022-03-30 01:34:49 -04:00
Sivabalan Narayanan	4fed8dd319	[HUDI-3485] Adding scheduler pool configs for async clustering (#5043 )	2022-03-29 21:27:45 -04:00
Danny Chan	5c1b482a1b	[HUDI-3741] Fix flink bucket index bulk insert generates too many small files (#5164 )	2022-03-30 08:18:36 +08:00
ForwardXu	941c254c33	[HUDI-2520] Fix CTAS statment issue when sync to hive (#5145 )	2022-03-29 12:25:31 -07:00
Alexey Kudinkin	e5a2baeed0	[HUDI-3549] Removing dependency on "spark-avro" (#4955 ) Hudi will be taking on promise for it bundles to stay compatible with Spark minor versions (for ex 2.4, 3.1, 3.2): meaning that single build of Hudi (for ex "hudi-spark3.2-bundle") will be compatible with ALL patch versions in that minor branch (in that case 3.2.1, 3.2.0, etc) To achieve that we'll have to remove (and ban) "spark-avro" as a dependency, which on a few occasions was the root-cause of incompatibility b/w consecutive Spark patch versions (most recently 3.2.1 and 3.2.0, due to this PR). Instead of bundling "spark-avro" as dependency, we will be copying over some of the classes Hudi depends on and maintain them along the Hudi code-base to make sure we're able to provide for the aforementioned guarantee. To workaround arising compatibility issues we will be applying local patches to guarantee compatibility of Hudi bundles w/in the Spark minor version branches. Following Hudi modules to Spark minor branches is currently maintained: "hudi-spark3" -> 3.2.x "hudi-spark3.1.x" -> 3.1.x "hudi-spark2" -> 2.4.x Following classes hierarchies (borrowed from "spark-avro") are maintained w/in these Spark-specific modules to guarantee compatibility with respective minor version branches: AvroSerializer AvroDeserializer AvroUtils Each of these classes has been correspondingly copied from Spark 3.2.1 (for 3.2.x branch), 3.1.2 (for 3.1.x branch), 2.4.4 (for 2.4.x branch) into their respective modules. SchemaConverters class in turn is shared across all those modules given its relative stability (there're only cosmetical changes from 2.4.4 to 3.2.1). All of the aforementioned classes have their corresponding scope of visibility limited to corresponding packages (org.apache.spark.sql.avro, org.apache.spark.sql) to make sure broader code-base does not become dependent on them and instead relies on facades abstracting them. Additionally, given that Hudi plans on supporting all the patch versions of Spark w/in aforementioned minor versions branches of Spark, additional build steps were added to validate that Hudi could be properly compiled against those versions. Testing, however, is performed against the most recent patch versions of Spark with the help of Azure CI. Brief change log: - Removing spark-avro bundling from Hudi by default - Scaffolded Spark 3.2.x hierarchy - Bootstrapped Spark 3.1.x Avro serializer/deserializer hierarchy - Bootstrapped Spark 2.4.x Avro serializer/deserializer hierarchy - Moved ExpressionCodeGen,ExpressionPayload into hudi-spark module - Fixed AvroDeserializer to stay compatible w/ both Spark 3.2.1 and 3.2.0 - Modified bot.yml to build full matrix of support Spark versions - Removed "spark-avro" dependency from all modules - Fixed relocation of spark-avro classes in bundles to assist in running integ-tests.	2022-03-29 14:44:47 -04:00
ForwardXu	0802510ca9	[HUDI-2520] Fix drop partition issue when sync to hive (#5147 )	2022-03-29 11:28:19 -07:00
Alexey Kudinkin	fcb003ec76	[HUDI-3731] Fixing Column Stats Index record Merging sequence missing `columnName` (#5159 ) * Added `DataSkippingFailureMode` to control how DS handles failures in the flow (either "strict", when exception would be thrown, or "fallback" when it will just fallback to the full-scan) * Make sure tests execute in `DataSkippingFailureMode.Strict` * Fixed Column Stats Index record merging sequence missing `columnName`	2022-03-29 21:09:56 +05:30
Raymond Xu	1b2fb71afc	[MINOR] Move Experiemental to javadoc (#5161 )	2022-03-28 21:07:59 -07:00
Nicolas Paris	7c7ecb11d5	[HUDI-3736] Fix default dynamodblock url default value (#4967 )	2022-03-28 20:31:46 -07:00
leesf	8f8a8158e2	[HUDI-2520] Fix drop table issue when sync to Hive (#5143 )	2022-03-28 19:34:12 -07:00
Danny Chan	3bf9c5ffe8	[HUDI-3728] Set the sort operator parallelism for flink bucket bulk insert (#5154 )	2022-03-29 09:52:35 +08:00
ForwardXu	72e0b52b18	[HUDI-3722] Fix truncate hudi table's error (#5140 )	2022-03-29 09:44:18 +08:00
Sivabalan Narayanan	d074089c62	[HUDI-2566] Adding multi-writer test support to integ test (#5065 )	2022-03-28 17:05:00 -04:00
Raymond Xu	6ccbae4d2a	[HUDI-2757] Implement Hudi AWS Glue sync (#5076 )	2022-03-28 14:54:59 -04:00
Y Ethan Guo	4ed84b216d	[HUDI-3720] Fix the logic of reattempting pending rollback (#5148 )	2022-03-28 14:54:31 -04:00
Shawy Geng	2e2d08cb72	[HUDI-3539] Flink bucket index bucketID bootstrap optimization. (#5093 ) * [HUDI-3539] Flink bucket index bucketID bootstrap optimization. Co-authored-by: gengxiaoyu <gengxiaoyu@bytedance.com>	2022-03-28 19:50:36 +08:00
huberylee	1d0f4ccfe0	[HUDI-3538] Support Compaction Command Based on Call Procedure Command for Spark SQL (#4945 ) * Support Compaction Command Based on Call Procedure Command for Spark SQL * Addressed review comments	2022-03-28 14:11:35 +08:00
ForwardXu	d31cde284c	[MINOR] Fix call command parser use spark3.2 (#5144 )	2022-03-28 11:13:44 +08:00
Sivabalan Narayanan	f2a93ead3b	[HUDI-3724] Fixing closure of ParquetReader (#5141 )	2022-03-28 09:36:15 +08:00
xiarixiaoyao	9da2dd416e	[HUDI-3719] High performance costs of AvroSerizlizer in DataSource wr… (#5137 ) * [HUDI-3719] High performance costs of AvroSerizlizer in DataSource writing * add benchmark framework which modify from spark add avroSerDerBenchmark	2022-03-27 11:01:43 -07:00
Sivabalan Narayanan	85c4a6cfc1	[MINOR] Relaxing cleaner and archival configs (#5142 )	2022-03-27 12:26:24 -04:00
Y Ethan Guo	484b3407e0	[HUDI-3604] Adjust the order of timeline changes in rollbacks (#5114 )	2022-03-26 22:37:44 -07:00
Danny Chan	4d940bbf8a	[HUDI-3716] OOM occurred when use bulk_insert cow table with flink BUCKET index (#5135 )	2022-03-27 09:13:58 +08:00
Alexey Kudinkin	189d5297b8	[HUDI-3709] Fixing `ParquetWriter` impls not respecting Parquet Max File Size limit (#5129 )	2022-03-26 17:51:36 -04:00
RexAn	57b4f39c31	[HUDI-3612] Clustering strategy should create new TypedProperties when modifying it (#5027 )	2022-03-26 16:16:03 +05:30

1 2 3 4 5 ...

2686 Commits