lanyuanxiaoyao/hudi - hudi - Gitea: Git with a cup of tea

Author	SHA1	Message	Date
Alexey Kudinkin	e5a2baeed0	[HUDI-3549] Removing dependency on "spark-avro" (#4955 ) Hudi will be taking on promise for it bundles to stay compatible with Spark minor versions (for ex 2.4, 3.1, 3.2): meaning that single build of Hudi (for ex "hudi-spark3.2-bundle") will be compatible with ALL patch versions in that minor branch (in that case 3.2.1, 3.2.0, etc) To achieve that we'll have to remove (and ban) "spark-avro" as a dependency, which on a few occasions was the root-cause of incompatibility b/w consecutive Spark patch versions (most recently 3.2.1 and 3.2.0, due to this PR). Instead of bundling "spark-avro" as dependency, we will be copying over some of the classes Hudi depends on and maintain them along the Hudi code-base to make sure we're able to provide for the aforementioned guarantee. To workaround arising compatibility issues we will be applying local patches to guarantee compatibility of Hudi bundles w/in the Spark minor version branches. Following Hudi modules to Spark minor branches is currently maintained: "hudi-spark3" -> 3.2.x "hudi-spark3.1.x" -> 3.1.x "hudi-spark2" -> 2.4.x Following classes hierarchies (borrowed from "spark-avro") are maintained w/in these Spark-specific modules to guarantee compatibility with respective minor version branches: AvroSerializer AvroDeserializer AvroUtils Each of these classes has been correspondingly copied from Spark 3.2.1 (for 3.2.x branch), 3.1.2 (for 3.1.x branch), 2.4.4 (for 2.4.x branch) into their respective modules. SchemaConverters class in turn is shared across all those modules given its relative stability (there're only cosmetical changes from 2.4.4 to 3.2.1). All of the aforementioned classes have their corresponding scope of visibility limited to corresponding packages (org.apache.spark.sql.avro, org.apache.spark.sql) to make sure broader code-base does not become dependent on them and instead relies on facades abstracting them. Additionally, given that Hudi plans on supporting all the patch versions of Spark w/in aforementioned minor versions branches of Spark, additional build steps were added to validate that Hudi could be properly compiled against those versions. Testing, however, is performed against the most recent patch versions of Spark with the help of Azure CI. Brief change log: - Removing spark-avro bundling from Hudi by default - Scaffolded Spark 3.2.x hierarchy - Bootstrapped Spark 3.1.x Avro serializer/deserializer hierarchy - Bootstrapped Spark 2.4.x Avro serializer/deserializer hierarchy - Moved ExpressionCodeGen,ExpressionPayload into hudi-spark module - Fixed AvroDeserializer to stay compatible w/ both Spark 3.2.1 and 3.2.0 - Modified bot.yml to build full matrix of support Spark versions - Removed "spark-avro" dependency from all modules - Fixed relocation of spark-avro classes in bundles to assist in running integ-tests.	2022-03-29 14:44:47 -04:00
leesf	8f8a8158e2	[HUDI-2520] Fix drop table issue when sync to Hive (#5143 )	2022-03-28 19:34:12 -07:00
huberylee	1d0f4ccfe0	[HUDI-3538] Support Compaction Command Based on Call Procedure Command for Spark SQL (#4945 ) * Support Compaction Command Based on Call Procedure Command for Spark SQL * Addressed review comments	2022-03-28 14:11:35 +08:00
ForwardXu	12cc8e715b	[MINOR] fix QuickstartUtils move (#5133 )	2022-03-25 07:34:35 -07:00
ForwardXu	e5c3f9089b	[HUDI-3563] Make quickstart examples covered by CI tests (#5082 )	2022-03-25 01:37:17 -07:00
Alexey Kudinkin	8b38ddedc2	[HUDI-3594] Supporting Composite Expressions over Data Table Columns in Data Skipping flow (#4996 )	2022-03-24 22:27:15 -07:00
Rajesh Mahindra	5f570ea151	[HUDI-2883] Refactor hive sync tool / config to use reflection and standardize configs (#4175 ) - Refactor hive sync tool / config to use reflection and standardize configs Co-authored-by: sivabalan <n.siva.b@gmail.com> Co-authored-by: Rajesh Mahindra <rmahindra@Rajeshs-MacBook-Pro.local> Co-authored-by: Raymond Xu <2701446+xushiyan@users.noreply.github.com>	2022-03-21 22:56:31 -04:00
peanut-chenzhong	4b75cb6f23	fix NPE when run schdule using spark-sql if the commits time < hoodie.compact.inline.max.delta.commits (#4976 ) * Update CompactionHoodiePathCommand.scala fix NPE when run schdule using spark-sql if the commits time < hoodie.compact.inline.max.delta.commits * Update CompactionHoodiePathCommand.scala fix IndexOutOfBoundsException when there`s no schedule for compaction * Update CompactionHoodiePathCommand.scala fix CI issue	2022-03-14 16:40:38 +08:00
ForwardXu	6c8224cae6	[HUDI-3501] Support savepoints command based on Call Produce Command (#5025 )	2022-03-13 16:58:21 +04:00
huberylee	56cb49485d	[HUDI-3567] Refactor HoodieCommonUtils to make code more reasonable (#4982 )	2022-03-11 13:23:19 -08:00
ForwardXu	08fd80c913	[HUDI-3221] Support querying a table as of a savepoint (#4720 )	2022-03-08 10:02:34 -08:00
shibei	62f534d002	[HUDI-3445] Support Clustering Command Based on Call Procedure Command for Spark SQL (#4901 ) * [HUDI-3445] Clustering Command Based on Call Procedure Command for Spark SQL * [HUDI-3445] Clustering Command Based on Call Procedure Command for Spark SQL * [HUDI-3445] Clustering Command Based on Call Procedure Command for Spark SQL Co-authored-by: shibei <huberylee.li@alibaba-inc.com>	2022-03-04 09:33:16 +08:00
ForwardXu	aa1810d737	[HUDI-3493] Not table to get execution plan (#4894 )	2022-02-24 17:04:44 -08:00
Alexey Kudinkin	85e8a5c4de	[HUDI-1296] Support Metadata Table in Spark Datasource (#4789 ) * Bootstrapping initial support for Metadata Table in Spark Datasource - Consolidated Avro/Row conversion utilities to center around Spark's AvroDeserializer ; removed duplication - Bootstrapped HoodieBaseRelation - Updated HoodieMergeOnReadRDD to be able to handle Metadata Table - Modified MOR relations to be able to read different Base File formats (Parquet, HFile)	2022-02-24 16:23:13 -05:00
ForwardXu	521338b4d9	[HUDI-3161] Add Call Produce Command for Spark SQL (#4535 )	2022-02-24 07:45:37 -08:00
leesf	2a93b8efb2	[HUDI-3489] Unify config to avoid duplicate code (#4883 )	2022-02-23 08:14:30 -05:00
Yann Byron	0c950181aa	[HUDI-3423] upgrade spark to 3.2.1 (#4815 )	2022-02-21 16:52:21 -08:00
leesf	0db1e978c6	[HUDI-3254] Introduce HoodieCatalog to manage tables for Spark Datasource V2 (#4611 )	2022-02-14 06:26:58 -08:00
YueZhang	76e2faa28d	[HUDI-3370] The files recorded in the commit may not match the actual ones for MOR Compaction (#4753 ) * use HoodieCommitMetadata to replace writeStatuses computation Co-authored-by: yuezhang <yuezhang@freewheel.tv>	2022-02-14 11:12:52 +08:00
Sivabalan Narayanan	ba4e732ba7	[HUDI-2987] Update all deprecated calls to new apis in HoodieRecordPayload (#4681 )	2022-02-10 19:19:33 -05:00
Y Ethan Guo	b8601a9f58	[HUDI-2656] Generalize HoodieIndex for flexible record data type (#3893 ) Co-authored-by: Raymond Xu <2701446+xushiyan@users.noreply.github.com>	2022-02-03 20:24:04 -08:00
xuzifu666	bf409e8423	[MINOR] Standardize HoodieSqlCommon.g4 file (#4582 )	2022-01-25 10:09:08 +08:00
董可伦	cfde45b548	[HUDI-3282] Fix delete exception for Spark SQL when sync Hive (#4644 )	2022-01-23 11:32:57 -08:00
董可伦	56cd8ffae0	[HUDI-2837] Add support for using database name in incremental query (#4083 )	2022-01-22 22:11:27 -08:00
Yann Byron	5e0171a5ee	[HUDI-3198] Improve Spark SQL create table from existing hudi table (#4584 ) To modify SQL statement for creating hudi table based on an existing hudi path. From: ```sql create table hudi_tbl using hudi tblproperties (primaryKey='id', preCombineField='ts', type='cow') partitioned by (pt) location '/path/to/hudi' ``` To: ```sql create table hudi_tbl using hudi location '/path/to/hudi' ```	2022-01-14 10:15:29 -08:00
leesf	5ce45c440b	[HUDI-3172] Refactor hudi existing modules to make more code reuse in V2 Implementation (#4514 ) * Introduce hudi-spark3-common and hudi-spark2-common modules to place classes that would be reused in different spark versions, also introduce hudi-spark3.1.x to support spark 3.1.x. * Introduce hudi format under hudi-spark2, hudi-spark3, hudi-spark3.1.x modules and change the hudi format in original hudi-spark module to hudi_v1 format. * Manually tested on Spark 3.1.2 and Spark 3.2.0 SQL. * Added a README.md file under hudi-spark-datasource module.	2022-01-14 13:42:35 +08:00
董可伦	017ddbbfac	[MINOR] Fix typos (#4567 )	2022-01-11 23:17:10 -08:00
Sagar Sumit	827549949c	[HUDI-2909] Handle logical type in TimestampBasedKeyGenerator (#4203 ) * [HUDI-2909] Handle logical type in TimestampBasedKeyGenerator Timestampbased key generator was returning diff values for row writer and non row writer path. this patch fixes it and is guarded by a config flag (`hoodie.datasource.write.keygenerator.consistent.logical.timestamp.enabled`)	2022-01-08 10:22:44 -05:00
Vinish Reddy	eee715b3ff	[HUDI-3168] Fixing null schema with empty commit in incremental relation (#4513 )	2022-01-05 11:43:10 -05:00
Yann Byron	fe9406dd33	[HUDI-3131] fix ctas error in spark3.1.1 (#4476 )	2022-01-02 03:06:55 -08:00
Yann Byron	1622b52c9c	[HUDI-3136] Fix merge/insert/show partitions error on Spark3.2 (#4490 )	2022-01-02 02:42:10 -08:00
ForwardXu	504747ecf4	[HUDI-3108] Fix Purge Drop MOR Table Cause error (#4455 )	2021-12-29 20:23:23 +08:00
Yann Byron	05942e018c	[HUDI-2811] Support Spark 3.2 (#4270 )	2021-12-28 00:12:44 -08:00
Yann Byron	1f7afba5e4	[HUDI-3093] fix spark-sql query table that write with TimestampBasedKeyGenerator (#4416 )	2021-12-27 21:39:52 -08:00
ForwardXu	282aa68552	[HUDI-3099] Purge drop partition for spark sql (#4436 )	2021-12-28 09:38:26 +08:00
xuzifu666	032b883bd1	[HUDI-3014] Add table option to set utc timezone (#4306 )	2021-12-23 16:27:45 +08:00
ForwardXu	5d93edc539	[HUDI-3060] drop table for spark sql (#4364 )	2021-12-22 19:17:43 +08:00
harshal patil	7d046f914a	[HUDI-3008] Fixing HoodieFileIndex partition column parsing for nested fields	2021-12-21 11:54:52 +05:30
xuzifu666	3ca92108b2	remove unused import (#4349 )	2021-12-20 16:32:41 +08:00
ForwardXu	dd96129191	[HUDI-2990] Sync to HMS when deleting partitions (#4291 )	2021-12-13 20:40:06 +08:00
Alexey Kudinkin	2d864f7524	[HUDI-2814] Make Z-index more generic Column-Stats Index (#4106 )	2021-12-10 14:56:09 -08:00
xiarixiaoyao	68f8597b12	[HUDI-2966] Add TaskCompletionListener for HoodieMergeOnReadRDD to close logScaner when the query finished. (#4265 ) * [HUDI-2966] Add TaskCompletionListener for HoodieMergeOnReadRDD to close logScaner when the query finished.	2021-12-09 19:51:49 +08:00
Yann Byron	2f96f4300b	Revert "[HUDI-2495] Resolve inconsistent key generation for timestamp types by GenericRecord and Row (#3944 )" (#4201 )	2021-12-03 11:13:38 -05:00
Alexey Kudinkin	bed7f9897a	[HUDI-2911] Removing default value for `PARTITIONPATH_FIELD_NAME` resulting in incorrect `KeyGenerator` configuration (#4195 )	2021-12-03 07:33:38 -05:00
Yann Byron	ca427240c0	[MINOR] use catalog schema if can not find table schema (#4182 )	2021-12-03 00:37:13 -08:00
zzzhy	61a03bc072	[MINOR] Fix the wrong usage of timestamp length variable bug (#4179 ) Signed-off-by: zzzhy <candle_1667@163.com>	2021-12-02 22:47:31 +08:00
董可伦	a398aad1fc	[HUDI-2642] Add support ignoring case in update sql operation (#3882 )	2021-11-29 22:36:36 -08:00
董可伦	3433f00cb5	[MINOR] Fix typo,rename 'getUrlEncodePartitoning' to 'getUrlEncodePartitioning' (#4130 )	2021-11-29 18:31:22 -08:00
Sivabalan Narayanan	38e75ea806	Removing rfc from release package and fixing release validation script (#4147 )	2021-11-29 13:18:35 +08:00
Y Ethan Guo	d1e83e4ba0	[HUDI-2767] Enabling timeline-server-based marker as default (#4112 ) - Changes the default config of marker type (HoodieWriteConfig.MARKERS_TYPE or hoodie.write.markers.type) from DIRECT to TIMELINE_SERVER_BASED for Spark Engine. - Adds engine-specific marker type configs: Spark -> TIMELINE_SERVER_BASED, Flink -> DIRECT, Java -> DIRECT. - Uses DIRECT markers as well for Spark structured streaming due to timeline server only available for the first mini-batch. - Fixes the marker creation method for non-partitioned table in TimelineServerBasedWriteMarkers. - Adds the fallback to direct markers even when TIMELINE_SERVER_BASED is configured, in WriteMarkersFactory: when HDFS is used, or embedded timeline server is disabled, the fallback to direct markers happens. - Fixes the closing of timeline service. - Fixes tests that depend on markers, mainly by starting the timeline service for each test.	2021-11-26 16:41:05 -05:00

1 2 3 4

179 Commits