lanyuanxiaoyao/hudi - hudi - Gitea: Git with a cup of tea

Author	SHA1	Message	Date
Alexey Kudinkin	e5a2baeed0	[HUDI-3549] Removing dependency on "spark-avro" (#4955 ) Hudi will be taking on promise for it bundles to stay compatible with Spark minor versions (for ex 2.4, 3.1, 3.2): meaning that single build of Hudi (for ex "hudi-spark3.2-bundle") will be compatible with ALL patch versions in that minor branch (in that case 3.2.1, 3.2.0, etc) To achieve that we'll have to remove (and ban) "spark-avro" as a dependency, which on a few occasions was the root-cause of incompatibility b/w consecutive Spark patch versions (most recently 3.2.1 and 3.2.0, due to this PR). Instead of bundling "spark-avro" as dependency, we will be copying over some of the classes Hudi depends on and maintain them along the Hudi code-base to make sure we're able to provide for the aforementioned guarantee. To workaround arising compatibility issues we will be applying local patches to guarantee compatibility of Hudi bundles w/in the Spark minor version branches. Following Hudi modules to Spark minor branches is currently maintained: "hudi-spark3" -> 3.2.x "hudi-spark3.1.x" -> 3.1.x "hudi-spark2" -> 2.4.x Following classes hierarchies (borrowed from "spark-avro") are maintained w/in these Spark-specific modules to guarantee compatibility with respective minor version branches: AvroSerializer AvroDeserializer AvroUtils Each of these classes has been correspondingly copied from Spark 3.2.1 (for 3.2.x branch), 3.1.2 (for 3.1.x branch), 2.4.4 (for 2.4.x branch) into their respective modules. SchemaConverters class in turn is shared across all those modules given its relative stability (there're only cosmetical changes from 2.4.4 to 3.2.1). All of the aforementioned classes have their corresponding scope of visibility limited to corresponding packages (org.apache.spark.sql.avro, org.apache.spark.sql) to make sure broader code-base does not become dependent on them and instead relies on facades abstracting them. Additionally, given that Hudi plans on supporting all the patch versions of Spark w/in aforementioned minor versions branches of Spark, additional build steps were added to validate that Hudi could be properly compiled against those versions. Testing, however, is performed against the most recent patch versions of Spark with the help of Azure CI. Brief change log: - Removing spark-avro bundling from Hudi by default - Scaffolded Spark 3.2.x hierarchy - Bootstrapped Spark 3.1.x Avro serializer/deserializer hierarchy - Bootstrapped Spark 2.4.x Avro serializer/deserializer hierarchy - Moved ExpressionCodeGen,ExpressionPayload into hudi-spark module - Fixed AvroDeserializer to stay compatible w/ both Spark 3.2.1 and 3.2.0 - Modified bot.yml to build full matrix of support Spark versions - Removed "spark-avro" dependency from all modules - Fixed relocation of spark-avro classes in bundles to assist in running integ-tests.	2022-03-29 14:44:47 -04:00
Raymond Xu	6ccbae4d2a	[HUDI-2757] Implement Hudi AWS Glue sync (#5076 )	2022-03-28 14:54:59 -04:00
wangxianghu	f20c9867d7	[HUDI-3711] Fix typo in MaxwellJsonKafkaSourcePostProcessor.Config#PRECOMBINE_FIELD_TYPE_PROP (#5096 )	2022-03-25 00:02:54 -07:00
Y Ethan Guo	4ddd094ba2	[HUDI-3689] Disable flaky tests in TestHoodieDeltaStreamer (#5127 )	2022-03-24 16:42:44 -07:00
Raymond Xu	ff136658a0	[HUDI-3689] Fix delta streamer tests (#5124 )	2022-03-24 14:19:53 -07:00
Raymond Xu	686da41696	[HUDI-3689] Fix UT failures in TestHoodieDeltaStreamer (#5120 )	2022-03-24 09:10:33 -07:00
Alexey Kudinkin	ccc3728002	[HUDI-3684] Fixing NPE in `ParquetUtils` (#5102 ) * Make sure nulls are properly handled in `HoodieColumnRangeMetadata`	2022-03-24 17:37:38 +05:30
Sagar Sumit	fe2c3989e3	[HUDI-3689] Fix glob path and hive sync in deltastreamer tests (#5117 ) * Remove glob pattern basePath from the deltastreamer tests. * [HUDI-3689] Fix file scheme config for CI failure in TestHoodieRealTimeRecordReader Co-authored-by: Raymond Xu <2701446+xushiyan@users.noreply.github.com>	2022-03-24 15:48:35 +05:30
Rajesh Mahindra	5f570ea151	[HUDI-2883] Refactor hive sync tool / config to use reflection and standardize configs (#4175 ) - Refactor hive sync tool / config to use reflection and standardize configs Co-authored-by: sivabalan <n.siva.b@gmail.com> Co-authored-by: Rajesh Mahindra <rmahindra@Rajeshs-MacBook-Pro.local> Co-authored-by: Raymond Xu <2701446+xushiyan@users.noreply.github.com>	2022-03-21 22:56:31 -04:00
YueZhang	8ca9a54db0	[Hudi-3376] Add an option to skip under deletion files for HoodieMetadataTableValidator (#4994 ) Co-authored-by: yuezhang <yuezhang@freewheel.tv>	2022-03-16 18:31:00 -07:00
wangxianghu	3b59b76952	[HUDI-3547] Introduce MaxwellSourcePostProcessor to extract data from Maxwell json string (#4987 ) * [HUDI-3547] Introduce MaxwellSourcePostProcessor to extract data from Maxwell json string * add ut * Address comment	2022-03-15 15:06:30 +04:00
Sivabalan Narayanan	22c3ce73db	[HUDI-3621] Fixing NullPointerException in DeltaStreamer (#5039 )	2022-03-14 18:34:17 -04:00
wangxianghu	e8918b6c2c	[HUDI-3569] Introduce ChainedJsonKafkaSourePostProcessor to support setting multi processors at once (#4969 )	2022-03-11 17:49:30 -05:00
wangxianghu	b00180342e	[HUDI-3575] Use HoodieTestDataGenerator#TRIP_SCHEMA as example schema in TestSchemaPostProcessor (#5019 )	2022-03-11 15:03:42 +04:00
wangxianghu	83cff3afee	[HUDI-3522] Introduce DropColumnSchemaPostProcessor to support drop columns from schema (#4972 ) * [HUDI-3522] Introduce DropColumnSchemaPostProcessor to support drop columns from schema * Fix case sensitivity	2022-03-11 09:30:37 +04:00
Sivabalan Narayanan	9dc6df5dca	[HUDI-3595] Fixing NULL schema provider for empty batch (#5002 )	2022-03-10 22:52:55 -05:00
Alexey Kudinkin	034addaef5	[HUDI-3396] Make sure `BaseFileOnlyViewRelation` only reads projected columns (#4818 ) NOTE: This change is first part of the series to clean up Hudi's Spark DataSource related implementations, making sure there's minimal code duplication among them, implementations are consistent and performant This PR is making sure that BaseFileOnlyViewRelation only reads projected columns as well as avoiding unnecessary serde from Row to InternalRow Brief change log - Introduced HoodieBaseRDD as a base for all custom RDD impls - Extracted common fields/methods to HoodieBaseRelation - Cleaned up and streamlined HoodieBaseFileViewOnlyRelation - Fixed all of the Relations to avoid superfluous Row <> InternalRow conversions	2022-03-09 21:45:25 -05:00
wangxianghu	548000b0d6	[HUDI-3568] Introduce ChainedSchemaPostProcessor to support setting multi processors at once (#4968 )	2022-03-09 11:16:22 +04:00
Sagar Sumit	575bc63468	[HUDI-3356][HUDI-3203] HoodieData for metadata index records; BloomFilter construction from index based on the type param (#4848 ) Rework of #4761 This diff introduces following changes: - Write stats are converted to metadata index records during the commit. Making them use the HoodieData type so that the record generation scales up with needs. - Metadata index init support for bloom filter and column stats partitions. - When building the BloomFilter from the index records, using the type param stored in the payload instead of hardcoded type. - Delta writes can change column ranges and the column stats index need to be properly updated with new ranges to be consistent with the table dataset. This fix add column stats index update support for the delta writes. Co-authored-by: Manoj Govindassamy <manoj.govindassamy@gmail.com>	2022-03-08 10:39:04 -05:00
Sivabalan Narayanan	3539578ccb	[HUDI-3213] Making commit preserve metadata to true for compaction (#4811 ) * Making commit preserve metadata to true * Fixing integ tests * Fixing preserve commit metadata for metadata table * fixed bootstrap tests * temp diff * Fixing merge handle * renaming fallback record * fixing build issue * Fixing test failures	2022-03-07 18:02:05 +05:30
wangxianghu	c9ffdc493e	[HUDI-3525] Introduce JsonkafkaSourceProcessor to support data preprocess before it is transformed to DataSet (#4930 )	2022-03-06 15:41:01 -05:00
wangxianghu	4b471772aa	[HUDI-3520] Introduce DeleteSupportSchemaPostProcessor to support adding _hoodie_is_deleted column to schema (#4921 )	2022-03-06 15:37:09 -05:00
Pratyaksh Sharma	907e60c252	[HUDI-3264]: made schema registry urls configurable with MTDS (#4779 )	2022-03-02 15:30:41 -05:00
Alexey Kudinkin	85f47b53df	[HUDI-3469] Refactor `HoodieTestDataGenerator` to provide for reproducible Builds (#4866 )	2022-03-01 22:15:26 -08:00
Sivabalan Narayanan	f7088a957c	[HUDI-3497] Adding Datatable validator tool (#4902 )	2022-02-28 22:46:32 -05:00
Y Ethan Guo	257052a94d	[HUDI-3465] Add validation of column stats and bloom filters in HoodieMetadataTableValidator (#4878 )	2022-02-28 18:49:30 -08:00
Sivabalan Narayanan	2f99e8458a	[HUDI-3521] Fixing kakfa key and value serializer value type from class to string (#4919 )	2022-02-27 11:13:13 -05:00
Sivabalan Narayanan	9678c3fbcf	[MINOR] Fixing checkpoint management in S3IncrSource (#4871 )	2022-02-22 09:15:16 -05:00
wangxianghu	4d1f74ebea	[HUDI-3464] Fix wrong exception thrown from HiveSchemaProvider (#4865 )	2022-02-22 10:20:20 +04:00
Sivabalan Narayanan	14dbbdf4c7	[HUDI-2189] Adding delete partitions support to DeltaStreamer (#4787 )	2022-02-22 00:01:30 -05:00
Sivabalan Narayanan	d36fe24c9e	[HUDI-3455] Fixing checkpoint management in hoodie incr source (#4850 )	2022-02-21 08:19:57 -05:00
Sivabalan Narayanan	66ac1446dd	[MINOR] Moving spark scheduling configs out of DataSourceOptions (#4843 )	2022-02-20 13:49:18 -05:00
Y Ethan Guo	fba5822ee3	[HUDI-3430] Fix Deltastreamer to properly shut down the services upon failure (#4824 )	2022-02-18 08:44:56 -05:00
Y Ethan Guo	9a05940a74	[HUDI-3366] Remove hardcoded logic of disabling metadata table in tests (#4792 )	2022-02-15 16:41:47 -05:00
YueZhang	76e2faa28d	[HUDI-3370] The files recorded in the commit may not match the actual ones for MOR Compaction (#4753 ) * use HoodieCommitMetadata to replace writeStatuses computation Co-authored-by: yuezhang <yuezhang@freewheel.tv>	2022-02-14 11:12:52 +08:00
冯健	55777fec05	[HUDI-2413] fix Sql source's checkpoint issue (#3648 ) * [HUDI-2413] fix Sql source's checkpoint * Fixing sql source checkpoint handling * Fixing docs Co-authored-by: jian.feng <fengjian428@gmial.com> Co-authored-by: sivabalan <n.siva.b@gmail.com>	2022-02-14 08:07:48 +05:30
Y Ethan Guo	6aba00e84f	[MINOR] Fix typos in Spark client related classes (#4781 )	2022-02-13 06:41:58 -08:00
wangxianghu	ce9762d588	[MINOR] unused import (#4799 )	2022-02-12 13:11:37 +04:00
zhangxiang17	9518f78610	[HUDI-3413]fix jackson parse error when empty message from JsonKafkaSource Using HoodieDeltaStreamer (#4794 )	2022-02-12 11:37:29 +04:00
Sivabalan Narayanan	60831d6906	[HUDI-3361] Fixing missing begin checkpoint in HoodieIncremental pull (#4755 )	2022-02-08 12:03:07 -05:00
Sivabalan Narayanan	ab73047958	Adding support for custom scheduler configs with streaming sink (#4762 )	2022-02-08 14:44:10 +05:30
YueZhang	1636876e8a	[HUDI-3320] Hoodie metadata table validator (#4721 ) Co-authored-by: yuezhang <yuezhang@freewheel.tv> Co-authored-by: Y Ethan Guo <ethan.guoyihua@gmail.com>	2022-02-08 00:29:44 -08:00
Vinish Reddy	8ab6f17149	[HUDI-3373] Add zero value metrics for empty data source and PROMETHEUS_PUSHGATEWAY reporter (#4760 )	2022-02-07 15:17:46 -05:00
ForwardXu	773b317983	[HUDI-2941] Show _hoodie_operation in spark sql results (#4649 )	2022-02-07 06:28:13 -08:00
Sivabalan Narayanan	24f738fe68	[HUDI-3360] Adding retries to deltastreamer for source errors (#4744 )	2022-02-07 08:10:06 -05:00
ehui	538db185ca	[HUDI-2491] Expose HMS mode metastore uri config option for spark writer (#3962 )	2022-02-07 18:13:51 +05:30
YueZhang	de206acbae	[HUDI-3369] New ScheduleAndExecute mode for HoodieCompactor and hudi-cli (#4750 ) Schedule and execute compaction plan in one single mode.	2022-02-07 15:01:34 +05:30
Y Ethan Guo	b8601a9f58	[HUDI-2656] Generalize HoodieIndex for flexible record data type (#3893 ) Co-authored-by: Raymond Xu <2701446+xushiyan@users.noreply.github.com>	2022-02-03 20:24:04 -08:00
Raymond Xu	caef3d5c58	[HUDI-3330] Remove fixture test tables for multi writer tests (#4704 )	2022-02-02 17:50:10 +05:30
jsbali	7ce0f4522b	[HUDI-2711] Fallback to fulltable scan for IncrementalRelation if underlying files have been cleared or moved by cleaner (#3946 ) Co-authored-by: sivabalan <n.siva.b@gmail.com>	2022-01-31 23:03:18 -05:00

1 2 3 4 5 ...

407 Commits