lanyuanxiaoyao/hudi - hudi - Gitea: Git with a cup of tea

Author	SHA1	Message	Date
wqwl611	52e63b39d6	[HUDI-4097] add table info to jobStatus (#5529 ) Co-authored-by: wqwl611 <wqwl611@gmail.com>	2022-05-13 21:01:15 -04:00
Sivabalan Narayanan	5c4813f101	[HUDI-4072] Fix NULL schema for empty batches in deltastreamer (#5543 )	2022-05-13 17:56:47 +05:30
Sivabalan Narayanan	b10ca7e69f	[HUDI-4085] Fixing flakiness with parquet empty batch tests in TestHoodieDeltaStreamer (#5559 )	2022-05-11 16:02:54 -04:00
Sivabalan Narayanan	569a76a9a5	[MINOR] fixing flaky tests in deltastreamer tests (#5521 )	2022-05-07 15:37:20 -04:00
Sivabalan Narayanan	52fe1c9fae	[HUDI-3675] Adding post write termination strategy to deltastreamer continuous mode (#5073 ) - Added a postWriteTerminationStrategy to deltastreamer continuous mode. One can enable by setting the appropriate termination strategy using DeltastreamerConfig.postWriteTerminationStrategyClass. If not, continuous mode is expected to run forever. - Added one concrete impl for termination strategy as NoNewDataTerminationStrategy which shuts down deltastreamer if there is no new data to consume from source for N consecutive rounds.	2022-05-06 09:27:29 -04:00
qianchutao	d794f4fbf9	[MINOR] Optimize code logic (#5499 )	2022-05-05 09:33:06 -07:00
Y Ethan Guo	a1d82b4dc5	[MINOR] Fix CI by ignoring SparkContext error (#5468 ) Sets spark.driver.allowMultipleContexts = true when constructing Spark conf in UtilHelpers	2022-04-29 11:19:07 -07:00
watermelon12138	cacbd98687	[HUDI-3945] After the async compaction operation is complete, the task should exit. (#5391 ) Co-authored-by: y00617041 <yangxuan42@huawei.com>	2022-04-27 21:16:09 +08:00
Alexey Kudinkin	4b296f79cc	[HUDI-3935] Adding config to fallback to enabled Partition Values extraction from Partition path (#5377 )	2022-04-21 01:36:19 -07:00
Y Ethan Guo	28fdddfee0	[HUDI-3920] Fix partition path construction in metadata table validator (#5365 )	2022-04-19 19:40:09 -04:00
Sagar Sumit	4f44e6aeb5	[HUDI-3899] Drop index to delete pending index instants from timeline if applicable (#5342 ) Co-authored-by: sivabalan <n.siva.b@gmail.com>	2022-04-18 22:28:46 -04:00
Sagar Sumit	1718bcab84	[HUDI-3707] Fix target schema handling in HoodieSparkUtils while creating RDD (#5347 )	2022-04-18 13:34:04 -04:00
Sivabalan Narayanan	05dfc39c29	Fixing async clustering job test in TestHoodieDeltaStreamer (#5317 )	2022-04-18 17:38:33 +05:30
董可伦	b8e465fdfc	[MINOR] Fix typos in log4j-surefire.properties (#5212 )	2022-04-15 13:33:37 -07:00
Raymond Xu	9e8664f4d2	[HOTFIX] add missing license (#5322 ) (#5324 )	2022-04-14 12:35:20 -07:00
Vinoth Govindarajan	2d46d5287e	[HUDI-3838] Moved the getPartitionColumns logic to driver. (#5303 )	2022-04-12 18:03:00 -04:00
Vinoth Govindarajan	d16740976e	[HUDI-3838] Implemented drop partition column feature for delta streamer code path (#5294 ) * [HUDI-3838] Implemented drop partition column feature for delta streamer code path * Ensure drop partition table config is updated in hoodie.props Co-authored-by: Sagar Sumit <sagarsumit09@gmail.com>	2022-04-12 18:10:30 +05:30
Sagar Sumit	3d8fc78c66	[HUDI-3844] Update props in indexer based on table config (#5293 )	2022-04-11 18:16:06 -04:00
Y Ethan Guo	63a099c5b7	[HUDI-3847] Fix NPE due to null schema in HoodieMetadataTableValidator (#5284 )	2022-04-10 17:59:29 -07:00
Alexey Kudinkin	81b25c543a	[HUDI-3825] Fixing Column Stats Index updating sequence (#5267 )	2022-04-08 23:14:08 -07:00
Y Ethan Guo	cd2c346df6	[HUDI-3637] Exclude uncommitted log files from metadata table validation (#5234 )	2022-04-07 13:03:03 -07:00
Raymond Xu	e96f08f355	Moving to 0.12.0-SNAPSHOT on master branch.	2022-04-06 15:24:10 +08:00
Prashant Wason	b28f0d6ceb	[HUDI-3290] Different file formats for the partition metadata file. (#5179 ) * [HUDI-3290] Different file formats for the partition metadata file. Partition metadata files are stored in each partition to help identify the base path of a table. These files are saved in the properties file format. Some query engines do not work when non Parquet/ORC files are found in a partition. Added a new table config 'hoodie.partition.metafile.use.data.format' which when enabled (default false for backward compatibility) ensures that partition metafiles will be saved in the same format as the base files of a dataset. For new datasets, the config can be set via hudi-cli. Deltastreamer has a new parameter --partition-metafile-use-data-format which will create a table with this setting. * Code review comments - Adding a new command to migrate from text to base file formats for meta file. - Reimplementing readFromFS() to first read the text format, then base format - Avoid extra exists() checks in readFromFS() - Added unit tests, enabled parquet format across hoodie-hadoop-mr - Code cleanup, restructuring, naming consistency. * Wiring in all the other Spark code paths to respect this config - Turned on parquet meta format for COW data source tests - Removed the deltastreamer command line to keep it shorter * populate HoodiePartitionMetadata#format after readFromFS() Co-authored-by: Vinoth Chandar <vinoth@apache.org> Co-authored-by: Raymond Xu <2701446+xushiyan@users.noreply.github.com>	2022-04-04 08:08:20 -07:00
Sivabalan Narayanan	84064a9b08	[HUDI-3772] Fixing auto adjustment of lock configs for deltastreamer (#5207 )	2022-04-02 23:44:10 -07:00
Alexey Kudinkin	cc3737be50	[HUDI-3664] Fixing Column Stats Index composition (#5181 ) Co-authored-by: Sagar Sumit <sagarsumit09@gmail.com>	2022-04-02 17:15:52 -07:00
Y Ethan Guo	c19f505b5a	[HUDI-3784] Improve docs and logs of HoodieMetadataTableValidator (#5216 )	2022-04-02 13:16:17 -07:00
Sagar Sumit	28dafa774e	[HUDI-2488][HUDI-3175] Implement async metadata indexing (#4693 ) - Add a new action called INDEX, whose state transition is described in the RFC. - Changes in timeline to support the new action. - Add an index planner in ScheduleIndexActionExecutor. - Add index plan executor in RunIndexActionExecutor. - Add 3 APIs in HoodieTableMetadataWriter; a) scheduleIndex: will generate an index plan based on latest completed instant, initialize file groups and add a requested INDEX instant, b) index: executes the index plan and also takes care of writes that happened after indexing was requested, c) dropIndex: will drop index by removing the given metadata partition. - Add 2 new table configs to serve as the source of truth for inflight and completed indexes. - Support upgrade/downgrade taking care of the newly added configs. - Add tool to trigger indexing in HoodieIndexer. - Handle corner cases related to partial failures. - Abort gracefully after deleting partition and instant. - Handle other actions in timeline to consider before catching up	2022-04-01 01:33:12 +05:30
YueZhang	2b60641d17	[HUDI-3635] Fix HoodieMetadataTableValidator around comparison of partition path listing (#5100 ) Co-authored-by: yuezhang <yuezhang@freewheel.tv>	2022-03-30 17:23:37 -04:00
YueZhang	eae8488536	[HUDI-3647] HoodieMetadataTableValidator: check MDT was initialized at first (#5152 ) Co-authored-by: yuezhang <yuezhang@freewheel.tv>	2022-03-30 17:18:08 -04:00
harshal	7fa363923c	[HUDI-3745] Support for spark datasource options in S3EventsHoodieIncrSource (#5170 )	2022-03-30 01:34:49 -04:00
Sivabalan Narayanan	4fed8dd319	[HUDI-3485] Adding scheduler pool configs for async clustering (#5043 )	2022-03-29 21:27:45 -04:00
Alexey Kudinkin	e5a2baeed0	[HUDI-3549] Removing dependency on "spark-avro" (#4955 ) Hudi will be taking on promise for it bundles to stay compatible with Spark minor versions (for ex 2.4, 3.1, 3.2): meaning that single build of Hudi (for ex "hudi-spark3.2-bundle") will be compatible with ALL patch versions in that minor branch (in that case 3.2.1, 3.2.0, etc) To achieve that we'll have to remove (and ban) "spark-avro" as a dependency, which on a few occasions was the root-cause of incompatibility b/w consecutive Spark patch versions (most recently 3.2.1 and 3.2.0, due to this PR). Instead of bundling "spark-avro" as dependency, we will be copying over some of the classes Hudi depends on and maintain them along the Hudi code-base to make sure we're able to provide for the aforementioned guarantee. To workaround arising compatibility issues we will be applying local patches to guarantee compatibility of Hudi bundles w/in the Spark minor version branches. Following Hudi modules to Spark minor branches is currently maintained: "hudi-spark3" -> 3.2.x "hudi-spark3.1.x" -> 3.1.x "hudi-spark2" -> 2.4.x Following classes hierarchies (borrowed from "spark-avro") are maintained w/in these Spark-specific modules to guarantee compatibility with respective minor version branches: AvroSerializer AvroDeserializer AvroUtils Each of these classes has been correspondingly copied from Spark 3.2.1 (for 3.2.x branch), 3.1.2 (for 3.1.x branch), 2.4.4 (for 2.4.x branch) into their respective modules. SchemaConverters class in turn is shared across all those modules given its relative stability (there're only cosmetical changes from 2.4.4 to 3.2.1). All of the aforementioned classes have their corresponding scope of visibility limited to corresponding packages (org.apache.spark.sql.avro, org.apache.spark.sql) to make sure broader code-base does not become dependent on them and instead relies on facades abstracting them. Additionally, given that Hudi plans on supporting all the patch versions of Spark w/in aforementioned minor versions branches of Spark, additional build steps were added to validate that Hudi could be properly compiled against those versions. Testing, however, is performed against the most recent patch versions of Spark with the help of Azure CI. Brief change log: - Removing spark-avro bundling from Hudi by default - Scaffolded Spark 3.2.x hierarchy - Bootstrapped Spark 3.1.x Avro serializer/deserializer hierarchy - Bootstrapped Spark 2.4.x Avro serializer/deserializer hierarchy - Moved ExpressionCodeGen,ExpressionPayload into hudi-spark module - Fixed AvroDeserializer to stay compatible w/ both Spark 3.2.1 and 3.2.0 - Modified bot.yml to build full matrix of support Spark versions - Removed "spark-avro" dependency from all modules - Fixed relocation of spark-avro classes in bundles to assist in running integ-tests.	2022-03-29 14:44:47 -04:00
Raymond Xu	6ccbae4d2a	[HUDI-2757] Implement Hudi AWS Glue sync (#5076 )	2022-03-28 14:54:59 -04:00
wangxianghu	f20c9867d7	[HUDI-3711] Fix typo in MaxwellJsonKafkaSourcePostProcessor.Config#PRECOMBINE_FIELD_TYPE_PROP (#5096 )	2022-03-25 00:02:54 -07:00
Y Ethan Guo	4ddd094ba2	[HUDI-3689] Disable flaky tests in TestHoodieDeltaStreamer (#5127 )	2022-03-24 16:42:44 -07:00
Raymond Xu	ff136658a0	[HUDI-3689] Fix delta streamer tests (#5124 )	2022-03-24 14:19:53 -07:00
Raymond Xu	686da41696	[HUDI-3689] Fix UT failures in TestHoodieDeltaStreamer (#5120 )	2022-03-24 09:10:33 -07:00
Alexey Kudinkin	ccc3728002	[HUDI-3684] Fixing NPE in `ParquetUtils` (#5102 ) * Make sure nulls are properly handled in `HoodieColumnRangeMetadata`	2022-03-24 17:37:38 +05:30
Sagar Sumit	fe2c3989e3	[HUDI-3689] Fix glob path and hive sync in deltastreamer tests (#5117 ) * Remove glob pattern basePath from the deltastreamer tests. * [HUDI-3689] Fix file scheme config for CI failure in TestHoodieRealTimeRecordReader Co-authored-by: Raymond Xu <2701446+xushiyan@users.noreply.github.com>	2022-03-24 15:48:35 +05:30
Rajesh Mahindra	5f570ea151	[HUDI-2883] Refactor hive sync tool / config to use reflection and standardize configs (#4175 ) - Refactor hive sync tool / config to use reflection and standardize configs Co-authored-by: sivabalan <n.siva.b@gmail.com> Co-authored-by: Rajesh Mahindra <rmahindra@Rajeshs-MacBook-Pro.local> Co-authored-by: Raymond Xu <2701446+xushiyan@users.noreply.github.com>	2022-03-21 22:56:31 -04:00
YueZhang	8ca9a54db0	[Hudi-3376] Add an option to skip under deletion files for HoodieMetadataTableValidator (#4994 ) Co-authored-by: yuezhang <yuezhang@freewheel.tv>	2022-03-16 18:31:00 -07:00
wangxianghu	3b59b76952	[HUDI-3547] Introduce MaxwellSourcePostProcessor to extract data from Maxwell json string (#4987 ) * [HUDI-3547] Introduce MaxwellSourcePostProcessor to extract data from Maxwell json string * add ut * Address comment	2022-03-15 15:06:30 +04:00
Sivabalan Narayanan	22c3ce73db	[HUDI-3621] Fixing NullPointerException in DeltaStreamer (#5039 )	2022-03-14 18:34:17 -04:00
wangxianghu	e8918b6c2c	[HUDI-3569] Introduce ChainedJsonKafkaSourePostProcessor to support setting multi processors at once (#4969 )	2022-03-11 17:49:30 -05:00
wangxianghu	b00180342e	[HUDI-3575] Use HoodieTestDataGenerator#TRIP_SCHEMA as example schema in TestSchemaPostProcessor (#5019 )	2022-03-11 15:03:42 +04:00
wangxianghu	83cff3afee	[HUDI-3522] Introduce DropColumnSchemaPostProcessor to support drop columns from schema (#4972 ) * [HUDI-3522] Introduce DropColumnSchemaPostProcessor to support drop columns from schema * Fix case sensitivity	2022-03-11 09:30:37 +04:00
Sivabalan Narayanan	9dc6df5dca	[HUDI-3595] Fixing NULL schema provider for empty batch (#5002 )	2022-03-10 22:52:55 -05:00
Alexey Kudinkin	034addaef5	[HUDI-3396] Make sure `BaseFileOnlyViewRelation` only reads projected columns (#4818 ) NOTE: This change is first part of the series to clean up Hudi's Spark DataSource related implementations, making sure there's minimal code duplication among them, implementations are consistent and performant This PR is making sure that BaseFileOnlyViewRelation only reads projected columns as well as avoiding unnecessary serde from Row to InternalRow Brief change log - Introduced HoodieBaseRDD as a base for all custom RDD impls - Extracted common fields/methods to HoodieBaseRelation - Cleaned up and streamlined HoodieBaseFileViewOnlyRelation - Fixed all of the Relations to avoid superfluous Row <> InternalRow conversions	2022-03-09 21:45:25 -05:00
wangxianghu	548000b0d6	[HUDI-3568] Introduce ChainedSchemaPostProcessor to support setting multi processors at once (#4968 )	2022-03-09 11:16:22 +04:00
Sagar Sumit	575bc63468	[HUDI-3356][HUDI-3203] HoodieData for metadata index records; BloomFilter construction from index based on the type param (#4848 ) Rework of #4761 This diff introduces following changes: - Write stats are converted to metadata index records during the commit. Making them use the HoodieData type so that the record generation scales up with needs. - Metadata index init support for bloom filter and column stats partitions. - When building the BloomFilter from the index records, using the type param stored in the payload instead of hardcoded type. - Delta writes can change column ranges and the column stats index need to be properly updated with new ranges to be consistent with the table dataset. This fix add column stats index update support for the delta writes. Co-authored-by: Manoj Govindassamy <manoj.govindassamy@gmail.com>	2022-03-08 10:39:04 -05:00

1 2 3 4 5 ...

438 Commits