lanyuanxiaoyao/hudi - hudi - Gitea: Git with a cup of tea

Author	SHA1	Message	Date
Y Ethan Guo	fbda4ad5bd	[HUDI-4360] Fix HoodieDropPartitionsTool based on refactored meta sync (#6043 )	2022-07-04 23:37:21 -07:00
YueZhang	45fdcf68a1	[HUDI-3116]Add a new HoodieDropPartitionsTool to let users drop table partitions through a standalone job. (#4459 ) Co-authored-by: yuezhang <yuezhang@freewheel.tv>	2022-07-04 19:24:18 -07:00
Shiyan Xu	c0e1587966	[HUDI-3730] Improve meta sync class design and hierarchies (#5854 ) * [HUDI-3730] Improve meta sync class design and hierarchies (#5754) * Implements class design proposed in RFC-55 Co-authored-by: jian.feng <fengjian428@gmial.com> Co-authored-by: jian.feng <jian.feng@shopee.com>	2022-07-03 14:47:25 +05:30
bschell	fd7d25ab63	[HUDI-1176] Upgrade hudi to log4j2 (#5366 ) * Move to log4j2 cr: https://code.amazon.com/reviews/CR-71010705 * Upgrade unit tests to log4j2 * update exclusion Co-authored-by: Brandon Scheller <bschelle@amazon.com>	2022-06-28 12:54:23 -07:00
董可伦	7689e62cd9	[HUDI-4265] Deprecate useless targetTableName parameter in HoodieMultiTableDeltaStreamer (#5883 )	2022-06-17 16:57:14 +08:00
董可伦	c291b05699	[HUDI-4218] [HUDI-4218] Expose the real exception information when an exception occurs in the tableExists method (#5827 )	2022-06-15 18:10:35 +08:00
Qi Ji	4774c4248f	[HUDI-4006] failOnDataLoss on delta-streamer kafka sources (#5718 ) add new config key hoodie.deltastreamer.source.kafka.enable.failOnDataLoss when failOnDataLoss=false (current behaviour, the default), log a warning instead of seeking to earliest silently when failOnDataLoss is set, fail explicitly	2022-06-13 10:31:57 -04:00
luoyajun	0d859fe58b	[HUDI-3863] Add UT for drop partition column in deltastreamer testsuite (#5727 )	2022-06-13 10:29:32 -04:00
Shiyan Xu	5aaac21d1d	[HUDI-4224] Fix CI issues (#5842 ) - Upgrade junit to 5.7.2 - Downgrade surefire and failsafe to 2.22.2 - Fix test failures that were previously not reported - Improve azure pipeline configs Co-authored-by: liujinhui1994 <965147871@qq.com> Co-authored-by: Y Ethan Guo <ethan.guoyihua@gmail.com>	2022-06-12 11:44:18 -07:00
Sivabalan Narayanan	21b903fddb	[HUDI-4197] Fix Async indexer to support building FILES partition (#5766 ) - When async indexer is invoked only with "FILES" partition, it fails. Fixing it to work with Async indexer. Also, if metadata table itself is not initialized, and if someone is looking to build indexes via AsyncIndexer, first they are expected to index "FILES" partition followed by other partitions. In general, we have a limitation of building only one index at a time w/ AsyncIndexer and hence. Have added guards to ensure these conditions are met.	2022-06-06 15:47:11 -04:00
Qi Ji	7276d0eaa6	[HUDI-3670] free temp views in sql transformers (#5080 )	2022-06-01 07:35:40 -07:00
Kumud Kumar Srivatsava Tirupati	795a99ba73	[HUDI-4107] Added --sync-tool-classes config option in HoodieMultiTableDeltaStreamer (#5597 ) * added --sync-tool-classes config option in multitable delta streamer * added a testcase to assert if syncClientToolClassNames is getting picked to the deltastreamer execution context	2022-05-31 20:27:50 +05:30
wangxianghu	58014c147a	[HUDI-4160] Make database regex of MaxwellJsonKafkaSourcePostProcessor optional (#5697 )	2022-05-28 11:13:24 +04:00
Sagar Sumit	31e13db1f0	[HUDI-4023] Decouple hudi-spark from hudi-utilities-slim-bundle (#5641 )	2022-05-26 11:28:49 +05:30
Sivabalan Narayanan	10363c1412	[HUDI-4132] Fixing determining target table schema for delta sync with empty batch (#5648 )	2022-05-24 08:17:15 -04:00
Heap	47b764ec33	[HUDI-4134] Fix Method naming consistency issues in FSUtils (#5655 )	2022-05-23 15:28:48 -07:00
wangxianghu	2af98303d3	[HUDI-4122] Fix NPE caused by adding kafka nodes (#5632 )	2022-05-21 11:12:53 +08:00
Sivabalan Narayanan	7d02b1fd3c	[MINOR] Minor fixes to exception log and removing unwanted metrics flush in integ test (#5646 )	2022-05-21 07:27:35 +08:00
wqwl611	52e63b39d6	[HUDI-4097] add table info to jobStatus (#5529 ) Co-authored-by: wqwl611 <wqwl611@gmail.com>	2022-05-13 21:01:15 -04:00
Sivabalan Narayanan	5c4813f101	[HUDI-4072] Fix NULL schema for empty batches in deltastreamer (#5543 )	2022-05-13 17:56:47 +05:30
Sivabalan Narayanan	b10ca7e69f	[HUDI-4085] Fixing flakiness with parquet empty batch tests in TestHoodieDeltaStreamer (#5559 )	2022-05-11 16:02:54 -04:00
Sivabalan Narayanan	569a76a9a5	[MINOR] fixing flaky tests in deltastreamer tests (#5521 )	2022-05-07 15:37:20 -04:00
Sivabalan Narayanan	52fe1c9fae	[HUDI-3675] Adding post write termination strategy to deltastreamer continuous mode (#5073 ) - Added a postWriteTerminationStrategy to deltastreamer continuous mode. One can enable by setting the appropriate termination strategy using DeltastreamerConfig.postWriteTerminationStrategyClass. If not, continuous mode is expected to run forever. - Added one concrete impl for termination strategy as NoNewDataTerminationStrategy which shuts down deltastreamer if there is no new data to consume from source for N consecutive rounds.	2022-05-06 09:27:29 -04:00
qianchutao	d794f4fbf9	[MINOR] Optimize code logic (#5499 )	2022-05-05 09:33:06 -07:00
Y Ethan Guo	a1d82b4dc5	[MINOR] Fix CI by ignoring SparkContext error (#5468 ) Sets spark.driver.allowMultipleContexts = true when constructing Spark conf in UtilHelpers	2022-04-29 11:19:07 -07:00
watermelon12138	cacbd98687	[HUDI-3945] After the async compaction operation is complete, the task should exit. (#5391 ) Co-authored-by: y00617041 <yangxuan42@huawei.com>	2022-04-27 21:16:09 +08:00
Alexey Kudinkin	4b296f79cc	[HUDI-3935] Adding config to fallback to enabled Partition Values extraction from Partition path (#5377 )	2022-04-21 01:36:19 -07:00
Y Ethan Guo	28fdddfee0	[HUDI-3920] Fix partition path construction in metadata table validator (#5365 )	2022-04-19 19:40:09 -04:00
Sagar Sumit	4f44e6aeb5	[HUDI-3899] Drop index to delete pending index instants from timeline if applicable (#5342 ) Co-authored-by: sivabalan <n.siva.b@gmail.com>	2022-04-18 22:28:46 -04:00
Sagar Sumit	1718bcab84	[HUDI-3707] Fix target schema handling in HoodieSparkUtils while creating RDD (#5347 )	2022-04-18 13:34:04 -04:00
Sivabalan Narayanan	05dfc39c29	Fixing async clustering job test in TestHoodieDeltaStreamer (#5317 )	2022-04-18 17:38:33 +05:30
董可伦	b8e465fdfc	[MINOR] Fix typos in log4j-surefire.properties (#5212 )	2022-04-15 13:33:37 -07:00
Raymond Xu	9e8664f4d2	[HOTFIX] add missing license (#5322 ) (#5324 )	2022-04-14 12:35:20 -07:00
Vinoth Govindarajan	2d46d5287e	[HUDI-3838] Moved the getPartitionColumns logic to driver. (#5303 )	2022-04-12 18:03:00 -04:00
Vinoth Govindarajan	d16740976e	[HUDI-3838] Implemented drop partition column feature for delta streamer code path (#5294 ) * [HUDI-3838] Implemented drop partition column feature for delta streamer code path * Ensure drop partition table config is updated in hoodie.props Co-authored-by: Sagar Sumit <sagarsumit09@gmail.com>	2022-04-12 18:10:30 +05:30
Sagar Sumit	3d8fc78c66	[HUDI-3844] Update props in indexer based on table config (#5293 )	2022-04-11 18:16:06 -04:00
Y Ethan Guo	63a099c5b7	[HUDI-3847] Fix NPE due to null schema in HoodieMetadataTableValidator (#5284 )	2022-04-10 17:59:29 -07:00
Alexey Kudinkin	81b25c543a	[HUDI-3825] Fixing Column Stats Index updating sequence (#5267 )	2022-04-08 23:14:08 -07:00
Y Ethan Guo	cd2c346df6	[HUDI-3637] Exclude uncommitted log files from metadata table validation (#5234 )	2022-04-07 13:03:03 -07:00
Raymond Xu	e96f08f355	Moving to 0.12.0-SNAPSHOT on master branch.	2022-04-06 15:24:10 +08:00
Prashant Wason	b28f0d6ceb	[HUDI-3290] Different file formats for the partition metadata file. (#5179 ) * [HUDI-3290] Different file formats for the partition metadata file. Partition metadata files are stored in each partition to help identify the base path of a table. These files are saved in the properties file format. Some query engines do not work when non Parquet/ORC files are found in a partition. Added a new table config 'hoodie.partition.metafile.use.data.format' which when enabled (default false for backward compatibility) ensures that partition metafiles will be saved in the same format as the base files of a dataset. For new datasets, the config can be set via hudi-cli. Deltastreamer has a new parameter --partition-metafile-use-data-format which will create a table with this setting. * Code review comments - Adding a new command to migrate from text to base file formats for meta file. - Reimplementing readFromFS() to first read the text format, then base format - Avoid extra exists() checks in readFromFS() - Added unit tests, enabled parquet format across hoodie-hadoop-mr - Code cleanup, restructuring, naming consistency. * Wiring in all the other Spark code paths to respect this config - Turned on parquet meta format for COW data source tests - Removed the deltastreamer command line to keep it shorter * populate HoodiePartitionMetadata#format after readFromFS() Co-authored-by: Vinoth Chandar <vinoth@apache.org> Co-authored-by: Raymond Xu <2701446+xushiyan@users.noreply.github.com>	2022-04-04 08:08:20 -07:00
Sivabalan Narayanan	84064a9b08	[HUDI-3772] Fixing auto adjustment of lock configs for deltastreamer (#5207 )	2022-04-02 23:44:10 -07:00
Alexey Kudinkin	cc3737be50	[HUDI-3664] Fixing Column Stats Index composition (#5181 ) Co-authored-by: Sagar Sumit <sagarsumit09@gmail.com>	2022-04-02 17:15:52 -07:00
Y Ethan Guo	c19f505b5a	[HUDI-3784] Improve docs and logs of HoodieMetadataTableValidator (#5216 )	2022-04-02 13:16:17 -07:00
Sagar Sumit	28dafa774e	[HUDI-2488][HUDI-3175] Implement async metadata indexing (#4693 ) - Add a new action called INDEX, whose state transition is described in the RFC. - Changes in timeline to support the new action. - Add an index planner in ScheduleIndexActionExecutor. - Add index plan executor in RunIndexActionExecutor. - Add 3 APIs in HoodieTableMetadataWriter; a) scheduleIndex: will generate an index plan based on latest completed instant, initialize file groups and add a requested INDEX instant, b) index: executes the index plan and also takes care of writes that happened after indexing was requested, c) dropIndex: will drop index by removing the given metadata partition. - Add 2 new table configs to serve as the source of truth for inflight and completed indexes. - Support upgrade/downgrade taking care of the newly added configs. - Add tool to trigger indexing in HoodieIndexer. - Handle corner cases related to partial failures. - Abort gracefully after deleting partition and instant. - Handle other actions in timeline to consider before catching up	2022-04-01 01:33:12 +05:30
YueZhang	2b60641d17	[HUDI-3635] Fix HoodieMetadataTableValidator around comparison of partition path listing (#5100 ) Co-authored-by: yuezhang <yuezhang@freewheel.tv>	2022-03-30 17:23:37 -04:00
YueZhang	eae8488536	[HUDI-3647] HoodieMetadataTableValidator: check MDT was initialized at first (#5152 ) Co-authored-by: yuezhang <yuezhang@freewheel.tv>	2022-03-30 17:18:08 -04:00
harshal	7fa363923c	[HUDI-3745] Support for spark datasource options in S3EventsHoodieIncrSource (#5170 )	2022-03-30 01:34:49 -04:00
Sivabalan Narayanan	4fed8dd319	[HUDI-3485] Adding scheduler pool configs for async clustering (#5043 )	2022-03-29 21:27:45 -04:00
Alexey Kudinkin	e5a2baeed0	[HUDI-3549] Removing dependency on "spark-avro" (#4955 ) Hudi will be taking on promise for it bundles to stay compatible with Spark minor versions (for ex 2.4, 3.1, 3.2): meaning that single build of Hudi (for ex "hudi-spark3.2-bundle") will be compatible with ALL patch versions in that minor branch (in that case 3.2.1, 3.2.0, etc) To achieve that we'll have to remove (and ban) "spark-avro" as a dependency, which on a few occasions was the root-cause of incompatibility b/w consecutive Spark patch versions (most recently 3.2.1 and 3.2.0, due to this PR). Instead of bundling "spark-avro" as dependency, we will be copying over some of the classes Hudi depends on and maintain them along the Hudi code-base to make sure we're able to provide for the aforementioned guarantee. To workaround arising compatibility issues we will be applying local patches to guarantee compatibility of Hudi bundles w/in the Spark minor version branches. Following Hudi modules to Spark minor branches is currently maintained: "hudi-spark3" -> 3.2.x "hudi-spark3.1.x" -> 3.1.x "hudi-spark2" -> 2.4.x Following classes hierarchies (borrowed from "spark-avro") are maintained w/in these Spark-specific modules to guarantee compatibility with respective minor version branches: AvroSerializer AvroDeserializer AvroUtils Each of these classes has been correspondingly copied from Spark 3.2.1 (for 3.2.x branch), 3.1.2 (for 3.1.x branch), 2.4.4 (for 2.4.x branch) into their respective modules. SchemaConverters class in turn is shared across all those modules given its relative stability (there're only cosmetical changes from 2.4.4 to 3.2.1). All of the aforementioned classes have their corresponding scope of visibility limited to corresponding packages (org.apache.spark.sql.avro, org.apache.spark.sql) to make sure broader code-base does not become dependent on them and instead relies on facades abstracting them. Additionally, given that Hudi plans on supporting all the patch versions of Spark w/in aforementioned minor versions branches of Spark, additional build steps were added to validate that Hudi could be properly compiled against those versions. Testing, however, is performed against the most recent patch versions of Spark with the help of Azure CI. Brief change log: - Removing spark-avro bundling from Hudi by default - Scaffolded Spark 3.2.x hierarchy - Bootstrapped Spark 3.1.x Avro serializer/deserializer hierarchy - Bootstrapped Spark 2.4.x Avro serializer/deserializer hierarchy - Moved ExpressionCodeGen,ExpressionPayload into hudi-spark module - Fixed AvroDeserializer to stay compatible w/ both Spark 3.2.1 and 3.2.0 - Modified bot.yml to build full matrix of support Spark versions - Removed "spark-avro" dependency from all modules - Fixed relocation of spark-avro classes in bundles to assist in running integ-tests.	2022-03-29 14:44:47 -04:00

1 2 3 4 5 ...

456 Commits