lanyuanxiaoyao/hudi - hudi - Gitea: Git with a cup of tea

Author	SHA1	Message	Date
YueZhang	020786a5f9	[HUDI-3451] Delete metadata table when the write client disables MDT (#5186 ) * Add checks for metadata table init to avoid possible out-of-sync * Revise the logic to reuse existing table config * Revise docs and naming Co-authored-by: yuezhang <yuezhang@freewheel.tv> Co-authored-by: Y Ethan Guo <ethan.guoyihua@gmail.com>	2022-04-02 16:31:06 +05:30
ForwardXu	98b4e9796e	[HUDI-3406] Rollback incorrectly relying on FS listing instead of Com… (#4957 ) * [HUDI-3406] Rollback incorrectly relying on FS listing instead of Commit Metadata * [HUDI-3406] Rollback incorrectly relying on FS listing instead of Commit Metadata * [HUDI-3406] Rollback incorrectly relying on FS listing instead of Commit Metadata * fix comments * fix comments * fix comments	2022-04-01 10:01:41 +08:00
Sagar Sumit	a048e940fd	[HUDI-3743] Support DELETE_PARTITION for metadata table (#5169 ) In order to drop any metadata partition (index), we can reuse the DELETE_PARTITION operation in metadata table. Subsequent to this, we can support drop index (with table config update) for async metadata indexer. - Add a new API in HoodieTableMetadataWriter - Current only supported for Spark metadata writer	2022-03-31 21:29:17 -04:00
Sivabalan Narayanan	73a21092f8	[HUDI-3732] Fixing rollback validation (#5157 ) * Fixing rollback validation * Adding tests	2022-03-31 17:25:24 +05:30
ForwardXu	80011df995	[HUDI-3135] Make delete partitions lazy to be executed by the cleaner (#4489 ) As of now, delete partitions will ensure all file groups are deleted, but the partition as such is not deleted. So, get all partitions might be returning the deleted partitions as well. but no data will be served since all file groups are deleted. With this patch, we are fixing it. We are letting cleaner take care of deleting the partitions when all file groups pertaining to a partitions are deleted. - Fixed the CleanPlanActionExecutor to return meta info about list of partitions to be deleted. If there are no valid file groups for a partition, clean planner will include the partition to be deleted. - Fixed HoodieCleanPlan avro schema to include the list of partitions to be deleted - CleanActionExecutor is fixed to delete partitions if any (as per clean plan) - Same info is added to HoodieCleanMetadata - Metadata table when applying clean metadata, will check for partitions to be deleted and will update the "all_partitions" record for the deleted partitions. Co-authored-by: sivabalan <n.siva.b@gmail.com>	2022-03-31 03:35:39 -04:00
Yuwei XIAO	ce45f7f129	[HUDI-3692] MetadataFileSystemView includes compaction in timeline (#5110 )	2022-03-31 02:24:59 -04:00
Sivabalan Narayanan	4569734d60	[HUDI-3713] Guarding archival for multi-writer (#5138 )	2022-03-31 01:44:31 -04:00
YueZhang	2dbb273d26	[HUDI-3721] Delete MDT if necessary when trigger rollback to savepoint (#5173 ) Co-authored-by: yuezhang <yuezhang@freewheel.tv>	2022-03-30 20:26:37 -07:00
Sivabalan Narayanan	d80c80699f	[MINOR] Fixing flakiness in TestHoodieSparkMergeOnReadTableRollback.testRollbackWithDeltaAndCompactionCommit (#5183 )	2022-03-30 22:07:22 -04:00
Alexey Kudinkin	8b796e9686	[HUDI-3653] Cleaning up bespoke Column Stats Index implementation (#5062 )	2022-03-30 10:01:43 -07:00
Alexey Kudinkin	e5a2baeed0	[HUDI-3549] Removing dependency on "spark-avro" (#4955 ) Hudi will be taking on promise for it bundles to stay compatible with Spark minor versions (for ex 2.4, 3.1, 3.2): meaning that single build of Hudi (for ex "hudi-spark3.2-bundle") will be compatible with ALL patch versions in that minor branch (in that case 3.2.1, 3.2.0, etc) To achieve that we'll have to remove (and ban) "spark-avro" as a dependency, which on a few occasions was the root-cause of incompatibility b/w consecutive Spark patch versions (most recently 3.2.1 and 3.2.0, due to this PR). Instead of bundling "spark-avro" as dependency, we will be copying over some of the classes Hudi depends on and maintain them along the Hudi code-base to make sure we're able to provide for the aforementioned guarantee. To workaround arising compatibility issues we will be applying local patches to guarantee compatibility of Hudi bundles w/in the Spark minor version branches. Following Hudi modules to Spark minor branches is currently maintained: "hudi-spark3" -> 3.2.x "hudi-spark3.1.x" -> 3.1.x "hudi-spark2" -> 2.4.x Following classes hierarchies (borrowed from "spark-avro") are maintained w/in these Spark-specific modules to guarantee compatibility with respective minor version branches: AvroSerializer AvroDeserializer AvroUtils Each of these classes has been correspondingly copied from Spark 3.2.1 (for 3.2.x branch), 3.1.2 (for 3.1.x branch), 2.4.4 (for 2.4.x branch) into their respective modules. SchemaConverters class in turn is shared across all those modules given its relative stability (there're only cosmetical changes from 2.4.4 to 3.2.1). All of the aforementioned classes have their corresponding scope of visibility limited to corresponding packages (org.apache.spark.sql.avro, org.apache.spark.sql) to make sure broader code-base does not become dependent on them and instead relies on facades abstracting them. Additionally, given that Hudi plans on supporting all the patch versions of Spark w/in aforementioned minor versions branches of Spark, additional build steps were added to validate that Hudi could be properly compiled against those versions. Testing, however, is performed against the most recent patch versions of Spark with the help of Azure CI. Brief change log: - Removing spark-avro bundling from Hudi by default - Scaffolded Spark 3.2.x hierarchy - Bootstrapped Spark 3.1.x Avro serializer/deserializer hierarchy - Bootstrapped Spark 2.4.x Avro serializer/deserializer hierarchy - Moved ExpressionCodeGen,ExpressionPayload into hudi-spark module - Fixed AvroDeserializer to stay compatible w/ both Spark 3.2.1 and 3.2.0 - Modified bot.yml to build full matrix of support Spark versions - Removed "spark-avro" dependency from all modules - Fixed relocation of spark-avro classes in bundles to assist in running integ-tests.	2022-03-29 14:44:47 -04:00
Y Ethan Guo	4ed84b216d	[HUDI-3720] Fix the logic of reattempting pending rollback (#5148 )	2022-03-28 14:54:31 -04:00
Y Ethan Guo	484b3407e0	[HUDI-3604] Adjust the order of timeline changes in rollbacks (#5114 )	2022-03-26 22:37:44 -07:00
Alexey Kudinkin	189d5297b8	[HUDI-3709] Fixing `ParquetWriter` impls not respecting Parquet Max File Size limit (#5129 )	2022-03-26 17:51:36 -04:00
Danny Chan	0c09a973fb	[HUDI-3435] Do not throw exception when instant to rollback does not exist in metadata table active timeline (#4821 )	2022-03-26 11:42:54 +08:00
Danny Chan	8896864d7b	[HUDI-3678] Fix record rewrite of create handle when 'preserveMetadata' is true (#5088 )	2022-03-25 11:48:50 +08:00
Y Ethan Guo	eaa4c4f2e2	[HUDI-1180] Upgrade HBase to 2.4.9 (#5004 ) Co-authored-by: Sagar Sumit <sagarsumit09@gmail.com>	2022-03-24 19:04:53 -07:00
Y Ethan Guo	9b3dd2e0b7	[HUDI-3624] Check all instants before starting a commit in metadata table (#5098 )	2022-03-24 17:13:58 -07:00
Y Ethan Guo	9b6e138af2	[HUDI-3640] Set SimpleKeyGenerator as default in 2to3 table upgrade for Spark engine (#5075 )	2022-03-21 20:35:06 -04:00
Pratyaksh Sharma	ca0931d332	[HUDI-1436]: Provide an option to trigger clean every nth commit (#4385 ) - Provided option to trigger clean every nth commit with default number of commits as 1 so that existing users are not affected. Co-authored-by: sivabalan <n.siva.b@gmail.com>	2022-03-21 20:06:30 -04:00
Alexey Kudinkin	1b6e201160	[HUDI-3663] Fixing Column Stats index to properly handle first Data Table commit (#5070 ) * Fixed metadata conversion util to extract schema from `HoodieCommitMetadata` * Fixed failure to fetch columns to index in empty table * Abort indexing seq in case there are no columns to index * Fallback to index at least primary key columns, in case no writer schema could be obtained to index all columns * Fixed `getRecordFields` incorrectly ignoring default value * Make sure Hudi metadata fields are also indexed	2022-03-20 10:24:13 +05:30
Raymond Xu	7446ff95a7	[HUDI-2439] Replace RDD with HoodieData in HoodieSparkTable and commit executors (#4856 ) - Adopt HoodieData in Spark action commit executors - Make Spark independent DeleteHelper, WriteHelper, MergeHelper in hudi-client-common - Make HoodieTable in WriteClient APIs have raw type to decouple with Client's generic types	2022-03-17 04:17:56 -07:00
Y Ethan Guo	5ba2d9ab2f	[HUDI-3494] Consider triggering condition of MOR compaction during archival (#4974 )	2022-03-17 01:28:11 -04:00
Alexey Kudinkin	5e8ff8d793	[HUDI-3514] Rebase Data Skipping flow to rely on MT Column Stats index (#4948 )	2022-03-15 10:38:36 -07:00
liujinhui	e60acc1258	[HUDI-3583] Fix MarkerBasedRollbackStrategy NoSuchElementException (#4984 ) Co-authored-by: Y Ethan Guo <ethan.guoyihua@gmail.com>	2022-03-12 23:00:50 -08:00
Sivabalan Narayanan	e7bb0413af	[HUDI-3556] Re-use rollback instant for rolling back of clustering and compaction if rollback failed mid-way (#4971 )	2022-03-11 18:40:13 -05:00
Alexey Kudinkin	034addaef5	[HUDI-3396] Make sure `BaseFileOnlyViewRelation` only reads projected columns (#4818 ) NOTE: This change is first part of the series to clean up Hudi's Spark DataSource related implementations, making sure there's minimal code duplication among them, implementations are consistent and performant This PR is making sure that BaseFileOnlyViewRelation only reads projected columns as well as avoiding unnecessary serde from Row to InternalRow Brief change log - Introduced HoodieBaseRDD as a base for all custom RDD impls - Extracted common fields/methods to HoodieBaseRelation - Cleaned up and streamlined HoodieBaseFileViewOnlyRelation - Fixed all of the Relations to avoid superfluous Row <> InternalRow conversions	2022-03-09 21:45:25 -05:00
Sagar Sumit	575bc63468	[HUDI-3356][HUDI-3203] HoodieData for metadata index records; BloomFilter construction from index based on the type param (#4848 ) Rework of #4761 This diff introduces following changes: - Write stats are converted to metadata index records during the commit. Making them use the HoodieData type so that the record generation scales up with needs. - Metadata index init support for bloom filter and column stats partitions. - When building the BloomFilter from the index records, using the type param stored in the payload instead of hardcoded type. - Delta writes can change column ranges and the column stats index need to be properly updated with new ranges to be consistent with the table dataset. This fix add column stats index update support for the delta writes. Co-authored-by: Manoj Govindassamy <manoj.govindassamy@gmail.com>	2022-03-08 10:39:04 -05:00
Alexey Kudinkin	a66fd40692	[HUDI-3365] Make sure Metadata Table records are updated appropriately on HDFS (#4739 ) - This change makes sure MT records are updated appropriately on HDFS: previously after Log File append operations MT records were updated w/ just the size of the deltas being appended to the original files, which have been found to be the cause of issues in case of Rollbacks that were instead updating MT with records bearing the full file-size. - To make sure that we hedge against similar issues going f/w, this PR alleviates this discrepancy and streamlines the flow of MT table always ingesting records bearing full file-sizes.	2022-03-07 15:38:27 -05:00
Aditya Tiwari	051ad0b033	[HUDI-3130] Fixing Hive getSchema for RT tables addressing different partitions having different schemas (#4468 ) * Fixing Hive getSchema for RT tables * Addressing feedback * temp diff * fixing tests after spark datasource read support for metadata table is merged to master * Adding multi-partition schema evolution tests to HoodieRealTimeRecordReader Co-authored-by: Aditya Tiwari <aditya.tiwari@flipkart.com> Co-authored-by: sivabalan <n.siva.b@gmail.com>	2022-03-06 07:51:35 +05:30
Sivabalan Narayanan	6a46130037	[HUDI-2761] Fixing timeline server for repeated refreshes (#4812 ) * Fixing timeline server for repeated refreshes	2022-03-05 10:04:16 +08:00
Sivabalan Narayanan	876a891979	[HUDI-3544] Fixing "populate meta fields" update to metadata table (#4941 ) * Fixing populateMeta fields update to metadata table * Fix checkstyle violations Co-authored-by: Sagar Sumit <sagarsumit09@gmail.com>	2022-03-03 17:02:25 +05:30
Alexey Kudinkin	85f47b53df	[HUDI-3469] Refactor `HoodieTestDataGenerator` to provide for reproducible Builds (#4866 )	2022-03-01 22:15:26 -08:00
Bo Cui	3fdc9332e5	[HUDI-3516] Implement record iterator for HoodieDataBlock (#4909 ) * Use iterator to void eager materialization to be memory friendly	2022-03-02 10:19:36 +08:00
Sivabalan Narayanan	4a59876c8b	[HUDI-2917] rollback insert data appended to log file when using Hbase Index (#4840 ) Co-authored-by: guanziyue <guanziyue@gmail.com>	2022-02-28 08:13:17 -05:00
Raymond Xu	c77b2591d0	[HUDI-2439] Remove SparkBoundedInMemoryExecutor (#4860 )	2022-02-26 08:02:12 -05:00
苏承祥	92cdc5987a	[HUDI-3515] Making rdd unpersist optional at the end of writes (#4898 ) Co-authored-by: 苏承祥 <sucx@tuya.com>	2022-02-25 11:30:10 -05:00
YueZhang	742810070b	[HUDI-3421]Pending clustering may break AbstractTableFileSystemView#getxxBaseFile() (#4810 )	2022-02-25 16:46:27 +05:30
Alexey Kudinkin	85e8a5c4de	[HUDI-1296] Support Metadata Table in Spark Datasource (#4789 ) * Bootstrapping initial support for Metadata Table in Spark Datasource - Consolidated Avro/Row conversion utilities to center around Spark's AvroDeserializer ; removed duplication - Bootstrapped HoodieBaseRelation - Updated HoodieMergeOnReadRDD to be able to handle Metadata Table - Modified MOR relations to be able to read different Base File formats (Parquet, HFile)	2022-02-24 16:23:13 -05:00
Prashant Wason	0dee8edc97	[HUDI-2925] Fix duplicate cleaning of same files when unfinished clean operations are present using a config. (#4212 ) Co-authored-by: sivabalan <n.siva.b@gmail.com>	2022-02-21 21:53:03 -05:00
Pratyaksh Sharma	bf16bc122a	[HUDI-349]: Added new cleaning policy based on number of hours (#3646 )	2022-02-21 09:04:42 -05:00
Sivabalan Narayanan	17cb5cb433	[HUDI-3432] Fixing restore with metadata enabled (#4849 ) * Fixing restore with metadata enabled * Fixing test failures	2022-02-21 18:25:30 +05:30
Sagar Sumit	ed106f671e	[HUDI-2809] Introduce a checksum mechanism for validating hoodie.properties (#4712 ) Fix dependency conflict Fix repairs command Implement putIfAbsent for DDB lock provider Add upgrade step and validate while fetching configs Validate checksum for latest table version only while fetching config Move generateChecksum to BinaryUtil Rebase and resolve conflict Fix table version check	2022-02-18 10:17:06 +05:30
Y Ethan Guo	9a05940a74	[HUDI-3366] Remove hardcoded logic of disabling metadata table in tests (#4792 )	2022-02-15 16:41:47 -05:00
Yann Byron	cb6ca7f0d1	[HUDI-3204] fix problem that spark on TimestampKeyGenerator has no re… (#4714 )	2022-02-14 23:38:38 -05:00
Raymond Xu	27bd7b538e	[HUDI-1576] Make archiving an async service (#4795 )	2022-02-14 21:15:06 -05:00
YueZhang	76e2faa28d	[HUDI-3370] The files recorded in the commit may not match the actual ones for MOR Compaction (#4753 ) * use HoodieCommitMetadata to replace writeStatuses computation Co-authored-by: yuezhang <yuezhang@freewheel.tv>	2022-02-14 11:12:52 +08:00
Y Ethan Guo	6aba00e84f	[MINOR] Fix typos in Spark client related classes (#4781 )	2022-02-13 06:41:58 -08:00
Sivabalan Narayanan	e7ec3a82dc	[HUDI-2432] Adding restore.requested instant and restore plan for restore action (#4605 ) - This adds a restore plan and serializes it to restore.requested meta file in timeline. This also means that we are introducing schedule and execution phases for restore which was not present before.	2022-02-10 08:06:23 -05:00
Sivabalan Narayanan	0ababcfaa7	[HUDI-1847] Adding inline scheduling support for spark datasource path for compaction and clustering (#4420 ) - This adds support in spark-datasource to just schedule table services inline so that users can leverage async execution w/o the need for lock service providers.	2022-02-10 08:04:55 -05:00

1 2 3 4 5

245 Commits