lanyuanxiaoyao/hudi - hudi - Gitea: Git with a cup of tea

Author	SHA1	Message	Date
董可伦	75f847691f	[HUDI-4001] Filter the properties should not be used when create table for Spark SQL (#5495 )	2022-05-16 09:50:29 +08:00
Sivabalan Narayanan	0cec955fa2	[HUDI-4018][HUDI-4027] Adding integ test yamls for immutable use-cases. Added delete partition support to integ tests (#5501 ) - Added pure immutable test yamls to integ test framework. Added SparkBulkInsertNode as part of it. - Added delete_partition support to integ test framework using spark-datasource. - Added a single yaml to test all non core write operations (insert overwrite, insert overwrite table and delete partitions) - Added tests for 4 concurrent spark datasource writers (multi-writer tests). - Fixed readme w/ sample commands for multi-writer.	2022-05-12 21:01:55 -04:00
Sivabalan Narayanan	6285a239a3	[HUDI-3995] Making perf optimizations for bulk insert row writer path (#5462 ) - Avoid using udf for key generator for SimpleKeyGen and NonPartitionedKeyGen. - Fixed NonPartitioned Key generator to directly fetch record key from row rather than involving GenericRecord. - Other minor fixes around using static values instead of looking up hashmap.	2022-05-09 12:40:22 -04:00
KnightChess	6ec039ba42	[MINOR] Update alter rename command class type for pattern matching (#5381 )	2022-04-26 19:39:51 -07:00
Sivabalan Narayanan	762623a15c	[HUDI-3972] Fixing hoodie.properties/tableConfig for no preCombine field with writes (#5424 ) Fixed instantiation of new table to set the null for preCombine if not explicitly set by the user.	2022-04-25 23:03:10 -04:00
Sivabalan Narayanan	7523542c1d	[HUDI-3947] Fixing Hive conf usage in HoodieSparkSqlWriter (#5401 )	2022-04-22 22:20:05 -04:00
Y Ethan Guo	c4bc2deea0	[HUDI-3936] Fix projection for a nested field as pre-combined key (#5379 ) This PR fixes the projection logic around a nested field which is used as the pre-combined key field. The fix is to only check and append the root level field for projection, i.e., "a", for a nested field "a.b.c" in the mandatory columns. - Changes the logic to check and append the root level field for a required nested field in the mandatory columns in HoodieBaseRelation.appendMandatoryColumns	2022-04-21 20:17:57 -04:00
Alexey Kudinkin	4b296f79cc	[HUDI-3935] Adding config to fallback to enabled Partition Values extraction from Partition path (#5377 )	2022-04-21 01:36:19 -07:00
Alexey Kudinkin	f7544e23ac	[HUDI-3204] Fixing partition-values being derived from partition-path instead of source columns (#5364 ) - Scaffolded `Spark24HoodieParquetFileFormat` extending `ParquetFileFormat` and overriding the behavior of adding partition columns to every row - Amended `SparkAdapter`s `createHoodieParquetFileFormat` API to be able to configure whether to append partition values or not - Fallback to append partition values in cases when the source columns are not persisted in data-file - Fixing HoodieBaseRelation incorrectly handling mandatory columns	2022-04-20 19:30:27 +08:00
Alexey Kudinkin	81bf771e56	[HUDI-3902] Fallback to `HadoopFsRelation` in cases non-involving Schema Evolution (#5352 ) Co-authored-by: Raymond Xu <2701446+xushiyan@users.noreply.github.com>	2022-04-19 10:40:20 -07:00
Alexey Kudinkin	7ecb47cd21	[HUDI-3895] Fixing file-partitioning seq for base-file only views to make sure we bucket the files efficiently (#5337 )	2022-04-18 16:06:52 -04:00
Sivabalan Narayanan	e8ab915aff	[MINOR] Removing invalid code to close parquet reader iterator (#5182 )	2022-04-15 14:50:07 -04:00
ForwardXu	6621f3cdbb	[HUDI-3845] Fix delete mor table's partition with urlencode's error (#5282 )	2022-04-14 01:49:00 -07:00
ForwardXu	44b3630b5d	[HUDI-3826] Make truncate partition use delete_partition operation (#5272 ) Make truncate partition and drop partition behave as drop partition with purge, which delete all records via Hudi DELETE_PARTITION; partition removed from metastore	2022-04-14 00:53:05 -07:00
Alexey Kudinkin	434e782b7d	[HUDI-3867] Disable Data Skipping by default (#5306 )	2022-04-13 11:21:12 +05:30
Alexey Kudinkin	458fdd5611	[HUDI-3841] Fixing Column Stats in the presence of Schema Evolution (#5275 ) Currently, Data Skipping is not handling correctly the case when column-stats are not aligned and, for ex, some of the (column, file) combinations are missing from the CSI. This could occur in different scenarios (schema evolution, CSI config changes), and has to be handled properly when we're composing CSI projection for Data Skipping. This PR addresses that. - Added appropriate aligning for the transposed CSI projection	2022-04-11 15:45:53 -04:00
Alexey Kudinkin	976840e8eb	[HUDI-3812] Fixing Data Skipping configuration to respect Metadata Table configs (#5244 ) Addressing the problem of Data Skipping not respecting Metadata Table configs which might differ b/w write/read paths. More details could be found in HUDI-3812. - Fixing Data Skipping configuration to respect MT configs (on the Read path) - Tightening up DS handling of cases when no top-level columns are in the target query - Enhancing tests to cover all possible case	2022-04-10 13:43:47 -04:00
KnightChess	7a6272fba1	[HUDI-3781] fix spark delete sql can not delete record (#5215 )	2022-04-08 14:26:40 +08:00
Alexey Kudinkin	d43b4cd95e	[HUDI-3739] Fix handling of the `isNotNull` predicate in Data Skipping (#5224 ) - Fix handling of the isNotNull predicate in Data Skipping	2022-04-06 15:17:36 -04:00
rkkalluri	939b3d1b07	[HUDI-3726] Switching from non-partitioned to partitioned key gen does not throw any exception (#5205 )	2022-04-06 13:35:32 -04:00
Alexey Kudinkin	9e87d164b3	[HUDI-3760] Adding capability to fetch Metadata Records by prefix (#5208 ) - Adding capability to fetch Metadata Records by key prefix so that Data Skipping could fetch only Column Stats - Index records pertaining to the columns being queried by, instead of reading out whole Index. - Fixed usages of HFileScanner in HFileReader. few code paths uses cached scanner if available. Other code paths uses its own HFileScanner w/ positional read. Brief change log - Rebasing ColumnStatsIndexSupport to rely on HoodieBackedTableMetadata in lieu of reading t/h Spark DS - Adding methods enabling key-prefix lookups to HoodiFileReader, HoodieHFileReader - Wiring key-prefix lookup t/h LogRecordScanner impls - Cleaning up HoodieHFileReader impl Co-authored-by: sivabalan <n.siva.b@gmail.com> Co-authored-by: Sagar Sumit <sagarsumit09@gmail.com>	2022-04-06 12:11:08 -04:00
Raymond Xu	e96f08f355	Moving to 0.12.0-SNAPSHOT on master branch.	2022-04-06 15:24:10 +08:00
Alexey Kudinkin	8baeb816d5	[HUDI-3723] Fixed stack overflows in Record Iterators (#5235 )	2022-04-05 23:12:13 -04:00
Yann Byron	3195f51562	[HUDI-3748] write and select hudi table when enable hoodie.datasource.write.drop.partition.columns (#5201 )	2022-04-05 01:31:41 -07:00
ForwardXu	3449e86989	[HUDI-3780] improve drop partitions (#5178 )	2022-04-05 11:52:33 +08:00
Prashant Wason	b28f0d6ceb	[HUDI-3290] Different file formats for the partition metadata file. (#5179 ) * [HUDI-3290] Different file formats for the partition metadata file. Partition metadata files are stored in each partition to help identify the base path of a table. These files are saved in the properties file format. Some query engines do not work when non Parquet/ORC files are found in a partition. Added a new table config 'hoodie.partition.metafile.use.data.format' which when enabled (default false for backward compatibility) ensures that partition metafiles will be saved in the same format as the base files of a dataset. For new datasets, the config can be set via hudi-cli. Deltastreamer has a new parameter --partition-metafile-use-data-format which will create a table with this setting. * Code review comments - Adding a new command to migrate from text to base file formats for meta file. - Reimplementing readFromFS() to first read the text format, then base format - Avoid extra exists() checks in readFromFS() - Added unit tests, enabled parquet format across hoodie-hadoop-mr - Code cleanup, restructuring, naming consistency. * Wiring in all the other Spark code paths to respect this config - Turned on parquet meta format for COW data source tests - Removed the deltastreamer command line to keep it shorter * populate HoodiePartitionMetadata#format after readFromFS() Co-authored-by: Vinoth Chandar <vinoth@apache.org> Co-authored-by: Raymond Xu <2701446+xushiyan@users.noreply.github.com>	2022-04-04 08:08:20 -07:00
Sivabalan Narayanan	84064a9b08	[HUDI-3772] Fixing auto adjustment of lock configs for deltastreamer (#5207 )	2022-04-02 23:44:10 -07:00
Alexey Kudinkin	cc3737be50	[HUDI-3664] Fixing Column Stats Index composition (#5181 ) Co-authored-by: Sagar Sumit <sagarsumit09@gmail.com>	2022-04-02 17:15:52 -07:00
xiarixiaoyao	444ff496a4	[RFC-33] [HUDI-2429][Stacked on HUDI-2560] Support full Schema evolution for Spark (#4910 ) * [HUDI-2560] introduce id_based schema to support full schema evolution. * add test for FileBasedInternalSchemaStorageManger and rebase code * add support for change column type and fix some test case * fix some bugs encountered in the production env and delete useless code * fix test error * rebase code * fixed some nested schema change bugs * [HUDI-2429][Stacked On HUDI-2560]Support full schema evolution for spark * [use dummyInternalSchema instead of null] * add support for spark3.1.x * remove support for spark3.1.x , sicne some compile fail * support spark3.1.x * rebase and prepare solve all comments * address all comments * rebase code * fixed the count() bug try to get internalSchema by parser commit file/history file directly, not use metaclient which is time cost address some comments * fixed all comments * fix new comments * rebase code,fix UT failed * fixed mistake * rebase code ,fixed new comments * rebase code , and prepare for address new comments * address commits * address new comments * fix new issues * control fallback original write logical	2022-04-01 13:20:24 -07:00
Sagar Sumit	a048e940fd	[HUDI-3743] Support DELETE_PARTITION for metadata table (#5169 ) In order to drop any metadata partition (index), we can reuse the DELETE_PARTITION operation in metadata table. Subsequent to this, we can support drop index (with table config update) for async metadata indexer. - Add a new API in HoodieTableMetadataWriter - Current only supported for Spark metadata writer	2022-03-31 21:29:17 -04:00
liuhe0702	1da196c1e8	[HUDI-2777] Improve HoodieSparkSqlWriter write performance (#5187 )	2022-03-31 12:48:47 -07:00
xiarixiaoyao	7889c7852f	[HUDI-3729][SPARK] fixed the per regression by enable vectorizeReader for parquet file (#5168 ) * [MINOR][SPARK] fixed the per regression by enable vectorizeReader for parquet file * address comments * add perf result	2022-03-31 20:09:26 +08:00
ForwardXu	80011df995	[HUDI-3135] Make delete partitions lazy to be executed by the cleaner (#4489 ) As of now, delete partitions will ensure all file groups are deleted, but the partition as such is not deleted. So, get all partitions might be returning the deleted partitions as well. but no data will be served since all file groups are deleted. With this patch, we are fixing it. We are letting cleaner take care of deleting the partitions when all file groups pertaining to a partitions are deleted. - Fixed the CleanPlanActionExecutor to return meta info about list of partitions to be deleted. If there are no valid file groups for a partition, clean planner will include the partition to be deleted. - Fixed HoodieCleanPlan avro schema to include the list of partitions to be deleted - CleanActionExecutor is fixed to delete partitions if any (as per clean plan) - Same info is added to HoodieCleanMetadata - Metadata table when applying clean metadata, will check for partitions to be deleted and will update the "all_partitions" record for the deleted partitions. Co-authored-by: sivabalan <n.siva.b@gmail.com>	2022-03-31 03:35:39 -04:00
KnightChess	2c4554fada	[HUDI-3750] Fix NPE when build HoodieFileIndex (#5134 ) Co-authored-by: wulingqi <wulingqi@baijiahulian.com>	2022-03-30 19:19:05 -07:00
Sivabalan Narayanan	4fed8dd319	[HUDI-3485] Adding scheduler pool configs for async clustering (#5043 )	2022-03-29 21:27:45 -04:00
ForwardXu	941c254c33	[HUDI-2520] Fix CTAS statment issue when sync to hive (#5145 )	2022-03-29 12:25:31 -07:00
Alexey Kudinkin	e5a2baeed0	[HUDI-3549] Removing dependency on "spark-avro" (#4955 ) Hudi will be taking on promise for it bundles to stay compatible with Spark minor versions (for ex 2.4, 3.1, 3.2): meaning that single build of Hudi (for ex "hudi-spark3.2-bundle") will be compatible with ALL patch versions in that minor branch (in that case 3.2.1, 3.2.0, etc) To achieve that we'll have to remove (and ban) "spark-avro" as a dependency, which on a few occasions was the root-cause of incompatibility b/w consecutive Spark patch versions (most recently 3.2.1 and 3.2.0, due to this PR). Instead of bundling "spark-avro" as dependency, we will be copying over some of the classes Hudi depends on and maintain them along the Hudi code-base to make sure we're able to provide for the aforementioned guarantee. To workaround arising compatibility issues we will be applying local patches to guarantee compatibility of Hudi bundles w/in the Spark minor version branches. Following Hudi modules to Spark minor branches is currently maintained: "hudi-spark3" -> 3.2.x "hudi-spark3.1.x" -> 3.1.x "hudi-spark2" -> 2.4.x Following classes hierarchies (borrowed from "spark-avro") are maintained w/in these Spark-specific modules to guarantee compatibility with respective minor version branches: AvroSerializer AvroDeserializer AvroUtils Each of these classes has been correspondingly copied from Spark 3.2.1 (for 3.2.x branch), 3.1.2 (for 3.1.x branch), 2.4.4 (for 2.4.x branch) into their respective modules. SchemaConverters class in turn is shared across all those modules given its relative stability (there're only cosmetical changes from 2.4.4 to 3.2.1). All of the aforementioned classes have their corresponding scope of visibility limited to corresponding packages (org.apache.spark.sql.avro, org.apache.spark.sql) to make sure broader code-base does not become dependent on them and instead relies on facades abstracting them. Additionally, given that Hudi plans on supporting all the patch versions of Spark w/in aforementioned minor versions branches of Spark, additional build steps were added to validate that Hudi could be properly compiled against those versions. Testing, however, is performed against the most recent patch versions of Spark with the help of Azure CI. Brief change log: - Removing spark-avro bundling from Hudi by default - Scaffolded Spark 3.2.x hierarchy - Bootstrapped Spark 3.1.x Avro serializer/deserializer hierarchy - Bootstrapped Spark 2.4.x Avro serializer/deserializer hierarchy - Moved ExpressionCodeGen,ExpressionPayload into hudi-spark module - Fixed AvroDeserializer to stay compatible w/ both Spark 3.2.1 and 3.2.0 - Modified bot.yml to build full matrix of support Spark versions - Removed "spark-avro" dependency from all modules - Fixed relocation of spark-avro classes in bundles to assist in running integ-tests.	2022-03-29 14:44:47 -04:00
Alexey Kudinkin	fcb003ec76	[HUDI-3731] Fixing Column Stats Index record Merging sequence missing `columnName` (#5159 ) * Added `DataSkippingFailureMode` to control how DS handles failures in the flow (either "strict", when exception would be thrown, or "fallback" when it will just fallback to the full-scan) * Make sure tests execute in `DataSkippingFailureMode.Strict` * Fixed Column Stats Index record merging sequence missing `columnName`	2022-03-29 21:09:56 +05:30
leesf	8f8a8158e2	[HUDI-2520] Fix drop table issue when sync to Hive (#5143 )	2022-03-28 19:34:12 -07:00
ForwardXu	72e0b52b18	[HUDI-3722] Fix truncate hudi table's error (#5140 )	2022-03-29 09:44:18 +08:00
Raymond Xu	6ccbae4d2a	[HUDI-2757] Implement Hudi AWS Glue sync (#5076 )	2022-03-28 14:54:59 -04:00
Sivabalan Narayanan	f2a93ead3b	[HUDI-3724] Fixing closure of ParquetReader (#5141 )	2022-03-28 09:36:15 +08:00
Alexey Kudinkin	51034fecf1	[HUDI-3396] Refactoring `MergeOnReadRDD` to avoid duplication, fetch only projected columns (#4888 )	2022-03-25 09:32:03 -07:00
Alexey Kudinkin	8b38ddedc2	[HUDI-3594] Supporting Composite Expressions over Data Table Columns in Data Skipping flow (#4996 )	2022-03-24 22:27:15 -07:00
Rajesh Mahindra	5f570ea151	[HUDI-2883] Refactor hive sync tool / config to use reflection and standardize configs (#4175 ) - Refactor hive sync tool / config to use reflection and standardize configs Co-authored-by: sivabalan <n.siva.b@gmail.com> Co-authored-by: Rajesh Mahindra <rmahindra@Rajeshs-MacBook-Pro.local> Co-authored-by: Raymond Xu <2701446+xushiyan@users.noreply.github.com>	2022-03-21 22:56:31 -04:00
Alexey Kudinkin	099c2c099a	[HUDI-3457] Refactored Spark DataSource Relations to avoid code duplication (#4877 ) Refactoring Spark DataSource Relations to avoid code duplication. Following Relations were in scope: - BaseFileOnlyViewRelation - MergeOnReadSnapshotRelaation - MergeOnReadIncrementalRelation	2022-03-18 22:32:16 -07:00
RexAn	9ece77561a	[MINOR] HoodieFileScanRDD could print null path (#5056 ) Co-authored-by: Rex An <bonean131@gmail.com>	2022-03-17 12:53:45 -07:00
Alexey Kudinkin	5e8ff8d793	[HUDI-3514] Rebase Data Skipping flow to rely on MT Column Stats index (#4948 )	2022-03-15 10:38:36 -07:00
huberylee	56cb49485d	[HUDI-3567] Refactor HoodieCommonUtils to make code more reasonable (#4982 )	2022-03-11 13:23:19 -08:00
Alexey Kudinkin	034addaef5	[HUDI-3396] Make sure `BaseFileOnlyViewRelation` only reads projected columns (#4818 ) NOTE: This change is first part of the series to clean up Hudi's Spark DataSource related implementations, making sure there's minimal code duplication among them, implementations are consistent and performant This PR is making sure that BaseFileOnlyViewRelation only reads projected columns as well as avoiding unnecessary serde from Row to InternalRow Brief change log - Introduced HoodieBaseRDD as a base for all custom RDD impls - Extracted common fields/methods to HoodieBaseRelation - Cleaned up and streamlined HoodieBaseFileViewOnlyRelation - Fixed all of the Relations to avoid superfluous Row <> InternalRow conversions	2022-03-09 21:45:25 -05:00

1 2 3 4

156 Commits