lanyuanxiaoyao/hudi - hudi - Gitea: Git with a cup of tea

Author	SHA1	Message	Date
Sivabalan Narayanan	2e0dd29714	[HUDI-4204] Fixing NPE with row writer path and with OCC (#5850 )	2022-07-21 15:57:34 -07:00
wenningd	f52b93fd10	Merge pull request #6154 from rahil-c/rahil-c/disable-emrSpark-properties [HUDI-4434] Disable EmrFS file metadata caching and EMR Spark's data prefetcher feature	2022-07-21 11:35:52 -07:00
Alexey Kudinkin	a33bdd32e3	[HUDI-3993] Replacing UDF in Bulk Insert w/ RDD transformation (#5470 )	2022-07-21 06:20:47 -07:00
wenningd	c7fe3fd01d	[HUDI-3764] Allow loading external configs while querying Hudi tables with Spark (#4915 ) Currently when doing Hudi queries w/ Spark, it won't load the external configurations. Say if customers enabled metadata listing in their global config file, then this would let them actually query w/o metadata feature enabled. This PR fixes this issue and allows loading global configs during the Hudi reading phase. Co-authored-by: Wenning Ding <wenningd@amazon.com>	2022-07-21 15:12:17 +05:30
Alexey Kudinkin	de37774e12	[HUDI-3896] Porting Nested Schema Pruning optimization for Hudi's custom Relations (#5428 ) Currently, all Hudi Relations bear performance gap relative to Spark's HadoopFsRelation and the reason to that is SchemaPruning optimization rule (pruning nested schemas) that is unfortunately predicated on usage of HadoopFsRelation, meaning that it's not applied in cases when any other relation is used. This change is porting this rule to Hudi relations (MOR, Incremental, etc) by the virtue of leveraging HoodieSparkSessionExtensions mechanism injecting modified version of the original SchemaPruning rule that is adopted to work w/ Hudi's custom relations. - Added customOptimizerRules to HoodieAnalysis - Added NestedSchemaPrunning Spark's Optimizer rule - Handle Spark's Optimizer pruned data schema (to effectively prune nested schemas) - Enable HoodieClientTestHarness to inject HoodieSparkSessionExtensions - Injecting Spark Session extensions for TestMORDataSource, TestCOWDataSource - Disabled fallback to HadoopFsRelation	2022-07-21 15:06:06 +05:30
Rahil Chertara	473be87aa5	Disable EmrFS file metadata caching and EMR Spark's data prefetcher feature	2022-07-20 17:04:00 -07:00
RexAn	ded197800a	[HUDI-4170] Make user can use hoodie.datasource.read.paths to read necessary files (#5722 ) * Rebase codes * Move listFileSlices to HoodieBaseRelation * Fix review * Fix style * Fix bug	2022-07-17 16:11:45 +08:00
Shiyan Xu	51244eba82	[HUDI-4323] Make database table names optional in sync tool (#6073 ) * [HUDI-4323] Make database table names optional in sync tool * Infer from these properties from the table config	2022-07-11 10:03:31 +05:30
Shiyan Xu	046044c83d	[HUDI-4324] Remove use_jdbc config from hudi sync (#6072 ) * [HUDI-4324] Remove use_jdbc config from hudi sync * Users should use HIVE_SYNC_MODE instead	2022-07-10 11:16:09 +05:30
liujinhui	126b88b48d	[HUDI-2150] Rename/Restructure configs for better modularity (#6061 ) - Move clean related configuration to HoodieCleanConfig - Move Archival related configuration to HoodieArchivalConfig - hoodie.compaction.payload.class move this to HoodiePayloadConfig	2022-07-09 20:00:48 +05:30
xiarixiaoyao	b686c07407	[HUDI-4276] Reconcile schema-inject null values for missing fields and add new fields (#6017 ) * [HUDI-4276] Reconcile schema-inject null values for missing fields and add new fields. * fix comments Co-authored-by: public (bdcee5037027) <mengtao0326@qq.com>	2022-07-09 03:08:38 +08:00
xi chaomin	23c9c5c296	[HUDI-3836] Improve the way of fetching metadata partitions from table (#5286 ) Co-authored-by: xicm <xicm@asiainfo.com>	2022-07-05 07:50:17 -07:00
Shiyan Xu	c0e1587966	[HUDI-3730] Improve meta sync class design and hierarchies (#5854 ) * [HUDI-3730] Improve meta sync class design and hierarchies (#5754) * Implements class design proposed in RFC-55 Co-authored-by: jian.feng <fengjian428@gmial.com> Co-authored-by: jian.feng <jian.feng@shopee.com>	2022-07-03 14:47:25 +05:30
RexAn	cdaaa3c4c7	[HUDI-4346] Fix params not update BULKINSERT_ARE_PARTITIONER_RECORDS_SORTED (#5999 )	2022-06-29 19:26:00 -07:00
Alexey Kudinkin	ed823f1c6f	[HUDI-4320] Make sure `HoodieStorageConfig.PARQUET_WRITE_LEGACY_FORMAT_ENABLED` could be specified by the writer (#5970 ) Fixed sequence determining whether Parquet's legacy-format writing property should be overridden to only kick in when it has not been explicitly specified by the caller	2022-06-28 12:27:32 -07:00
ForwardXu	08eba914ed	[HUDI-4333] fix HoodieFileIndex's listFiles method log print skipping percent NaN (#5990 )	2022-06-28 15:08:48 +08:00
cxzl25	72fa19bcc9	[HUDI-4316] Support for spillable diskmap configuration when constructing HoodieMergedLogRecordScanner (#5959 )	2022-06-27 11:09:30 +08:00
xiarixiaoyao	142adf4ccb	[HUDI-4296] Fix the bug that TestHoodieSparkSqlWriter.testSchemaEvolutionForTableType is flaky (#5973 )	2022-06-25 21:03:19 +08:00
xiarixiaoyao	360df576a9	Revert "[TEST][DO_NOT_MERGE]fix random failed for ci (#5948 )" (#5971 ) This reverts commit `e8fbd4daf4`.	2022-06-25 11:23:17 +08:00
xiarixiaoyao	e8fbd4daf4	[TEST][DO_NOT_MERGE]fix random failed for ci (#5948 )	2022-06-25 10:15:08 +08:00
RexAn	17ac5a4573	[HUDI-4173] Fix wrong results if the user read no base files hudi table by glob paths (#5723 )	2022-06-20 23:02:34 +05:30
KnightChess	2bf0a1906d	[HUDI-4217] improve repeat init object in ExpressionPayload (#5825 )	2022-06-15 20:21:28 +08:00
Shiyan Xu	5aaac21d1d	[HUDI-4224] Fix CI issues (#5842 ) - Upgrade junit to 5.7.2 - Downgrade surefire and failsafe to 2.22.2 - Fix test failures that were previously not reported - Improve azure pipeline configs Co-authored-by: liujinhui1994 <965147871@qq.com> Co-authored-by: Y Ethan Guo <ethan.guoyihua@gmail.com>	2022-06-12 11:44:18 -07:00
Y Ethan Guo	fd8f7c5f6c	[HUDI-4205] Fix NullPointerException in HFile reader creation (#5841 ) Replace SerializableConfiguration with SerializableWritable for broadcasting the hadoop configuration before initializing HFile readers	2022-06-11 14:46:43 -07:00
Y Ethan Guo	97ccf5dd18	[HUDI-4223] Fix NullPointerException from getLogRecordScanner when reading metadata table (#5840 ) When explicitly specifying the metadata table path for reading in spark, the "hoodie.metadata.enable" is overwritten to true for proper read behavior.	2022-06-11 13:19:24 -07:00
xi chaomin	2b3a85528a	[HUDI-3889] Do not validate table config if save mode is set to Overwrite (#5619 ) Co-authored-by: xicm <xicm@asiainfo.com>	2022-06-09 19:23:51 -04:00
Danny Chan	c608dbd6c2	[HUDI-4213] Infer keygen clazz for Spark SQL (#5815 )	2022-06-09 20:37:58 +08:00
Alexey Kudinkin	35afdb4316	[HUDI-4178] Addressing performance regressions in Spark DataSourceV2 Integration (#5737 ) There are multiple issues with our current DataSource V2 integrations: b/c we advertise Hudi tables as V2, Spark expects it to implement certain APIs which are not implemented at the moment, instead we're using custom Resolution rule (in HoodieSpark3Analysis) to instead manually fallback to V1 APIs. This commit fixes the issue by reverting DSv2 APIs and making Spark use V1, except for schema evaluation logic.	2022-06-07 16:30:46 -07:00
Sivabalan Narayanan	4f6fc726d0	[HUDI-4140] Fixing hive style partitioning and default partition with bulk insert row writer with SimpleKeyGen and virtual keys (#5664 ) Bulk insert row writer code path had a gap wrt hive style partitioning and default partition when virtual keys are enabled with SimpleKeyGen. This patch fixes the issue.	2022-06-06 10:21:00 -07:00
Alexey Kudinkin	4f7ea8c79a	[HUDI-4176] Fixing `TableSchemaResolver` to avoid repeated `HoodieCommitMetadata` parsing (#5733 ) As has been outlined in HUDI-4176, we've hit a roadblock while testing Hudi on a large dataset (~1Tb) having pretty fat commits where Hudi's commit metadata could reach into 100s of Mbs. Given the size some of ours commit metadata instances Spark's parsing and resolving phase (when spark.sql(...) is involved, but before returned Dataset is dereferenced) starts to dominate some of our queries' execution time. - Rebased onto new APIs to avoid excessive Hadoop's Path allocations - Eliminated hasOperationField completely to avoid repeatitive computations - Cleaning up duplication in HoodieActiveTimeline - Added caching for common instances of HoodieCommitMetadata - Made tableStructSchema lazy;	2022-06-06 13:14:26 -04:00
Sagar Sumit	21ab0ff8be	[HUDI-4195] Bulk insert should use right keygen for non-partitioned table (#5759 )	2022-06-06 07:19:03 -04:00
Jin Xing	918c4f4e0b	[HUDI-4149] Drop-Table fails when underlying table directory is broken (#5672 )	2022-05-30 19:09:26 +08:00
ForwardXu	8fa8f26031	[MINOR] Fix Hive and meta sync config for sql statement (#5316 )	2022-05-28 07:56:39 -07:00
RexAn	554caa3421	[MINOR] Fix the issue when handling conf hoodie.datasource.write.operation=bulk_insert in sql mode (#5679 ) Co-authored-by: Rex An <bonean131@gmail.com>	2022-05-27 04:45:09 -07:00
Alexey Kudinkin	1767ff5e7c	[HUDI-4161] Make sure partition values are taken from partition path (#5699 )	2022-05-27 02:36:30 -07:00
watermelon12138	57dbe57bed	[HUDI-4162] Fixed some constant mapping issues. (#5700 ) Co-authored-by: y00617041 <yangxuan42@huawei.com>	2022-05-27 14:08:54 +08:00
komao	8d2f009048	[HUDI-4124] Add valid check in Spark Datasource configs (#5637 ) Co-authored-by: wangzixuan.wzxuan <wangzixuan.wzxuan@bytedance.com>	2022-05-26 05:21:28 -07:00
felixYyu	716e995a38	[MINOR] Removing redundant semicolons and line breaks (#5662 )	2022-05-23 15:26:36 -07:00
Raymond Xu	271d1a79c0	[HUDI-4051] Allow nested field as primary key and preCombineField in spark sql (#5517 ) * [HUDI-4051] Allow nested field as preCombineField in spark sql * relax validation for primary key	2022-05-22 00:47:51 -07:00
huberylee	85b146d3d5	[HUDI-3985] Refactor DLASyncTool to support read hoodie table as spark datasource table (#5532 )	2022-05-20 22:25:32 +08:00
huberylee	6573469e73	[HUDI-4116] Unify clustering/compaction related procedures' output type (#5620 ) * Unify clustering/compaction related procedures' output type * Address review comments	2022-05-19 09:48:03 +08:00
Jin Xing	d422f69a0d	[HUDI-4087] Support dropping RO and RT table in DropHoodieTableCommand (#5564 ) * [HUDI-4087] Support dropping RO and RT table in DropHoodieTableCommand * Set hoodie.query.as.ro.table in serde properties	2022-05-17 14:12:50 +08:00
董可伦	75f847691f	[HUDI-4001] Filter the properties should not be used when create table for Spark SQL (#5495 )	2022-05-16 09:50:29 +08:00
Sivabalan Narayanan	0cec955fa2	[HUDI-4018][HUDI-4027] Adding integ test yamls for immutable use-cases. Added delete partition support to integ tests (#5501 ) - Added pure immutable test yamls to integ test framework. Added SparkBulkInsertNode as part of it. - Added delete_partition support to integ test framework using spark-datasource. - Added a single yaml to test all non core write operations (insert overwrite, insert overwrite table and delete partitions) - Added tests for 4 concurrent spark datasource writers (multi-writer tests). - Fixed readme w/ sample commands for multi-writer.	2022-05-12 21:01:55 -04:00
Sivabalan Narayanan	6285a239a3	[HUDI-3995] Making perf optimizations for bulk insert row writer path (#5462 ) - Avoid using udf for key generator for SimpleKeyGen and NonPartitionedKeyGen. - Fixed NonPartitioned Key generator to directly fetch record key from row rather than involving GenericRecord. - Other minor fixes around using static values instead of looking up hashmap.	2022-05-09 12:40:22 -04:00
KnightChess	6ec039ba42	[MINOR] Update alter rename command class type for pattern matching (#5381 )	2022-04-26 19:39:51 -07:00
Sivabalan Narayanan	762623a15c	[HUDI-3972] Fixing hoodie.properties/tableConfig for no preCombine field with writes (#5424 ) Fixed instantiation of new table to set the null for preCombine if not explicitly set by the user.	2022-04-25 23:03:10 -04:00
Sivabalan Narayanan	7523542c1d	[HUDI-3947] Fixing Hive conf usage in HoodieSparkSqlWriter (#5401 )	2022-04-22 22:20:05 -04:00
Y Ethan Guo	c4bc2deea0	[HUDI-3936] Fix projection for a nested field as pre-combined key (#5379 ) This PR fixes the projection logic around a nested field which is used as the pre-combined key field. The fix is to only check and append the root level field for projection, i.e., "a", for a nested field "a.b.c" in the mandatory columns. - Changes the logic to check and append the root level field for a required nested field in the mandatory columns in HoodieBaseRelation.appendMandatoryColumns	2022-04-21 20:17:57 -04:00
Alexey Kudinkin	4b296f79cc	[HUDI-3935] Adding config to fallback to enabled Partition Values extraction from Partition path (#5377 )	2022-04-21 01:36:19 -07:00

1 2 3 4

198 Commits