lanyuanxiaoyao/hudi - hudi - Gitea: Git with a cup of tea

Author	SHA1	Message	Date
Shiyan Xu	1ea1e659c2	[HUDI-4474] Infer metasync configs (#6217 ) - infer repeated sync configs from original configs - `META_SYNC_BASE_FILE_FORMAT` - infer from `org.apache.hudi.common.table.HoodieTableConfig.BASE_FILE_FORMAT` - `META_SYNC_ASSUME_DATE_PARTITION` - infer from `org.apache.hudi.common.config.HoodieMetadataConfig.ASSUME_DATE_PARTITIONING` - `META_SYNC_DECODE_PARTITION` - infer from `org.apache.hudi.common.table.HoodieTableConfig.URL_ENCODE_PARTITIONING` - `META_SYNC_USE_FILE_LISTING_FROM_METADATA` - infer from `org.apache.hudi.common.config.HoodieMetadataConfig.ENABLE` As proposed in https://github.com/apache/hudi/blob/master/rfc/rfc-55/rfc-55.md#compatible-changes	2022-07-26 15:28:31 +05:30
Alexey Kudinkin	e7c8df7e8b	[HUDI-4250][HUDI-4202] Optimize performance of Column Stats Index reading in Data Skipping (#5746 ) We provide an alternative way of fetching Column Stats Index within the reading process to avoid the penalty of a more heavy-weight execution scheduled through a Spark engine.	2022-07-25 15:36:12 -07:00
KnightChess	a54c963543	[HUDI-4348] fix merge into sql data quality in concurrent scene (#6020 )	2022-07-24 06:29:47 -07:00
Alexey Kudinkin	2d745057ea	[HUDI-4420] Fixing table schema delineation on partition/data schema for Spark relations (#5708 )	2022-07-23 16:59:16 -05:00
Rahil C	f1f0109ab8	[HUDI-4440] Treat boostrapped table as non-partitioned in HudiFileIndex if partition column is missing from schema (#6163 ) Co-authored-by: Ryan Pifer <rmpifer@umich.edu>	2022-07-23 11:44:40 -07:00
Rahil C	a5348cc685	[HUDI-4436] Invalidate cached table in Spark after write (#6159 ) Co-authored-by: Ryan Pifer <rmpifer@umich.edu>	2022-07-22 22:47:47 -07:00
Shiyan Xu	d5c7c79d87	Revert "[HUDI-4324] Remove use_jdbc config from hudi sync (#6072 )" (#6160 ) This reverts commit `046044c83d`.	2022-07-22 17:18:45 -07:00
Alexey Kudinkin	39f2a06c85	[HUDI-3979] Optimize out mandatory columns when no merging is performed (#5430 ) For MOR, when no merging is performed there is no point in reading either primary-key or pre-combine-key values (unless query is referencing these). Avoiding reading these allows to potentially save substantial resources wasted for reading it out.	2022-07-22 15:32:44 -07:00
Alexey Kudinkin	eea4a692c0	[HUDI-4039] Make sure all builtin `KeyGenerator`s properly implement Spark specific APIs (#5523 ) This set of changes makes sure that all builtin KeyGenerators properly implement Spark-specific APIs in a performant way (minimizing key-generators overhead)	2022-07-22 08:35:07 -07:00
Sivabalan Narayanan	2e0dd29714	[HUDI-4204] Fixing NPE with row writer path and with OCC (#5850 )	2022-07-21 15:57:34 -07:00
wenningd	f52b93fd10	Merge pull request #6154 from rahil-c/rahil-c/disable-emrSpark-properties [HUDI-4434] Disable EmrFS file metadata caching and EMR Spark's data prefetcher feature	2022-07-21 11:35:52 -07:00
Alexey Kudinkin	a33bdd32e3	[HUDI-3993] Replacing UDF in Bulk Insert w/ RDD transformation (#5470 )	2022-07-21 06:20:47 -07:00
wenningd	c7fe3fd01d	[HUDI-3764] Allow loading external configs while querying Hudi tables with Spark (#4915 ) Currently when doing Hudi queries w/ Spark, it won't load the external configurations. Say if customers enabled metadata listing in their global config file, then this would let them actually query w/o metadata feature enabled. This PR fixes this issue and allows loading global configs during the Hudi reading phase. Co-authored-by: Wenning Ding <wenningd@amazon.com>	2022-07-21 15:12:17 +05:30
Alexey Kudinkin	de37774e12	[HUDI-3896] Porting Nested Schema Pruning optimization for Hudi's custom Relations (#5428 ) Currently, all Hudi Relations bear performance gap relative to Spark's HadoopFsRelation and the reason to that is SchemaPruning optimization rule (pruning nested schemas) that is unfortunately predicated on usage of HadoopFsRelation, meaning that it's not applied in cases when any other relation is used. This change is porting this rule to Hudi relations (MOR, Incremental, etc) by the virtue of leveraging HoodieSparkSessionExtensions mechanism injecting modified version of the original SchemaPruning rule that is adopted to work w/ Hudi's custom relations. - Added customOptimizerRules to HoodieAnalysis - Added NestedSchemaPrunning Spark's Optimizer rule - Handle Spark's Optimizer pruned data schema (to effectively prune nested schemas) - Enable HoodieClientTestHarness to inject HoodieSparkSessionExtensions - Injecting Spark Session extensions for TestMORDataSource, TestCOWDataSource - Disabled fallback to HadoopFsRelation	2022-07-21 15:06:06 +05:30
Rahil Chertara	473be87aa5	Disable EmrFS file metadata caching and EMR Spark's data prefetcher feature	2022-07-20 17:04:00 -07:00
RexAn	ded197800a	[HUDI-4170] Make user can use hoodie.datasource.read.paths to read necessary files (#5722 ) * Rebase codes * Move listFileSlices to HoodieBaseRelation * Fix review * Fix style * Fix bug	2022-07-17 16:11:45 +08:00
Shiyan Xu	51244eba82	[HUDI-4323] Make database table names optional in sync tool (#6073 ) * [HUDI-4323] Make database table names optional in sync tool * Infer from these properties from the table config	2022-07-11 10:03:31 +05:30
Shiyan Xu	046044c83d	[HUDI-4324] Remove use_jdbc config from hudi sync (#6072 ) * [HUDI-4324] Remove use_jdbc config from hudi sync * Users should use HIVE_SYNC_MODE instead	2022-07-10 11:16:09 +05:30
liujinhui	126b88b48d	[HUDI-2150] Rename/Restructure configs for better modularity (#6061 ) - Move clean related configuration to HoodieCleanConfig - Move Archival related configuration to HoodieArchivalConfig - hoodie.compaction.payload.class move this to HoodiePayloadConfig	2022-07-09 20:00:48 +05:30
xiarixiaoyao	b686c07407	[HUDI-4276] Reconcile schema-inject null values for missing fields and add new fields (#6017 ) * [HUDI-4276] Reconcile schema-inject null values for missing fields and add new fields. * fix comments Co-authored-by: public (bdcee5037027) <mengtao0326@qq.com>	2022-07-09 03:08:38 +08:00
xi chaomin	23c9c5c296	[HUDI-3836] Improve the way of fetching metadata partitions from table (#5286 ) Co-authored-by: xicm <xicm@asiainfo.com>	2022-07-05 07:50:17 -07:00
Shiyan Xu	c0e1587966	[HUDI-3730] Improve meta sync class design and hierarchies (#5854 ) * [HUDI-3730] Improve meta sync class design and hierarchies (#5754) * Implements class design proposed in RFC-55 Co-authored-by: jian.feng <fengjian428@gmial.com> Co-authored-by: jian.feng <jian.feng@shopee.com>	2022-07-03 14:47:25 +05:30
RexAn	cdaaa3c4c7	[HUDI-4346] Fix params not update BULKINSERT_ARE_PARTITIONER_RECORDS_SORTED (#5999 )	2022-06-29 19:26:00 -07:00
Alexey Kudinkin	ed823f1c6f	[HUDI-4320] Make sure `HoodieStorageConfig.PARQUET_WRITE_LEGACY_FORMAT_ENABLED` could be specified by the writer (#5970 ) Fixed sequence determining whether Parquet's legacy-format writing property should be overridden to only kick in when it has not been explicitly specified by the caller	2022-06-28 12:27:32 -07:00
ForwardXu	08eba914ed	[HUDI-4333] fix HoodieFileIndex's listFiles method log print skipping percent NaN (#5990 )	2022-06-28 15:08:48 +08:00
cxzl25	72fa19bcc9	[HUDI-4316] Support for spillable diskmap configuration when constructing HoodieMergedLogRecordScanner (#5959 )	2022-06-27 11:09:30 +08:00
xiarixiaoyao	142adf4ccb	[HUDI-4296] Fix the bug that TestHoodieSparkSqlWriter.testSchemaEvolutionForTableType is flaky (#5973 )	2022-06-25 21:03:19 +08:00
xiarixiaoyao	360df576a9	Revert "[TEST][DO_NOT_MERGE]fix random failed for ci (#5948 )" (#5971 ) This reverts commit `e8fbd4daf4`.	2022-06-25 11:23:17 +08:00
xiarixiaoyao	e8fbd4daf4	[TEST][DO_NOT_MERGE]fix random failed for ci (#5948 )	2022-06-25 10:15:08 +08:00
RexAn	17ac5a4573	[HUDI-4173] Fix wrong results if the user read no base files hudi table by glob paths (#5723 )	2022-06-20 23:02:34 +05:30
KnightChess	2bf0a1906d	[HUDI-4217] improve repeat init object in ExpressionPayload (#5825 )	2022-06-15 20:21:28 +08:00
Shiyan Xu	5aaac21d1d	[HUDI-4224] Fix CI issues (#5842 ) - Upgrade junit to 5.7.2 - Downgrade surefire and failsafe to 2.22.2 - Fix test failures that were previously not reported - Improve azure pipeline configs Co-authored-by: liujinhui1994 <965147871@qq.com> Co-authored-by: Y Ethan Guo <ethan.guoyihua@gmail.com>	2022-06-12 11:44:18 -07:00
Y Ethan Guo	fd8f7c5f6c	[HUDI-4205] Fix NullPointerException in HFile reader creation (#5841 ) Replace SerializableConfiguration with SerializableWritable for broadcasting the hadoop configuration before initializing HFile readers	2022-06-11 14:46:43 -07:00
Y Ethan Guo	97ccf5dd18	[HUDI-4223] Fix NullPointerException from getLogRecordScanner when reading metadata table (#5840 ) When explicitly specifying the metadata table path for reading in spark, the "hoodie.metadata.enable" is overwritten to true for proper read behavior.	2022-06-11 13:19:24 -07:00
xi chaomin	2b3a85528a	[HUDI-3889] Do not validate table config if save mode is set to Overwrite (#5619 ) Co-authored-by: xicm <xicm@asiainfo.com>	2022-06-09 19:23:51 -04:00
Danny Chan	c608dbd6c2	[HUDI-4213] Infer keygen clazz for Spark SQL (#5815 )	2022-06-09 20:37:58 +08:00
Alexey Kudinkin	35afdb4316	[HUDI-4178] Addressing performance regressions in Spark DataSourceV2 Integration (#5737 ) There are multiple issues with our current DataSource V2 integrations: b/c we advertise Hudi tables as V2, Spark expects it to implement certain APIs which are not implemented at the moment, instead we're using custom Resolution rule (in HoodieSpark3Analysis) to instead manually fallback to V1 APIs. This commit fixes the issue by reverting DSv2 APIs and making Spark use V1, except for schema evaluation logic.	2022-06-07 16:30:46 -07:00
Sivabalan Narayanan	4f6fc726d0	[HUDI-4140] Fixing hive style partitioning and default partition with bulk insert row writer with SimpleKeyGen and virtual keys (#5664 ) Bulk insert row writer code path had a gap wrt hive style partitioning and default partition when virtual keys are enabled with SimpleKeyGen. This patch fixes the issue.	2022-06-06 10:21:00 -07:00
Alexey Kudinkin	4f7ea8c79a	[HUDI-4176] Fixing `TableSchemaResolver` to avoid repeated `HoodieCommitMetadata` parsing (#5733 ) As has been outlined in HUDI-4176, we've hit a roadblock while testing Hudi on a large dataset (~1Tb) having pretty fat commits where Hudi's commit metadata could reach into 100s of Mbs. Given the size some of ours commit metadata instances Spark's parsing and resolving phase (when spark.sql(...) is involved, but before returned Dataset is dereferenced) starts to dominate some of our queries' execution time. - Rebased onto new APIs to avoid excessive Hadoop's Path allocations - Eliminated hasOperationField completely to avoid repeatitive computations - Cleaning up duplication in HoodieActiveTimeline - Added caching for common instances of HoodieCommitMetadata - Made tableStructSchema lazy;	2022-06-06 13:14:26 -04:00
Sagar Sumit	21ab0ff8be	[HUDI-4195] Bulk insert should use right keygen for non-partitioned table (#5759 )	2022-06-06 07:19:03 -04:00
Jin Xing	918c4f4e0b	[HUDI-4149] Drop-Table fails when underlying table directory is broken (#5672 )	2022-05-30 19:09:26 +08:00
ForwardXu	8fa8f26031	[MINOR] Fix Hive and meta sync config for sql statement (#5316 )	2022-05-28 07:56:39 -07:00
RexAn	554caa3421	[MINOR] Fix the issue when handling conf hoodie.datasource.write.operation=bulk_insert in sql mode (#5679 ) Co-authored-by: Rex An <bonean131@gmail.com>	2022-05-27 04:45:09 -07:00
Alexey Kudinkin	1767ff5e7c	[HUDI-4161] Make sure partition values are taken from partition path (#5699 )	2022-05-27 02:36:30 -07:00
watermelon12138	57dbe57bed	[HUDI-4162] Fixed some constant mapping issues. (#5700 ) Co-authored-by: y00617041 <yangxuan42@huawei.com>	2022-05-27 14:08:54 +08:00
komao	8d2f009048	[HUDI-4124] Add valid check in Spark Datasource configs (#5637 ) Co-authored-by: wangzixuan.wzxuan <wangzixuan.wzxuan@bytedance.com>	2022-05-26 05:21:28 -07:00
felixYyu	716e995a38	[MINOR] Removing redundant semicolons and line breaks (#5662 )	2022-05-23 15:26:36 -07:00
Raymond Xu	271d1a79c0	[HUDI-4051] Allow nested field as primary key and preCombineField in spark sql (#5517 ) * [HUDI-4051] Allow nested field as preCombineField in spark sql * relax validation for primary key	2022-05-22 00:47:51 -07:00
huberylee	85b146d3d5	[HUDI-3985] Refactor DLASyncTool to support read hoodie table as spark datasource table (#5532 )	2022-05-20 22:25:32 +08:00
huberylee	6573469e73	[HUDI-4116] Unify clustering/compaction related procedures' output type (#5620 ) * Unify clustering/compaction related procedures' output type * Address review comments	2022-05-19 09:48:03 +08:00

1 2 3 4 5

207 Commits