1
0
Commit Graph

198 Commits

Author SHA1 Message Date
Sivabalan Narayanan
2e0dd29714 [HUDI-4204] Fixing NPE with row writer path and with OCC (#5850) 2022-07-21 15:57:34 -07:00
wenningd
f52b93fd10 Merge pull request #6154 from rahil-c/rahil-c/disable-emrSpark-properties
[HUDI-4434] Disable EmrFS file metadata caching and EMR Spark's data prefetcher feature
2022-07-21 11:35:52 -07:00
Alexey Kudinkin
a33bdd32e3 [HUDI-3993] Replacing UDF in Bulk Insert w/ RDD transformation (#5470) 2022-07-21 06:20:47 -07:00
wenningd
c7fe3fd01d [HUDI-3764] Allow loading external configs while querying Hudi tables with Spark (#4915)
Currently when doing Hudi queries w/ Spark, it won't 
load the external configurations. Say if customers enabled 
metadata listing in their global config file, then this would 
let them actually query w/o metadata feature enabled. 
This PR fixes this issue and allows loading global 
configs during the Hudi reading phase.

Co-authored-by: Wenning Ding <wenningd@amazon.com>
2022-07-21 15:12:17 +05:30
Alexey Kudinkin
de37774e12 [HUDI-3896] Porting Nested Schema Pruning optimization for Hudi's custom Relations (#5428)
Currently, all Hudi Relations bear performance gap relative to Spark's HadoopFsRelation 
and the reason to that is SchemaPruning optimization rule (pruning nested schemas) 
that is unfortunately predicated on usage of HadoopFsRelation, meaning that it's 
not applied in cases when any other relation is used.

This change is porting this rule to Hudi relations (MOR, Incremental, etc) 
by the virtue of leveraging HoodieSparkSessionExtensions mechanism 
injecting modified version of the original SchemaPruning rule 
that is adopted to work w/ Hudi's custom relations.

- Added customOptimizerRules to HoodieAnalysis
- Added NestedSchemaPrunning Spark's Optimizer rule
- Handle Spark's Optimizer pruned data schema (to effectively prune nested schemas)
- Enable HoodieClientTestHarness to inject HoodieSparkSessionExtensions
- Injecting Spark Session extensions for TestMORDataSource, TestCOWDataSource
- Disabled fallback to HadoopFsRelation
2022-07-21 15:06:06 +05:30
Rahil Chertara
473be87aa5 Disable EmrFS file metadata caching and EMR Spark's data prefetcher feature 2022-07-20 17:04:00 -07:00
RexAn
ded197800a [HUDI-4170] Make user can use hoodie.datasource.read.paths to read necessary files (#5722)
* Rebase codes

* Move listFileSlices to HoodieBaseRelation

* Fix review

* Fix style

* Fix bug
2022-07-17 16:11:45 +08:00
Shiyan Xu
51244eba82 [HUDI-4323] Make database table names optional in sync tool (#6073)
* [HUDI-4323] Make database table names optional in sync tool
* Infer from these properties from the table config
2022-07-11 10:03:31 +05:30
Shiyan Xu
046044c83d [HUDI-4324] Remove use_jdbc config from hudi sync (#6072)
* [HUDI-4324] Remove use_jdbc config from hudi sync
* Users should use HIVE_SYNC_MODE instead
2022-07-10 11:16:09 +05:30
liujinhui
126b88b48d [HUDI-2150] Rename/Restructure configs for better modularity (#6061)
- Move clean related configuration to HoodieCleanConfig
- Move Archival related configuration to HoodieArchivalConfig
- hoodie.compaction.payload.class move this to HoodiePayloadConfig
2022-07-09 20:00:48 +05:30
xiarixiaoyao
b686c07407 [HUDI-4276] Reconcile schema-inject null values for missing fields and add new fields (#6017)
* [HUDI-4276] Reconcile schema-inject null values for missing fields and add new fields.

* fix comments

Co-authored-by: public (bdcee5037027) <mengtao0326@qq.com>
2022-07-09 03:08:38 +08:00
xi chaomin
23c9c5c296 [HUDI-3836] Improve the way of fetching metadata partitions from table (#5286)
Co-authored-by: xicm <xicm@asiainfo.com>
2022-07-05 07:50:17 -07:00
Shiyan Xu
c0e1587966 [HUDI-3730] Improve meta sync class design and hierarchies (#5854)
* [HUDI-3730] Improve meta sync class design and hierarchies (#5754)
* Implements class design proposed in RFC-55

Co-authored-by: jian.feng <fengjian428@gmial.com>
Co-authored-by: jian.feng <jian.feng@shopee.com>
2022-07-03 14:47:25 +05:30
RexAn
cdaaa3c4c7 [HUDI-4346] Fix params not update BULKINSERT_ARE_PARTITIONER_RECORDS_SORTED (#5999) 2022-06-29 19:26:00 -07:00
Alexey Kudinkin
ed823f1c6f [HUDI-4320] Make sure HoodieStorageConfig.PARQUET_WRITE_LEGACY_FORMAT_ENABLED could be specified by the writer (#5970)
Fixed sequence determining whether Parquet's legacy-format writing property should be overridden to only kick in when it has not been explicitly specified by the caller
2022-06-28 12:27:32 -07:00
ForwardXu
08eba914ed [HUDI-4333] fix HoodieFileIndex's listFiles method log print skipping percent NaN (#5990) 2022-06-28 15:08:48 +08:00
cxzl25
72fa19bcc9 [HUDI-4316] Support for spillable diskmap configuration when constructing HoodieMergedLogRecordScanner (#5959) 2022-06-27 11:09:30 +08:00
xiarixiaoyao
142adf4ccb [HUDI-4296] Fix the bug that TestHoodieSparkSqlWriter.testSchemaEvolutionForTableType is flaky (#5973) 2022-06-25 21:03:19 +08:00
xiarixiaoyao
360df576a9 Revert "[TEST][DO_NOT_MERGE]fix random failed for ci (#5948)" (#5971)
This reverts commit e8fbd4daf4.
2022-06-25 11:23:17 +08:00
xiarixiaoyao
e8fbd4daf4 [TEST][DO_NOT_MERGE]fix random failed for ci (#5948) 2022-06-25 10:15:08 +08:00
RexAn
17ac5a4573 [HUDI-4173] Fix wrong results if the user read no base files hudi table by glob paths (#5723) 2022-06-20 23:02:34 +05:30
KnightChess
2bf0a1906d [HUDI-4217] improve repeat init object in ExpressionPayload (#5825) 2022-06-15 20:21:28 +08:00
Shiyan Xu
5aaac21d1d [HUDI-4224] Fix CI issues (#5842)
- Upgrade junit to 5.7.2
- Downgrade surefire and failsafe to 2.22.2
- Fix test failures that were previously not reported
- Improve azure pipeline configs

Co-authored-by: liujinhui1994 <965147871@qq.com>
Co-authored-by: Y Ethan Guo <ethan.guoyihua@gmail.com>
2022-06-12 11:44:18 -07:00
Y Ethan Guo
fd8f7c5f6c [HUDI-4205] Fix NullPointerException in HFile reader creation (#5841)
Replace SerializableConfiguration with SerializableWritable for broadcasting the hadoop configuration before initializing HFile readers
2022-06-11 14:46:43 -07:00
Y Ethan Guo
97ccf5dd18 [HUDI-4223] Fix NullPointerException from getLogRecordScanner when reading metadata table (#5840)
When explicitly specifying the metadata table path for reading in spark, the "hoodie.metadata.enable" is overwritten to true for proper read behavior.
2022-06-11 13:19:24 -07:00
xi chaomin
2b3a85528a [HUDI-3889] Do not validate table config if save mode is set to Overwrite (#5619)
Co-authored-by: xicm <xicm@asiainfo.com>
2022-06-09 19:23:51 -04:00
Danny Chan
c608dbd6c2 [HUDI-4213] Infer keygen clazz for Spark SQL (#5815) 2022-06-09 20:37:58 +08:00
Alexey Kudinkin
35afdb4316 [HUDI-4178] Addressing performance regressions in Spark DataSourceV2 Integration (#5737)
There are multiple issues with our current DataSource V2 integrations: b/c we advertise Hudi tables as V2, Spark expects it to implement certain APIs which are not implemented at the moment, instead we're using custom Resolution rule (in HoodieSpark3Analysis) to instead manually fallback to V1 APIs.  This commit fixes the issue by reverting DSv2 APIs and making Spark use V1, except for schema evaluation logic.
2022-06-07 16:30:46 -07:00
Sivabalan Narayanan
4f6fc726d0 [HUDI-4140] Fixing hive style partitioning and default partition with bulk insert row writer with SimpleKeyGen and virtual keys (#5664)
Bulk insert row writer code path had a gap wrt hive style partitioning and default partition when virtual keys are enabled with SimpleKeyGen.  This patch fixes the issue.
2022-06-06 10:21:00 -07:00
Alexey Kudinkin
4f7ea8c79a [HUDI-4176] Fixing TableSchemaResolver to avoid repeated HoodieCommitMetadata parsing (#5733)
As has been outlined in HUDI-4176, we've hit a roadblock while testing Hudi on a large dataset (~1Tb) having pretty fat commits where Hudi's commit metadata could reach into 100s of Mbs.
Given the size some of ours commit metadata instances Spark's parsing and resolving phase (when spark.sql(...) is involved, but before returned Dataset is dereferenced) starts to dominate some of our queries' execution time.

- Rebased onto new APIs to avoid excessive Hadoop's Path allocations
- Eliminated hasOperationField completely to avoid repeatitive computations
- Cleaning up duplication in HoodieActiveTimeline
- Added caching for common instances of HoodieCommitMetadata
- Made tableStructSchema lazy;
2022-06-06 13:14:26 -04:00
Sagar Sumit
21ab0ff8be [HUDI-4195] Bulk insert should use right keygen for non-partitioned table (#5759) 2022-06-06 07:19:03 -04:00
Jin Xing
918c4f4e0b [HUDI-4149] Drop-Table fails when underlying table directory is broken (#5672) 2022-05-30 19:09:26 +08:00
ForwardXu
8fa8f26031 [MINOR] Fix Hive and meta sync config for sql statement (#5316) 2022-05-28 07:56:39 -07:00
RexAn
554caa3421 [MINOR] Fix the issue when handling conf hoodie.datasource.write.operation=bulk_insert in sql mode (#5679)
Co-authored-by: Rex An <bonean131@gmail.com>
2022-05-27 04:45:09 -07:00
Alexey Kudinkin
1767ff5e7c [HUDI-4161] Make sure partition values are taken from partition path (#5699) 2022-05-27 02:36:30 -07:00
watermelon12138
57dbe57bed [HUDI-4162] Fixed some constant mapping issues. (#5700)
Co-authored-by: y00617041 <yangxuan42@huawei.com>
2022-05-27 14:08:54 +08:00
komao
8d2f009048 [HUDI-4124] Add valid check in Spark Datasource configs (#5637)
Co-authored-by: wangzixuan.wzxuan <wangzixuan.wzxuan@bytedance.com>
2022-05-26 05:21:28 -07:00
felixYyu
716e995a38 [MINOR] Removing redundant semicolons and line breaks (#5662) 2022-05-23 15:26:36 -07:00
Raymond Xu
271d1a79c0 [HUDI-4051] Allow nested field as primary key and preCombineField in spark sql (#5517)
* [HUDI-4051] Allow nested field as preCombineField in spark sql

* relax validation for primary key
2022-05-22 00:47:51 -07:00
huberylee
85b146d3d5 [HUDI-3985] Refactor DLASyncTool to support read hoodie table as spark datasource table (#5532) 2022-05-20 22:25:32 +08:00
huberylee
6573469e73 [HUDI-4116] Unify clustering/compaction related procedures' output type (#5620)
* Unify clustering/compaction related procedures' output type

* Address review comments
2022-05-19 09:48:03 +08:00
Jin Xing
d422f69a0d [HUDI-4087] Support dropping RO and RT table in DropHoodieTableCommand (#5564)
* [HUDI-4087] Support dropping RO and RT table in DropHoodieTableCommand

* Set hoodie.query.as.ro.table in serde properties
2022-05-17 14:12:50 +08:00
董可伦
75f847691f [HUDI-4001] Filter the properties should not be used when create table for Spark SQL (#5495) 2022-05-16 09:50:29 +08:00
Sivabalan Narayanan
0cec955fa2 [HUDI-4018][HUDI-4027] Adding integ test yamls for immutable use-cases. Added delete partition support to integ tests (#5501)
- Added pure immutable test yamls to integ test framework. Added SparkBulkInsertNode as part of it.
- Added delete_partition support to integ test framework using spark-datasource.
- Added a single yaml to test all non core write operations (insert overwrite, insert overwrite table and delete partitions)
- Added tests for 4 concurrent spark datasource writers (multi-writer tests).
- Fixed readme w/ sample commands for multi-writer.
2022-05-12 21:01:55 -04:00
Sivabalan Narayanan
6285a239a3 [HUDI-3995] Making perf optimizations for bulk insert row writer path (#5462)
- Avoid using udf for key generator for SimpleKeyGen and NonPartitionedKeyGen.
- Fixed NonPartitioned Key generator to directly fetch record key from row rather than involving GenericRecord.
- Other minor fixes around using static values instead of looking up hashmap.
2022-05-09 12:40:22 -04:00
KnightChess
6ec039ba42 [MINOR] Update alter rename command class type for pattern matching (#5381) 2022-04-26 19:39:51 -07:00
Sivabalan Narayanan
762623a15c [HUDI-3972] Fixing hoodie.properties/tableConfig for no preCombine field with writes (#5424)
Fixed instantiation of new table to set the null for preCombine if not explicitly set by the user.
2022-04-25 23:03:10 -04:00
Sivabalan Narayanan
7523542c1d [HUDI-3947] Fixing Hive conf usage in HoodieSparkSqlWriter (#5401) 2022-04-22 22:20:05 -04:00
Y Ethan Guo
c4bc2deea0 [HUDI-3936] Fix projection for a nested field as pre-combined key (#5379)
This PR fixes the projection logic around a nested field which is used as the pre-combined key field. The fix is to only check and append the root level field for projection, i.e., "a", for a nested field "a.b.c" in the mandatory columns.

- Changes the logic to check and append the root level field for a required nested field in the mandatory columns in HoodieBaseRelation.appendMandatoryColumns
2022-04-21 20:17:57 -04:00
Alexey Kudinkin
4b296f79cc [HUDI-3935] Adding config to fallback to enabled Partition Values extraction from Partition path (#5377) 2022-04-21 01:36:19 -07:00