1
0
Commit Graph

215 Commits

Author SHA1 Message Date
ForwardXu
51b5783161 [HUDI-4404] Fix insert into dynamic partition write misalignment (#6124) 2022-07-22 09:40:52 +08:00
superche
8e0b47e360 [MINOR] Fix result missing information issue in commits_compare Procedure (#6165)
Co-authored-by: superche <superche@tencent.com>
2022-07-21 16:25:22 -07:00
Alexey Kudinkin
de37774e12 [HUDI-3896] Porting Nested Schema Pruning optimization for Hudi's custom Relations (#5428)
Currently, all Hudi Relations bear performance gap relative to Spark's HadoopFsRelation 
and the reason to that is SchemaPruning optimization rule (pruning nested schemas) 
that is unfortunately predicated on usage of HadoopFsRelation, meaning that it's 
not applied in cases when any other relation is used.

This change is porting this rule to Hudi relations (MOR, Incremental, etc) 
by the virtue of leveraging HoodieSparkSessionExtensions mechanism 
injecting modified version of the original SchemaPruning rule 
that is adopted to work w/ Hudi's custom relations.

- Added customOptimizerRules to HoodieAnalysis
- Added NestedSchemaPrunning Spark's Optimizer rule
- Handle Spark's Optimizer pruned data schema (to effectively prune nested schemas)
- Enable HoodieClientTestHarness to inject HoodieSparkSessionExtensions
- Injecting Spark Session extensions for TestMORDataSource, TestCOWDataSource
- Disabled fallback to HadoopFsRelation
2022-07-21 15:06:06 +05:30
simonsssu
80368a049d [HUDI-3503] Add call procedure for CleanCommand (#6065)
* [HUDI-3503] Add call procedure for CleanCommand
Co-authored-by: simonssu <simonssu@tencent.com>
2022-07-16 22:33:26 +08:00
liujinhui
126b88b48d [HUDI-2150] Rename/Restructure configs for better modularity (#6061)
- Move clean related configuration to HoodieCleanConfig
- Move Archival related configuration to HoodieArchivalConfig
- hoodie.compaction.payload.class move this to HoodiePayloadConfig
2022-07-09 20:00:48 +05:30
superche
6566fc6625 [HUDI-3500] Add call procedure for RepairsCommand (#6053) 2022-07-09 09:29:14 +08:00
苏承祥
f20acb8dc3 [HUDI-4367] Support copyToTable on call (#6054) 2022-07-08 15:08:11 +08:00
shenjiayu17
b18c32379f [HUDI-4219] Merge Into when update expression "col=s.col+2" on precombine cause exception (#5828) 2022-07-06 09:10:35 +08:00
董可伦
3670e82af5 [HUDI-4356] Fix the error when sync hive in CTAS (#6029) 2022-07-06 00:08:23 +08:00
ForwardXu
8570c3aab4 [HUDI-4359] Support show_fs_path_detail command on Call Produce Command (#6042) 2022-07-05 23:56:32 +08:00
superche
e0954040a9 [HUDI-3511] Add call procedure for MetadataCommand (#6018) 2022-07-03 21:44:56 +08:00
Shiyan Xu
c0e1587966 [HUDI-3730] Improve meta sync class design and hierarchies (#5854)
* [HUDI-3730] Improve meta sync class design and hierarchies (#5754)
* Implements class design proposed in RFC-55

Co-authored-by: jian.feng <fengjian428@gmial.com>
Co-authored-by: jian.feng <jian.feng@shopee.com>
2022-07-03 14:47:25 +05:30
superche
c00ea84985 [HUDI-3505] Add call procedure for UpgradeOrDowngradeCommand (#6012)
Co-authored-by: superche <superche@tencent.com>
2022-07-03 08:47:48 +08:00
Teng
e3eb14ad2d [HUDI-4334] close SparkRDDWriteClient after usage in Create/Delete/RollbackSavepointsProcedure (#5994) 2022-06-29 06:13:29 +08:00
KnightChess
09dc001430 [HUDI-4325] fix spark sql procedure cause ParseException with semicolon (#5982)
* [HUDI-4325] fix saprk sql procedure cause ParseException with semicolon
2022-06-28 09:44:41 +08:00
superche
b14ed47f21 [HUDI-3506] Add call procedure for CommitsCommand (#5974)
* [HUDI-3506] Add call procedure for CommitsCommand

Co-authored-by: superche <superche@tencent.com>
2022-06-28 09:43:36 +08:00
ForwardXu
26c967bac6 [HUDI-3504] Support bootstrap command based on Call Produce Command (#5977) 2022-06-27 13:06:50 +08:00
ForwardXu
1c43c590ac [HUDI-3502] Support hdfs parquet import command based on Call Produce Command (#5956) 2022-06-26 11:27:14 +08:00
jiz
eeafaeacd2 [HUDI-3512] Add call procedure for StatsCommand (#5955)
Co-authored-by: zhanshaoxiong <shaoxiong0001@@gmail.com>
2022-06-25 09:43:23 +08:00
jiz
af9f09047d [HUDI-3509] Add call procedure for HoodieLogFileCommand (#5949)
Co-authored-by: zhanshaoxiong <jiimmyzhan@tencent.com>
2022-06-24 10:16:54 +08:00
jiz
1bb017d396 [HUDI-3508] Add call procedure for FileSystemViewCommand (#5929)
* [HUDI-3508] Add call procedure for FileSystemView

* minor

Co-authored-by: jiimmyzhan <jiimmyzhan@tencent.com>
2022-06-22 17:50:20 +08:00
ForwardXu
c5c4cfec91 [HUDI-3507] Support export command based on Call Produce Command (#5901) 2022-06-19 18:48:22 +08:00
huberylee
fec49dc12b [HUDI-4165] Support Create/Drop/Show/Refresh Index Syntax for Spark SQL (#5761)
* Support Create/Drop/Show/Refresh Index Syntax for Spark SQL
2022-06-17 18:33:58 +08:00
KnightChess
0ff34b6974 [HUDI-4214] improve repeat init write schema in ExpressionPayload (#5820)
* [HUDI-4214] improve repeat init write schema in ExpressionPayload
2022-06-16 17:58:37 +08:00
KnightChess
2bf0a1906d [HUDI-4217] improve repeat init object in ExpressionPayload (#5825) 2022-06-15 20:21:28 +08:00
superche
7b946cf351 [HUDI-3499] Add Call Procedure for show rollbacks (#5848)
* Add Call Procedure for show rollbacks

* fix

* add ut for show_rollback_detail and exception handle

Co-authored-by: superche <superche@tencent.com>
2022-06-15 16:50:15 +08:00
Alexey Kudinkin
35afdb4316 [HUDI-4178] Addressing performance regressions in Spark DataSourceV2 Integration (#5737)
There are multiple issues with our current DataSource V2 integrations: b/c we advertise Hudi tables as V2, Spark expects it to implement certain APIs which are not implemented at the moment, instead we're using custom Resolution rule (in HoodieSpark3Analysis) to instead manually fallback to V1 APIs.  This commit fixes the issue by reverting DSv2 APIs and making Spark use V1, except for schema evaluation logic.
2022-06-07 16:30:46 -07:00
Saisai Shao
bd26d633d7 [HUDI-4168] Add Call Procedure for marker deletion (#5738)
* Add Call Procedure for marker deletion
2022-06-05 11:05:38 +08:00
ForwardXu
8fa8f26031 [MINOR] Fix Hive and meta sync config for sql statement (#5316) 2022-05-28 07:56:39 -07:00
huberylee
85b146d3d5 [HUDI-3985] Refactor DLASyncTool to support read hoodie table as spark datasource table (#5532) 2022-05-20 22:25:32 +08:00
huberylee
6573469e73 [HUDI-4116] Unify clustering/compaction related procedures' output type (#5620)
* Unify clustering/compaction related procedures' output type

* Address review comments
2022-05-19 09:48:03 +08:00
董可伦
75f847691f [HUDI-4001] Filter the properties should not be used when create table for Spark SQL (#5495) 2022-05-16 09:50:29 +08:00
KnightChess
6ec039ba42 [MINOR] Update alter rename command class type for pattern matching (#5381) 2022-04-26 19:39:51 -07:00
ForwardXu
44b3630b5d [HUDI-3826] Make truncate partition use delete_partition operation (#5272)
Make truncate partition and drop partition behave as drop partition with purge, which delete all records via Hudi DELETE_PARTITION; partition removed from metastore
2022-04-14 00:53:05 -07:00
ForwardXu
3449e86989 [HUDI-3780] improve drop partitions (#5178) 2022-04-05 11:52:33 +08:00
xiarixiaoyao
444ff496a4 [RFC-33] [HUDI-2429][Stacked on HUDI-2560] Support full Schema evolution for Spark (#4910)
* [HUDI-2560] introduce id_based schema to support full schema evolution.

* add test for FileBasedInternalSchemaStorageManger and rebase code

* add support for change column type and fix some test case

* fix some bugs encountered in the production env and delete useless code

* fix test error

* rebase code

* fixed some nested schema change bugs

* [HUDI-2429][Stacked On HUDI-2560]Support full schema evolution for spark

* [use dummyInternalSchema instead of null]

* add support for spark3.1.x

* remove support for spark3.1.x , sicne some compile fail

* support spark3.1.x

* rebase and prepare solve all comments

* address all comments

* rebase code

* fixed the count(*) bug

* try to get internalSchema by parser commit file/history file directly, not use metaclient which is time cost
address some comments

* fixed all comments

* fix new comments

* rebase code,fix UT failed

* fixed mistake

* rebase code ,fixed new comments

* rebase code , and prepare for address new comments

* address commits

* address new comments

* fix new issues

* control fallback original write logical
2022-04-01 13:20:24 -07:00
Alexey Kudinkin
e5a2baeed0 [HUDI-3549] Removing dependency on "spark-avro" (#4955)
Hudi will be taking on promise for it bundles to stay compatible with Spark minor versions (for ex 2.4, 3.1, 3.2): meaning that single build of Hudi (for ex "hudi-spark3.2-bundle") will be compatible with ALL patch versions in that minor branch (in that case 3.2.1, 3.2.0, etc)

To achieve that we'll have to remove (and ban) "spark-avro" as a dependency, which on a few occasions was the root-cause of incompatibility b/w consecutive Spark patch versions (most recently 3.2.1 and 3.2.0, due to this PR).

Instead of bundling "spark-avro" as dependency, we will be copying over some of the classes Hudi depends on and maintain them along the Hudi code-base to make sure we're able to provide for the aforementioned guarantee. To workaround arising compatibility issues we will be applying local patches to guarantee compatibility of Hudi bundles w/in the Spark minor version branches.

Following Hudi modules to Spark minor branches is currently maintained:

"hudi-spark3" -> 3.2.x
"hudi-spark3.1.x" -> 3.1.x
"hudi-spark2" -> 2.4.x
Following classes hierarchies (borrowed from "spark-avro") are maintained w/in these Spark-specific modules to guarantee compatibility with respective minor version branches:

AvroSerializer
AvroDeserializer
AvroUtils
Each of these classes has been correspondingly copied from Spark 3.2.1 (for 3.2.x branch), 3.1.2 (for 3.1.x branch), 2.4.4 (for 2.4.x branch) into their respective modules.

SchemaConverters class in turn is shared across all those modules given its relative stability (there're only cosmetical changes from 2.4.4 to 3.2.1).
All of the aforementioned classes have their corresponding scope of visibility limited to corresponding packages (org.apache.spark.sql.avro, org.apache.spark.sql) to make sure broader code-base does not become dependent on them and instead relies on facades abstracting them.

Additionally, given that Hudi plans on supporting all the patch versions of Spark w/in aforementioned minor versions branches of Spark, additional build steps were added to validate that Hudi could be properly compiled against those versions. Testing, however, is performed against the most recent patch versions of Spark with the help of Azure CI.

Brief change log:
- Removing spark-avro bundling from Hudi by default
- Scaffolded Spark 3.2.x hierarchy
- Bootstrapped Spark 3.1.x Avro serializer/deserializer hierarchy
- Bootstrapped Spark 2.4.x Avro serializer/deserializer hierarchy
- Moved ExpressionCodeGen,ExpressionPayload into hudi-spark module
- Fixed AvroDeserializer to stay compatible w/ both Spark 3.2.1 and 3.2.0
- Modified bot.yml to build full matrix of support Spark versions
- Removed "spark-avro" dependency from all modules
- Fixed relocation of spark-avro classes in bundles to assist in running integ-tests.
2022-03-29 14:44:47 -04:00
leesf
8f8a8158e2 [HUDI-2520] Fix drop table issue when sync to Hive (#5143) 2022-03-28 19:34:12 -07:00
huberylee
1d0f4ccfe0 [HUDI-3538] Support Compaction Command Based on Call Procedure Command for Spark SQL (#4945)
* Support Compaction Command Based on Call Procedure Command for Spark SQL

* Addressed review comments
2022-03-28 14:11:35 +08:00
ForwardXu
12cc8e715b [MINOR] fix QuickstartUtils move (#5133) 2022-03-25 07:34:35 -07:00
ForwardXu
e5c3f9089b [HUDI-3563] Make quickstart examples covered by CI tests (#5082) 2022-03-25 01:37:17 -07:00
Alexey Kudinkin
8b38ddedc2 [HUDI-3594] Supporting Composite Expressions over Data Table Columns in Data Skipping flow (#4996) 2022-03-24 22:27:15 -07:00
Rajesh Mahindra
5f570ea151 [HUDI-2883] Refactor hive sync tool / config to use reflection and standardize configs (#4175)
- Refactor hive sync tool / config to use reflection and standardize configs

Co-authored-by: sivabalan <n.siva.b@gmail.com>
Co-authored-by: Rajesh Mahindra <rmahindra@Rajeshs-MacBook-Pro.local>
Co-authored-by: Raymond Xu <2701446+xushiyan@users.noreply.github.com>
2022-03-21 22:56:31 -04:00
peanut-chenzhong
4b75cb6f23 fix NPE when run schdule using spark-sql if the commits time < hoodie.compact.inline.max.delta.commits (#4976)
* Update CompactionHoodiePathCommand.scala

fix NPE when run schdule using spark-sql if the commits time < hoodie.compact.inline.max.delta.commits

* Update CompactionHoodiePathCommand.scala

fix IndexOutOfBoundsException when there`s no schedule for compaction

* Update CompactionHoodiePathCommand.scala

fix CI issue
2022-03-14 16:40:38 +08:00
ForwardXu
6c8224cae6 [HUDI-3501] Support savepoints command based on Call Produce Command (#5025) 2022-03-13 16:58:21 +04:00
huberylee
56cb49485d [HUDI-3567] Refactor HoodieCommonUtils to make code more reasonable (#4982) 2022-03-11 13:23:19 -08:00
ForwardXu
08fd80c913 [HUDI-3221] Support querying a table as of a savepoint (#4720) 2022-03-08 10:02:34 -08:00
shibei
62f534d002 [HUDI-3445] Support Clustering Command Based on Call Procedure Command for Spark SQL (#4901)
* [HUDI-3445] Clustering Command Based on Call Procedure Command for Spark SQL

* [HUDI-3445] Clustering Command Based on Call Procedure Command for Spark SQL

* [HUDI-3445] Clustering Command Based on Call Procedure Command for Spark SQL

Co-authored-by: shibei <huberylee.li@alibaba-inc.com>
2022-03-04 09:33:16 +08:00
ForwardXu
aa1810d737 [HUDI-3493] Not table to get execution plan (#4894) 2022-02-24 17:04:44 -08:00
Alexey Kudinkin
85e8a5c4de [HUDI-1296] Support Metadata Table in Spark Datasource (#4789)
* Bootstrapping initial support for Metadata Table in Spark Datasource

- Consolidated Avro/Row conversion utilities to center around Spark's AvroDeserializer ; removed duplication
- Bootstrapped HoodieBaseRelation
- Updated HoodieMergeOnReadRDD to be able to handle Metadata Table
- Modified MOR relations to be able to read different Base File formats (Parquet, HFile)
2022-02-24 16:23:13 -05:00