1
0
Commit Graph

3116 Commits

Author SHA1 Message Date
wenningd
f52b93fd10 Merge pull request #6154 from rahil-c/rahil-c/disable-emrSpark-properties
[HUDI-4434] Disable EmrFS file metadata caching and EMR Spark's data prefetcher feature
2022-07-21 11:35:52 -07:00
Rahil C
2bf7920bd9 [MINOR] Add logger for HoodieCopyOnWriteTableInputFormat (#6161)
Co-authored-by: Wenning Ding <wenningd@amazon.com>
2022-07-21 22:27:18 +05:30
Alexey Kudinkin
a33bdd32e3 [HUDI-3993] Replacing UDF in Bulk Insert w/ RDD transformation (#5470) 2022-07-21 06:20:47 -07:00
wenningd
c7fe3fd01d [HUDI-3764] Allow loading external configs while querying Hudi tables with Spark (#4915)
Currently when doing Hudi queries w/ Spark, it won't 
load the external configurations. Say if customers enabled 
metadata listing in their global config file, then this would 
let them actually query w/o metadata feature enabled. 
This PR fixes this issue and allows loading global 
configs during the Hudi reading phase.

Co-authored-by: Wenning Ding <wenningd@amazon.com>
2022-07-21 15:12:17 +05:30
Alexey Kudinkin
de37774e12 [HUDI-3896] Porting Nested Schema Pruning optimization for Hudi's custom Relations (#5428)
Currently, all Hudi Relations bear performance gap relative to Spark's HadoopFsRelation 
and the reason to that is SchemaPruning optimization rule (pruning nested schemas) 
that is unfortunately predicated on usage of HadoopFsRelation, meaning that it's 
not applied in cases when any other relation is used.

This change is porting this rule to Hudi relations (MOR, Incremental, etc) 
by the virtue of leveraging HoodieSparkSessionExtensions mechanism 
injecting modified version of the original SchemaPruning rule 
that is adopted to work w/ Hudi's custom relations.

- Added customOptimizerRules to HoodieAnalysis
- Added NestedSchemaPrunning Spark's Optimizer rule
- Handle Spark's Optimizer pruned data schema (to effectively prune nested schemas)
- Enable HoodieClientTestHarness to inject HoodieSparkSessionExtensions
- Injecting Spark Session extensions for TestMORDataSource, TestCOWDataSource
- Disabled fallback to HadoopFsRelation
2022-07-21 15:06:06 +05:30
Shiyan Xu
2394c62973 [HUDI-4146][RFC-55] Update config changes proposal (#6162) 2022-07-21 12:55:02 +05:30
Danny Chan
348519f3cd [HUDI-4427] Add a computed column IT test (#6150) 2022-07-21 09:38:26 +08:00
Rahil Chertara
473be87aa5 Disable EmrFS file metadata caching and EMR Spark's data prefetcher feature 2022-07-20 17:04:00 -07:00
Y Ethan Guo
2b828ccb98 [HUDI-4401] Skip HBase version check (#6114) 2022-07-20 14:09:45 -07:00
Danny Chan
e3675fe9b0 [HUDI-4372] Enable matadata table by default for flink (#6066) 2022-07-20 16:10:19 +08:00
Danny Chan
6c3578069e [HUDI-4416] Default database path for hoodie hive catalog (#6136) 2022-07-19 15:38:47 +08:00
冯健
382d19e85b [HUDI-4065] Add FileBasedLockProvider (#6071) 2022-07-19 07:52:47 +08:00
liujinhui
1959b843b7 [HUDI-4409] Improve LockManager wait logic when catch exception (#6122) 2022-07-18 22:45:52 +08:00
Bo Cui
9282611bae [HUDI-4098] Support HMS for flink HudiCatalog (#6082)
* [HUDI-4098]Support HMS for flink HudiCatalog
2022-07-18 11:46:23 +08:00
Sivabalan Narayanan
3964c476e0 Fix file group count issue with metadata partitions (#5892) 2022-07-18 07:19:29 +05:30
RexAn
ded197800a [HUDI-4170] Make user can use hoodie.datasource.read.paths to read necessary files (#5722)
* Rebase codes

* Move listFileSlices to HoodieBaseRelation

* Fix review

* Fix style

* Fix bug
2022-07-17 16:11:45 +08:00
Alexey Kudinkin
4bda6afe0b [HUDI-4249] Fixing in-memory HoodieData implementation to operate lazily (#5855) 2022-07-16 18:26:48 -05:00
simonsssu
80368a049d [HUDI-3503] Add call procedure for CleanCommand (#6065)
* [HUDI-3503] Add call procedure for CleanCommand
Co-authored-by: simonssu <simonssu@tencent.com>
2022-07-16 22:33:26 +08:00
Danny Chan
6aec9d754f [HUDI-4408] Reuse old rollover file as base file for flink merge handle (#6120) 2022-07-16 20:46:23 +08:00
Danny Chan
0faa562b6f [HUDI-4403] Fix the end input metadata for bounded source (#6116) 2022-07-16 12:02:17 +08:00
Shiyan Xu
726e8e3590 [MINOR] Disable TestHiveSyncGlobalCommitTool (#6119) 2022-07-15 10:23:21 -07:00
JerryYue-M
b781b31045 [HUDI-4397] Flink Inline Cluster and Compact plan distribute strategy changed from rebalance to hash to avoid potential multiple threads accessing the same file (#6106)
Co-authored-by: jerryyue <jerryyue@didiglobal.com>
2022-07-15 12:21:50 +08:00
Tim Brown
4898ea52f7 [HUDI-4399][RFC-57] Claim RFC 57 for DeltaStreamer proto support (#6112) 2022-07-14 18:11:45 -07:00
Danny Chan
05606708fa [HUDI-4393] Add marker file for target file when flink merge handle rolls over (#6103) 2022-07-14 16:00:08 +08:00
Yann Byron
aaccc63ad5 [RFC-51] [HUDI-3478] Hudi to support Change-Data-Capture (#5436)
Co-authored-by: Raymond Xu <2701446+xushiyan@users.noreply.github.com>
2022-07-14 00:36:26 -07:00
Danny Chan
e70a427956 [HUDI-4391] Incremental read from archived commits for flink (#6096) 2022-07-14 15:19:26 +08:00
Luning (Lucas) Wang
ee956b8951 [HUDI-4379] Bump Flink versions to 1.14.5 and 1.15.1 (#6080) 2022-07-12 15:03:24 +08:00
HunterXHunter
994c561488 [HUDI-4298] When reading the mor table with QUERY_TYPE_SNAPSHOT,Unabl… (#5937)
* [HUDI-4298] Add test case for reading mor table

Signed-off-by: LinMingQiang <1356469429@qq.com>
2022-07-12 14:49:44 +08:00
Sagar Sumit
a270eeeef9 [MINOR] Update RFCs status (#6078) 2022-07-11 13:04:25 +05:30
Shiyan Xu
51244eba82 [HUDI-4323] Make database table names optional in sync tool (#6073)
* [HUDI-4323] Make database table names optional in sync tool
* Infer from these properties from the table config
2022-07-11 10:03:31 +05:30
冯健
63f95ab801 [HUDI-3730][RFC-55] Improve hudi-sync classes design and simplify configs (#5695)
* [HUDI-4146] RFC for Improve Hive/Meta sync class design and hierarchies

Co-authored-by: jian.feng <jian.feng@shopee.com>
Co-authored-by: Raymond Xu <2701446+xushiyan@users.noreply.github.com>
2022-07-10 11:42:34 +05:30
Shiyan Xu
046044c83d [HUDI-4324] Remove use_jdbc config from hudi sync (#6072)
* [HUDI-4324] Remove use_jdbc config from hudi sync
* Users should use HIVE_SYNC_MODE instead
2022-07-10 11:16:09 +05:30
dependabot[bot]
10aec07fd2 [MINOR] Bump xalan from 2.7.1 to 2.7.2 (#6062)
Bumps xalan from 2.7.1 to 2.7.2.

---
updated-dependencies:
- dependency-name: xalan:xalan
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <support@github.com>

Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2022-07-09 20:02:36 +05:30
liujinhui
126b88b48d [HUDI-2150] Rename/Restructure configs for better modularity (#6061)
- Move clean related configuration to HoodieCleanConfig
- Move Archival related configuration to HoodieArchivalConfig
- hoodie.compaction.payload.class move this to HoodiePayloadConfig
2022-07-09 20:00:48 +05:30
superche
6566fc6625 [HUDI-3500] Add call procedure for RepairsCommand (#6053) 2022-07-09 09:29:14 +08:00
xiarixiaoyao
b686c07407 [HUDI-4276] Reconcile schema-inject null values for missing fields and add new fields (#6017)
* [HUDI-4276] Reconcile schema-inject null values for missing fields and add new fields.

* fix comments

Co-authored-by: public (bdcee5037027) <mengtao0326@qq.com>
2022-07-09 03:08:38 +08:00
Kumud Kumar Srivatsava Tirupati
fc8d96246a [HUDI-4335] Bug fixes in AWSGlueCatalogSyncClient post schema evolution. (#5995)
* fix for updateTableParameters which is not excluding partition columns and updateTableProperties boolean check

* Fix - serde parameters getting overrided on table property update

* removing stale syncConfig
2022-07-08 09:47:49 -05:00
苏承祥
f20acb8dc3 [HUDI-4367] Support copyToTable on call (#6054) 2022-07-08 15:08:11 +08:00
Danny Chan
a998586396 [minor] following 4152, refactor the clazz about plan selection strategy (#6060) 2022-07-08 09:56:10 +08:00
Danny Chan
c744848c59 [HUDI-4366] Synchronous cleaning for flink bounded source (#6051) 2022-07-08 09:55:07 +08:00
KnightChess
5673819736 [HUDI-4309] fix spark32 repartition error (#6033) 2022-07-08 09:38:09 +08:00
e74ad324c3 [HUDI-4152] Flink offline compaction support compacting multi compaction plan at once (#5677)
* [HUDI-4152] Flink offline compaction allow compact multi compaction plan at once

* [HUDI-4152] Fix exception for duplicated uid when multi compaction plan are compacted

* [HUDI-4152] Provider UT & IT for compact multi compaction plan

* [HUDI-4152] Put multi compaction plans into one compaction plan source

* [HUDI-4152] InstantCompactionPlanSelectStrategy allow multi instant by using comma

* [HUDI-4152] Add IT for InstantCompactionPlanSelectStrategy
2022-07-07 14:11:26 +08:00
Danny Chan
7eeaff9ee0 [HUDI-4357] Support flink 1.15.x (#6050) 2022-07-06 13:42:58 +08:00
shenjiayu17
b18c32379f [HUDI-4219] Merge Into when update expression "col=s.col+2" on precombine cause exception (#5828) 2022-07-06 09:10:35 +08:00
董可伦
3670e82af5 [HUDI-4356] Fix the error when sync hive in CTAS (#6029) 2022-07-06 00:08:23 +08:00
ForwardXu
8570c3aab4 [HUDI-4359] Support show_fs_path_detail command on Call Produce Command (#6042) 2022-07-05 23:56:32 +08:00
xi chaomin
23c9c5c296 [HUDI-3836] Improve the way of fetching metadata partitions from table (#5286)
Co-authored-by: xicm <xicm@asiainfo.com>
2022-07-05 07:50:17 -07:00
Y Ethan Guo
fbda4ad5bd [HUDI-4360] Fix HoodieDropPartitionsTool based on refactored meta sync (#6043) 2022-07-04 23:37:21 -07:00
YueZhang
45fdcf68a1 [HUDI-3116]Add a new HoodieDropPartitionsTool to let users drop table partitions through a standalone job. (#4459)
Co-authored-by: yuezhang <yuezhang@freewheel.tv>
2022-07-04 19:24:18 -07:00
Shiyan Xu
6187622178 [MINOR] Improve variable names (#6039) 2022-07-04 18:03:50 -07:00