1
0
Commit Graph

3101 Commits

Author SHA1 Message Date
Alexey Kudinkin
e7c8df7e8b [HUDI-4250][HUDI-4202] Optimize performance of Column Stats Index reading in Data Skipping (#5746)
We provide an alternative way of fetching Column Stats Index within the reading process to avoid the penalty of a more heavy-weight execution scheduled through a Spark engine.
2022-07-25 15:36:12 -07:00
Sagar Sumit
6e7ac45735 [HUDI-3884] Support archival beyond savepoint commits (#5837)
Co-authored-by: sivabalan <n.siva.b@gmail.com>
2022-07-25 13:42:29 -05:00
Shiyan Xu
eee6a02f77 [HUDI-4456] Clean up test resources (#6203) 2022-07-25 10:13:06 -05:00
Shiyan Xu
71c2c3102b [HUDI-4455] Improve test classes for TestHiveSyncTool (#6202)
Improve HiveTestService, HiveTestUtil, and related classes.
2022-07-25 19:05:34 +05:30
superche
1fda9ee9bb [HUDI-4071] Match ROLLBACK_USING_MARKERS_ENABLE in sql as datasource (#6206)
Co-authored-by: superche <superche@tencent.com>
2022-07-25 18:40:23 +08:00
Danny Chan
b513232449 [HUDI-4458] Add a converter cache for flink ColumnStatsIndices (#6205) 2022-07-25 17:49:01 +08:00
Y Ethan Guo
f6e7227ed5 [MINOR] Only log stdout output for non-zero exit from commands in IT (#6199) 2022-07-24 22:08:33 -07:00
Tim Brown
76a28daeb0 [HUDI-4456] Close FileSystem in SparkClientFunctionalTestHarness (#6201) 2022-07-24 21:42:15 -07:00
Vander
2a08a65f71 [MINOR] Fix typos in Spark client related classes (#6204) 2022-07-24 21:41:42 -07:00
simonsssu
1a910fd473 [HUDI-3510] Add sync validate procedure (#6200)
* [HUDI-3510] Add sync validate procedure

Co-authored-by: simonssu <simonssu@tencent.com>
2022-07-25 09:28:46 +08:00
KnightChess
a54c963543 [HUDI-4348] fix merge into sql data quality in concurrent scene (#6020) 2022-07-24 06:29:47 -07:00
Rahil C
1a5a9f7f03 [HUDI-4439] Fix Amazon CloudWatch reporter for metadata enabled tables (#6164)
Co-authored-by: Udit Mehrotra <uditme@amazon.com>
Co-authored-by: Y Ethan Guo <ethan.guoyihua@gmail.com>
2022-07-23 21:08:21 -07:00
Danny Chan
ba11082282 [HUDI-4450] Revert the checkpoint abort notification (#6181) 2022-07-24 08:44:22 +08:00
Danny Chan
a0ffd05b77 [HUDI-4448] Remove the latest commit refresh for timeline server (#6179) 2022-07-23 16:10:53 -07:00
Alexey Kudinkin
2d745057ea [HUDI-4420] Fixing table schema delineation on partition/data schema for Spark relations (#5708) 2022-07-23 16:59:16 -05:00
Sagar Sumit
da28e38fe3 [HUDI-4071] Make NONE sort mode as default for bulk insert (#6195) 2022-07-23 14:37:04 -05:00
Rahil C
f1f0109ab8 [HUDI-4440] Treat boostrapped table as non-partitioned in HudiFileIndex if partition column is missing from schema (#6163)
Co-authored-by: Ryan Pifer <rmpifer@umich.edu>
2022-07-23 11:44:40 -07:00
Shiyan Xu
f0e843249c [MINOR] Bump CI timeout to 150m (#6198) 2022-07-23 10:07:51 -05:00
superche
859157ec01 [MINOR] Fix Call Procedure code style (#6186)
* Fix Call Procedure code style.
Co-authored-by: superche <superche@tencent.com>
2022-07-23 17:18:38 +08:00
Rahil C
a5348cc685 [HUDI-4436] Invalidate cached table in Spark after write (#6159)
Co-authored-by: Ryan Pifer <rmpifer@umich.edu>
2022-07-22 22:47:47 -07:00
冯健
340c3dbbe1 [HUDI-4437] Fix test conflicts by clearing file system cache (#6123)
Co-authored-by: jian.feng <fengjian428@gmial.com>
Co-authored-by: jian.feng <jian.feng@shopee.com>
Co-authored-by: Raymond Xu <2701446+xushiyan@users.noreply.github.com>
2022-07-22 17:58:04 -07:00
Rahil C
af10a97e7a [HUDI-4435] Fix Avro field not found issue introduced by Avro 1.10 (#6155)
Co-authored-by: Wenning Ding <wenningd@amazon.com>
2022-07-22 17:26:16 -07:00
Shiyan Xu
d5c7c79d87 Revert "[HUDI-4324] Remove use_jdbc config from hudi sync (#6072)" (#6160)
This reverts commit 046044c83d.
2022-07-22 17:18:45 -07:00
Sagar Sumit
a36762a862 [HUDI-4303] Use Hive sentinel value as partition default to avoid type caste issues (#5954) 2022-07-22 17:14:36 -07:00
Alexey Kudinkin
39f2a06c85 [HUDI-3979] Optimize out mandatory columns when no merging is performed (#5430)
For MOR, when no merging is performed there is no point in reading either primary-key or pre-combine-key values (unless query is referencing these). Avoiding reading these allows to potentially save substantial resources wasted for reading it out.
2022-07-22 15:32:44 -07:00
Shiyan Xu
6b84384022 Revert "[MINOR] Fix CI issue with TestHiveSyncTool (#6110)" (#6192)
This reverts commit d5c904e10e.
2022-07-22 12:20:39 -07:00
Sagar Sumit
716dd3512b [MINOR] Disable Flink compactor IT test (#6189) 2022-07-22 10:16:55 -07:00
Alexey Kudinkin
eea4a692c0 [HUDI-4039] Make sure all builtin KeyGenerators properly implement Spark specific APIs (#5523)
This set of changes makes sure that all builtin KeyGenerators properly implement Spark-specific APIs in a performant way (minimizing key-generators overhead)
2022-07-22 08:35:07 -07:00
Shiyan Xu
d5c904e10e [MINOR] Fix CI issue with TestHiveSyncTool (#6110) 2022-07-22 10:30:00 -05:00
Alexey Kudinkin
41653fc708 [MINOR] Fallback to default for hive-style partitioning, url-encoding configs (#6175)
- Fixes broken ITTestHoodieDemo#testParquetDemo
2022-07-22 18:55:58 +05:30
ForwardXu
51b5783161 [HUDI-4404] Fix insert into dynamic partition write misalignment (#6124) 2022-07-22 09:40:52 +08:00
superche
8e0b47e360 [MINOR] Fix result missing information issue in commits_compare Procedure (#6165)
Co-authored-by: superche <superche@tencent.com>
2022-07-21 16:25:22 -07:00
Sivabalan Narayanan
36e656aa77 [HUDI-4247] Upgrading protocol buffers version for presto bundle (#5852) 2022-07-21 15:58:40 -07:00
Sivabalan Narayanan
2e0dd29714 [HUDI-4204] Fixing NPE with row writer path and with OCC (#5850) 2022-07-21 15:57:34 -07:00
Y Ethan Guo
50cdb867c7 [HUDI-4400] Fix missing bloom filters in metadata table in non-partitioned table (#6113)
Fixes the missing bloom filters in metadata table in the non-partitioned table due to incorrect record key generation, because of wrong file names when generating the metadata payload for the bloom filter.
2022-07-21 11:38:25 -07:00
wenningd
f52b93fd10 Merge pull request #6154 from rahil-c/rahil-c/disable-emrSpark-properties
[HUDI-4434] Disable EmrFS file metadata caching and EMR Spark's data prefetcher feature
2022-07-21 11:35:52 -07:00
Rahil C
2bf7920bd9 [MINOR] Add logger for HoodieCopyOnWriteTableInputFormat (#6161)
Co-authored-by: Wenning Ding <wenningd@amazon.com>
2022-07-21 22:27:18 +05:30
Alexey Kudinkin
a33bdd32e3 [HUDI-3993] Replacing UDF in Bulk Insert w/ RDD transformation (#5470) 2022-07-21 06:20:47 -07:00
wenningd
c7fe3fd01d [HUDI-3764] Allow loading external configs while querying Hudi tables with Spark (#4915)
Currently when doing Hudi queries w/ Spark, it won't 
load the external configurations. Say if customers enabled 
metadata listing in their global config file, then this would 
let them actually query w/o metadata feature enabled. 
This PR fixes this issue and allows loading global 
configs during the Hudi reading phase.

Co-authored-by: Wenning Ding <wenningd@amazon.com>
2022-07-21 15:12:17 +05:30
Alexey Kudinkin
de37774e12 [HUDI-3896] Porting Nested Schema Pruning optimization for Hudi's custom Relations (#5428)
Currently, all Hudi Relations bear performance gap relative to Spark's HadoopFsRelation 
and the reason to that is SchemaPruning optimization rule (pruning nested schemas) 
that is unfortunately predicated on usage of HadoopFsRelation, meaning that it's 
not applied in cases when any other relation is used.

This change is porting this rule to Hudi relations (MOR, Incremental, etc) 
by the virtue of leveraging HoodieSparkSessionExtensions mechanism 
injecting modified version of the original SchemaPruning rule 
that is adopted to work w/ Hudi's custom relations.

- Added customOptimizerRules to HoodieAnalysis
- Added NestedSchemaPrunning Spark's Optimizer rule
- Handle Spark's Optimizer pruned data schema (to effectively prune nested schemas)
- Enable HoodieClientTestHarness to inject HoodieSparkSessionExtensions
- Injecting Spark Session extensions for TestMORDataSource, TestCOWDataSource
- Disabled fallback to HadoopFsRelation
2022-07-21 15:06:06 +05:30
Shiyan Xu
2394c62973 [HUDI-4146][RFC-55] Update config changes proposal (#6162) 2022-07-21 12:55:02 +05:30
Danny Chan
348519f3cd [HUDI-4427] Add a computed column IT test (#6150) 2022-07-21 09:38:26 +08:00
Rahil Chertara
473be87aa5 Disable EmrFS file metadata caching and EMR Spark's data prefetcher feature 2022-07-20 17:04:00 -07:00
Y Ethan Guo
2b828ccb98 [HUDI-4401] Skip HBase version check (#6114) 2022-07-20 14:09:45 -07:00
Danny Chan
e3675fe9b0 [HUDI-4372] Enable matadata table by default for flink (#6066) 2022-07-20 16:10:19 +08:00
Danny Chan
6c3578069e [HUDI-4416] Default database path for hoodie hive catalog (#6136) 2022-07-19 15:38:47 +08:00
冯健
382d19e85b [HUDI-4065] Add FileBasedLockProvider (#6071) 2022-07-19 07:52:47 +08:00
liujinhui
1959b843b7 [HUDI-4409] Improve LockManager wait logic when catch exception (#6122) 2022-07-18 22:45:52 +08:00
Bo Cui
9282611bae [HUDI-4098] Support HMS for flink HudiCatalog (#6082)
* [HUDI-4098]Support HMS for flink HudiCatalog
2022-07-18 11:46:23 +08:00
Sivabalan Narayanan
3964c476e0 Fix file group count issue with metadata partitions (#5892) 2022-07-18 07:19:29 +05:30