1
0
Commit Graph

725 Commits

Author SHA1 Message Date
jcxiaozf
eb4b741c38 If there are multiple files under the same partition path and file ID, sort them according to the modification time of the files to avoid reading the files that failed to write before. 2024-05-24 15:17:37 +08:00
v-zhangjc9
32f7e323dc Change version to private 2024-05-24 15:16:38 +08:00
v-zhangjc9
8462d79ead Change hadoop version to 3.1.2 2022-08-02 16:06:39 +08:00
Sivabalan Narayanan
765dd2eae6 [HUDI-4221] Optimzing getAllPartitionPaths (#6234)
- Levering spark par for dir processing
2022-07-29 03:49:56 -04:00
Udit Mehrotra
c39e88dcf0 [HUDI-4495] Fix handling of S3 paths incompatible with java URI standards (#6237) 2022-07-28 20:04:14 -07:00
Alexey Kudinkin
cfd0c1ee34 [HUDI-4081][HUDI-4472] Addressing Spark SQL vs Spark DS performance gap (#6213) 2022-07-28 15:36:03 -07:00
Rahil C
0a5ce000bf [HUDI-4490] Make AWSDmsAvroPayload class backwards compatible (#6229)
Co-authored-by: Rahil Chertara <rchertar@amazon.com>
2022-07-27 21:55:06 -05:00
Y Ethan Guo
924c30c7ea [HUDI-4469] Flip reuse flag to true in HoodieBackedTableMetadata to improve file listing (#6214) 2022-07-27 14:04:59 -07:00
Shiyan Xu
1ea1e659c2 [HUDI-4474] Infer metasync configs (#6217)
- infer repeated sync configs from original configs
  - `META_SYNC_BASE_FILE_FORMAT`
    - infer from `org.apache.hudi.common.table.HoodieTableConfig.BASE_FILE_FORMAT`
  - `META_SYNC_ASSUME_DATE_PARTITION`
    - infer from `org.apache.hudi.common.config.HoodieMetadataConfig.ASSUME_DATE_PARTITIONING`
  - `META_SYNC_DECODE_PARTITION`
    - infer from `org.apache.hudi.common.table.HoodieTableConfig.URL_ENCODE_PARTITIONING`
  - `META_SYNC_USE_FILE_LISTING_FROM_METADATA`
    - infer from `org.apache.hudi.common.config.HoodieMetadataConfig.ENABLE`

As proposed in https://github.com/apache/hudi/blob/master/rfc/rfc-55/rfc-55.md#compatible-changes
2022-07-26 15:28:31 +05:30
Dongwook Kwon
74d7b4d751 [HUDI-4471] Relocate AWSDmsAvroPayload class to hudi-common 2022-07-25 17:51:27 -07:00
Alexey Kudinkin
e7c8df7e8b [HUDI-4250][HUDI-4202] Optimize performance of Column Stats Index reading in Data Skipping (#5746)
We provide an alternative way of fetching Column Stats Index within the reading process to avoid the penalty of a more heavy-weight execution scheduled through a Spark engine.
2022-07-25 15:36:12 -07:00
Sagar Sumit
6e7ac45735 [HUDI-3884] Support archival beyond savepoint commits (#5837)
Co-authored-by: sivabalan <n.siva.b@gmail.com>
2022-07-25 13:42:29 -05:00
Shiyan Xu
eee6a02f77 [HUDI-4456] Clean up test resources (#6203) 2022-07-25 10:13:06 -05:00
Vander
2a08a65f71 [MINOR] Fix typos in Spark client related classes (#6204) 2022-07-24 21:41:42 -07:00
Danny Chan
a0ffd05b77 [HUDI-4448] Remove the latest commit refresh for timeline server (#6179) 2022-07-23 16:10:53 -07:00
Shiyan Xu
d5c7c79d87 Revert "[HUDI-4324] Remove use_jdbc config from hudi sync (#6072)" (#6160)
This reverts commit 046044c83d.
2022-07-22 17:18:45 -07:00
Sagar Sumit
a36762a862 [HUDI-4303] Use Hive sentinel value as partition default to avoid type caste issues (#5954) 2022-07-22 17:14:36 -07:00
Alexey Kudinkin
eea4a692c0 [HUDI-4039] Make sure all builtin KeyGenerators properly implement Spark specific APIs (#5523)
This set of changes makes sure that all builtin KeyGenerators properly implement Spark-specific APIs in a performant way (minimizing key-generators overhead)
2022-07-22 08:35:07 -07:00
Alexey Kudinkin
41653fc708 [MINOR] Fallback to default for hive-style partitioning, url-encoding configs (#6175)
- Fixes broken ITTestHoodieDemo#testParquetDemo
2022-07-22 18:55:58 +05:30
Y Ethan Guo
50cdb867c7 [HUDI-4400] Fix missing bloom filters in metadata table in non-partitioned table (#6113)
Fixes the missing bloom filters in metadata table in the non-partitioned table due to incorrect record key generation, because of wrong file names when generating the metadata payload for the bloom filter.
2022-07-21 11:38:25 -07:00
Alexey Kudinkin
a33bdd32e3 [HUDI-3993] Replacing UDF in Bulk Insert w/ RDD transformation (#5470) 2022-07-21 06:20:47 -07:00
wenningd
c7fe3fd01d [HUDI-3764] Allow loading external configs while querying Hudi tables with Spark (#4915)
Currently when doing Hudi queries w/ Spark, it won't 
load the external configurations. Say if customers enabled 
metadata listing in their global config file, then this would 
let them actually query w/o metadata feature enabled. 
This PR fixes this issue and allows loading global 
configs during the Hudi reading phase.

Co-authored-by: Wenning Ding <wenningd@amazon.com>
2022-07-21 15:12:17 +05:30
Y Ethan Guo
2b828ccb98 [HUDI-4401] Skip HBase version check (#6114) 2022-07-20 14:09:45 -07:00
Danny Chan
e3675fe9b0 [HUDI-4372] Enable matadata table by default for flink (#6066) 2022-07-20 16:10:19 +08:00
冯健
382d19e85b [HUDI-4065] Add FileBasedLockProvider (#6071) 2022-07-19 07:52:47 +08:00
Bo Cui
9282611bae [HUDI-4098] Support HMS for flink HudiCatalog (#6082)
* [HUDI-4098]Support HMS for flink HudiCatalog
2022-07-18 11:46:23 +08:00
Sivabalan Narayanan
3964c476e0 Fix file group count issue with metadata partitions (#5892) 2022-07-18 07:19:29 +05:30
Alexey Kudinkin
4bda6afe0b [HUDI-4249] Fixing in-memory HoodieData implementation to operate lazily (#5855) 2022-07-16 18:26:48 -05:00
Danny Chan
e70a427956 [HUDI-4391] Incremental read from archived commits for flink (#6096) 2022-07-14 15:19:26 +08:00
HunterXHunter
994c561488 [HUDI-4298] When reading the mor table with QUERY_TYPE_SNAPSHOT,Unabl… (#5937)
* [HUDI-4298] Add test case for reading mor table

Signed-off-by: LinMingQiang <1356469429@qq.com>
2022-07-12 14:49:44 +08:00
Shiyan Xu
046044c83d [HUDI-4324] Remove use_jdbc config from hudi sync (#6072)
* [HUDI-4324] Remove use_jdbc config from hudi sync
* Users should use HIVE_SYNC_MODE instead
2022-07-10 11:16:09 +05:30
xiarixiaoyao
b686c07407 [HUDI-4276] Reconcile schema-inject null values for missing fields and add new fields (#6017)
* [HUDI-4276] Reconcile schema-inject null values for missing fields and add new fields.

* fix comments

Co-authored-by: public (bdcee5037027) <mengtao0326@qq.com>
2022-07-09 03:08:38 +08:00
xi chaomin
23c9c5c296 [HUDI-3836] Improve the way of fetching metadata partitions from table (#5286)
Co-authored-by: xicm <xicm@asiainfo.com>
2022-07-05 07:50:17 -07:00
Shiyan Xu
6187622178 [MINOR] Improve variable names (#6039) 2022-07-04 18:03:50 -07:00
Shiyan Xu
c0e1587966 [HUDI-3730] Improve meta sync class design and hierarchies (#5854)
* [HUDI-3730] Improve meta sync class design and hierarchies (#5754)
* Implements class design proposed in RFC-55

Co-authored-by: jian.feng <fengjian428@gmial.com>
Co-authored-by: jian.feng <jian.feng@shopee.com>
2022-07-03 14:47:25 +05:30
RexAn
62a0c962ac [HUDI-3634] Could read empty or partial HoodieCommitMetaData in downstream if using HDFS (#5048)
Add the differentiated logic of creating immutable file in HDFS by first creating the file.tmp and then renaming the file
2022-06-30 11:07:40 -07:00
cxzl25
6a01f7029c [MINOR] Following #2070, Fix BindException when running tests on shared machines. (#5951) 2022-06-29 19:20:59 -07:00
luoyajun
3948b8935a [HUDI-4336] Fix records overwritten bug with binary primary key (#5996) 2022-06-30 09:12:00 +08:00
wenningd
03a94d9ff5 [HUDI-4331] Allow loading external config file from class loader (#5987)
Co-authored-by: Wenning Ding <wenningd@amazon.com>
2022-06-29 17:04:34 -07:00
Alexey Kudinkin
c86edfc28e [HUDI-4319] Fixed Parquet's PLAIN_DICTIONARY encoding not being applied when bulk-inserting (#5966)
* Fixed Dictionary encoding config not being properly propagated to Parquet writer (making it unable to apply it, substantially bloating the storage footprint)
2022-06-24 23:52:28 -04:00
Sagar Sumit
eeb78f23e6 [HUDI-4290] Fix fetchLatestBaseFiles to filter replaced filegroups (#5941)
* [HUDI-4290] Fix fetchLatestBaseFiles to filter replaced filegroups

* Separate out incremental sync fsview test with clustering
2022-06-23 19:40:08 +05:30
Danny Chan
1dbd9d407a [minor] following 4270, add unit tests for the keys lost case (#5918) 2022-06-22 16:56:06 +08:00
LinMingQiang
c9590790f8 [HUDI-4279] Strength the remote fs view lagging check when latest commit refresh is enabled (#5917)
Signed-off-by: LinMingQiang <1356469429@qq.com>
2022-06-22 10:32:21 +08:00
RexAn
17ac5a4573 [HUDI-4173] Fix wrong results if the user read no base files hudi table by glob paths (#5723) 2022-06-20 23:02:34 +05:30
felixYyu
d7facb8cb8 fix remove redundant Variable (#5806) 2022-06-20 15:21:49 +08:00
huberylee
d4f0326b4b [HUDI-4275] Refactor rollback inflight instant for clustering/compaction to reuse some code (#5894) 2022-06-20 14:29:21 +08:00
huberylee
fec49dc12b [HUDI-4165] Support Create/Drop/Show/Refresh Index Syntax for Spark SQL (#5761)
* Support Create/Drop/Show/Refresh Index Syntax for Spark SQL
2022-06-17 18:33:58 +08:00
Sivabalan Narayanan
08fe281091 [HUDI-4221] Fixing getAllPartitionPaths perf hit w/ FileSystemBackedMetadata (#5829) 2022-06-11 13:17:42 -07:00
Alexey Kudinkin
35afdb4316 [HUDI-4178] Addressing performance regressions in Spark DataSourceV2 Integration (#5737)
There are multiple issues with our current DataSource V2 integrations: b/c we advertise Hudi tables as V2, Spark expects it to implement certain APIs which are not implemented at the moment, instead we're using custom Resolution rule (in HoodieSpark3Analysis) to instead manually fallback to V1 APIs.  This commit fixes the issue by reverting DSv2 APIs and making Spark use V1, except for schema evaluation logic.
2022-06-07 16:30:46 -07:00
Sivabalan Narayanan
f85cd9b16d [HUDI-4200] Fixing sorting of keys fetched from metadata table (#5773)
- Key fetched from metadata table especially from base file reader is not sorted. and hence may result in throwing NPE (key prefix search) or unnecessary seeks to starting of Hfile (full key look ups). Fixing the same in this patch. This is not an issue with log blocks, since sorting is taking care within HoodieHfileDataBlock.
- Commit where the sorting was mistakenly reverted [HUDI-3760] Adding capability to fetch Metadata Records by prefix  #5208
2022-06-07 08:19:52 -04:00