lanyuanxiaoyao/hudi - hudi - Gitea: Git with a cup of tea

Author	SHA1	Message	Date
Dongwook Kwon	74d7b4d751	[HUDI-4471] Relocate AWSDmsAvroPayload class to hudi-common	2022-07-25 17:51:27 -07:00
Alexey Kudinkin	e7c8df7e8b	[HUDI-4250][HUDI-4202] Optimize performance of Column Stats Index reading in Data Skipping (#5746 ) We provide an alternative way of fetching Column Stats Index within the reading process to avoid the penalty of a more heavy-weight execution scheduled through a Spark engine.	2022-07-25 15:36:12 -07:00
Sagar Sumit	6e7ac45735	[HUDI-3884] Support archival beyond savepoint commits (#5837 ) Co-authored-by: sivabalan <n.siva.b@gmail.com>	2022-07-25 13:42:29 -05:00
Shiyan Xu	eee6a02f77	[HUDI-4456] Clean up test resources (#6203 )	2022-07-25 10:13:06 -05:00
Vander	2a08a65f71	[MINOR] Fix typos in Spark client related classes (#6204 )	2022-07-24 21:41:42 -07:00
Danny Chan	a0ffd05b77	[HUDI-4448] Remove the latest commit refresh for timeline server (#6179 )	2022-07-23 16:10:53 -07:00
Shiyan Xu	d5c7c79d87	Revert "[HUDI-4324] Remove use_jdbc config from hudi sync (#6072 )" (#6160 ) This reverts commit `046044c83d`.	2022-07-22 17:18:45 -07:00
Sagar Sumit	a36762a862	[HUDI-4303] Use Hive sentinel value as partition default to avoid type caste issues (#5954 )	2022-07-22 17:14:36 -07:00
Alexey Kudinkin	eea4a692c0	[HUDI-4039] Make sure all builtin `KeyGenerator`s properly implement Spark specific APIs (#5523 ) This set of changes makes sure that all builtin KeyGenerators properly implement Spark-specific APIs in a performant way (minimizing key-generators overhead)	2022-07-22 08:35:07 -07:00
Alexey Kudinkin	41653fc708	[MINOR] Fallback to default for hive-style partitioning, url-encoding configs (#6175 ) - Fixes broken ITTestHoodieDemo#testParquetDemo	2022-07-22 18:55:58 +05:30
Y Ethan Guo	50cdb867c7	[HUDI-4400] Fix missing bloom filters in metadata table in non-partitioned table (#6113 ) Fixes the missing bloom filters in metadata table in the non-partitioned table due to incorrect record key generation, because of wrong file names when generating the metadata payload for the bloom filter.	2022-07-21 11:38:25 -07:00
Alexey Kudinkin	a33bdd32e3	[HUDI-3993] Replacing UDF in Bulk Insert w/ RDD transformation (#5470 )	2022-07-21 06:20:47 -07:00
wenningd	c7fe3fd01d	[HUDI-3764] Allow loading external configs while querying Hudi tables with Spark (#4915 ) Currently when doing Hudi queries w/ Spark, it won't load the external configurations. Say if customers enabled metadata listing in their global config file, then this would let them actually query w/o metadata feature enabled. This PR fixes this issue and allows loading global configs during the Hudi reading phase. Co-authored-by: Wenning Ding <wenningd@amazon.com>	2022-07-21 15:12:17 +05:30
Y Ethan Guo	2b828ccb98	[HUDI-4401] Skip HBase version check (#6114 )	2022-07-20 14:09:45 -07:00
Danny Chan	e3675fe9b0	[HUDI-4372] Enable matadata table by default for flink (#6066 )	2022-07-20 16:10:19 +08:00
冯健	382d19e85b	[HUDI-4065] Add FileBasedLockProvider (#6071 )	2022-07-19 07:52:47 +08:00
Bo Cui	9282611bae	[HUDI-4098] Support HMS for flink HudiCatalog (#6082 ) * [HUDI-4098]Support HMS for flink HudiCatalog	2022-07-18 11:46:23 +08:00
Sivabalan Narayanan	3964c476e0	Fix file group count issue with metadata partitions (#5892 )	2022-07-18 07:19:29 +05:30
Alexey Kudinkin	4bda6afe0b	[HUDI-4249] Fixing in-memory `HoodieData` implementation to operate lazily (#5855 )	2022-07-16 18:26:48 -05:00
Danny Chan	e70a427956	[HUDI-4391] Incremental read from archived commits for flink (#6096 )	2022-07-14 15:19:26 +08:00
HunterXHunter	994c561488	[HUDI-4298] When reading the mor table with QUERY_TYPE_SNAPSHOT,Unabl… (#5937 ) * [HUDI-4298] Add test case for reading mor table Signed-off-by: LinMingQiang <1356469429@qq.com>	2022-07-12 14:49:44 +08:00
Shiyan Xu	046044c83d	[HUDI-4324] Remove use_jdbc config from hudi sync (#6072 ) * [HUDI-4324] Remove use_jdbc config from hudi sync * Users should use HIVE_SYNC_MODE instead	2022-07-10 11:16:09 +05:30
xiarixiaoyao	b686c07407	[HUDI-4276] Reconcile schema-inject null values for missing fields and add new fields (#6017 ) * [HUDI-4276] Reconcile schema-inject null values for missing fields and add new fields. * fix comments Co-authored-by: public (bdcee5037027) <mengtao0326@qq.com>	2022-07-09 03:08:38 +08:00
xi chaomin	23c9c5c296	[HUDI-3836] Improve the way of fetching metadata partitions from table (#5286 ) Co-authored-by: xicm <xicm@asiainfo.com>	2022-07-05 07:50:17 -07:00
Shiyan Xu	6187622178	[MINOR] Improve variable names (#6039 )	2022-07-04 18:03:50 -07:00
Shiyan Xu	c0e1587966	[HUDI-3730] Improve meta sync class design and hierarchies (#5854 ) * [HUDI-3730] Improve meta sync class design and hierarchies (#5754) * Implements class design proposed in RFC-55 Co-authored-by: jian.feng <fengjian428@gmial.com> Co-authored-by: jian.feng <jian.feng@shopee.com>	2022-07-03 14:47:25 +05:30
RexAn	62a0c962ac	[HUDI-3634] Could read empty or partial HoodieCommitMetaData in downstream if using HDFS (#5048 ) Add the differentiated logic of creating immutable file in HDFS by first creating the file.tmp and then renaming the file	2022-06-30 11:07:40 -07:00
cxzl25	6a01f7029c	[MINOR] Following #2070 , Fix BindException when running tests on shared machines. (#5951 )	2022-06-29 19:20:59 -07:00
luoyajun	3948b8935a	[HUDI-4336] Fix records overwritten bug with binary primary key (#5996 )	2022-06-30 09:12:00 +08:00
wenningd	03a94d9ff5	[HUDI-4331] Allow loading external config file from class loader (#5987 ) Co-authored-by: Wenning Ding <wenningd@amazon.com>	2022-06-29 17:04:34 -07:00
Alexey Kudinkin	c86edfc28e	[HUDI-4319] Fixed Parquet's `PLAIN_DICTIONARY` encoding not being applied when bulk-inserting (#5966 ) * Fixed Dictionary encoding config not being properly propagated to Parquet writer (making it unable to apply it, substantially bloating the storage footprint)	2022-06-24 23:52:28 -04:00
Sagar Sumit	eeb78f23e6	[HUDI-4290] Fix fetchLatestBaseFiles to filter replaced filegroups (#5941 ) * [HUDI-4290] Fix fetchLatestBaseFiles to filter replaced filegroups * Separate out incremental sync fsview test with clustering	2022-06-23 19:40:08 +05:30
Danny Chan	1dbd9d407a	[minor] following 4270, add unit tests for the keys lost case (#5918 )	2022-06-22 16:56:06 +08:00
LinMingQiang	c9590790f8	[HUDI-4279] Strength the remote fs view lagging check when latest commit refresh is enabled (#5917 ) Signed-off-by: LinMingQiang <1356469429@qq.com>	2022-06-22 10:32:21 +08:00
RexAn	17ac5a4573	[HUDI-4173] Fix wrong results if the user read no base files hudi table by glob paths (#5723 )	2022-06-20 23:02:34 +05:30
felixYyu	d7facb8cb8	fix remove redundant Variable (#5806 )	2022-06-20 15:21:49 +08:00
huberylee	d4f0326b4b	[HUDI-4275] Refactor rollback inflight instant for clustering/compaction to reuse some code (#5894 )	2022-06-20 14:29:21 +08:00
huberylee	fec49dc12b	[HUDI-4165] Support Create/Drop/Show/Refresh Index Syntax for Spark SQL (#5761 ) * Support Create/Drop/Show/Refresh Index Syntax for Spark SQL	2022-06-17 18:33:58 +08:00
Sivabalan Narayanan	08fe281091	[HUDI-4221] Fixing getAllPartitionPaths perf hit w/ FileSystemBackedMetadata (#5829 )	2022-06-11 13:17:42 -07:00
Alexey Kudinkin	35afdb4316	[HUDI-4178] Addressing performance regressions in Spark DataSourceV2 Integration (#5737 ) There are multiple issues with our current DataSource V2 integrations: b/c we advertise Hudi tables as V2, Spark expects it to implement certain APIs which are not implemented at the moment, instead we're using custom Resolution rule (in HoodieSpark3Analysis) to instead manually fallback to V1 APIs. This commit fixes the issue by reverting DSv2 APIs and making Spark use V1, except for schema evaluation logic.	2022-06-07 16:30:46 -07:00
Sivabalan Narayanan	f85cd9b16d	[HUDI-4200] Fixing sorting of keys fetched from metadata table (#5773 ) - Key fetched from metadata table especially from base file reader is not sorted. and hence may result in throwing NPE (key prefix search) or unnecessary seeks to starting of Hfile (full key look ups). Fixing the same in this patch. This is not an issue with log blocks, since sorting is taking care within HoodieHfileDataBlock. - Commit where the sorting was mistakenly reverted [HUDI-3760] Adding capability to fetch Metadata Records by prefix #5208	2022-06-07 08:19:52 -04:00
Sivabalan Narayanan	7da97c8096	[HUDI-4171] Fixing Non partitioned with virtual keys in read path (#5747 ) - When Non partitioned key gen is used with virtual keys, read path could break since partition path may not exist.	2022-06-06 15:48:21 -04:00
Alexey Kudinkin	4f7ea8c79a	[HUDI-4176] Fixing `TableSchemaResolver` to avoid repeated `HoodieCommitMetadata` parsing (#5733 ) As has been outlined in HUDI-4176, we've hit a roadblock while testing Hudi on a large dataset (~1Tb) having pretty fat commits where Hudi's commit metadata could reach into 100s of Mbs. Given the size some of ours commit metadata instances Spark's parsing and resolving phase (when spark.sql(...) is involved, but before returned Dataset is dereferenced) starts to dominate some of our queries' execution time. - Rebased onto new APIs to avoid excessive Hadoop's Path allocations - Eliminated hasOperationField completely to avoid repeatitive computations - Cleaning up duplication in HoodieActiveTimeline - Added caching for common instances of HoodieCommitMetadata - Made tableStructSchema lazy;	2022-06-06 13:14:26 -04:00
marchpure	73b0be3c96	[HUDI-4192] HoodieHFileReader scan top cells after bottom cells throw NullPointerException (#5755 ) SeekTo top cells avoid NullPointerException	2022-06-06 12:07:26 +08:00
Danny Chan	7f8630cc57	[HUDI-4167] Remove the timeline refresh with initializing hoodie table (#5716 ) The timeline refresh on table initialization invokes the fs view #sync, which has two actions now: 1. reload the timeline of the fs view, so that the next fs view request is based on this timeline metadata 2. if this is a local fs view, clear all the local states; if this is a remote fs view, send request to sync the remote fs view But, let's see the construction, the meta client is instantiated freshly so the timeline is already the latest, the table is also constructed freshly, so the fs view has no local states, that means, the #sync is unnecessary totally. In this patch, the metadata lifecycle and data set fs view are kept in sync, when the fs view is refreshed, the underneath metadata is also refreshed synchronouly. The freshness of the metadata follows the same rules as data fs view: 1. if the fs view is local, the visibility is based on the client table metadata client's latest commit 2. if the fs view is remote, the timeline server would #sync the fs view and metadata together based on the lagging server local timeline From the perspective of client, no need to care about the refresh action anymore no matter whether the metadata table is enabled or not. That make the client logic more clear and less error-prone. Removes the timeline refresh has another benefit: if avoids unncecessary #refresh of the remote fs view, if all the clients send request to #sync the remote fs view, the server would encounter conflicts and the client encounters a response error.	2022-06-02 09:48:48 +08:00
Raymond Xu	0a72458291	[HUDI-3551] Fix testStorageSchemes for oci storage (#5711 )	2022-05-28 12:13:37 -07:00
Carter Shanklin	62d792368b	[HUDI-3551] Add the Oracle Cloud Infrastructure (oci) Object Storage URI scheme (#4952 )	2022-05-28 08:26:14 -07:00
komao	8d2f009048	[HUDI-4124] Add valid check in Spark Datasource configs (#5637 ) Co-authored-by: wangzixuan.wzxuan <wangzixuan.wzxuan@bytedance.com>	2022-05-26 05:21:28 -07:00
Danny Chan	676d5cefe0	[HUDI-4138] Fix the concurrency modification of hoodie table config for flink (#5660 ) * Remove the metadata cleaning strategy for flink, that means the multi-modal index may be affected * Improve the HoodieTable#clearMetadataTablePartitionsConfig to only update table config when necessary * Remove the modification of read code path in HoodieTableConfig	2022-05-24 13:07:55 +08:00
Heap	47b764ec33	[HUDI-4134] Fix Method naming consistency issues in FSUtils (#5655 )	2022-05-23 15:28:48 -07:00

1 2 3 4 5 ...

716 Commits