lanyuanxiaoyao/hudi - hudi - Gitea: Git with a cup of tea

Author	SHA1	Message	Date
Alexey Kudinkin	c86edfc28e	[HUDI-4319] Fixed Parquet's `PLAIN_DICTIONARY` encoding not being applied when bulk-inserting (#5966 ) * Fixed Dictionary encoding config not being properly propagated to Parquet writer (making it unable to apply it, substantially bloating the storage footprint)	2022-06-24 23:52:28 -04:00
Sagar Sumit	eeb78f23e6	[HUDI-4290] Fix fetchLatestBaseFiles to filter replaced filegroups (#5941 ) * [HUDI-4290] Fix fetchLatestBaseFiles to filter replaced filegroups * Separate out incremental sync fsview test with clustering	2022-06-23 19:40:08 +05:30
Danny Chan	1dbd9d407a	[minor] following 4270, add unit tests for the keys lost case (#5918 )	2022-06-22 16:56:06 +08:00
LinMingQiang	c9590790f8	[HUDI-4279] Strength the remote fs view lagging check when latest commit refresh is enabled (#5917 ) Signed-off-by: LinMingQiang <1356469429@qq.com>	2022-06-22 10:32:21 +08:00
RexAn	17ac5a4573	[HUDI-4173] Fix wrong results if the user read no base files hudi table by glob paths (#5723 )	2022-06-20 23:02:34 +05:30
felixYyu	d7facb8cb8	fix remove redundant Variable (#5806 )	2022-06-20 15:21:49 +08:00
huberylee	d4f0326b4b	[HUDI-4275] Refactor rollback inflight instant for clustering/compaction to reuse some code (#5894 )	2022-06-20 14:29:21 +08:00
huberylee	fec49dc12b	[HUDI-4165] Support Create/Drop/Show/Refresh Index Syntax for Spark SQL (#5761 ) * Support Create/Drop/Show/Refresh Index Syntax for Spark SQL	2022-06-17 18:33:58 +08:00
Sivabalan Narayanan	08fe281091	[HUDI-4221] Fixing getAllPartitionPaths perf hit w/ FileSystemBackedMetadata (#5829 )	2022-06-11 13:17:42 -07:00
Alexey Kudinkin	35afdb4316	[HUDI-4178] Addressing performance regressions in Spark DataSourceV2 Integration (#5737 ) There are multiple issues with our current DataSource V2 integrations: b/c we advertise Hudi tables as V2, Spark expects it to implement certain APIs which are not implemented at the moment, instead we're using custom Resolution rule (in HoodieSpark3Analysis) to instead manually fallback to V1 APIs. This commit fixes the issue by reverting DSv2 APIs and making Spark use V1, except for schema evaluation logic.	2022-06-07 16:30:46 -07:00
Sivabalan Narayanan	f85cd9b16d	[HUDI-4200] Fixing sorting of keys fetched from metadata table (#5773 ) - Key fetched from metadata table especially from base file reader is not sorted. and hence may result in throwing NPE (key prefix search) or unnecessary seeks to starting of Hfile (full key look ups). Fixing the same in this patch. This is not an issue with log blocks, since sorting is taking care within HoodieHfileDataBlock. - Commit where the sorting was mistakenly reverted [HUDI-3760] Adding capability to fetch Metadata Records by prefix #5208	2022-06-07 08:19:52 -04:00
Sivabalan Narayanan	7da97c8096	[HUDI-4171] Fixing Non partitioned with virtual keys in read path (#5747 ) - When Non partitioned key gen is used with virtual keys, read path could break since partition path may not exist.	2022-06-06 15:48:21 -04:00
Alexey Kudinkin	4f7ea8c79a	[HUDI-4176] Fixing `TableSchemaResolver` to avoid repeated `HoodieCommitMetadata` parsing (#5733 ) As has been outlined in HUDI-4176, we've hit a roadblock while testing Hudi on a large dataset (~1Tb) having pretty fat commits where Hudi's commit metadata could reach into 100s of Mbs. Given the size some of ours commit metadata instances Spark's parsing and resolving phase (when spark.sql(...) is involved, but before returned Dataset is dereferenced) starts to dominate some of our queries' execution time. - Rebased onto new APIs to avoid excessive Hadoop's Path allocations - Eliminated hasOperationField completely to avoid repeatitive computations - Cleaning up duplication in HoodieActiveTimeline - Added caching for common instances of HoodieCommitMetadata - Made tableStructSchema lazy;	2022-06-06 13:14:26 -04:00
marchpure	73b0be3c96	[HUDI-4192] HoodieHFileReader scan top cells after bottom cells throw NullPointerException (#5755 ) SeekTo top cells avoid NullPointerException	2022-06-06 12:07:26 +08:00
Danny Chan	7f8630cc57	[HUDI-4167] Remove the timeline refresh with initializing hoodie table (#5716 ) The timeline refresh on table initialization invokes the fs view #sync, which has two actions now: 1. reload the timeline of the fs view, so that the next fs view request is based on this timeline metadata 2. if this is a local fs view, clear all the local states; if this is a remote fs view, send request to sync the remote fs view But, let's see the construction, the meta client is instantiated freshly so the timeline is already the latest, the table is also constructed freshly, so the fs view has no local states, that means, the #sync is unnecessary totally. In this patch, the metadata lifecycle and data set fs view are kept in sync, when the fs view is refreshed, the underneath metadata is also refreshed synchronouly. The freshness of the metadata follows the same rules as data fs view: 1. if the fs view is local, the visibility is based on the client table metadata client's latest commit 2. if the fs view is remote, the timeline server would #sync the fs view and metadata together based on the lagging server local timeline From the perspective of client, no need to care about the refresh action anymore no matter whether the metadata table is enabled or not. That make the client logic more clear and less error-prone. Removes the timeline refresh has another benefit: if avoids unncecessary #refresh of the remote fs view, if all the clients send request to #sync the remote fs view, the server would encounter conflicts and the client encounters a response error.	2022-06-02 09:48:48 +08:00
Raymond Xu	0a72458291	[HUDI-3551] Fix testStorageSchemes for oci storage (#5711 )	2022-05-28 12:13:37 -07:00
Carter Shanklin	62d792368b	[HUDI-3551] Add the Oracle Cloud Infrastructure (oci) Object Storage URI scheme (#4952 )	2022-05-28 08:26:14 -07:00
komao	8d2f009048	[HUDI-4124] Add valid check in Spark Datasource configs (#5637 ) Co-authored-by: wangzixuan.wzxuan <wangzixuan.wzxuan@bytedance.com>	2022-05-26 05:21:28 -07:00
Danny Chan	676d5cefe0	[HUDI-4138] Fix the concurrency modification of hoodie table config for flink (#5660 ) * Remove the metadata cleaning strategy for flink, that means the multi-modal index may be affected * Improve the HoodieTable#clearMetadataTablePartitionsConfig to only update table config when necessary * Remove the modification of read code path in HoodieTableConfig	2022-05-24 13:07:55 +08:00
Heap	47b764ec33	[HUDI-4134] Fix Method naming consistency issues in FSUtils (#5655 )	2022-05-23 15:28:48 -07:00
felixYyu	716e995a38	[MINOR] Removing redundant semicolons and line breaks (#5662 )	2022-05-23 15:26:36 -07:00
Danny Chan	ebbe56e862	[minor] Some code refactoring for LogFileComparator and Instant instantiation (#5600 )	2022-05-18 09:30:09 +08:00
Shawy Geng	ad773b3d96	[HUDI-3654] Preparations for hudi metastore. (#5572 ) * [HUDI-3654] Preparations for hudi metastore. Co-authored-by: gengxiaoyu <gengxiaoyu@bytedance.com>	2022-05-17 09:47:10 +08:00
Yuwei XIAO	61030d8e7a	[HUDI-3123] consistent hashing index: basic write path (upsert/insert) (#4480 ) 1. basic write path(insert/upsert) implementation 2. adapt simple bucket index	2022-05-16 11:07:01 +08:00
陈浩	1fded18dff	fix hive sync no partition table error (#5585 )	2022-05-16 09:51:24 +08:00
Bo Cui	7fb436d3cf	[HUDI-4078][HUDI-FLINK]BootstrapOperator contains the pending compact… (#5545 ) * [HUDI-4078][HUDI-FLINK]BootstrapOperator contains the pending compaction files	2022-05-13 14:32:48 +08:00
Lanyuanxiaoyao	4258a71517	[HUDI-4003] Try to read all the log file to parse schema (#5473 )	2022-05-10 18:45:53 -04:00
Sivabalan Narayanan	6285a239a3	[HUDI-3995] Making perf optimizations for bulk insert row writer path (#5462 ) - Avoid using udf for key generator for SimpleKeyGen and NonPartitionedKeyGen. - Fixed NonPartitioned Key generator to directly fetch record key from row rather than involving GenericRecord. - Other minor fixes around using static values instead of looking up hashmap.	2022-05-09 12:40:22 -04:00
guanziyue	75eaa0bffe	[HUDI-4055]refactor ratelimiter to avoid stack overflow (#5530 )	2022-05-09 10:27:37 +08:00
guanziyue	abb4893b25	[HUDI-2875] Make HoodieParquetWriter Thread safe and memory executor exit gracefully (#4264 )	2022-05-05 13:49:34 -07:00
LiChuang	4e928a6fe1	[HUDI-3943] Some description fixes for 0.10.1 docs (#5447 )	2022-04-28 15:18:56 -07:00
Y Ethan Guo	c4bc2deea0	[HUDI-3936] Fix projection for a nested field as pre-combined key (#5379 ) This PR fixes the projection logic around a nested field which is used as the pre-combined key field. The fix is to only check and append the root level field for projection, i.e., "a", for a nested field "a.b.c" in the mandatory columns. - Changes the logic to check and append the root level field for a required nested field in the mandatory columns in HoodieBaseRelation.appendMandatoryColumns	2022-04-21 20:17:57 -04:00
xiarixiaoyao	037f89ee7c	[HUDI-3921] Fixed schema evolution cannot work with HUDI-3855 (#5376 ) - when columns names are renamed (schema evolution enabled), while copying records from old data file with HoodieMergeHande, renamed columns wasn't handled well.	2022-04-21 18:27:54 -04:00
Alexey Kudinkin	4b296f79cc	[HUDI-3935] Adding config to fallback to enabled Partition Values extraction from Partition path (#5377 )	2022-04-21 01:36:19 -07:00
Alexey Kudinkin	f7544e23ac	[HUDI-3204] Fixing partition-values being derived from partition-path instead of source columns (#5364 ) - Scaffolded `Spark24HoodieParquetFileFormat` extending `ParquetFileFormat` and overriding the behavior of adding partition columns to every row - Amended `SparkAdapter`s `createHoodieParquetFileFormat` API to be able to configure whether to append partition values or not - Fallback to append partition values in cases when the source columns are not persisted in data-file - Fixing HoodieBaseRelation incorrectly handling mandatory columns	2022-04-20 19:30:27 +08:00
Sivabalan Narayanan	b00d03fd62	[HUDI-3886] Adding default null for some of the fields in col stats in MDT schema (#5329 )	2022-04-18 10:37:03 -04:00
董可伦	b8e465fdfc	[MINOR] Fix typos in log4j-surefire.properties (#5212 )	2022-04-15 13:33:37 -07:00
Raymond Xu	d6a64f765e	Revert "[HUDI-3652] Make ObjectSizeCalculator threadlocal to reduce memory footprint (#5060 )" (#5323 ) This reverts commit `f0ab4a6e9e`.	2022-04-14 12:28:27 -07:00
sekaiga	f0ab4a6e9e	[HUDI-3652] Make ObjectSizeCalculator threadlocal to reduce memory footprint (#5060 ) Co-authored-by: zhouhuidong <zhouhuidong@bilibili.co>	2022-04-14 03:08:14 -07:00
Y Ethan Guo	571cbe4c11	[MINOR] Code cleanup in test utils (#5312 )	2022-04-13 17:37:07 -04:00
Y Ethan Guo	c7f41f9018	[HUDI-3869] Improve error handling of loading Hudi conf (#5311 )	2022-04-13 17:25:31 -04:00
Danny Chan	0281725c6b	[MINOR] Inline the partition path logic into the builder (#5310 )	2022-04-13 16:54:39 +05:30
Vinoth Govindarajan	d16740976e	[HUDI-3838] Implemented drop partition column feature for delta streamer code path (#5294 ) * [HUDI-3838] Implemented drop partition column feature for delta streamer code path * Ensure drop partition table config is updated in hoodie.props Co-authored-by: Sagar Sumit <sagarsumit09@gmail.com>	2022-04-12 18:10:30 +05:30
Alexey Kudinkin	101b82a679	[HUDI-3839] Fixing incorrect selection of MT partitions to be updated (#5274 ) * Fixing incorrect selection of MT partitions to be updated * Ensure that metadata partitions table config is inherited correctly Co-authored-by: Sagar Sumit <sagarsumit09@gmail.com>	2022-04-12 13:37:52 +05:30
Sivabalan Narayanan	f91e9e63e1	[HUDI-3799] Fixing not deleting empty instants w/o archiving (#5261 )	2022-04-11 21:02:43 -07:00
Alexey Kudinkin	7a9d48d126	[HUDI-3834] Fixing performance hits in reading Column Stats Index (#5266 ) Fixing performance hits in reading Column Stats Index: [HUDI-3834] There's substantial performance degradation in Avro 1.10 default generated Builder classes: they by default rely on SpecificData.getForSchema that load corresponding model's class using reflection, which takes a hit when executed on the hot-path (this was bringing overall runtime to read full Column Stats Index of 800k records to 60s, whereas now it's taking mere 3s) Addressing memory churn by over-used Hadoop's Path creation: Path ctor is not a lightweight sequence and produces quite a bit of memory churn adding pressure on GC. Cleaning such avoidable allocations up to make sure there's no unnecessarily added pressure on GC.	2022-04-10 13:42:06 -04:00
Alexey Kudinkin	81b25c543a	[HUDI-3825] Fixing Column Stats Index updating sequence (#5267 )	2022-04-08 23:14:08 -07:00
Alexey Kudinkin	d7cc767dbc	[HUDI-3825] Fixing non-partitioned table Partition Records persistence in MT (#5259 ) * Filter out empty string (for non-partitioned table) being added to "__all_partitions__" record * Instead of filtering, transform empty partition-id to `NON_PARTITIONED_NAME` * Cleaned up `HoodieBackedTableMetadataWriter` * Make sure REPLACE_COMMITS are handled as well	2022-04-08 15:58:31 +05:30
Danny Chan	67215abaf0	[HUDI-3827] Promote the inetAddress picking strategy for NetworkUtils#getHostname (#5260 )	2022-04-08 14:33:56 +08:00
Sagar Sumit	df87095ef0	[HUDI-3454] Fix partition name in all code paths for LogRecordScanner (#5252 ) * Depend on FSUtils#getRelativePartitionPath(basePath, logFilePath.getParent) to get the partition. * If the list of log file paths in the split is empty, then fallback to usual behaviour.	2022-04-08 09:59:36 +05:30

1 2 3 4 5 ...

686 Commits