lanyuanxiaoyao/hudi - hudi - Gitea: Git with a cup of tea

Author	SHA1	Message	Date
Danny Chan	676d5cefe0	[HUDI-4138] Fix the concurrency modification of hoodie table config for flink (#5660 ) * Remove the metadata cleaning strategy for flink, that means the multi-modal index may be affected * Improve the HoodieTable#clearMetadataTablePartitionsConfig to only update table config when necessary * Remove the modification of read code path in HoodieTableConfig	2022-05-24 13:07:55 +08:00
Heap	47b764ec33	[HUDI-4134] Fix Method naming consistency issues in FSUtils (#5655 )	2022-05-23 15:28:48 -07:00
felixYyu	716e995a38	[MINOR] Removing redundant semicolons and line breaks (#5662 )	2022-05-23 15:26:36 -07:00
Danny Chan	ebbe56e862	[minor] Some code refactoring for LogFileComparator and Instant instantiation (#5600 )	2022-05-18 09:30:09 +08:00
Shawy Geng	ad773b3d96	[HUDI-3654] Preparations for hudi metastore. (#5572 ) * [HUDI-3654] Preparations for hudi metastore. Co-authored-by: gengxiaoyu <gengxiaoyu@bytedance.com>	2022-05-17 09:47:10 +08:00
Yuwei XIAO	61030d8e7a	[HUDI-3123] consistent hashing index: basic write path (upsert/insert) (#4480 ) 1. basic write path(insert/upsert) implementation 2. adapt simple bucket index	2022-05-16 11:07:01 +08:00
陈浩	1fded18dff	fix hive sync no partition table error (#5585 )	2022-05-16 09:51:24 +08:00
Bo Cui	7fb436d3cf	[HUDI-4078][HUDI-FLINK]BootstrapOperator contains the pending compact… (#5545 ) * [HUDI-4078][HUDI-FLINK]BootstrapOperator contains the pending compaction files	2022-05-13 14:32:48 +08:00
Lanyuanxiaoyao	4258a71517	[HUDI-4003] Try to read all the log file to parse schema (#5473 )	2022-05-10 18:45:53 -04:00
Sivabalan Narayanan	6285a239a3	[HUDI-3995] Making perf optimizations for bulk insert row writer path (#5462 ) - Avoid using udf for key generator for SimpleKeyGen and NonPartitionedKeyGen. - Fixed NonPartitioned Key generator to directly fetch record key from row rather than involving GenericRecord. - Other minor fixes around using static values instead of looking up hashmap.	2022-05-09 12:40:22 -04:00
guanziyue	75eaa0bffe	[HUDI-4055]refactor ratelimiter to avoid stack overflow (#5530 )	2022-05-09 10:27:37 +08:00
guanziyue	abb4893b25	[HUDI-2875] Make HoodieParquetWriter Thread safe and memory executor exit gracefully (#4264 )	2022-05-05 13:49:34 -07:00
LiChuang	4e928a6fe1	[HUDI-3943] Some description fixes for 0.10.1 docs (#5447 )	2022-04-28 15:18:56 -07:00
Y Ethan Guo	c4bc2deea0	[HUDI-3936] Fix projection for a nested field as pre-combined key (#5379 ) This PR fixes the projection logic around a nested field which is used as the pre-combined key field. The fix is to only check and append the root level field for projection, i.e., "a", for a nested field "a.b.c" in the mandatory columns. - Changes the logic to check and append the root level field for a required nested field in the mandatory columns in HoodieBaseRelation.appendMandatoryColumns	2022-04-21 20:17:57 -04:00
xiarixiaoyao	037f89ee7c	[HUDI-3921] Fixed schema evolution cannot work with HUDI-3855 (#5376 ) - when columns names are renamed (schema evolution enabled), while copying records from old data file with HoodieMergeHande, renamed columns wasn't handled well.	2022-04-21 18:27:54 -04:00
Alexey Kudinkin	4b296f79cc	[HUDI-3935] Adding config to fallback to enabled Partition Values extraction from Partition path (#5377 )	2022-04-21 01:36:19 -07:00
Alexey Kudinkin	f7544e23ac	[HUDI-3204] Fixing partition-values being derived from partition-path instead of source columns (#5364 ) - Scaffolded `Spark24HoodieParquetFileFormat` extending `ParquetFileFormat` and overriding the behavior of adding partition columns to every row - Amended `SparkAdapter`s `createHoodieParquetFileFormat` API to be able to configure whether to append partition values or not - Fallback to append partition values in cases when the source columns are not persisted in data-file - Fixing HoodieBaseRelation incorrectly handling mandatory columns	2022-04-20 19:30:27 +08:00
Sivabalan Narayanan	b00d03fd62	[HUDI-3886] Adding default null for some of the fields in col stats in MDT schema (#5329 )	2022-04-18 10:37:03 -04:00
董可伦	b8e465fdfc	[MINOR] Fix typos in log4j-surefire.properties (#5212 )	2022-04-15 13:33:37 -07:00
Raymond Xu	d6a64f765e	Revert "[HUDI-3652] Make ObjectSizeCalculator threadlocal to reduce memory footprint (#5060 )" (#5323 ) This reverts commit `f0ab4a6e9e`.	2022-04-14 12:28:27 -07:00
sekaiga	f0ab4a6e9e	[HUDI-3652] Make ObjectSizeCalculator threadlocal to reduce memory footprint (#5060 ) Co-authored-by: zhouhuidong <zhouhuidong@bilibili.co>	2022-04-14 03:08:14 -07:00
Y Ethan Guo	571cbe4c11	[MINOR] Code cleanup in test utils (#5312 )	2022-04-13 17:37:07 -04:00
Y Ethan Guo	c7f41f9018	[HUDI-3869] Improve error handling of loading Hudi conf (#5311 )	2022-04-13 17:25:31 -04:00
Danny Chan	0281725c6b	[MINOR] Inline the partition path logic into the builder (#5310 )	2022-04-13 16:54:39 +05:30
Vinoth Govindarajan	d16740976e	[HUDI-3838] Implemented drop partition column feature for delta streamer code path (#5294 ) * [HUDI-3838] Implemented drop partition column feature for delta streamer code path * Ensure drop partition table config is updated in hoodie.props Co-authored-by: Sagar Sumit <sagarsumit09@gmail.com>	2022-04-12 18:10:30 +05:30
Alexey Kudinkin	101b82a679	[HUDI-3839] Fixing incorrect selection of MT partitions to be updated (#5274 ) * Fixing incorrect selection of MT partitions to be updated * Ensure that metadata partitions table config is inherited correctly Co-authored-by: Sagar Sumit <sagarsumit09@gmail.com>	2022-04-12 13:37:52 +05:30
Sivabalan Narayanan	f91e9e63e1	[HUDI-3799] Fixing not deleting empty instants w/o archiving (#5261 )	2022-04-11 21:02:43 -07:00
Alexey Kudinkin	7a9d48d126	[HUDI-3834] Fixing performance hits in reading Column Stats Index (#5266 ) Fixing performance hits in reading Column Stats Index: [HUDI-3834] There's substantial performance degradation in Avro 1.10 default generated Builder classes: they by default rely on SpecificData.getForSchema that load corresponding model's class using reflection, which takes a hit when executed on the hot-path (this was bringing overall runtime to read full Column Stats Index of 800k records to 60s, whereas now it's taking mere 3s) Addressing memory churn by over-used Hadoop's Path creation: Path ctor is not a lightweight sequence and produces quite a bit of memory churn adding pressure on GC. Cleaning such avoidable allocations up to make sure there's no unnecessarily added pressure on GC.	2022-04-10 13:42:06 -04:00
Alexey Kudinkin	81b25c543a	[HUDI-3825] Fixing Column Stats Index updating sequence (#5267 )	2022-04-08 23:14:08 -07:00
Alexey Kudinkin	d7cc767dbc	[HUDI-3825] Fixing non-partitioned table Partition Records persistence in MT (#5259 ) * Filter out empty string (for non-partitioned table) being added to "__all_partitions__" record * Instead of filtering, transform empty partition-id to `NON_PARTITIONED_NAME` * Cleaned up `HoodieBackedTableMetadataWriter` * Make sure REPLACE_COMMITS are handled as well	2022-04-08 15:58:31 +05:30
Danny Chan	67215abaf0	[HUDI-3827] Promote the inetAddress picking strategy for NetworkUtils#getHostname (#5260 )	2022-04-08 14:33:56 +08:00
Sagar Sumit	df87095ef0	[HUDI-3454] Fix partition name in all code paths for LogRecordScanner (#5252 ) * Depend on FSUtils#getRelativePartitionPath(basePath, logFilePath.getParent) to get the partition. * If the list of log file paths in the split is empty, then fallback to usual behaviour.	2022-04-08 09:59:36 +05:30
Sivabalan Narayanan	ef06e4a526	[HUDI-3810] Fixing lazy read for metadata log record readers (#5241 )	2022-04-07 18:40:51 -04:00
Y Ethan Guo	9d744bb35c	[HUDI-3805] Delete existing corrupted requested rollback plan during rollback (#5245 )	2022-04-07 15:32:34 +05:30
Alexey Kudinkin	9e87d164b3	[HUDI-3760] Adding capability to fetch Metadata Records by prefix (#5208 ) - Adding capability to fetch Metadata Records by key prefix so that Data Skipping could fetch only Column Stats - Index records pertaining to the columns being queried by, instead of reading out whole Index. - Fixed usages of HFileScanner in HFileReader. few code paths uses cached scanner if available. Other code paths uses its own HFileScanner w/ positional read. Brief change log - Rebasing ColumnStatsIndexSupport to rely on HoodieBackedTableMetadata in lieu of reading t/h Spark DS - Adding methods enabling key-prefix lookups to HoodiFileReader, HoodieHFileReader - Wiring key-prefix lookup t/h LogRecordScanner impls - Cleaning up HoodieHFileReader impl Co-authored-by: sivabalan <n.siva.b@gmail.com> Co-authored-by: Sagar Sumit <sagarsumit09@gmail.com>	2022-04-06 12:11:08 -04:00
Raymond Xu	e96f08f355	Moving to 0.12.0-SNAPSHOT on master branch.	2022-04-06 15:24:10 +08:00
Sagar Sumit	898be6174a	[HUDI-3782] Fixing table config when any of the index is disabled (#5222 )	2022-04-05 23:06:52 -04:00
Yann Byron	3195f51562	[HUDI-3748] write and select hudi table when enable hoodie.datasource.write.drop.partition.columns (#5201 )	2022-04-05 01:31:41 -07:00
ForwardXu	3449e86989	[HUDI-3780] improve drop partitions (#5178 )	2022-04-05 11:52:33 +08:00
Prashant Wason	b28f0d6ceb	[HUDI-3290] Different file formats for the partition metadata file. (#5179 ) * [HUDI-3290] Different file formats for the partition metadata file. Partition metadata files are stored in each partition to help identify the base path of a table. These files are saved in the properties file format. Some query engines do not work when non Parquet/ORC files are found in a partition. Added a new table config 'hoodie.partition.metafile.use.data.format' which when enabled (default false for backward compatibility) ensures that partition metafiles will be saved in the same format as the base files of a dataset. For new datasets, the config can be set via hudi-cli. Deltastreamer has a new parameter --partition-metafile-use-data-format which will create a table with this setting. * Code review comments - Adding a new command to migrate from text to base file formats for meta file. - Reimplementing readFromFS() to first read the text format, then base format - Avoid extra exists() checks in readFromFS() - Added unit tests, enabled parquet format across hoodie-hadoop-mr - Code cleanup, restructuring, naming consistency. * Wiring in all the other Spark code paths to respect this config - Turned on parquet meta format for COW data source tests - Removed the deltastreamer command line to keep it shorter * populate HoodiePartitionMetadata#format after readFromFS() Co-authored-by: Vinoth Chandar <vinoth@apache.org> Co-authored-by: Raymond Xu <2701446+xushiyan@users.noreply.github.com>	2022-04-04 08:08:20 -07:00
Y Ethan Guo	c34eb07598	[MINOR] Reuse deleteMetadataTable for disabling metadata table (#5217 )	2022-04-03 16:12:14 +05:30
Alexey Kudinkin	cc3737be50	[HUDI-3664] Fixing Column Stats Index composition (#5181 ) Co-authored-by: Sagar Sumit <sagarsumit09@gmail.com>	2022-04-02 17:15:52 -07:00
Sagar Sumit	74eb09be9b	[HUDI-3776] Fix BloomIndex incorrectly using ColStats to lookup records locations (#5213 )	2022-04-02 18:22:57 -04:00
YueZhang	020786a5f9	[HUDI-3451] Delete metadata table when the write client disables MDT (#5186 ) * Add checks for metadata table init to avoid possible out-of-sync * Revise the logic to reuse existing table config * Revise docs and naming Co-authored-by: yuezhang <yuezhang@freewheel.tv> Co-authored-by: Y Ethan Guo <ethan.guoyihua@gmail.com>	2022-04-02 16:31:06 +05:30
Y Ethan Guo	b1e7e1f14e	[HUDI-3708] Fix failure with HoodieMetadataRecord due to schema compatibility check (#5204 )	2022-04-01 20:17:02 -07:00
Y Ethan Guo	fb45fc9cb9	[HUDI-3773] Fix parallelism used for metadata table bloom filter index (#5209 )	2022-04-01 20:14:07 -07:00
xiarixiaoyao	444ff496a4	[RFC-33] [HUDI-2429][Stacked on HUDI-2560] Support full Schema evolution for Spark (#4910 ) * [HUDI-2560] introduce id_based schema to support full schema evolution. * add test for FileBasedInternalSchemaStorageManger and rebase code * add support for change column type and fix some test case * fix some bugs encountered in the production env and delete useless code * fix test error * rebase code * fixed some nested schema change bugs * [HUDI-2429][Stacked On HUDI-2560]Support full schema evolution for spark * [use dummyInternalSchema instead of null] * add support for spark3.1.x * remove support for spark3.1.x , sicne some compile fail * support spark3.1.x * rebase and prepare solve all comments * address all comments * rebase code * fixed the count() bug try to get internalSchema by parser commit file/history file directly, not use metaclient which is time cost address some comments * fixed all comments * fix new comments * rebase code,fix UT failed * fixed mistake * rebase code ,fixed new comments * rebase code , and prepare for address new comments * address commits * address new comments * fix new issues * control fallback original write logical	2022-04-01 13:20:24 -07:00
Sivabalan Narayanan	7dfb168003	[HUDI-3763] Fixing hadoop conf class loading for inline reading (#5194 )	2022-04-01 11:27:40 -04:00
Danny Chan	6df14f15a3	[HUDI-2752] The MOR DELETE block breaks the event time sequence of CDC (#4880 )	2022-04-01 20:46:51 +08:00
ForwardXu	98b4e9796e	[HUDI-3406] Rollback incorrectly relying on FS listing instead of Com… (#4957 ) * [HUDI-3406] Rollback incorrectly relying on FS listing instead of Commit Metadata * [HUDI-3406] Rollback incorrectly relying on FS listing instead of Commit Metadata * [HUDI-3406] Rollback incorrectly relying on FS listing instead of Commit Metadata * fix comments * fix comments * fix comments	2022-04-01 10:01:41 +08:00

1 2 3 4 5 ...

668 Commits