lanyuanxiaoyao/hudi - hudi - Gitea: Git with a cup of tea

Author	SHA1	Message	Date
Alexey Kudinkin	f7544e23ac	[HUDI-3204] Fixing partition-values being derived from partition-path instead of source columns (#5364 ) - Scaffolded `Spark24HoodieParquetFileFormat` extending `ParquetFileFormat` and overriding the behavior of adding partition columns to every row - Amended `SparkAdapter`s `createHoodieParquetFileFormat` API to be able to configure whether to append partition values or not - Fallback to append partition values in cases when the source columns are not persisted in data-file - Fixing HoodieBaseRelation incorrectly handling mandatory columns	2022-04-20 19:30:27 +08:00
董可伦	b8e465fdfc	[MINOR] Fix typos in log4j-surefire.properties (#5212 )	2022-04-15 13:33:37 -07:00
Danny Chan	0281725c6b	[MINOR] Inline the partition path logic into the builder (#5310 )	2022-04-13 16:54:39 +05:30
Sagar Sumit	df87095ef0	[HUDI-3454] Fix partition name in all code paths for LogRecordScanner (#5252 ) * Depend on FSUtils#getRelativePartitionPath(basePath, logFilePath.getParent) to get the partition. * If the list of log file paths in the split is empty, then fallback to usual behaviour.	2022-04-08 09:59:36 +05:30
董可伦	6a8396420c	[HUDI-3643] Fix hive count exception when the table is empty and the path depth is less than 3 (#5051 )	2022-04-07 04:21:03 -07:00
Prashant Wason	b28f0d6ceb	[HUDI-3290] Different file formats for the partition metadata file. (#5179 ) * [HUDI-3290] Different file formats for the partition metadata file. Partition metadata files are stored in each partition to help identify the base path of a table. These files are saved in the properties file format. Some query engines do not work when non Parquet/ORC files are found in a partition. Added a new table config 'hoodie.partition.metafile.use.data.format' which when enabled (default false for backward compatibility) ensures that partition metafiles will be saved in the same format as the base files of a dataset. For new datasets, the config can be set via hudi-cli. Deltastreamer has a new parameter --partition-metafile-use-data-format which will create a table with this setting. * Code review comments - Adding a new command to migrate from text to base file formats for meta file. - Reimplementing readFromFS() to first read the text format, then base format - Avoid extra exists() checks in readFromFS() - Added unit tests, enabled parquet format across hoodie-hadoop-mr - Code cleanup, restructuring, naming consistency. * Wiring in all the other Spark code paths to respect this config - Turned on parquet meta format for COW data source tests - Removed the deltastreamer command line to keep it shorter * populate HoodiePartitionMetadata#format after readFromFS() Co-authored-by: Vinoth Chandar <vinoth@apache.org> Co-authored-by: Raymond Xu <2701446+xushiyan@users.noreply.github.com>	2022-04-04 08:08:20 -07:00
Y Ethan Guo	eaa4c4f2e2	[HUDI-1180] Upgrade HBase to 2.4.9 (#5004 ) Co-authored-by: Sagar Sumit <sagarsumit09@gmail.com>	2022-03-24 19:04:53 -07:00
Sagar Sumit	fe2c3989e3	[HUDI-3689] Fix glob path and hive sync in deltastreamer tests (#5117 ) * Remove glob pattern basePath from the deltastreamer tests. * [HUDI-3689] Fix file scheme config for CI failure in TestHoodieRealTimeRecordReader Co-authored-by: Raymond Xu <2701446+xushiyan@users.noreply.github.com>	2022-03-24 15:48:35 +05:30
Alexey Kudinkin	099c2c099a	[HUDI-3457] Refactored Spark DataSource Relations to avoid code duplication (#4877 ) Refactoring Spark DataSource Relations to avoid code duplication. Following Relations were in scope: - BaseFileOnlyViewRelation - MergeOnReadSnapshotRelaation - MergeOnReadIncrementalRelation	2022-03-18 22:32:16 -07:00
冯健	bf191f8d46	[HUDI-3645] Fix NPE caused by multiple threads accessing non-thread-safe HashMap (#5028 ) - Change HashMap in HoodieROTablePathFilter to ConcurrentHashMap	2022-03-17 14:20:28 +05:30
Alexey Kudinkin	5e8ff8d793	[HUDI-3514] Rebase Data Skipping flow to rely on MT Column Stats index (#4948 )	2022-03-15 10:38:36 -07:00
Aditya Tiwari	051ad0b033	[HUDI-3130] Fixing Hive getSchema for RT tables addressing different partitions having different schemas (#4468 ) * Fixing Hive getSchema for RT tables * Addressing feedback * temp diff * fixing tests after spark datasource read support for metadata table is merged to master * Adding multi-partition schema evolution tests to HoodieRealTimeRecordReader Co-authored-by: Aditya Tiwari <aditya.tiwari@flipkart.com> Co-authored-by: sivabalan <n.siva.b@gmail.com>	2022-03-06 07:51:35 +05:30
Raymond Xu	b4362fac45	[HUDI-3348] Add UT to verify HoodieRealtimeFileSplit serde (#4951 )	2022-03-04 11:19:16 +04:00
Raymond Xu	c77b2591d0	[HUDI-2439] Remove SparkBoundedInMemoryExecutor (#4860 )	2022-02-26 08:02:12 -05:00
Sagar Sumit	6a5cfb45b9	[MINOR] Fix table type in input format test (#4912 )	2022-02-25 13:51:53 -05:00
Alexey Kudinkin	85e8a5c4de	[HUDI-1296] Support Metadata Table in Spark Datasource (#4789 ) * Bootstrapping initial support for Metadata Table in Spark Datasource - Consolidated Avro/Row conversion utilities to center around Spark's AvroDeserializer ; removed duplication - Bootstrapped HoodieBaseRelation - Updated HoodieMergeOnReadRDD to be able to handle Metadata Table - Modified MOR relations to be able to read different Base File formats (Parquet, HFile)	2022-02-24 16:23:13 -05:00
Alexey Kudinkin	aaddaf524a	[HUDI-3280] Cleaning up Hive-related hierarchies after refactoring (#4743 )	2022-02-16 15:36:37 -08:00
Sivabalan Narayanan	ba4e732ba7	[HUDI-2987] Update all deprecated calls to new apis in HoodieRecordPayload (#4681 )	2022-02-10 19:19:33 -05:00
Alexey Kudinkin	464027ec37	[HUDI-3239] Convert `BaseHoodieTableFileIndex` to Java (#4669 ) Converting BaseHoodieTableFileIndex to Java, removing Scala as a dependency from "hudi-common"	2022-02-09 18:42:08 -05:00
Alexey Kudinkin	973087f385	[HUDI-3276] Rebased Parquet-based `FileInputFormat` impls to inherit from `MapredParquetInputFormat` (#4667 ) Rebased Parquet-based FileInputFormat impls to inherit from MapredParquetInputFormat, to make sure that Hive is appropriately recognizing those impls and applying corresponding optimizations. - Converted HoodieRealtimeFileInputFormatBase and HoodieFileInputFormatBase into standalone implementations that could be instantiated as standalone objects (which could be used for delegation) - Renamed HoodieFileInputFormatBase > HoodieCopyOnWriteTableInputFormat, HoodieRealtimeFileInputFormatBase > HoodieMergeOnReadTableInputFormat - Scaffolded HoodieParquetFileInputFormatBase for all Parquet impls to inherit from - Rebased Parquet impls onto HoodieParquetFileInputFormatBase	2022-02-08 15:21:45 -05:00
Alexey Kudinkin	3f263b82ce	[HUDI-3206] Unify Hive's MOR implementations to avoid duplication (#4559 ) Unify Hive's MOR implementations to avoid duplication to avoid duplication across implementations for different file-formats (Parquet, HFile, etc) - Extracted HoodieRealtimeFileInputFormatBase (extending COW HoodieFileInputFormatBase base) - Rebased Parquet, HFile implementations onto HoodieRealtimeFileInputFormatBase - Tidying up	2022-02-07 14:06:28 -05:00
Y Ethan Guo	b8601a9f58	[HUDI-2656] Generalize HoodieIndex for flexible record data type (#3893 ) Co-authored-by: Raymond Xu <2701446+xushiyan@users.noreply.github.com>	2022-02-03 20:24:04 -08:00
Alexey Kudinkin	69dfcda116	[HUDI-3191] Removing duplicating file-listing process w/in Hive's MOR `FileInputFormat`s (#4556 )	2022-02-03 14:01:41 -08:00
Alexey Kudinkin	a68e1dc2db	[HUDI-431] Adding support for Parquet in MOR `LogBlock`s (#4333 ) - Adding support for Parquet in MOR tables Log blocks Co-authored-by: Sivabalan Narayanan <n.siva.b@gmail.com>	2022-02-02 14:35:05 -05:00
Manoj Govindassamy	f87c47352a	[HUDI-2763] Metadata table records - support for key deduplication based on hardcoded key field (#4449 ) * [HUDI-2763] Metadata table records - support for key deduplication and virtual keys - The backing log format for the metadata table is HFile, a KeyValue type. Since the key field in the metadata record payload is a duplicate of the Key in the Cell, the redundant key field in the record can be emptied to save on the cost. - HoodieHFileWriter and HoodieHFileDataBlock will now serialize records with the key field emptied by default. HFile writer tries to find if the record has metadata payload schema field 'key' and if so it does the key trimming from the record payload. - HoodieHFileReader when reading the serialized records back from disk, it materializes the missing keyFields if any. HFile reader tries to find if the record has metadata payload schema fiels 'key' and if so it does the key materialization in the record payload. - Tests have been added to verify the default virtual keys and key deduplication support for the metadata table records. Co-authored-by: Vinoth Chandar <vinoth@apache.org>	2022-01-26 13:34:04 -05:00
董可伦	56cd8ffae0	[HUDI-2837] Add support for using database name in incremental query (#4083 )	2022-01-22 22:11:27 -08:00
Alexey Kudinkin	4bea758738	[HUDI-3191] Rebasing Hive's FileInputFormat onto `AbstractHoodieTableFileIndex` (#4531 )	2022-01-18 14:54:51 -08:00
Yuwei XIAO	d36533735f	[HUDI-3194] fix MOR snapshot query during compaction (#4540 )	2022-01-17 17:24:24 -05:00
Alexey Kudinkin	75caa7d3d8	[HUDI-3179] Extracted common `AbstractHoodieTableFileIndex` to be shared across engines (#4520 )	2022-01-16 22:46:20 -08:00
Alexey Kudinkin	6cdcd89afa	[HUDI-3094] Unify Hive's InputFormat implementations to avoid duplication (#4417 )	2022-01-11 15:02:13 -08:00
xuzifu666	f0c2912d35	[MINOR] Remove unused methods in HoodieColumnProjectionUtils (#4408 )	2022-01-06 15:36:13 -08:00
Sivabalan Narayanan	a66212d204	[HUDI-2966] Closing LogRecordScanner in compactor (#4478 ) * Closing LogRecordScanner in compactor * Addressing comments	2022-01-05 10:57:18 +08:00
RexAn	f612a20815	[HUDI-2779] Cache BaseDir if HudiTableNotFound Exception thrown (#4014 )	2021-12-09 16:04:11 +05:30
xuzifu666	c9e18d1e7d	[HUDI-2942] add error message log in HoodieCombineHiveInputFormat (#4224 )	2021-12-07 22:05:39 -08:00
xiarixiaoyao	57c4bf8152	[HUDI-2876] for hive/presto hudi should remove the temp file which created by HoodieMergedLogRecordSanner when the query finished. (#4139 )	2021-12-06 21:33:10 +08:00
zhangyue19921010	5616830ae1	Revert "[HUDI-2489]Tuning HoodieROTablePathFilter by caching hoodieTableFileSystemView, aiming to reduce unnecessary list/get requests" Co-authored-by: yuezhang <yuezhang@freewheel.tv>	2021-12-04 08:26:53 +05:30
Sivabalan Narayanan	8340ccb503	[HUDI-2005] Removing direct fs call in HoodieLogFileReader (#3865 )	2021-11-25 18:51:38 -05:00
Danny Chan	a2eb2b0b0a	[HUDI-2480] FileSlice after pending compaction-requested instant-time… (#3703 ) * [HUDI-2480] FileSlice after pending compaction-requested instant-time is ignored by MOR snapshot reader * include file slice after a pending compaction for spark reader Co-authored-by: garyli1019 <yanjia.gary.li@gmail.com>	2021-11-25 22:30:09 +08:00
Jimmy.Zhou	0d1e7ecdab	[MINOR] Fix typo,'multipe' corrected to 'multiple' (#4068 )	2021-11-22 17:20:23 -08:00
xiarixiaoyao	a0dae41409	[HUDI-2758] remove redundant code in the hoodieRealtimeInputFormatUitls.getRealtimeSplits (#3994 )	2021-11-15 11:29:40 +08:00
xiarixiaoyao	a40ac62e0c	[HUDI-2086]redo the logical of mor_incremental_view for hive (#3203 )	2021-11-10 15:41:07 +08:00
Genmao Yu	f41539a9cb	[HUDI-313] bugfix: NPE when select count start from a realtime table with Tez(#3630 ) Co-authored-by: dylonyu <dylonyu@tencent.com>	2021-11-06 12:16:13 -04:00
xiarixiaoyao	5517d292f9	[HUDI-2674] hudi hive reader should not print read values. (#3910 )	2021-11-02 23:10:18 -04:00
Sivabalan Narayanan	69ee790a47	[HUDI-1294] Adding inline read and seek based read(batch get) for hfile log blocks in metadata table (#3762 )	2021-10-29 12:12:44 -04:00
Sivabalan Narayanan	e3fc74668f	[HUDI-2625] Revert "[HUDI-2005] Avoiding direct fs calls in HoodieLogFileReader (#3757 )" (#3863 ) This reverts commit `1bb0532563`.	2021-10-25 21:43:15 -04:00
Sivabalan Narayanan	1bb0532563	[HUDI-2005] Avoiding direct fs calls in HoodieLogFileReader (#3757 )	2021-10-25 01:21:08 -04:00
zhangyue19921010	1e285dc399	[HUDI-2489]Tuning HoodieROTablePathFilter by caching hoodieTableFileSystemView, aiming to reduce unnecessary list/get requests (#3719 ) Co-authored-by: yuezhang <yuezhang@freewheel.tv>	2021-10-22 12:03:58 -04:00
Danny Chan	abf3e3fe71	[HUDI-2548] Flink streaming reader misses the rolling over file handles (#3787 )	2021-10-14 10:36:18 +08:00
Sivabalan Narayanan	5f32162a2f	[HUDI-2285][HUDI-2476] Metadata table synchronous design. Rebased and Squashed from pull/3426 (#3590 ) * [HUDI-2285] Adding Synchronous updates to metadata before completion of commits in data timelime. - This patch adds synchronous updates to metadata table. In other words, every write is first committed to metadata table followed by data table. While reading metadata table, we ignore any delta commits that are present only in metadata table and not in data table timeline. - Compaction of metadata table is fenced by the condition that we trigger compaction only when there are no inflight requests in datatable. This ensures that all base files in metadata table is always in sync with data table(w/o any holes) and only there could be some extra invalid commits among delta log files in metadata table. - Due to this, archival of data table also fences itself up until compacted instant in metadata table. All writes to metadata table happens within the datatable lock. So, metadata table works in one writer mode only. This might be tough to loosen since all writers write to same FILES partition and so, will result in a conflict anyways. - As part of this, have added acquiring locks in data table for those operations which were not before while committing (rollback, clean, compaction, cluster). To note, we were not doing any conflict resolution. All we are doing here is to commit by taking a lock. So that all writes to metadata table is always a single writer. - Also added building block to add buckets for partitions, which will be leveraged by other indexes like record level index, etc. For now, FILES partition has only one bucket. In general, any number of buckets per partition is allowed and each partition has a fixed fileId prefix with incremental suffix for each bucket within each partition. Have fixed [HUDI-2476]. This fix is about retrying a failed compaction if it succeeded in metadata for first time, but failed w/ data table. - Enabling metadata table by default. - Adding more tests for metadata table Co-authored-by: Prashant Wason <pwason@uber.com>	2021-10-06 00:17:52 -04:00
Jimmy.Zhou	55df8f61e1	[MINOR] Fix typo."funcitons" corrected to "functions" (#3681 )	2021-09-21 20:30:13 -04:00

1 2 3 4

163 Commits