lanyuanxiaoyao/hudi - hudi - Gitea: Git with a cup of tea

Author	SHA1	Message	Date
董可伦	6a8396420c	[HUDI-3643] Fix hive count exception when the table is empty and the path depth is less than 3 (#5051 )	2022-04-07 04:21:03 -07:00
Prashant Wason	b28f0d6ceb	[HUDI-3290] Different file formats for the partition metadata file. (#5179 ) * [HUDI-3290] Different file formats for the partition metadata file. Partition metadata files are stored in each partition to help identify the base path of a table. These files are saved in the properties file format. Some query engines do not work when non Parquet/ORC files are found in a partition. Added a new table config 'hoodie.partition.metafile.use.data.format' which when enabled (default false for backward compatibility) ensures that partition metafiles will be saved in the same format as the base files of a dataset. For new datasets, the config can be set via hudi-cli. Deltastreamer has a new parameter --partition-metafile-use-data-format which will create a table with this setting. * Code review comments - Adding a new command to migrate from text to base file formats for meta file. - Reimplementing readFromFS() to first read the text format, then base format - Avoid extra exists() checks in readFromFS() - Added unit tests, enabled parquet format across hoodie-hadoop-mr - Code cleanup, restructuring, naming consistency. * Wiring in all the other Spark code paths to respect this config - Turned on parquet meta format for COW data source tests - Removed the deltastreamer command line to keep it shorter * populate HoodiePartitionMetadata#format after readFromFS() Co-authored-by: Vinoth Chandar <vinoth@apache.org> Co-authored-by: Raymond Xu <2701446+xushiyan@users.noreply.github.com>	2022-04-04 08:08:20 -07:00
Y Ethan Guo	eaa4c4f2e2	[HUDI-1180] Upgrade HBase to 2.4.9 (#5004 ) Co-authored-by: Sagar Sumit <sagarsumit09@gmail.com>	2022-03-24 19:04:53 -07:00
Sagar Sumit	fe2c3989e3	[HUDI-3689] Fix glob path and hive sync in deltastreamer tests (#5117 ) * Remove glob pattern basePath from the deltastreamer tests. * [HUDI-3689] Fix file scheme config for CI failure in TestHoodieRealTimeRecordReader Co-authored-by: Raymond Xu <2701446+xushiyan@users.noreply.github.com>	2022-03-24 15:48:35 +05:30
Aditya Tiwari	051ad0b033	[HUDI-3130] Fixing Hive getSchema for RT tables addressing different partitions having different schemas (#4468 ) * Fixing Hive getSchema for RT tables * Addressing feedback * temp diff * fixing tests after spark datasource read support for metadata table is merged to master * Adding multi-partition schema evolution tests to HoodieRealTimeRecordReader Co-authored-by: Aditya Tiwari <aditya.tiwari@flipkart.com> Co-authored-by: sivabalan <n.siva.b@gmail.com>	2022-03-06 07:51:35 +05:30
Raymond Xu	b4362fac45	[HUDI-3348] Add UT to verify HoodieRealtimeFileSplit serde (#4951 )	2022-03-04 11:19:16 +04:00
Sagar Sumit	6a5cfb45b9	[MINOR] Fix table type in input format test (#4912 )	2022-02-25 13:51:53 -05:00
Alexey Kudinkin	aaddaf524a	[HUDI-3280] Cleaning up Hive-related hierarchies after refactoring (#4743 )	2022-02-16 15:36:37 -08:00
Alexey Kudinkin	3f263b82ce	[HUDI-3206] Unify Hive's MOR implementations to avoid duplication (#4559 ) Unify Hive's MOR implementations to avoid duplication to avoid duplication across implementations for different file-formats (Parquet, HFile, etc) - Extracted HoodieRealtimeFileInputFormatBase (extending COW HoodieFileInputFormatBase base) - Rebased Parquet, HFile implementations onto HoodieRealtimeFileInputFormatBase - Tidying up	2022-02-07 14:06:28 -05:00
Alexey Kudinkin	69dfcda116	[HUDI-3191] Removing duplicating file-listing process w/in Hive's MOR `FileInputFormat`s (#4556 )	2022-02-03 14:01:41 -08:00
Alexey Kudinkin	a68e1dc2db	[HUDI-431] Adding support for Parquet in MOR `LogBlock`s (#4333 ) - Adding support for Parquet in MOR tables Log blocks Co-authored-by: Sivabalan Narayanan <n.siva.b@gmail.com>	2022-02-02 14:35:05 -05:00
Manoj Govindassamy	f87c47352a	[HUDI-2763] Metadata table records - support for key deduplication based on hardcoded key field (#4449 ) * [HUDI-2763] Metadata table records - support for key deduplication and virtual keys - The backing log format for the metadata table is HFile, a KeyValue type. Since the key field in the metadata record payload is a duplicate of the Key in the Cell, the redundant key field in the record can be emptied to save on the cost. - HoodieHFileWriter and HoodieHFileDataBlock will now serialize records with the key field emptied by default. HFile writer tries to find if the record has metadata payload schema field 'key' and if so it does the key trimming from the record payload. - HoodieHFileReader when reading the serialized records back from disk, it materializes the missing keyFields if any. HFile reader tries to find if the record has metadata payload schema fiels 'key' and if so it does the key materialization in the record payload. - Tests have been added to verify the default virtual keys and key deduplication support for the metadata table records. Co-authored-by: Vinoth Chandar <vinoth@apache.org>	2022-01-26 13:34:04 -05:00
董可伦	56cd8ffae0	[HUDI-2837] Add support for using database name in incremental query (#4083 )	2022-01-22 22:11:27 -08:00
Alexey Kudinkin	4bea758738	[HUDI-3191] Rebasing Hive's FileInputFormat onto `AbstractHoodieTableFileIndex` (#4531 )	2022-01-18 14:54:51 -08:00
RexAn	f612a20815	[HUDI-2779] Cache BaseDir if HudiTableNotFound Exception thrown (#4014 )	2021-12-09 16:04:11 +05:30
xiarixiaoyao	57c4bf8152	[HUDI-2876] for hive/presto hudi should remove the temp file which created by HoodieMergedLogRecordSanner when the query finished. (#4139 )	2021-12-06 21:33:10 +08:00
Sivabalan Narayanan	8340ccb503	[HUDI-2005] Removing direct fs call in HoodieLogFileReader (#3865 )	2021-11-25 18:51:38 -05:00
xiarixiaoyao	a40ac62e0c	[HUDI-2086]redo the logical of mor_incremental_view for hive (#3203 )	2021-11-10 15:41:07 +08:00
Sivabalan Narayanan	69ee790a47	[HUDI-1294] Adding inline read and seek based read(batch get) for hfile log blocks in metadata table (#3762 )	2021-10-29 12:12:44 -04:00
Sivabalan Narayanan	e3fc74668f	[HUDI-2625] Revert "[HUDI-2005] Avoiding direct fs calls in HoodieLogFileReader (#3757 )" (#3863 ) This reverts commit `1bb0532563`.	2021-10-25 21:43:15 -04:00
Sivabalan Narayanan	1bb0532563	[HUDI-2005] Avoiding direct fs calls in HoodieLogFileReader (#3757 )	2021-10-25 01:21:08 -04:00
Udit Mehrotra	c350d05dd3	Restore 0.8.0 config keys with deprecated annotation (#3506 ) Co-authored-by: Sagar Sumit <sagarsumit09@gmail.com> Co-authored-by: Vinoth Chandar <vinoth@apache.org>	2021-08-19 13:36:40 -07:00
Sivabalan Narayanan	fe508376fa	[HUDI-2177][HUDI-2200] Adding virtual keys support for MOR table (#3315 )	2021-08-02 09:45:09 -04:00
rmahindra123	8fef50e237	[HUDI-2044] Integrate consumers with rocksDB and compression within External Spillable Map (#3318 )	2021-07-28 01:31:03 -04:00
Danny Chan	ac75bda929	[HUDI-1969] Support reading logs for MOR Hive rt table (#3033 )	2021-07-13 23:43:30 -07:00
wenningd	d412fb2fe6	[HUDI-89] Add configOption & refactor all configs based on that (#2833 ) Co-authored-by: Wenning Ding <wenningd@amazon.com>	2021-06-30 14:26:30 -07:00
s-sanjay	0fb8556b0d	Add ability to provide multi-region (global) data consistency across HMS in different regions (#2542 ) [global-hive-sync-tool] Add a global hive sync tool to sync hudi table across clusters. Add a way to rollback the replicated time stamp if we fail to sync or if we partly sync Co-authored-by: Jagmeet Bali <jsbali@uber.com>	2021-06-24 20:26:26 -07:00
xiarixiaoyao	081061e14b	[HUDI-1719] hive on spark/mr,Incremental query of the mor table, the partition field is incorrect (#2720 )	2021-05-20 11:00:08 -04:00
xiarixiaoyao	6f7ff7e8ca	[HUDI-1722]Fix hive beeline/spark-sql query specified field on mor table occur NPE (#2722 )	2021-05-12 20:52:37 +08:00
TeRS-K	be9db2c4f5	[HUDI-1055] Remove hardcoded parquet in tests (#2740 ) * Remove hardcoded parquet in tests * Use DataFileUtils.getInstance * Renaming DataFileUtils to BaseFileUtils Co-authored-by: Vinoth Chandar <vinoth@apache.org>	2021-05-11 10:01:45 -07:00
jsbali	aa398f77f1	[HUDI-1789] Support reading older snapshots (#2809 ) * [HUDI-1789] In HoodieParquetInoutFormat we currently default to the latest version of base files. This PR attempts to add a new jobConf `hoodie.%s.consume.snapshot.time` This new config will allow us to read older snapshots. - Reusing hoodie.%s.consume.commit for point in time snapshot queries as well. - Adding javadocs and some more tests	2021-05-10 15:26:49 -07:00
xiarixiaoyao	1db904a12e	[HUDI-1718] When query incr view of mor table which has Multi level partitions, the query failed (#2716 )	2021-05-05 00:34:20 -04:00
Raymond Xu	faf3785a2d	[HUDI-1811] Fix TestHoodieRealtimeRecordReader (#2873 ) Pass basePath with scheme 'file://' to HoodieRealtimeFileSplit	2021-04-30 11:16:55 -07:00
xiarixiaoyao	929eca43fe	[HUDI-1817] Fix getting incorrect partition path while using incr query by spark-sql (#2858 )	2021-04-30 14:57:52 +08:00
satishkotha	7cc75e0be2	[HUDI-1646] Provide mechanism to read uncommitted data through InputFormat (#2611 )	2021-03-04 17:43:31 -08:00
n3nash	ffcfb58bac	[HUDI-1486] Remove inline inflight rollback in hoodie writer (#2359 ) 1. Refactor rollback and move cleaning failed commits logic into cleaner 2. Introduce hoodie heartbeat to ascertain failed commits 3. Fix test cases	2021-02-19 20:12:22 -08:00
Gary Li	605b617cfa	[HUDI-1434] fix incorrect log file path in HoodieWriteStat (#2300 ) * [HUDI-1434] fix incorrect log file path in HoodieWriteStat * HoodieWriteHandle#close() returns a list of WriteStatus objs * Handle rolled-over log files and return a WriteStatus per log file written - Combined data and delete block logging into a single call - Lazily initialize and manage write status based on returned AppendResult - Use FSUtils.getFileSize() to set final file size, consistent with other handles - Added tests around returned values in AppendResult - Added validation of the file sizes returned in write stat Co-authored-by: Vinoth Chandar <vinoth@apache.org>	2020-12-30 14:22:15 -08:00
Balaji Varadarajan	3ec9270e8e	[HUDI-1490] Incremental Query should work even when there are partitions that have no incremental changes (#2371 ) * Incremental Query should work even when there are partitions that have no incremental changes Co-authored-by: Sivabalan Narayanan <sivabala@uber.com>	2020-12-26 12:17:49 -05:00
Raymond Xu	c5e10d668f	[HUDI-995] Migrate HoodieTestUtils APIs to HoodieTestTable (#2167 ) Remove APIs in `HoodieTestUtils` - `createCommitFiles` - `createDataFile` - `createNewLogFile` - `createCompactionRequest` Migrated usages in `TestCleaner#testPendingCompactions`. Also improved some API names in `HoodieTestTable`.	2020-10-12 14:39:10 +08:00
Raymond Xu	3201665295	[HUDI-995] Use HoodieTestTable in more classes (#2079 ) * [HUDI-995] Use HoodieTestTable in more classes Migrate test data prep logic in - TestStatsCommand - TestHoodieROTablePathFilter Re-implement methods for create new commit times in HoodieTestUtils and HoodieClientTestHarness - Move relevant APIs to HoodieTestTable - Migrate usages After changing to HoodieTestTable APIs, removed unused deprecated APIs in HoodieTestUtils	2020-09-17 09:29:07 -07:00
Prashant Wason	6461927eac	[HUDI-960] Implementation of the HFile base and log file format. (#1804 ) * [HUDI-960] Implementation of the HFile base and log file format. 1. Includes HFileWriter and HFileReader 2. Includes HFileInputFormat for both snapshot and realtime input format for Hive 3. Unit test for new code 4. IT for using HFile format and querying using Hive (Presto and SparkSQL are not supported) Advantage: HFile file format saves data as binary key-value pairs. This implementation chooses the following values: 1. Key = Hoodie Record Key (as bytes) 2. Value = Avro encoded GenericRecord (as bytes) HFile allows efficient lookup of a record by key or range of keys. Hence, this base file format is well suited to applications like RFC-15, RFC-08 which will benefit from the ability to lookup records by key or search in a range of keys without having to read the entire data/log format. Limitations: HFile storage format has certain limitations when used as a general purpose data storage format. 1. Does not have a implemented reader for Presto and SparkSQL 2. Is not a columnar file format and hence may lead to lower compression levels and greater IO on query side due to lack of column pruning Other changes: - Remove databricks/avro from pom - Fix HoodieClientTestUtils from not using scala imports/reflection based conversion etc - Breaking up limitFileSize(), per parquet and hfile base files - Added three new configs for HoodieHFileConfig - prefetchBlocksOnOpen, cacheDataInL1, dropBehindCacheCompaction - Throw UnsupportedException in HFileReader.getRecordKeys() - Updated HoodieCopyOnWriteTable to create the correct merge handle (HoodieSortedMergeHandle for HFile and HoodieMergeHandle otherwise) * Fixing checkstyle Co-authored-by: Vinoth Chandar <vinoth@apache.org>	2020-08-31 08:05:59 -07:00
Udit Mehrotra	8d04268264	[HUDI-1174] Changes for bootstrapped tables to work with presto (#1944 ) The purpose of this pull request is to implement changes required on Hudi side to get Bootstrapped tables integrated with Presto. The testing was done against presto 0.232 and following changes were identified to make it work: Annotation UseRecordReaderFromInputFormat is required on HoodieParquetInputFormat as well, because the reading for bootstrapped tables needs to happen through record reader to be able to perform the merge. On presto side, this annotation is already handled. We need to internally maintain VIRTUAL_COLUMN_NAMES because presto's internal hive version hive-apache-1.2.2 has VirutalColumn as a class, versus the one we depend on in hudi which is an enum. Dependency changes in hudi-presto-bundle to avoid runtime exceptions.	2020-08-12 17:51:31 -07:00
Balaji Varadarajan	9bcd3221fd	[HUDI-1144] Speedup spark read queries by caching metaclient in HoodieROPathFilter (#1919 )	2020-08-05 09:19:10 -07:00
Raymond Xu	10e4268792	[HUDI-995] Use Transformations, Assertions and SchemaTestUtil (#1884 ) - Consolidate transform functions for tests in Transformations.java - Consolidate assertion functions for tests in Assertions.java - Make use of SchemaTestUtil for loading schema from resource	2020-08-01 20:57:18 +08:00
Prashant Wason	2603cfb33e	[HUDI-684] Introduced abstraction for writing and reading different types of base file formats. (#1687 ) Notable changes: 1. HoodieFileWriter and HoodieFileReader abstractions for writer/reader side of a base file format 2. HoodieDataBlock abstraction for creation specific data blocks for base file formats. (e.g. Parquet has HoodieAvroDataBlock) 3. All hardocded references to Parquet / Parquet based classes have been abstracted to call methods which accept a base file format 4. HiveSyncTool accepts the base file format as a CLI parameter 5. HoodieDeltaStreamer accepts the base file format as a CLI parameter 6. HoodieSparkSqlWriter accepts the base file format as a parameter	2020-06-25 23:46:55 -07:00
Satish Kotha	a7fd331624	Add unit test for snapshot reads in hadoop-mr	2020-06-13 10:23:05 -07:00
Gary Li	37838cea60	[HUDI-822] decouple Hudi related logics from HoodieInputFormat (#1592 ) - Refactoring business logic out of InputFormat into Utils helpers.	2020-06-09 06:10:16 -07:00
Shen Hong	2901f5423a	[HUDI-1002] Ignore case when setting incremental mode in hive query (#1715 )	2020-06-08 19:38:32 +08:00
Raymond Xu	03f136361a	[HUDI-811] Restructure test packages in hudi-common (#1644 ) * [HUDI-811] Restructure test packages in hudi-common	2020-05-27 16:28:17 +08:00
Raymond Xu	0d4848b68b	[HUDI-811] Restructure test packages (#1607 ) * restructure hudi-spark tests * restructure hudi-timeline-service tests * restructure hudi-hadoop-mr hudi-utilities tests * restructure hudi-hive-sync tests	2020-05-13 15:37:03 -07:00

1 2

84 Commits