1
0
Commit Graph

84 Commits

Author SHA1 Message Date
董可伦
6a8396420c [HUDI-3643] Fix hive count exception when the table is empty and the path depth is less than 3 (#5051) 2022-04-07 04:21:03 -07:00
Prashant Wason
b28f0d6ceb [HUDI-3290] Different file formats for the partition metadata file. (#5179)
* [HUDI-3290] Different file formats for the partition metadata file.

Partition metadata files are stored in each partition to help identify the base path of a table. These files are saved in the properties file format. Some query engines do not work when non Parquet/ORC files are found in a partition.

Added a new table config 'hoodie.partition.metafile.use.data.format' which when enabled (default false for backward compatibility) ensures that partition metafiles will be saved in the same format as the base files of a dataset.

For new datasets, the config can be set via hudi-cli. Deltastreamer has a new parameter --partition-metafile-use-data-format which will create a table with this setting.

* Code review comments

- Adding a new command to migrate from text to base file formats for meta file.
- Reimplementing readFromFS() to first read the text format, then base format
- Avoid extra exists() checks in readFromFS()
- Added unit tests, enabled parquet format across hoodie-hadoop-mr
- Code cleanup, restructuring, naming consistency.

* Wiring in all the other Spark code paths to respect this config

 - Turned on parquet meta format for COW data source tests
 - Removed the deltastreamer command line to keep it shorter

* populate HoodiePartitionMetadata#format after readFromFS()

Co-authored-by: Vinoth Chandar <vinoth@apache.org>
Co-authored-by: Raymond Xu <2701446+xushiyan@users.noreply.github.com>
2022-04-04 08:08:20 -07:00
Y Ethan Guo
eaa4c4f2e2 [HUDI-1180] Upgrade HBase to 2.4.9 (#5004)
Co-authored-by: Sagar Sumit <sagarsumit09@gmail.com>
2022-03-24 19:04:53 -07:00
Sagar Sumit
fe2c3989e3 [HUDI-3689] Fix glob path and hive sync in deltastreamer tests (#5117)
* Remove glob pattern basePath from the deltastreamer tests.

* [HUDI-3689] Fix file scheme config

for CI failure in TestHoodieRealTimeRecordReader

Co-authored-by: Raymond Xu <2701446+xushiyan@users.noreply.github.com>
2022-03-24 15:48:35 +05:30
Aditya Tiwari
051ad0b033 [HUDI-3130] Fixing Hive getSchema for RT tables addressing different partitions having different schemas (#4468)
* Fixing Hive getSchema for RT tables

* Addressing feedback

* temp diff

* fixing tests after spark datasource read support for metadata table is merged to master

* Adding multi-partition schema evolution tests to HoodieRealTimeRecordReader

Co-authored-by: Aditya Tiwari <aditya.tiwari@flipkart.com>
Co-authored-by: sivabalan <n.siva.b@gmail.com>
2022-03-06 07:51:35 +05:30
Raymond Xu
b4362fac45 [HUDI-3348] Add UT to verify HoodieRealtimeFileSplit serde (#4951) 2022-03-04 11:19:16 +04:00
Sagar Sumit
6a5cfb45b9 [MINOR] Fix table type in input format test (#4912) 2022-02-25 13:51:53 -05:00
Alexey Kudinkin
aaddaf524a [HUDI-3280] Cleaning up Hive-related hierarchies after refactoring (#4743) 2022-02-16 15:36:37 -08:00
Alexey Kudinkin
3f263b82ce [HUDI-3206] Unify Hive's MOR implementations to avoid duplication (#4559)
Unify Hive's MOR implementations to avoid duplication to avoid duplication across implementations for different file-formats (Parquet, HFile, etc)

- Extracted HoodieRealtimeFileInputFormatBase (extending COW HoodieFileInputFormatBase base)
- Rebased Parquet, HFile implementations onto HoodieRealtimeFileInputFormatBase
- Tidying up
2022-02-07 14:06:28 -05:00
Alexey Kudinkin
69dfcda116 [HUDI-3191] Removing duplicating file-listing process w/in Hive's MOR FileInputFormats (#4556) 2022-02-03 14:01:41 -08:00
Alexey Kudinkin
a68e1dc2db [HUDI-431] Adding support for Parquet in MOR LogBlocks (#4333)
- Adding support for Parquet in MOR tables Log blocks

Co-authored-by: Sivabalan Narayanan <n.siva.b@gmail.com>
2022-02-02 14:35:05 -05:00
Manoj Govindassamy
f87c47352a [HUDI-2763] Metadata table records - support for key deduplication based on hardcoded key field (#4449)
* [HUDI-2763] Metadata table records - support for key deduplication and virtual keys
- The backing log format for the metadata table is HFile, a KeyValue type.
Since the key field in the metadata record payload is a duplicate of the
Key in the Cell, the redundant key field in the record can be emptied
to save on the cost.

- HoodieHFileWriter and HoodieHFileDataBlock will now serialize records
with the key field emptied by default. HFile writer tries to find if
the record has metadata payload schema field 'key' and if so it does
the key trimming from the record payload.

- HoodieHFileReader when reading the serialized records back from disk,
it materializes the missing keyFields if any. HFile reader tries to
find if the record has metadata payload schema fiels 'key' and if so
it does the key materialization in the record payload.

- Tests have been added to verify the default virtual keys and key
   deduplication support for the metadata table records.

Co-authored-by: Vinoth Chandar <vinoth@apache.org>
2022-01-26 13:34:04 -05:00
董可伦
56cd8ffae0 [HUDI-2837] Add support for using database name in incremental query (#4083) 2022-01-22 22:11:27 -08:00
Alexey Kudinkin
4bea758738 [HUDI-3191] Rebasing Hive's FileInputFormat onto AbstractHoodieTableFileIndex (#4531) 2022-01-18 14:54:51 -08:00
RexAn
f612a20815 [HUDI-2779] Cache BaseDir if HudiTableNotFound Exception thrown (#4014) 2021-12-09 16:04:11 +05:30
xiarixiaoyao
57c4bf8152 [HUDI-2876] for hive/presto hudi should remove the temp file which created by HoodieMergedLogRecordSanner when the query finished. (#4139) 2021-12-06 21:33:10 +08:00
Sivabalan Narayanan
8340ccb503 [HUDI-2005] Removing direct fs call in HoodieLogFileReader (#3865) 2021-11-25 18:51:38 -05:00
xiarixiaoyao
a40ac62e0c [HUDI-2086]redo the logical of mor_incremental_view for hive (#3203) 2021-11-10 15:41:07 +08:00
Sivabalan Narayanan
69ee790a47 [HUDI-1294] Adding inline read and seek based read(batch get) for hfile log blocks in metadata table (#3762) 2021-10-29 12:12:44 -04:00
Sivabalan Narayanan
e3fc74668f [HUDI-2625] Revert "[HUDI-2005] Avoiding direct fs calls in HoodieLogFileReader (#3757)" (#3863)
This reverts commit 1bb0532563.
2021-10-25 21:43:15 -04:00
Sivabalan Narayanan
1bb0532563 [HUDI-2005] Avoiding direct fs calls in HoodieLogFileReader (#3757) 2021-10-25 01:21:08 -04:00
Udit Mehrotra
c350d05dd3 Restore 0.8.0 config keys with deprecated annotation (#3506)
Co-authored-by: Sagar Sumit <sagarsumit09@gmail.com>
Co-authored-by: Vinoth Chandar <vinoth@apache.org>
2021-08-19 13:36:40 -07:00
Sivabalan Narayanan
fe508376fa [HUDI-2177][HUDI-2200] Adding virtual keys support for MOR table (#3315) 2021-08-02 09:45:09 -04:00
rmahindra123
8fef50e237 [HUDI-2044] Integrate consumers with rocksDB and compression within External Spillable Map (#3318) 2021-07-28 01:31:03 -04:00
Danny Chan
ac75bda929 [HUDI-1969] Support reading logs for MOR Hive rt table (#3033) 2021-07-13 23:43:30 -07:00
wenningd
d412fb2fe6 [HUDI-89] Add configOption & refactor all configs based on that (#2833)
Co-authored-by: Wenning Ding <wenningd@amazon.com>
2021-06-30 14:26:30 -07:00
s-sanjay
0fb8556b0d Add ability to provide multi-region (global) data consistency across HMS in different regions (#2542)
[global-hive-sync-tool] Add a global hive sync tool to sync hudi table across clusters. Add a way to rollback the replicated time stamp if we fail to sync or if we partly sync

Co-authored-by: Jagmeet Bali <jsbali@uber.com>
2021-06-24 20:26:26 -07:00
xiarixiaoyao
081061e14b [HUDI-1719] hive on spark/mr,Incremental query of the mor table, the partition field is incorrect (#2720) 2021-05-20 11:00:08 -04:00
xiarixiaoyao
6f7ff7e8ca [HUDI-1722]Fix hive beeline/spark-sql query specified field on mor table occur NPE (#2722) 2021-05-12 20:52:37 +08:00
TeRS-K
be9db2c4f5 [HUDI-1055] Remove hardcoded parquet in tests (#2740)
* Remove hardcoded parquet in tests
* Use DataFileUtils.getInstance
* Renaming DataFileUtils to BaseFileUtils

Co-authored-by: Vinoth Chandar <vinoth@apache.org>
2021-05-11 10:01:45 -07:00
jsbali
aa398f77f1 [HUDI-1789] Support reading older snapshots (#2809)
* [HUDI-1789] In HoodieParquetInoutFormat we currently default to the latest version of base files.
This PR attempts to add a new jobConf
 `hoodie.%s.consume.snapshot.time`

This new config will allow us to read older snapshots.

-  Reusing hoodie.%s.consume.commit for point in time snapshot queries as well.
-  Adding javadocs and some more tests
2021-05-10 15:26:49 -07:00
xiarixiaoyao
1db904a12e [HUDI-1718] When query incr view of mor table which has Multi level partitions, the query failed (#2716) 2021-05-05 00:34:20 -04:00
Raymond Xu
faf3785a2d [HUDI-1811] Fix TestHoodieRealtimeRecordReader (#2873)
Pass basePath with scheme 'file://' to HoodieRealtimeFileSplit
2021-04-30 11:16:55 -07:00
xiarixiaoyao
929eca43fe [HUDI-1817] Fix getting incorrect partition path while using incr query by spark-sql (#2858) 2021-04-30 14:57:52 +08:00
satishkotha
7cc75e0be2 [HUDI-1646] Provide mechanism to read uncommitted data through InputFormat (#2611) 2021-03-04 17:43:31 -08:00
n3nash
ffcfb58bac [HUDI-1486] Remove inline inflight rollback in hoodie writer (#2359)
1. Refactor rollback and move cleaning failed commits logic into cleaner
2. Introduce hoodie heartbeat to ascertain failed commits
3. Fix test cases
2021-02-19 20:12:22 -08:00
Gary Li
605b617cfa [HUDI-1434] fix incorrect log file path in HoodieWriteStat (#2300)
* [HUDI-1434] fix incorrect log file path in HoodieWriteStat

* HoodieWriteHandle#close() returns a list of WriteStatus objs

* Handle rolled-over log files and return a WriteStatus per log file written

 - Combined data and delete block logging into a single call
 - Lazily initialize and manage write status based on returned AppendResult
 - Use FSUtils.getFileSize() to set final file size, consistent with other handles
 - Added tests around returned values in AppendResult
 - Added validation of the file sizes returned in write stat

Co-authored-by: Vinoth Chandar <vinoth@apache.org>
2020-12-30 14:22:15 -08:00
Balaji Varadarajan
3ec9270e8e [HUDI-1490] Incremental Query should work even when there are partitions that have no incremental changes (#2371)
* Incremental Query should work even when there are  partitions that have no incremental changes

Co-authored-by: Sivabalan Narayanan <sivabala@uber.com>
2020-12-26 12:17:49 -05:00
Raymond Xu
c5e10d668f [HUDI-995] Migrate HoodieTestUtils APIs to HoodieTestTable (#2167)
Remove APIs in `HoodieTestUtils`
- `createCommitFiles`
- `createDataFile`
- `createNewLogFile`
- `createCompactionRequest`

Migrated usages in `TestCleaner#testPendingCompactions`.

Also improved some API names in `HoodieTestTable`.
2020-10-12 14:39:10 +08:00
Raymond Xu
3201665295 [HUDI-995] Use HoodieTestTable in more classes (#2079)
* [HUDI-995] Use HoodieTestTable in more classes

Migrate test data prep logic in
- TestStatsCommand
- TestHoodieROTablePathFilter

Re-implement methods for create new commit times in HoodieTestUtils and HoodieClientTestHarness
- Move relevant APIs to HoodieTestTable
- Migrate usages

After changing to HoodieTestTable APIs, removed unused deprecated APIs in HoodieTestUtils
2020-09-17 09:29:07 -07:00
Prashant Wason
6461927eac [HUDI-960] Implementation of the HFile base and log file format. (#1804)
* [HUDI-960] Implementation of the HFile base and log file format.

1. Includes HFileWriter and HFileReader
2. Includes HFileInputFormat for both snapshot and realtime input format for Hive
3. Unit test for new code
4. IT for using HFile format and querying using Hive (Presto and SparkSQL are not supported)

Advantage:
HFile file format saves data as binary key-value pairs. This implementation chooses the following values:
1. Key = Hoodie Record Key (as bytes)
2. Value = Avro encoded GenericRecord (as bytes)

HFile allows efficient lookup of a record by key or range of keys. Hence, this base file format is well suited to applications like RFC-15, RFC-08 which will benefit from the ability to lookup records by key or search in a range of keys without having to read the entire data/log format.

Limitations:
HFile storage format has certain limitations when used as a general purpose data storage format.
1. Does not have a implemented reader for Presto and SparkSQL
2. Is not a columnar file format and hence may lead to lower compression levels and greater IO on query side due to lack of column pruning


Other changes: 
 - Remove databricks/avro from pom
 - Fix HoodieClientTestUtils from not using scala imports/reflection based conversion etc
 - Breaking up limitFileSize(), per parquet and hfile base files
 - Added three new configs for HoodieHFileConfig - prefetchBlocksOnOpen, cacheDataInL1, dropBehindCacheCompaction
 - Throw UnsupportedException in HFileReader.getRecordKeys()
 - Updated HoodieCopyOnWriteTable to create the correct merge handle (HoodieSortedMergeHandle for HFile and HoodieMergeHandle otherwise)

* Fixing checkstyle

Co-authored-by: Vinoth Chandar <vinoth@apache.org>
2020-08-31 08:05:59 -07:00
Udit Mehrotra
8d04268264 [HUDI-1174] Changes for bootstrapped tables to work with presto (#1944)
The purpose of this pull request is to implement changes required on Hudi side to get Bootstrapped tables integrated with Presto. The testing was done against presto 0.232 and following changes were identified to make it work:

Annotation UseRecordReaderFromInputFormat is required on HoodieParquetInputFormat as well, because the reading for bootstrapped tables needs to happen through record reader to be able to perform the merge. On presto side, this annotation is already handled.

We need to internally maintain VIRTUAL_COLUMN_NAMES because presto's internal hive version hive-apache-1.2.2 has VirutalColumn as a class, versus the one we depend on in hudi which is an enum. 

Dependency changes in hudi-presto-bundle to avoid runtime exceptions.
2020-08-12 17:51:31 -07:00
Balaji Varadarajan
9bcd3221fd [HUDI-1144] Speedup spark read queries by caching metaclient in HoodieROPathFilter (#1919) 2020-08-05 09:19:10 -07:00
Raymond Xu
10e4268792 [HUDI-995] Use Transformations, Assertions and SchemaTestUtil (#1884)
- Consolidate transform functions for tests in Transformations.java
- Consolidate assertion functions for tests in Assertions.java
- Make use of SchemaTestUtil for loading schema from resource
2020-08-01 20:57:18 +08:00
Prashant Wason
2603cfb33e [HUDI-684] Introduced abstraction for writing and reading different types of base file formats. (#1687)
Notable changes:
    1. HoodieFileWriter and HoodieFileReader abstractions for writer/reader side of a base file format
    2. HoodieDataBlock abstraction for creation specific data blocks for base file formats. (e.g. Parquet has HoodieAvroDataBlock)
    3. All hardocded references to Parquet / Parquet based classes have been abstracted to call methods which accept a base file format
    4. HiveSyncTool accepts the base file format as a CLI parameter
    5. HoodieDeltaStreamer accepts the base file format as a CLI parameter
    6. HoodieSparkSqlWriter accepts the base file format as a parameter
2020-06-25 23:46:55 -07:00
Satish Kotha
a7fd331624 Add unit test for snapshot reads in hadoop-mr 2020-06-13 10:23:05 -07:00
Gary Li
37838cea60 [HUDI-822] decouple Hudi related logics from HoodieInputFormat (#1592)
- Refactoring business logic out of InputFormat into Utils helpers.
2020-06-09 06:10:16 -07:00
Shen Hong
2901f5423a [HUDI-1002] Ignore case when setting incremental mode in hive query (#1715) 2020-06-08 19:38:32 +08:00
Raymond Xu
03f136361a [HUDI-811] Restructure test packages in hudi-common (#1644)
* [HUDI-811] Restructure test packages in hudi-common
2020-05-27 16:28:17 +08:00
Raymond Xu
0d4848b68b [HUDI-811] Restructure test packages (#1607)
* restructure hudi-spark tests
* restructure hudi-timeline-service tests
* restructure hudi-hadoop-mr hudi-utilities tests
* restructure hudi-hive-sync tests
2020-05-13 15:37:03 -07:00