Files

Prashant Wason 6461927eac [HUDI-960] Implementation of the HFile base and log file format. (#1804 )

* [HUDI-960] Implementation of the HFile base and log file format.

1. Includes HFileWriter and HFileReader
2. Includes HFileInputFormat for both snapshot and realtime input format for Hive
3. Unit test for new code
4. IT for using HFile format and querying using Hive (Presto and SparkSQL are not supported)

Advantage:
HFile file format saves data as binary key-value pairs. This implementation chooses the following values:
1. Key = Hoodie Record Key (as bytes)
2. Value = Avro encoded GenericRecord (as bytes)

HFile allows efficient lookup of a record by key or range of keys. Hence, this base file format is well suited to applications like RFC-15, RFC-08 which will benefit from the ability to lookup records by key or search in a range of keys without having to read the entire data/log format.

Limitations:
HFile storage format has certain limitations when used as a general purpose data storage format.
1. Does not have a implemented reader for Presto and SparkSQL
2. Is not a columnar file format and hence may lead to lower compression levels and greater IO on query side due to lack of column pruning

Other changes:
- Remove databricks/avro from pom
- Fix HoodieClientTestUtils from not using scala imports/reflection based conversion etc
- Breaking up limitFileSize(), per parquet and hfile base files
- Added three new configs for HoodieHFileConfig - prefetchBlocksOnOpen, cacheDataInL1, dropBehindCacheCompaction
- Throw UnsupportedException in HFileReader.getRecordKeys()
- Updated HoodieCopyOnWriteTable to create the correct merge handle (HoodieSortedMergeHandle for HFile and HoodieMergeHandle otherwise)

* Fixing checkstyle

Co-authored-by: Vinoth Chandar <vinoth@apache.org>

2020-08-31 08:05:59 -07:00

hudi-hadoop-mr-bundle

Moving to 0.6.1-SNAPSHOT on master branch.

2020-08-14 12:54:15 -07:00

hudi-hive-sync-bundle

Moving to 0.6.1-SNAPSHOT on master branch.

2020-08-14 12:54:15 -07:00

hudi-integ-test-bundle

Moving to 0.6.1-SNAPSHOT on master branch.

2020-08-14 12:54:15 -07:00

hudi-presto-bundle

Moving to 0.6.1-SNAPSHOT on master branch.

2020-08-14 12:54:15 -07:00

hudi-spark-bundle

[HUDI-960] Implementation of the HFile base and log file format. (#1804 )

2020-08-31 08:05:59 -07:00

hudi-timeline-server-bundle

Moving to 0.6.1-SNAPSHOT on master branch.

2020-08-14 12:54:15 -07:00

hudi-utilities-bundle

[HUDI-960] Implementation of the HFile base and log file format. (#1804 )

2020-08-31 08:05:59 -07:00

README.md

HUDI-121 : Address comments during RC2 voting

2019-09-30 15:42:15 -07:00

README.md

Overview

This folder contains several modules that build out bundles (i.e fat/uber jars) that enable hudi integration into various systems.

Here are the key principles applied in designing these bundles

As much as possible, try to make the bundle work with the target system's jars and classes. (e.g: better to make Hudi work with Hive's parquet version than bundling parquet with Hudi). This lets us evolve Hudi as a lighter weight component and also provides flexibility for changing these jar versions in target systems
Bundle's pom only needs to depend on the required hudi modules & any other modules that are declared "provided" in parent poms (e.g: parquet-avro).
Such other modules should be declared as "compile" dependency in the bundle pom to actually get the shade plugin in pull them into the bundle. By default, provided scoped dependencies are not included
Any other runtime dependencies needed by the bundle should specified in the <include> whitelist. New bundles also should follow the same style of explicitly whitelisting modules and shading as needed.
Leave abundant comments on why someone is being included, shaded or even being left out.

Please follow these when adding new ones or making changes.

Resources

Classes needed for Hive2 JDBC documented here