Adding documentation around Hoodie concepts (#71)
This commit is contained in:
@@ -35,6 +35,10 @@ entries:
|
||||
url: /concepts.html
|
||||
output: web
|
||||
|
||||
- title: Implementation
|
||||
url: /implementation.html
|
||||
output: web
|
||||
|
||||
- title: SQL Queries
|
||||
url: /sql_queries.html
|
||||
output: web
|
||||
@@ -47,10 +51,6 @@ entries:
|
||||
url: /admin_guide.html
|
||||
output: web
|
||||
|
||||
- title: Performance
|
||||
url: /performance.html
|
||||
output: web
|
||||
|
||||
- title: Comparison
|
||||
url: /comparison.html
|
||||
output: web
|
||||
|
||||
@@ -7,4 +7,9 @@ permalink: comparison.html
|
||||
|
||||
Work In Progress
|
||||
|
||||
## Kudu
|
||||
|
||||
## Hive Transactions
|
||||
|
||||
## HBase
|
||||
|
||||
|
||||
115
docs/concepts.md
115
docs/concepts.md
@@ -3,12 +3,119 @@ title: Concepts
|
||||
keywords: concepts
|
||||
sidebar: mydoc_sidebar
|
||||
permalink: concepts.html
|
||||
toc: false
|
||||
summary: "Here we introduce some basic concepts & give a broad technical overview of Hoodie"
|
||||
---
|
||||
|
||||
Hoodie provides the following primitives to build & access datasets on HDFS
|
||||
Hoodie provides the following primitives over datasets on HDFS
|
||||
|
||||
* Upsert (how do I change the table efficiently?)
|
||||
* Incremental consumption (how do I obtain records that changed?)
|
||||
* Upsert (how do I change the dataset?)
|
||||
* Incremental consumption (how do I fetch data that changed?)
|
||||
|
||||
|
||||
To reason about consistency of the above primitives, Hoodie introduces a notion of `COMMIT`
|
||||
In order to achieve this, Hoodie maintains a `timeline` of all activity performed on the dataset, that helps provide `instantaenous` views of the dataset,
|
||||
while also efficiently supporting retrieval of data in the order of arrival into the dataset.
|
||||
Such key activities include
|
||||
|
||||
* `COMMITS` - A single commit captures information about an **atomic write** of a batch of records into a dataset.
|
||||
Commits are identified by a monotonically increasing timestamp, denoting the start of the write operation.
|
||||
* `CLEANS` - Background activity that gets rid of older versions of files in the dataset, that are no longer needed.
|
||||
* `COMPACTIONS` - Background activity to reconcile differential data structures within Hoodie e.g: moving updates from row based log files to columnar formats.
|
||||
|
||||
|
||||
{% include image.html file="hoodie_timeline.png" alt="hoodie_timeline.png" %}
|
||||
|
||||
Example above shows upserts happenings between 10:00 and 10:20 on a Hoodie dataset, roughly every 5 mins, leaving commit metadata on the hoodie timeline, along
|
||||
with other background cleaning/compactions. One key observation to make is that the commit time indicates the `arrival time` of the data (10:20AM), while the actual data
|
||||
organization reflects the actual time or `event time`, the data was intended for (hourly buckets from 07:00). These are two key concepts when reasoning about tradeoffs between latency and completeness of data.
|
||||
|
||||
When there is late arriving data (data intended for 9:00 arriving >1 hr late at 10:20), we can see the upsert producing new data into even older time buckets/folders.
|
||||
With the help of the timeline, an incremental query attempting to get all new data that was committed successfully since 10:00 hours, is able to very efficiently consume
|
||||
only the changed files without say scanning all the time buckets > 07:00.
|
||||
|
||||
## Storage Types
|
||||
|
||||
Hoodie storage types capture how data is indexed & laid out on the filesystem, and how the above primitives and timeline activities are implemented on top of
|
||||
such organization (i.e how data is written). This is not to be confused with the notion of Read Optimized & Near-Real time tables, which are merely how the underlying data is exposed
|
||||
to the queries (i.e how data is read).
|
||||
|
||||
Hoodie (will) supports the following storage types.
|
||||
|
||||
| Storage Type | Supported Tables |
|
||||
|-------------- |------------------|
|
||||
| Copy On Write | Read Optimized |
|
||||
| Merge On Read | Read Optimized + Near Real-time |
|
||||
|
||||
- Copy On Write : A heavily read optimized storage type, that simply creates new versions of files corresponding to the records that changed.
|
||||
- Merge On Read : Also provides a near-real time datasets in the order of 5 mins, by shifting some of the write cost, to the reads and merging incoming and on-disk data on-the-fly
|
||||
|
||||
{% include callout.html content="Hoodie is a young project. merge-on-read is currently underway. Get involved [here](https://github.com/uber/hoodie/projects/1)" type="info" %}
|
||||
|
||||
Regardless of the storage type, Hoodie organizes a datasets into a directory structure under a `basepath`,
|
||||
very similar to Hive tables. Dataset is broken up into partitions, which are folders containing files for that partition.
|
||||
Each partition uniquely identified by its `partitionpath`, which is relative to the basepath.
|
||||
|
||||
Within each partition, records are distributed into multiple files. Each file is identified by an unique `file id` and the `commit` that
|
||||
produced the file. Multiple files can share the same file id but written at different commits, in case of updates.
|
||||
|
||||
Each record is uniquely identified by a `record key` and mapped to a file id forever. This mapping between record key
|
||||
and file id, never changes once the first version of a record has been written to a file. In short, the
|
||||
`file id` identifies a group of files, that contain all versions of a group of records.
|
||||
|
||||
|
||||
## Copy On Write
|
||||
|
||||
As mentioned above, each commit on Copy On Write storage, produces new versions of files. In other words, we implicitly compact every
|
||||
commit, such that only columnar data exists. As a result, the write amplification (number of bytes written for 1 byte of incoming data)
|
||||
is much higher, where read amplification is close to zero. This is a much desired property for a system like Hadoop, which is predominantly read-heavy.
|
||||
|
||||
Following illustrates how this works conceptually, when data written into copy-on-write storage and two queries running on top of it.
|
||||
|
||||
|
||||
{% include image.html file="hoodie_cow.png" alt="hoodie_cow.png" %}
|
||||
|
||||
|
||||
As data gets written, updates to existing file ids, produce a new version for that file id stamped with the commit and
|
||||
inserts allocate a new file id and write its first version for that file id. These file versions and their commits are color coded above.
|
||||
Normal SQL queries running against such dataset (eg: select count(*) counting the total records in that partition), first checks the timeline for latest commit
|
||||
and filters all but latest versions of each file id. As you can see, an old query does not see the current inflight commit's files colored in pink,
|
||||
but a new query starting after the commit picks up the new data. Thus queries are immune to any write failures/partial writes and only run on committed data.
|
||||
|
||||
The intention of copy on write storage, is to fundamentally improve how datasets are managed today on Hadoop through
|
||||
|
||||
- First class support for atomically updating data at file-level, instead of rewriting whole tables/partitions
|
||||
- Ability to incremental consume changes, as opposed to wasteful scans or fumbling with heuristical approaches
|
||||
- Tight control file sizes to keep query performance excellent (small files hurt query performance considerably).
|
||||
|
||||
|
||||
## Merge On Read
|
||||
|
||||
Merge on read storage is a superset of copy on write, in the sense it still provides a read optimized view of the dataset via the Read Optmized table.
|
||||
But, additionally stores incoming upserts for each file id, onto a `row based append log`, that enables providing near real-time data to the queries
|
||||
by applying the append log, onto the latest version of each file id on-the-fly during query time. Thus, this storage type attempts to balance read and write amplication intelligently, to provide near real-time queries.
|
||||
The most significant change here, would be to the compactor, which now carefully chooses which append logs need to be compacted onto
|
||||
their columnar base data, to keep the query performance in check (larger append logs would incur longer merge times with merge data on query side)
|
||||
|
||||
Following illustrates how the storage works, and shows queries on both near-real time table and read optimized table.
|
||||
|
||||
{% include image.html file="hoodie_mor.png" alt="hoodie_mor.png" max-width="1000" %}
|
||||
|
||||
|
||||
There are lot of interesting things happening in this example, which bring out the subleties in the approach.
|
||||
|
||||
- We now have commits every 1 minute or so, something we could not do in the other storage type.
|
||||
- Within each file id group, now there is an append log, which holds incoming updates to records in the base columnar files. In the example, the append logs hold
|
||||
all the data from 10:05 to 10:10. The base columnar files are still versioned with the commit, as before.
|
||||
Thus, if one were to simply look at base files alone, then the storage layout looks exactly like a copy on write table.
|
||||
- A periodic compaction process reconciles these changes from the append log and produces a new version of base file, just like what happened at 10:05 in the example.
|
||||
- There are two ways of querying the same underlying storage: ReadOptimized (RO) Table and Near-Realtime (RT) table, depending on whether we chose query performance or freshness of data.
|
||||
- The semantics around when data from a commit is available to a query changes in a subtle way for the RO table. Note, that such a query
|
||||
running at 10:10, wont see data after 10:05 above, while a query on the RT table always sees the freshest data.
|
||||
- When we trigger compaction & what it decides to compact hold all the key to solving these hard problems. By implementing a compacting
|
||||
strategy, where we aggressively compact the latest partitions compared to older partitions, we could ensure the RO Table sees data
|
||||
published within X minutes in a consistent fashion.
|
||||
|
||||
{% include callout.html content="Hoodie is a young project. merge-on-read is currently underway. Get involved [here](https://github.com/uber/hoodie/projects/1)" type="info" %}
|
||||
|
||||
The intention of merge on read storage, is to enable near real-time processing directly on top of Hadoop, as opposed to copying
|
||||
data out to specialized systems, which may not be able to handle the data volume.
|
||||
|
||||
BIN
docs/images/hoodie_cow.png
Normal file
BIN
docs/images/hoodie_cow.png
Normal file
Binary file not shown.
|
After Width: | Height: | Size: 30 KiB |
BIN
docs/images/hoodie_mor.png
Normal file
BIN
docs/images/hoodie_mor.png
Normal file
Binary file not shown.
|
After Width: | Height: | Size: 55 KiB |
BIN
docs/images/hoodie_timeline.png
Normal file
BIN
docs/images/hoodie_timeline.png
Normal file
Binary file not shown.
|
After Width: | Height: | Size: 23 KiB |
@@ -2,9 +2,13 @@
|
||||
title: Performance
|
||||
keywords: performance
|
||||
sidebar: mydoc_sidebar
|
||||
permalink: performance.html
|
||||
permalink: implementation.html
|
||||
---
|
||||
|
||||
Work In Progress
|
||||
|
||||
## Index
|
||||
|
||||
## Data Storage
|
||||
|
||||
## Performance
|
||||
Reference in New Issue
Block a user