Adding documentation around Hoodie concepts (#71)

2017-01-29 15:22:57 -08:00
parent 8e72ed69b4
commit 186d70713f
7 changed files with 125 additions and 9 deletions
--- a/docs/_data/sidebars/mydoc_sidebar.yml
+++ b/docs/_data/sidebars/mydoc_sidebar.yml
@@ -35,6 +35,10 @@ entries:
      url: /concepts.html
      output: web

+    - title: Implementation
+      url: /implementation.html
+      output: web
+
    - title: SQL Queries
      url: /sql_queries.html
      output: web
@@ -47,10 +51,6 @@ entries:
      url: /admin_guide.html
      output: web

-    - title: Performance
-      url: /performance.html
-      output: web
-
    - title: Comparison
      url: /comparison.html
      output: web
--- a/docs/comparison.md
+++ b/docs/comparison.md
@@ -7,4 +7,9 @@ permalink: comparison.html

 Work In Progress

+## Kudu
+
+## Hive Transactions
+
+## HBase

--- a/docs/concepts.md
+++ b/docs/concepts.md
@@ -3,12 +3,119 @@ title: Concepts
 keywords: concepts
 sidebar: mydoc_sidebar
 permalink: concepts.html
+toc: false
+summary: "Here we introduce some basic concepts & give a broad technical overview of Hoodie"
 ---

-Hoodie provides the following primitives to build & access datasets on HDFS
+Hoodie provides the following primitives over datasets on HDFS

- * Upsert                     (how do I change the table efficiently?)
- * Incremental consumption    (how do I obtain records that changed?)
+ * Upsert                     (how do I change the dataset?)
+ * Incremental consumption    (how do I fetch data that changed?)


-To reason about consistency of the above primitives, Hoodie introduces a notion of `COMMIT`
+In order to achieve this, Hoodie maintains a `timeline` of all activity performed on the dataset, that helps provide `instantaenous` views of the dataset,
+while also efficiently supporting retrieval of data in the order of arrival into the dataset.
+Such key activities include
+
+ * `COMMITS` - A single commit captures information about an **atomic write** of a batch of records into a dataset.
+       Commits are identified by a monotonically increasing timestamp, denoting the start of the write operation.
+ * `CLEANS` - Background activity that gets rid of older versions of files in the dataset, that are no longer needed.
+ * `COMPACTIONS` - Background activity to reconcile differential data structures within Hoodie e.g: moving updates from row based log files to columnar formats.
+
+
+{% include image.html file="hoodie_timeline.png" alt="hoodie_timeline.png" %}
+
+Example above shows upserts happenings between 10:00 and 10:20 on a Hoodie dataset, roughly every 5 mins, leaving commit metadata on the hoodie timeline, along
+with other background cleaning/compactions. One key observation to make is that the commit time indicates the `arrival time` of the data (10:20AM), while the actual data
+organization reflects the actual time or `event time`, the data was intended for (hourly buckets from 07:00). These are two key concepts when reasoning about tradeoffs between latency and completeness of data.
+
+When there is late arriving data (data intended for 9:00 arriving >1 hr late at 10:20), we can see the upsert producing new data into even older time buckets/folders.
+With the help of the timeline, an incremental query attempting to get all new data that was committed successfully since 10:00 hours, is able to very efficiently consume
+only the changed files without say scanning all the time buckets > 07:00.
+
+## Storage Types
+
+Hoodie storage types capture how data is indexed & laid out on the filesystem, and how the above primitives and timeline activities are implemented on top of
+such organization (i.e how data is written). This is not to be confused with the notion of Read Optimized & Near-Real time tables, which are merely how the underlying data is exposed
+to the queries (i.e how data is read).
+
+Hoodie (will) supports the following storage types.
+
+| Storage Type  | Supported Tables |
+|-------------- |------------------|
+| Copy On Write | Read Optimized   |
+| Merge On Read | Read Optimized + Near Real-time |
+
+  - Copy On Write : A heavily read optimized storage type, that simply creates new versions of files corresponding to the records that changed.
+  - Merge On Read : Also provides a near-real time datasets in the order of 5 mins, by shifting some of the write cost, to the reads and merging incoming and on-disk data on-the-fly
+
+{% include callout.html content="Hoodie is a young project. merge-on-read is currently underway. Get involved [here](https://github.com/uber/hoodie/projects/1)" type="info" %}
+
+Regardless of the storage type, Hoodie organizes a datasets into a directory structure under a `basepath`,
+very similar to Hive tables. Dataset is broken up into partitions, which are folders containing files for that partition.
+Each partition uniquely identified by its `partitionpath`, which is relative to the basepath.
+
+Within each partition, records are distributed into multiple files. Each file is identified by an unique `file id` and the `commit` that
+produced the file. Multiple files can share the same file id but written at different commits, in case of updates.
+
+Each record is uniquely identified by a `record key` and mapped to a file id forever. This mapping between record key
+and file id, never changes once the first version of a record has been written to a file. In short, the
+ `file id` identifies a group of files, that contain all versions of a group of records.
+
+
+## Copy On Write
+
+As mentioned above, each commit on Copy On Write storage, produces new versions of files. In other words, we implicitly compact every
+commit, such that only columnar data exists. As a result, the write amplification (number of bytes written for 1 byte of incoming data)
+ is much higher, where read amplification is close to zero. This is a much desired property for a system like Hadoop, which is predominantly read-heavy.
+
+Following illustrates how this works conceptually, when  data written into copy-on-write storage  and two queries running on top of it.
+
+
+{% include image.html file="hoodie_cow.png" alt="hoodie_cow.png" %}
+
+
+As data gets written, updates to existing file ids, produce a new version for that file id stamped with the commit and
+inserts allocate a new file id and write its first version for that file id. These file versions and their commits are color coded above.
+Normal SQL queries running against such dataset (eg: select count(*) counting the total records in that partition), first checks the timeline for latest commit
+and filters all but latest versions of each file id. As you can see, an old query does not see the current inflight commit's files colored in pink,
+but a new query starting after the commit picks up the new data. Thus queries are immune to any write failures/partial writes and only run on committed data.
+
+The intention of copy on write storage, is to fundamentally improve how datasets are managed today on Hadoop through
+
+  - First class support for atomically updating data at file-level, instead of rewriting whole tables/partitions
+  - Ability to incremental consume changes, as opposed to wasteful scans or fumbling with heuristical approaches
+  - Tight control file sizes to keep query performance excellent (small files hurt query performance considerably).
+
+
+## Merge On Read
+
+Merge on read storage is a superset of copy on write, in the sense it still provides a read optimized view of the dataset via the Read Optmized table.
+But, additionally stores incoming upserts for each file id, onto a `row based append log`, that enables providing near real-time data to the queries
+ by applying the append log, onto the latest version of each file id on-the-fly during query time. Thus, this storage type attempts to balance read and write amplication intelligently, to provide near real-time queries.
+The most significant change here, would be to the compactor, which now carefully chooses which append logs need to be compacted onto
+their columnar base data, to keep the query performance in check (larger append logs would incur longer merge times with merge data on query side)
+
+Following illustrates how the storage works, and shows queries on both near-real time table and read optimized table.
+
+{% include image.html file="hoodie_mor.png" alt="hoodie_mor.png" max-width="1000" %}
+
+
+There are lot of interesting things happening in this example, which bring out the subleties in the approach.
+
+ - We now have commits every 1 minute or so, something we could not do in the other storage type.
+ - Within each file id group, now there is an append log, which holds incoming updates to records in the base columnar files. In the example, the append logs hold
+ all the data from 10:05 to 10:10. The base columnar files are still versioned with the commit, as before.
+ Thus, if one were to simply look at base files alone, then the storage layout looks exactly like a copy on write table.
+ - A periodic compaction process reconciles these changes from the append log and produces a new version of base file, just like what happened at 10:05 in the example.
+ - There are two ways of querying the same underlying storage: ReadOptimized (RO) Table and Near-Realtime (RT) table, depending on whether we chose query performance or freshness of data.
+ - The semantics around when data from a commit is available to a query changes in a subtle way for the RO table. Note, that such a query
+ running at 10:10, wont see data after 10:05 above, while a query on the RT table always sees the freshest data.
+ - When we trigger compaction & what it decides to compact hold all the key to solving these hard problems. By implementing a compacting
+ strategy, where we aggressively compact the latest partitions compared to older partitions, we could ensure the RO Table sees data
+ published within X minutes in a consistent fashion.
+
+{% include callout.html content="Hoodie is a young project. merge-on-read is currently underway. Get involved [here](https://github.com/uber/hoodie/projects/1)" type="info" %}
+
+The intention of merge on read storage, is to enable near real-time processing directly on top of Hadoop, as opposed to copying
+data out to specialized systems, which may not be able to handle the data volume.
--- a/docs/images/hoodie_cow.png
+++ b/docs/images/hoodie_cow.png
--- a/docs/images/hoodie_mor.png
+++ b/docs/images/hoodie_mor.png
--- a/docs/images/hoodie_timeline.png
+++ b/docs/images/hoodie_timeline.png
--- a/docs/implementation.md
+++ b/docs/implementation.md
@@ -2,9 +2,13 @@
 title: Performance
 keywords: performance
 sidebar: mydoc_sidebar
-permalink: performance.html
+permalink: implementation.html
 ---

 Work In Progress

+## Index

+## Data Storage
+
+## Performance