diff --git a/docs/_data/sidebars/mydoc_sidebar.yml b/docs/_data/sidebars/mydoc_sidebar.yml index b5c590d62..843dc09b0 100644 --- a/docs/_data/sidebars/mydoc_sidebar.yml +++ b/docs/_data/sidebars/mydoc_sidebar.yml @@ -35,6 +35,10 @@ entries: url: /concepts.html output: web + - title: Implementation + url: /implementation.html + output: web + - title: SQL Queries url: /sql_queries.html output: web @@ -47,10 +51,6 @@ entries: url: /admin_guide.html output: web - - title: Performance - url: /performance.html - output: web - - title: Comparison url: /comparison.html output: web diff --git a/docs/comparison.md b/docs/comparison.md index 467b60984..0898417d0 100644 --- a/docs/comparison.md +++ b/docs/comparison.md @@ -7,4 +7,9 @@ permalink: comparison.html Work In Progress +## Kudu + +## Hive Transactions + +## HBase diff --git a/docs/concepts.md b/docs/concepts.md index 9a2faac3c..8de1915a5 100644 --- a/docs/concepts.md +++ b/docs/concepts.md @@ -3,12 +3,119 @@ title: Concepts keywords: concepts sidebar: mydoc_sidebar permalink: concepts.html +toc: false +summary: "Here we introduce some basic concepts & give a broad technical overview of Hoodie" --- -Hoodie provides the following primitives to build & access datasets on HDFS +Hoodie provides the following primitives over datasets on HDFS - * Upsert (how do I change the table efficiently?) - * Incremental consumption (how do I obtain records that changed?) + * Upsert (how do I change the dataset?) + * Incremental consumption (how do I fetch data that changed?) -To reason about consistency of the above primitives, Hoodie introduces a notion of `COMMIT` +In order to achieve this, Hoodie maintains a `timeline` of all activity performed on the dataset, that helps provide `instantaenous` views of the dataset, +while also efficiently supporting retrieval of data in the order of arrival into the dataset. +Such key activities include + + * `COMMITS` - A single commit captures information about an **atomic write** of a batch of records into a dataset. + Commits are identified by a monotonically increasing timestamp, denoting the start of the write operation. + * `CLEANS` - Background activity that gets rid of older versions of files in the dataset, that are no longer needed. + * `COMPACTIONS` - Background activity to reconcile differential data structures within Hoodie e.g: moving updates from row based log files to columnar formats. + + +{% include image.html file="hoodie_timeline.png" alt="hoodie_timeline.png" %} + +Example above shows upserts happenings between 10:00 and 10:20 on a Hoodie dataset, roughly every 5 mins, leaving commit metadata on the hoodie timeline, along +with other background cleaning/compactions. One key observation to make is that the commit time indicates the `arrival time` of the data (10:20AM), while the actual data +organization reflects the actual time or `event time`, the data was intended for (hourly buckets from 07:00). These are two key concepts when reasoning about tradeoffs between latency and completeness of data. + +When there is late arriving data (data intended for 9:00 arriving >1 hr late at 10:20), we can see the upsert producing new data into even older time buckets/folders. +With the help of the timeline, an incremental query attempting to get all new data that was committed successfully since 10:00 hours, is able to very efficiently consume +only the changed files without say scanning all the time buckets > 07:00. + +## Storage Types + +Hoodie storage types capture how data is indexed & laid out on the filesystem, and how the above primitives and timeline activities are implemented on top of +such organization (i.e how data is written). This is not to be confused with the notion of Read Optimized & Near-Real time tables, which are merely how the underlying data is exposed +to the queries (i.e how data is read). + +Hoodie (will) supports the following storage types. + +| Storage Type | Supported Tables | +|-------------- |------------------| +| Copy On Write | Read Optimized | +| Merge On Read | Read Optimized + Near Real-time | + + - Copy On Write : A heavily read optimized storage type, that simply creates new versions of files corresponding to the records that changed. + - Merge On Read : Also provides a near-real time datasets in the order of 5 mins, by shifting some of the write cost, to the reads and merging incoming and on-disk data on-the-fly + +{% include callout.html content="Hoodie is a young project. merge-on-read is currently underway. Get involved [here](https://github.com/uber/hoodie/projects/1)" type="info" %} + +Regardless of the storage type, Hoodie organizes a datasets into a directory structure under a `basepath`, +very similar to Hive tables. Dataset is broken up into partitions, which are folders containing files for that partition. +Each partition uniquely identified by its `partitionpath`, which is relative to the basepath. + +Within each partition, records are distributed into multiple files. Each file is identified by an unique `file id` and the `commit` that +produced the file. Multiple files can share the same file id but written at different commits, in case of updates. + +Each record is uniquely identified by a `record key` and mapped to a file id forever. This mapping between record key +and file id, never changes once the first version of a record has been written to a file. In short, the + `file id` identifies a group of files, that contain all versions of a group of records. + + +## Copy On Write + +As mentioned above, each commit on Copy On Write storage, produces new versions of files. In other words, we implicitly compact every +commit, such that only columnar data exists. As a result, the write amplification (number of bytes written for 1 byte of incoming data) + is much higher, where read amplification is close to zero. This is a much desired property for a system like Hadoop, which is predominantly read-heavy. + +Following illustrates how this works conceptually, when data written into copy-on-write storage and two queries running on top of it. + + +{% include image.html file="hoodie_cow.png" alt="hoodie_cow.png" %} + + +As data gets written, updates to existing file ids, produce a new version for that file id stamped with the commit and +inserts allocate a new file id and write its first version for that file id. These file versions and their commits are color coded above. +Normal SQL queries running against such dataset (eg: select count(*) counting the total records in that partition), first checks the timeline for latest commit +and filters all but latest versions of each file id. As you can see, an old query does not see the current inflight commit's files colored in pink, +but a new query starting after the commit picks up the new data. Thus queries are immune to any write failures/partial writes and only run on committed data. + +The intention of copy on write storage, is to fundamentally improve how datasets are managed today on Hadoop through + + - First class support for atomically updating data at file-level, instead of rewriting whole tables/partitions + - Ability to incremental consume changes, as opposed to wasteful scans or fumbling with heuristical approaches + - Tight control file sizes to keep query performance excellent (small files hurt query performance considerably). + + +## Merge On Read + +Merge on read storage is a superset of copy on write, in the sense it still provides a read optimized view of the dataset via the Read Optmized table. +But, additionally stores incoming upserts for each file id, onto a `row based append log`, that enables providing near real-time data to the queries + by applying the append log, onto the latest version of each file id on-the-fly during query time. Thus, this storage type attempts to balance read and write amplication intelligently, to provide near real-time queries. +The most significant change here, would be to the compactor, which now carefully chooses which append logs need to be compacted onto +their columnar base data, to keep the query performance in check (larger append logs would incur longer merge times with merge data on query side) + +Following illustrates how the storage works, and shows queries on both near-real time table and read optimized table. + +{% include image.html file="hoodie_mor.png" alt="hoodie_mor.png" max-width="1000" %} + + +There are lot of interesting things happening in this example, which bring out the subleties in the approach. + + - We now have commits every 1 minute or so, something we could not do in the other storage type. + - Within each file id group, now there is an append log, which holds incoming updates to records in the base columnar files. In the example, the append logs hold + all the data from 10:05 to 10:10. The base columnar files are still versioned with the commit, as before. + Thus, if one were to simply look at base files alone, then the storage layout looks exactly like a copy on write table. + - A periodic compaction process reconciles these changes from the append log and produces a new version of base file, just like what happened at 10:05 in the example. + - There are two ways of querying the same underlying storage: ReadOptimized (RO) Table and Near-Realtime (RT) table, depending on whether we chose query performance or freshness of data. + - The semantics around when data from a commit is available to a query changes in a subtle way for the RO table. Note, that such a query + running at 10:10, wont see data after 10:05 above, while a query on the RT table always sees the freshest data. + - When we trigger compaction & what it decides to compact hold all the key to solving these hard problems. By implementing a compacting + strategy, where we aggressively compact the latest partitions compared to older partitions, we could ensure the RO Table sees data + published within X minutes in a consistent fashion. + +{% include callout.html content="Hoodie is a young project. merge-on-read is currently underway. Get involved [here](https://github.com/uber/hoodie/projects/1)" type="info" %} + +The intention of merge on read storage, is to enable near real-time processing directly on top of Hadoop, as opposed to copying +data out to specialized systems, which may not be able to handle the data volume. diff --git a/docs/images/hoodie_cow.png b/docs/images/hoodie_cow.png new file mode 100644 index 000000000..bad15a891 Binary files /dev/null and b/docs/images/hoodie_cow.png differ diff --git a/docs/images/hoodie_mor.png b/docs/images/hoodie_mor.png new file mode 100644 index 000000000..8d7d90254 Binary files /dev/null and b/docs/images/hoodie_mor.png differ diff --git a/docs/images/hoodie_timeline.png b/docs/images/hoodie_timeline.png new file mode 100644 index 000000000..b56aa0032 Binary files /dev/null and b/docs/images/hoodie_timeline.png differ diff --git a/docs/performance.md b/docs/implementation.md similarity index 56% rename from docs/performance.md rename to docs/implementation.md index a3d1a1a77..b1222e78b 100644 --- a/docs/performance.md +++ b/docs/implementation.md @@ -2,9 +2,13 @@ title: Performance keywords: performance sidebar: mydoc_sidebar -permalink: performance.html +permalink: implementation.html --- Work In Progress +## Index +## Data Storage + +## Performance