Docs for impl & comparison (#79)

* Initial version of comparison, implementation * Finished doc for comparison to other systems
2017-02-17 08:25:17 -08:00
parent a725382549
commit c7a8e15c78
3 changed files with 129 additions and 6 deletions
--- a/docs/implementation.md
+++ b/docs/implementation.md
@@ -1,14 +1,90 @@
 ---
-title: Performance
-keywords: performance
+title: Implementation
+keywords: implementation
 sidebar: mydoc_sidebar
+toc: false
 permalink: implementation.html
 ---

-Work In Progress
+Hoodie is implemented as a Spark library, which makes it easy to integrate into existing data pipelines or ingestion
+libraries (which we will refer to as `hoodie clients`). Hoodie Clients prepare an `RDD[HoodieRecord]` that contains the data to be upserted and
+Hoodie upsert/insert is merely a Spark DAG, that can be broken into two big pieces.
+
+ - **Indexing** :  A big part of Hoodie's efficiency comes from indexing the mapping from record keys to the file ids, to which they belong to.
+ This index also helps the `HoodieWriteClient` separate upserted records into inserts and updates, so they can be treated differently.
+ `HoodieReadClient` supports operations such as `filterExists` (used for de-duplication of table) and an efficient batch `read(keys)` api, that
+ can read out the records corresponding to the keys using the index much quickly, than a typical scan via a query. The index is also atomically
+ updated each commit, and is also rolled back when commits are rolled back.
+
+ - **Storage** : The storage part of the DAG is responsible for taking an `RDD[HoodieRecord]`, that has been tagged as
+ an insert or update via index lookup, and writing it out efficiently onto storage.

 ## Index

-## Data Storage
+Hoodie currently provides two choices for indexes : `BloomIndex` and `HBaseIndex` to map a record key into the file id to which it belongs to. This enables
+us to speed up upserts significantly, without scanning over every record in the dataset.
+
+#### HBase Index
+
+Here, we just use HBase in a straightforward way to store the mapping above. The challenge with using HBase (or any external key-value store
+ for that matter) is performing rollback of a commit and handling partial index updates.
+ Since the HBase table is indexed by record key and not commit Time, we would have to scan all the entries which will be prohibitively expensive.
+ Insteead, we store the commit time with the value and discard its value if it does not belong to a valid commit.
+
+#### Bloom Index
+
+This index is built by adding bloom filters with a very high false positive tolerance (e.g: 1/10^9), to the parquet file footers.
+The advantage of this index over HBase is the obvious removal of a big external dependency, and also nicer handling of rollbacks & partial updates
+since the index is part of the data file itself.
+
+At runtime, checking the Bloom Index for a given set of record keys effectively ammonts to checking all the bloom filters within a given
+partition, against the incoming records, using a Spark join. Much of the engineering effort towards the Bloom index has gone into scaling this join
+by caching the incoming RDD[HoodieRecord] to be able and dynamically tuning join parallelism, to avoid hitting Spark limitations like 2GB maximum
+for partition size. As a result, Bloom Index implementation has been able to handle single upserts upto 5TB, in a reliable manner.
+
+
+## Storage
+
+The implementation specifics of the two storage types, introduced in [concepts](concepts.html) section, are detailed below.
+
+
+#### Copy On Write
+
+The Spark DAG for this storage, is relatively simpler. The key goal here is to group the tagged hoodie record RDD, into a series of
+updates and inserts, by using a partitioner. To achieve the goals of maintaining file sizes, we first sample the input to obtain a `workload profile`
+that understands the spread of inserts vs updates, their distribution among the partitions etc. With this information, we bin-pack the
+records such that
+
+ - For updates, the latest version of the that file id, is rewritten once, with new values for all records that have changed
+ - For inserts, the records are first packed onto the smallest file in each partition path, until it reaches the configured maximum size.
+   Any remaining records after that, are again packed into new file id groups, again meeting the size requirements.
+
+In this storage, index updation is a no-op, since the bloom filters are already written as a part of committing data.
+
+#### Merge On Read
+
+Work in Progress .. .. .. .. ..

 ## Performance
+
+In this section, we go over some real world performance numbers for Hoodie upserts, incremental pull and compare them against
+the conventional alternatives for achieving these tasks.
+
+#### Upsert vs Bulk Loading
+
+
+
+
+
+#### Incremental Scan vs Full Scan
+
+
+
+
+
+#### Scalability of Upserts
+
+
+#### Copy On Write Regular Query Performance
+
+