Adding documentation for migration guide and COW vs MOR tradeoffs, moving some docs around for more clarity

2018-09-25 16:04:50 -07:00
parent 1628d044ac
commit 48aa026dc4
7 changed files with 143 additions and 32 deletions
--- a/docs/concepts.md
+++ b/docs/concepts.md
@@ -20,6 +20,8 @@ Such key activities include
 * `COMMITS` - A single commit captures information about an **atomic write** of a batch of records into a dataset.
       Commits are identified by a monotonically increasing timestamp, denoting the start of the write operation.
 * `CLEANS` - Background activity that gets rid of older versions of files in the dataset, that are no longer needed.
+ * `DELTA_COMMITS` - A single commit captures information about an **atomic write** of a batch of records into a 
+ MergeOnRead storage type of dataset
 * `COMPACTIONS` - Background activity to reconcile differential data structures within Hoodie e.g: moving updates from row based log files to columnar formats.


@@ -33,6 +35,24 @@ When there is late arriving data (data intended for 9:00 arriving >1 hr late at
 With the help of the timeline, an incremental query attempting to get all new data that was committed successfully since 10:00 hours, is able to very efficiently consume
 only the changed files without say scanning all the time buckets > 07:00.

+## Terminologies
+
+ * `Hudi Dataset` 
+    A structured hive/spark dataset managed by Hudi. Hudi supports both partitioned and non-partitioned Hive tables. 
+ * `Commit` 
+    A commit marks a new batch of data applied to a dataset. Hudi maintains  monotonically increasing timestamps to track commits and guarantees that a commit is atomically 
+    published.
+ * `Commit Timeline`
+    Commit Timeline refers to the sequence of Commits that was applied in order on a dataset over its lifetime. 
+ * `File Slice` 
+    Hudi provides efficient handling of updates by having a fixed mapping between record key to a logical file Id. 
+    Hudi uses MVCC to provide atomicity and isolation of readers from a writer. This means that a logical fileId will
+    have many physical versions of it. Each of these physical version of a file represents a complete view of the
+    file as of a commit and is called File Slice
+ * `File Group`
+    A file-group is a file-slice timeline. It is a list of file-slices in commit order. It is identified by `file id`
+
+
 ## Storage Types

 Hoodie storage types capture how data is indexed & laid out on the filesystem, and how the above primitives and timeline activities are implemented on top of
@@ -62,23 +82,6 @@ Each record is uniquely identified by a `record key` and mapped to a file id for
 and file id, never changes once the first version of a record has been written to a file. In short, the
 `file id` identifies a group of files, that contain all versions of a group of records.

-## Terminologies
-
- * `Hudi Dataset` 
-    A structured hive/spark table managed by Hudi. Hudi supports both partitioned and non-partitioned Hive tables. 
- * `Commit` 
-    A commit marks a new batch of data applied to a dataset. Hudi maintains  monotonically increasing timestamps to track commits and guarantees that a commit is atomically 
-    published.
- * `Commit Timeline`
-    Commit Timeline refers to the sequence of Commits that was applied in order on a dataset over its lifetime. 
- * `File Slice` 
-    Hudi provides efficient handling of updates by having a fixed mapping between record key to a logical file Id. 
-    Hudi uses MVCC to provide atomicity and isolation of readers from a writer. This means that a logical fileId will
-    have many physical versions of it. Each of these physical version of a file represents a complete view of the
-    file as of a commit and is called File Slice
- * `File Group`
-    A file-group is a file-slice timeline. It is a list of file-slices in commit order. It is identified by `file id`
-

 ## Copy On Write

@@ -136,3 +139,21 @@ There are lot of interesting things happening in this example, which bring out t

 The intention of merge on read storage, is to enable near real-time processing directly on top of Hadoop, as opposed to copying
 data out to specialized systems, which may not be able to handle the data volume.
+
+## Trade offs when choosing different storage types and views
+
+### Storage Types
+
+| Trade-off | CopyOnWrite | MergeOnRead |
+|-------------- |------------------| ------------------|
+| Data Latency | Higher   | Lower |
+| Update cost (I/O) | Higher (rewrite entire parquet) | Lower (append to delta file) |
+| Parquet File Size | Smaller (high update(I/0) cost) | Larger (low update cost) |
+| Write Amplification | Higher | Lower (depending on compaction strategy) |
+
+### Hudi Views
+
+| Trade-off | ReadOptimized | RealTime |
+|-------------- |------------------| ------------------|
+| Data Latency | Higher   | Lower |
+| Query Latency | Lower (raw columnar performance) | Higher (merge columnar + row based delta) |