Merge pull request #28 from vinothchandar/add-docs

Adding Documentation for Getting Started Section
2017-01-04 20:54:39 -08:00
parent 83d4163c75 958f7ceda6
commit 5d5e346c61
11 changed files with 299 additions and 213 deletions
--- a/docs/_includes/head.html
+++ b/docs/_includes/head.html
@@ -12,7 +12,7 @@
 <link rel="stylesheet" href="{{ "css/modern-business.css" }}">
 <link rel="stylesheet" href="{{ "css/lavish-bootstrap.css" }}">
 <link rel="stylesheet" href="{{ "css/customstyles.css" }}">
-<link rel="stylesheet" href="{{ "css/theme-green.css" }}">
+<link rel="stylesheet" href="{{ "css/theme-blue.css" }}">
 <script src="https://cdnjs.cloudflare.com/ajax/libs/jquery/2.1.4/jquery.min.js"></script>
 <script src="https://cdnjs.cloudflare.com/ajax/libs/jquery-cookie/1.4.1/jquery.cookie.min.js"></script>
--- a/docs/admin_guide.md
+++ b/docs/admin_guide.md
@@ -5,6 +5,47 @@ sidebar: mydoc_sidebar
 permalink: admin_guide.html
 ---
-Work In Progress
+## Hoodie Admin CLI
 ### Launching Command Line
 <todo - change this after packaging is done>
 * mvn clean install in hoodie-cli
 * ./hoodie-cli
 If all is good you should get a command prompt similar to this one
 ```
 prasanna@:~/hoodie/hoodie-cli$ ./hoodie-cli.sh
 16/07/13 21:27:47 INFO xml.XmlBeanDefinitionReader: Loading XML bean definitions from URL [jar:file:/home/prasanna/hoodie/hoodie-cli/target/hoodie-cli-0.1-SNAPSHOT.jar!/META-INF/spring/spring-shell-plugin.xml]
 16/07/13 21:27:47 INFO support.GenericApplicationContext: Refreshing org.springframework.context.support.GenericApplicationContext@372688e8: startup date [Wed Jul 13 21:27:47 UTC 2016]; root of context hierarchy
 16/07/13 21:27:47 INFO annotation.AutowiredAnnotationBeanPostProcessor: JSR-330 'javax.inject.Inject' annotation found and supported for autowiring
 ============================================
 *                                         \*
 *     _    _                 _ _          \*
 *    | |  | |               | (_)          *
 *    | |__| | ___   ___   __| |_  ___      *
 *    |  __  |/ _ \ / _ \ / _` | |/ _ \     *
 *    | |  | | (_) | (_) | (_| | |  __/     *
 *    |_|  |_|\___/ \___/ \__,_|_|\___|     *
 *                                          *
 ============================================
 Welcome to Hoodie CLI. Please type help if you are looking for help.
 hoodie->
 ```
 ### Commands
 * connect --path [dataset_path]                 : Connect to the specific dataset by its path
 * commits show                                  : Show all details about the commits
 * commits refresh                               : Refresh the commits from HDFS
 * commit rollback      --commit [commitTime]    : Rollback a commit
 * commit showfiles      --commit [commitTime]   : Show details of a commit (lists all the files modified along with other metrics)
 * commit showpartitions --commit [commitTime]   : Show details of a commit (lists statistics aggregated at partition level)
 * commits compare --path [otherBasePath]        : Compares the current dataset commits with the path provided and tells you how many commits behind or ahead
 * stats wa                                      : Calculate commit level and overall write amplification factor (total records written / total records upserted)
 * help
--- a/docs/code_structure.md
+++ b/docs/code_structure.md
@@ -5,6 +5,13 @@ sidebar: mydoc_sidebar
 permalink: code_structure.html
 ---
-Work In Progress
+## Code & Project Structure
 * hoodie-client     : Spark client library to take a bunch of inserts + updates and apply them to a Hoodie table
 * hoodie-common     : Common code shared between different artifacts of Hoodie
 We have embraced the [Google Java code style](https://google.github.io/styleguide/javaguide.html). Please setup your IDE accordingly with style files from [here] (https://github.com/google/styleguide)
--- a/docs/community.md
+++ b/docs/community.md
@@ -5,6 +5,11 @@ sidebar: mydoc_sidebar
 permalink: community.html
 ---
-Work In Progress
+## Contributing
 We :heart: contributions. If you find a bug in the library or would like to add new features, go ahead and open
 issues or pull requests against this repo. Before you do so, please sign the
 [Uber CLA](https://docs.google.com/a/uber.com/forms/d/1pAwS_-dA1KhPlfxzYLBqK6rsSWwRwH95OCCZrcsY5rk/viewform).
 Also, be sure to write unit tests for your bug fix or feature to show that it works as expected.
--- a/docs/concepts.md
+++ b/docs/concepts.md
@@ -5,6 +5,10 @@ sidebar: mydoc_sidebar
 permalink: concepts.html
 ---
-Work In Progress
+Hoodie provides the following primitives to build & access datasets on HDFS
 * Upsert                     (how do I change the table efficiently?)
 * Incremental consumption    (how do I obtain records that changed?)
--- a/docs/images/hoodie_intro_1.png
+++ b/docs/images/hoodie_intro_1.png
--- a/docs/index.md
+++ b/docs/index.md
@@ -1,220 +1,29 @@
 ---
-title: Hoodie - Upserts & Incrementals On Hadoop
+title: Hoodie Overview
 keywords: homepage
 tags: [getting_started]
 sidebar: mydoc_sidebar
 permalink: index.html
 summary: "Hoodie lowers data latency across the board, while simultaenously achieving orders of magnitude of efficiency over traditional batch processing."
 ---
 Hoodie - Spark Library For Upserts & Incremental Consumption
 =============================================================
 - - - -
 # Core Functionality #
 Hoodie provides the following abilities on a Hive table
 * Upsert                     (how do I change the table efficiently?)
 * Incremental consumption    (how do I obtain records that changed?)
-Ultimately, make the built Hive table, queryable via Spark & Presto as well.
+Hoodie manages storage of large analytical datasets on [HDFS](http://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-hdfs/HdfsDesign.html) and serve them out via two types of tables
 * **Read Optimized Table** - Provides excellent query performance via purely columnar storage (e.g. [Parquet](https://parquet.apache.org/))
 * **Near-Real time Table** - Provides queries on real-time data, using a combination of columnar & row based storage (e.g Parquet + [Avro](http://avro.apache.org/docs/current/mr.html))
-# Code & Project Structure #
+{% include image.html file="hoodie_intro_1.png" alt="hoodie_intro_1.png" %}
- * hoodie-client     : Spark client library to take a bunch of inserts + updates and apply them to a Hoodie table
+By carefully managing how data is laid out on storage & how its exposed to queries, Hoodie is able to power a rich data ecosystem where external sources can be ingested into Hadoop in near-real time.
- * hoodie-common     : Common code shared between different artifacts of Hoodie
+The ingested data is then available for interactive SQL Engines like [Presto](https://prestodb.io) & [Spark](https://spark.apache.org/sql/),
 while at the same time capable of being consumed incrementally from processing/ETL frameoworks like [Hive](https://hive.apache.org/) & [Spark](https://spark.apache.org/docs/latest/) to build derived (hoodie) datasets.
 Hoodie broadly consists of a self contained Spark library to build datasets and integrations with existing query engines for data access.
- We have embraced the [Google Java code style](https://google.github.io/styleguide/javaguide.html). Please setup your IDE accordingly with style files from [here] (https://github.com/google/styleguide)
+{% include callout.html content="Hoodie is a young project. Near-Real time Table implementation is currently underway. Get involved [here](https://github.com/uber/hoodie/projects/1)" type="info" %}
 # Quickstart #
 Check out code and pull it into Intellij as a normal maven project.
 > You might want to add your spark assembly jar to project dependencies under "Module Setttings", to be able to run Spark from IDE
 Setup your local hadoop/hive test environment. See [this](http://www.bytearray.io/2016/05/setting-up-hadoopyarnsparkhive-on-mac.html) for reference
 ## Run the Hoodie Test Job ##
 Create the output folder on your local HDFS
 ```
 hdfs dfs -mkdir -p /tmp/hoodie/sample-table
 ```
 You can run the __HoodieClientExample__ class, to place a set of inserts + updates onto your HDFS at /tmp/hoodie/sample-table
 ## Access via Hive ##
 Add in the hoodie-mr jar so, Hive can pick up the right files to hit, to answer the query.
 ```
 hive> add jar file:///tmp/hoodie-mr-0.1.jar;
 Added [file:///tmp/hoodie-mr-0.1.jar] to class path
 Added resources: [file:///tmp/hoodie-mr-0.1.jar]
 ```
 Then, you need to create a table and register the sample partitions
 ```
 drop table hoodie_test;
 CREATE EXTERNAL TABLE hoodie_test(`_row_key`  string,
 `_hoodie_commit_time` string,
 `_hoodie_commit_seqno` string,
 rider string,
 driver string,
 begin_lat double,
 begin_lon double,
 end_lat double,
 end_lon double,
 fare double)
 PARTITIONED BY (`datestr` string)
 ROW FORMAT SERDE
   'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
 STORED AS INPUTFORMAT
   'com.uber.hoodie.hadoop.HoodieInputFormat'
 OUTPUTFORMAT
   'com.uber.hoodie.hadoop.HoodieOutputFormat'
 LOCATION
   'hdfs:///tmp/hoodie/sample-table';
 ALTER TABLE `hoodie_test` ADD IF NOT EXISTS PARTITION (datestr='2016-03-15') LOCATION 'hdfs:///tmp/hoodie/sample-table/2016/03/15';
 ALTER TABLE `hoodie_test` ADD IF NOT EXISTS PARTITION (datestr='2015-03-16') LOCATION 'hdfs:///tmp/hoodie/sample-table/2015/03/16';
 ALTER TABLE `hoodie_test` ADD IF NOT EXISTS PARTITION (datestr='2015-03-17') LOCATION 'hdfs:///tmp/hoodie/sample-table/2015/03/17';
 ```
 Let's first perform a query on the latest committed snapshot of the table
 ```
 hive> select count(*) from hoodie_test;
 ...
 OK
 100
 Time taken: 18.05 seconds, Fetched: 1 row(s)
 hive>
 ```
 Let's now perform a query, to obtain the changed rows since a commit in the past
 ```
 hive> set hoodie.scan.mode=INCREMENTAL;
 hive> set hoodie.last.commitTs=001;
 hive> select `_hoodie_commit_time`, rider, driver from hoodie_test limit 10;
 OK
 All commits :[001, 002]
 002	rider-001	driver-001
 002	rider-001	driver-001
 002	rider-002	driver-002
 002	rider-001	driver-001
 002	rider-001	driver-001
 002	rider-002	driver-002
 002	rider-001	driver-001
 002	rider-002	driver-002
 002	rider-002	driver-002
 002	rider-001	driver-001
 Time taken: 0.056 seconds, Fetched: 10 row(s)
 hive>
 hive>
 ```
 ## Access via Spark ##
 Spark is super easy, once you get Hive working as above. Just spin up a Spark Shell as below
 ```
 $ cd $SPARK_INSTALL
 $ export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop
 $ spark-shell --jars /tmp/hoodie-mr-0.1.jar --driver-class-path $HADOOP_CONF_DIR --conf spark.sql.hive.convertMetastoreParquet=false
 scala> sqlContext.sql("show tables").show(10000)
 scala> sqlContext.sql("describe hoodie_test").show(10000)
 scala> sqlContext.sql("select count(*) from hoodie_test").show(10000)
 ```
 ## Access via Presto ##
 Checkout the 'hoodie-integration' branch, build off it, and place your installation somewhere.
 * Copy the hoodie-mr jar into $PRESTO_INSTALL/plugin/hive-hadoop2/
 * Change your catalog config, to make presto respect the __HoodieInputFormat__
 ```
 $ cat etc/catalog/hive.properties
 connector.name=hive-hadoop2
 hive.metastore.uri=thrift://localhost:10000
 hive.respect-input-format-splits=true
 ```
 startup your server and you should be able to query the same Hive table via Presto
 ```
 show columns from hive.default.hoodie_test;
 select count(*) from hive.default.hoodie_test
 ```
 > NOTE: As of now, Presto has trouble accessing HDFS locally, hence create a new table as above, backed on local filesystem file:// as a workaround
 # Planned #
 * Support for Self Joins - As of now, you cannot incrementally consume the same table more than once, since the InputFormat does not understand the QueryPlan.
 * Hoodie Spark Datasource -  Allows for reading and writing data back using Apache Spark natively (without falling back to InputFormat), which can be more performant
 * Hoodie Presto Connector - Allows for querying data managed by Hoodie using Presto natively, which can again boost [performance](https://prestodb.io/docs/current/release/release-0.138.html)
 # Hoodie Admin CLI
 # Launching Command Line #
 <todo - change this after packaging is done>
 * mvn clean install in hoodie-cli
 * ./hoodie-cli
 If all is good you should get a command prompt similar to this one
 ```
 prasanna@:~/hoodie/hoodie-cli$ ./hoodie-cli.sh
 16/07/13 21:27:47 INFO xml.XmlBeanDefinitionReader: Loading XML bean definitions from URL [jar:file:/home/prasanna/hoodie/hoodie-cli/target/hoodie-cli-0.1-SNAPSHOT.jar!/META-INF/spring/spring-shell-plugin.xml]
 16/07/13 21:27:47 INFO support.GenericApplicationContext: Refreshing org.springframework.context.support.GenericApplicationContext@372688e8: startup date [Wed Jul 13 21:27:47 UTC 2016]; root of context hierarchy
 16/07/13 21:27:47 INFO annotation.AutowiredAnnotationBeanPostProcessor: JSR-330 'javax.inject.Inject' annotation found and supported for autowiring
 ============================================
 *                                          *
 *     _    _                 _ _           *
 *    | |  | |               | (_)          *
 *    | |__| | ___   ___   __| |_  ___      *
 *    |  __  |/ _ \ / _ \ / _` | |/ _ \     *
 *    | |  | | (_) | (_) | (_| | |  __/     *
 *    |_|  |_|\___/ \___/ \__,_|_|\___|     *
 *                                          *
 ============================================
 Welcome to Hoodie CLI. Please type help if you are looking for help.
 hoodie->
 ```
 # Commands #
 * connect --path [dataset_path]                 : Connect to the specific dataset by its path
 * commits show                                  : Show all details about the commits
 * commits refresh                               : Refresh the commits from HDFS
 * commit rollback      --commit [commitTime]    : Rollback a commit
 * commit showfiles      --commit [commitTime]   : Show details of a commit (lists all the files modified along with other metrics)
 * commit showpartitions --commit [commitTime]   : Show details of a commit (lists statistics aggregated at partition level)
 * commits compare --path [otherBasePath]        : Compares the current dataset commits with the path provided and tells you how many commits behind or ahead
 * stats wa                                      : Calculate commit level and overall write amplification factor (total records written / total records upserted)
 * help
 ## Contributing
 We :heart: contributions. If you find a bug in the library or would like to add new features, go ahead and open
 issues or pull requests against this repo. Before you do so, please sign the
 [Uber CLA](https://docs.google.com/a/uber.com/forms/d/1pAwS_-dA1KhPlfxzYLBqK6rsSWwRwH95OCCZrcsY5rk/viewform).
 Also, be sure to write unit tests for your bug fix or feature to show that it works as expected.
--- a/docs/powered_by.md
+++ b/docs/powered_by.md
@@ -5,5 +5,7 @@ sidebar: mydoc_sidebar
 permalink: powered_by.html
 ---
-Work In Progress
+## Uber
 Hoodie was originally developed at [Uber](https://uber.com), to achieve [low latency database ingestion, with high efficiency](http://www.slideshare.net/vinothchandar/hadoop-strata-talk-uber-your-hadoop-has-arrived/32).
 It has been in production since Aug 2016, powering highly business critical (7/10 most used including trips,riders,partners totalling 100s of TBs) tables on Hadoop. It also powers several incremental Hive ETL pipelines and being currently integrated into Uber's data dispersal system.
--- a/docs/quickstart.md
+++ b/docs/quickstart.md
@@ -6,5 +6,149 @@ sidebar: mydoc_sidebar
 permalink: quickstart.html
 ---
-Work In Progress
+
 ## Download Hoodie
 Check out code and pull it into Intellij as a normal maven project.
 Normally build the maven project, from command line
 ```
 $ mvn clean install -DskipTests
 ```
 {% include callout.html content="You might want to add your spark assembly jar to project dependencies under 'Module Setttings', to be able to run Spark from IDE" type="info" %}
 {% include note.html content="Setup your local hadoop/hive test environment, so you can play with entire ecosystem. See [this](http://www.bytearray.io/2016/05/setting-up-hadoopyarnsparkhive-on-mac.html) for reference" %}
 ## Generate a Hoodie Dataset
 Create the output folder on your local HDFS
 ```
 hdfs dfs -mkdir -p /tmp/hoodie/sample-table
 ```
 You can run the __HoodieClientExample__ class, to place a two commits (commit 1 => 100 inserts, commit 2 => 100 updates to previously inserted 100 records) onto your HDFS at /tmp/hoodie/sample-table
 ## Register Dataset to Hive Metastore
 Add in the hoodie-hadoop-mr jar so, Hive can read the Hoodie dataset and answer the query.
 ```
 hive> add jar file:///tmp/hoodie-hadoop-mr-0.2.7.jar;
 Added [file:///tmp/hoodie-hadoop-mr-0.2.7.jar] to class path
 Added resources: [file:///tmp/hoodie-hadoop-mr-0.2.7.jar]
 ```
 Then, you need to create a ReadOptimized table as below (only type supported as of now)and register the sample partitions
 ```
 drop table hoodie_test;
 CREATE EXTERNAL TABLE hoodie_test(`_row_key`  string,
 `_hoodie_commit_time` string,
 `_hoodie_commit_seqno` string,
 rider string,
 driver string,
 begin_lat double,
 begin_lon double,
 end_lat double,
 end_lon double,
 fare double)
 PARTITIONED BY (`datestr` string)
 ROW FORMAT SERDE
   'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
 STORED AS INPUTFORMAT
   'com.uber.hoodie.hadoop.HoodieInputFormat'
 OUTPUTFORMAT
   'com.uber.hoodie.hadoop.HoodieOutputFormat'
 LOCATION
   'hdfs:///tmp/hoodie/sample-table';
 ALTER TABLE `hoodie_test` ADD IF NOT EXISTS PARTITION (datestr='2016-03-15') LOCATION 'hdfs:///tmp/hoodie/sample-table/2016/03/15';
 ALTER TABLE `hoodie_test` ADD IF NOT EXISTS PARTITION (datestr='2015-03-16') LOCATION 'hdfs:///tmp/hoodie/sample-table/2015/03/16';
 ALTER TABLE `hoodie_test` ADD IF NOT EXISTS PARTITION (datestr='2015-03-17') LOCATION 'hdfs:///tmp/hoodie/sample-table/2015/03/17';
 ```
 ## Querying The Dataset
 Now, we can proceed to query the dataset, as we would normally do across all the three query engines supported.
 ### HiveQL
 Let's first perform a query on the latest committed snapshot of the table
 ```
 hive> select count(*) from hoodie_test;
 ...
 OK
 100
 Time taken: 18.05 seconds, Fetched: 1 row(s)
 hive>
 ```
 ### SparkSQL
 Spark is super easy, once you get Hive working as above. Just spin up a Spark Shell as below
 ```
 $ cd $SPARK_INSTALL
 $ export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop
 $ spark-shell --jars /tmp/hoodie-hadoop-mr-0.2.7.jar --driver-class-path $HADOOP_CONF_DIR --conf spark.sql.hive.convertMetastoreParquet=false
 scala> sqlContext.sql("show tables").show(10000)
 scala> sqlContext.sql("describe hoodie_test").show(10000)
 scala> sqlContext.sql("select count(*) from hoodie_test").show(10000)
 ```
 ### Presto
 Checkout the 'master' branch on OSS Presto, build it, and place your installation somewhere.
 * Copy the hoodie-hadoop-mr-0.2.7 jar into $PRESTO_INSTALL/plugin/hive-hadoop2/
 * Startup your server and you should be able to query the same Hive table via Presto
 ```
 show columns from hive.default.hoodie_test;
 select count(*) from hive.default.hoodie_test
 ```
 ## Incremental Queries
 Let's now perform a query, to obtain the __ONLY__ changed rows since a commit in the past.
 ```
 hive> set hoodie.scan.mode=INCREMENTAL;
 hive> set hoodie.last.commitTs=001;
 hive> select `_hoodie_commit_time`, rider, driver from hoodie_test limit 10;
 OK
 All commits :[001, 002]
 002	rider-001	driver-001
 002	rider-001	driver-001
 002	rider-002	driver-002
 002	rider-001	driver-001
 002	rider-001	driver-001
 002	rider-002	driver-002
 002	rider-001	driver-001
 002	rider-002	driver-002
 002	rider-002	driver-002
 002	rider-001	driver-001
 Time taken: 0.056 seconds, Fetched: 10 row(s)
 hive>
 hive>
 ```
--- a/docs/roadmap.md
+++ b/docs/roadmap.md
@@ -5,6 +5,10 @@ sidebar: mydoc_sidebar
 permalink: roadmap.html
 ---
-Work In Progress
+## Planned Features
 * Support for Self Joins - As of now, you cannot incrementally consume the same table more than once, since the InputFormat does not understand the QueryPlan.
 * Hoodie Spark Datasource -  Allows for reading and writing data back using Apache Spark natively (without falling back to InputFormat), which can be more performant
 * Hoodie Presto Connector - Allows for querying data managed by Hoodie using Presto natively, which can again boost [performance](https://prestodb.io/docs/current/release/release-0.138.html)
--- a/docs/use_cases.md
+++ b/docs/use_cases.md
@@ -3,7 +3,77 @@ title: Use Cases
 keywords: usecases
 sidebar: mydoc_sidebar
 permalink: use_cases.html
 toc: false
 ---
-Work In Progress
+Following are some sample use-cases for Hoodie.
 ## Near Real-Time Ingestion
 Ingesting data from external sources like (event logs, databases, external sources) into a [Hadoop Data Lake](http://martinfowler.com/bliki/DataLake.html) is a well known problem.
 In most (if not all) Hadoop deployments, it is unfortunately solved in a piecemeal fashion, using a medley of ingestion tools,
 even though this data is arguably the most valuable for the entire organization.
 For RDBMS ingestion, Hoodie provides __faster loads via Upserts__, as opposed costly & inefficient bulk loads. For e.g, you can read the MySQL BIN log or [Sqoop Incremental Import](https://sqoop.apache.org/docs/1.4.2/SqoopUserGuide.html#_incremental_imports) and apply them to an
 equivalent Hoodie table on HDFS. This would be much faster/efficient than a [bulk merge job](https://sqoop.apache.org/docs/1.4.0-incubating/SqoopUserGuide.html#id1770457)
 or [complicated handcrafted merge workflows](http://hortonworks.com/blog/four-step-strategy-incremental-updates-hive/)
 For NoSQL datastores like [Cassandra](http://cassandra.apache.org/) / [Voldemort](http://www.project-voldemort.com/voldemort/) / [HBase](https://hbase.apache.org/), even moderately big installations store billions of rows.
 It goes without saying that __full bulk loads are simply infeasible__ and more efficient approaches are needed if ingestion is to keep up with the typically high update volumes.
 Even for immutable data sources like [Kafka](kafka.apache.org) , Hoodie helps __enforces a minimum file size on HDFS__, which improves [NameNode health](https://blog.cloudera.com/blog/2009/02/the-small-files-problem/).
 This is all the more important in such an use-case since typically event data is high volume (eg: click streams) and if not managed well, can cause serious damage to your Hadoop cluster.
 Across all sources, Hoodie adds the much needed ability to atomically publish new data to consumers via notion of commits, shielding them from partial ingestion failures
 ## Near Real-time Analytics
 Typically, real-time [datamarts](https://en.wikipedia.org/wiki/Data_mart) are powered by specialized analytical stores such as [Druid](http://druid.io/) or [Memsql](http://www.memsql.com/) or [even OpenTSDB](http://opentsdb.net/) .
 This is absolutely perfect for lower scale ([relative to Hadoop installations like this](https://blog.twitter.com/2015/hadoop-filesystem-at-twitter)) data,
 that needs sub-second query responses such as system monitoring or interactive real-time analysis.
 But, typically these systems end up getting abused for less interactive queries also since data on Hadoop is intolerably stale. This leads to under utilization & wasteful hardware/license costs.
 On the other hand, interactive SQL solutions on Hadoop such as Presto & SparkSQL excel in __queries that finish within few seconds__.
 By bringing __data freshness to a few minutes__, Hoodie can provide a much efficient alternative, as well unlock real-time analytics on __several magnitudes larger datasets__ stored in HDFS.
 Also, Hoodie has no external dependencies (like a dedicated HBase cluster, purely used for real-time analytics) and thus enabled faster analytics on much fresher analytics, without increasing the operational overhead.
 ## Incremental Processing Pipelines
 One fundamental ability Hadoop provides is to build a chain of datasets derived from each other via DAGs expressed as workflows.
 Workflows often depend on new data being output by multiple upstream workflows and traditionally, availability of new data is indicated by a new HDFS Folder/Hive Partition.
 Let's take a concrete example to illustrate this. An upstream workflow `U` can create a Hive partition for every hour, with data for that hour (event_time) at the end of each hour (processing_time), providing effective freshness of 1 hour.
 Then, a downstream workflow `D`, kicks off immediately after `U` finishes, and does its own processing for the next hour, increasing the effective latency to 2 hours.
 The above paradigm simply ignores late arriving data i.e when `processing_time` and `event_time` drift apart.
 Unfortunately, in today's post-mobile & pre-IoT world, __late data from intermittently connected mobile devices & sensors are the norm, not an anomaly__.
 In such cases, the only remedy to guarantee correctness is to [reprocess the last few hours](https://falcon.apache.org/FalconDocumentation.html#Handling_late_input_data) worth of data,
 over and over again each hour, which can significantly hurt the efficiency across the entire ecosystem. For e.g; imagine reprocessing TBs worth of data every hour across hundreds of workflows.
 Hoodie comes to the rescue again, by providing a way to consume new data (including late data) from an upsteam Hoodie dataset `HU` at a record granularity (not folders/partitions),
 apply the processing logic, and efficiently update/reconcile late data with a downstream Hoodie dataset `HD`. Here, `HU` and `HD` can be continuously scheduled at a much more frequent schedule
 like 15 mins, and providing an end-end latency of 30 mins at `HD`.
 {% include callout.html content="To achieve this, Hoodie borrows concepts from stream processing frameworks like [Spark Streaming](https://spark.apache.org/docs/latest/streaming-programming-guide.html#join-operations) , Pub/Sub systems like [Kafka](http://kafka.apache.org/documentation/#theconsumer)
 or database replication technologies like [Oracle XStream](https://docs.oracle.com/cd/E11882_01/server.112/e16545/xstrm_cncpt.htm#XSTRM187).
 For the more curious, a more detailed explanation of the benefits of Incremetal Processing (compared to Stream Processing & Batch Processing) can be found [here](https://www.oreilly.com/ideas/ubers-case-for-incremental-processing-on-hadoop)" type="info" %}
 ## Data Dispersal From Hadoop
 A popular use-case for Hadoop, is to crunch data and then disperse it back to an online serving store, to be used by an application.
 For e.g, a Spark Pipeline can [determine hard braking events on Hadoop](https://eng.uber.com/telematics/) and load them into a serving store like ElasticSearch,
 to be used by the Uber application to increase safe driving. Typical architectures for this employ a `queue` between Hadoop and serving store, to prevent overwhelming the target serving store.
 A popular choice for this queue is Kafka and this model often results in __redundant storage of same data on HDFS (for offline analysis on computed results) and Kafka (for dispersal)__
 Once again Hoodie can efficiently solve this problem efficiently. Using the same example, the Spark Pipeline can keep upserting output from
 each run into a Hoodie dataset, which can now be incrementally tailed (just like a Kafka topic) for new data to be written into the serving store.