diff --git a/docs/_includes/head.html b/docs/_includes/head.html index a5700e6d6..25be6c718 100644 --- a/docs/_includes/head.html +++ b/docs/_includes/head.html @@ -12,7 +12,7 @@ - + diff --git a/docs/admin_guide.md b/docs/admin_guide.md index 9e3f18870..5143aa227 100644 --- a/docs/admin_guide.md +++ b/docs/admin_guide.md @@ -5,6 +5,47 @@ sidebar: mydoc_sidebar permalink: admin_guide.html --- -Work In Progress +## Hoodie Admin CLI +### Launching Command Line + + + +* mvn clean install in hoodie-cli +* ./hoodie-cli + +If all is good you should get a command prompt similar to this one +``` +prasanna@:~/hoodie/hoodie-cli$ ./hoodie-cli.sh +16/07/13 21:27:47 INFO xml.XmlBeanDefinitionReader: Loading XML bean definitions from URL [jar:file:/home/prasanna/hoodie/hoodie-cli/target/hoodie-cli-0.1-SNAPSHOT.jar!/META-INF/spring/spring-shell-plugin.xml] +16/07/13 21:27:47 INFO support.GenericApplicationContext: Refreshing org.springframework.context.support.GenericApplicationContext@372688e8: startup date [Wed Jul 13 21:27:47 UTC 2016]; root of context hierarchy +16/07/13 21:27:47 INFO annotation.AutowiredAnnotationBeanPostProcessor: JSR-330 'javax.inject.Inject' annotation found and supported for autowiring +============================================ +* \* +* _ _ _ _ \* +* | | | | | (_) * +* | |__| | ___ ___ __| |_ ___ * +* | __ |/ _ \ / _ \ / _` | |/ _ \ * +* | | | | (_) | (_) | (_| | | __/ * +* |_| |_|\___/ \___/ \__,_|_|\___| * +* * +============================================ + +Welcome to Hoodie CLI. Please type help if you are looking for help. +hoodie-> +``` + +### Commands + + * connect --path [dataset_path] : Connect to the specific dataset by its path + * commits show : Show all details about the commits + * commits refresh : Refresh the commits from HDFS + * commit rollback --commit [commitTime] : Rollback a commit + * commit showfiles --commit [commitTime] : Show details of a commit (lists all the files modified along with other metrics) + * commit showpartitions --commit [commitTime] : Show details of a commit (lists statistics aggregated at partition level) + + * commits compare --path [otherBasePath] : Compares the current dataset commits with the path provided and tells you how many commits behind or ahead + * stats wa : Calculate commit level and overall write amplification factor (total records written / total records upserted) + * help + diff --git a/docs/code_structure.md b/docs/code_structure.md index b0ad81939..5f160fbe7 100644 --- a/docs/code_structure.md +++ b/docs/code_structure.md @@ -5,6 +5,13 @@ sidebar: mydoc_sidebar permalink: code_structure.html --- -Work In Progress +## Code & Project Structure + + * hoodie-client : Spark client library to take a bunch of inserts + updates and apply them to a Hoodie table + * hoodie-common : Common code shared between different artifacts of Hoodie + + + We have embraced the [Google Java code style](https://google.github.io/styleguide/javaguide.html). Please setup your IDE accordingly with style files from [here] (https://github.com/google/styleguide) + diff --git a/docs/community.md b/docs/community.md index 533dade7c..83502964c 100644 --- a/docs/community.md +++ b/docs/community.md @@ -5,6 +5,11 @@ sidebar: mydoc_sidebar permalink: community.html --- -Work In Progress +## Contributing +We :heart: contributions. If you find a bug in the library or would like to add new features, go ahead and open +issues or pull requests against this repo. Before you do so, please sign the +[Uber CLA](https://docs.google.com/a/uber.com/forms/d/1pAwS_-dA1KhPlfxzYLBqK6rsSWwRwH95OCCZrcsY5rk/viewform). +Also, be sure to write unit tests for your bug fix or feature to show that it works as expected. + diff --git a/docs/concepts.md b/docs/concepts.md index 5d35dead2..389380b0e 100644 --- a/docs/concepts.md +++ b/docs/concepts.md @@ -5,6 +5,10 @@ sidebar: mydoc_sidebar permalink: concepts.html --- -Work In Progress +Hoodie provides the following primitives to build & access datasets on HDFS + + * Upsert (how do I change the table efficiently?) + * Incremental consumption (how do I obtain records that changed?) + diff --git a/docs/images/hoodie_intro_1.png b/docs/images/hoodie_intro_1.png new file mode 100644 index 000000000..dd8977a19 Binary files /dev/null and b/docs/images/hoodie_intro_1.png differ diff --git a/docs/index.md b/docs/index.md index 32a03bd3e..f7fc045f3 100644 --- a/docs/index.md +++ b/docs/index.md @@ -1,220 +1,29 @@ --- -title: Hoodie - Upserts & Incrementals On Hadoop +title: Hoodie Overview keywords: homepage tags: [getting_started] sidebar: mydoc_sidebar permalink: index.html +summary: "Hoodie lowers data latency across the board, while simultaenously achieving orders of magnitude of efficiency over traditional batch processing." --- -Hoodie - Spark Library For Upserts & Incremental Consumption -============================================================= - -- - - - - -# Core Functionality # - -Hoodie provides the following abilities on a Hive table - - * Upsert (how do I change the table efficiently?) - * Incremental consumption (how do I obtain records that changed?) -Ultimately, make the built Hive table, queryable via Spark & Presto as well. +Hoodie manages storage of large analytical datasets on [HDFS](http://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-hdfs/HdfsDesign.html) and serve them out via two types of tables + + * **Read Optimized Table** - Provides excellent query performance via purely columnar storage (e.g. [Parquet](https://parquet.apache.org/)) + * **Near-Real time Table** - Provides queries on real-time data, using a combination of columnar & row based storage (e.g Parquet + [Avro](http://avro.apache.org/docs/current/mr.html)) -# Code & Project Structure # +{% include image.html file="hoodie_intro_1.png" alt="hoodie_intro_1.png" %} - * hoodie-client : Spark client library to take a bunch of inserts + updates and apply them to a Hoodie table - * hoodie-common : Common code shared between different artifacts of Hoodie +By carefully managing how data is laid out on storage & how its exposed to queries, Hoodie is able to power a rich data ecosystem where external sources can be ingested into Hadoop in near-real time. +The ingested data is then available for interactive SQL Engines like [Presto](https://prestodb.io) & [Spark](https://spark.apache.org/sql/), +while at the same time capable of being consumed incrementally from processing/ETL frameoworks like [Hive](https://hive.apache.org/) & [Spark](https://spark.apache.org/docs/latest/) to build derived (hoodie) datasets. + +Hoodie broadly consists of a self contained Spark library to build datasets and integrations with existing query engines for data access. - We have embraced the [Google Java code style](https://google.github.io/styleguide/javaguide.html). Please setup your IDE accordingly with style files from [here] (https://github.com/google/styleguide) +{% include callout.html content="Hoodie is a young project. Near-Real time Table implementation is currently underway. Get involved [here](https://github.com/uber/hoodie/projects/1)" type="info" %} - -# Quickstart # - -Check out code and pull it into Intellij as a normal maven project. -> You might want to add your spark assembly jar to project dependencies under "Module Setttings", to be able to run Spark from IDE - -Setup your local hadoop/hive test environment. See [this](http://www.bytearray.io/2016/05/setting-up-hadoopyarnsparkhive-on-mac.html) for reference - -## Run the Hoodie Test Job ## - -Create the output folder on your local HDFS -``` -hdfs dfs -mkdir -p /tmp/hoodie/sample-table -``` - -You can run the __HoodieClientExample__ class, to place a set of inserts + updates onto your HDFS at /tmp/hoodie/sample-table - -## Access via Hive ## - -Add in the hoodie-mr jar so, Hive can pick up the right files to hit, to answer the query. - -``` -hive> add jar file:///tmp/hoodie-mr-0.1.jar; -Added [file:///tmp/hoodie-mr-0.1.jar] to class path -Added resources: [file:///tmp/hoodie-mr-0.1.jar] -``` - -Then, you need to create a table and register the sample partitions - - -``` -drop table hoodie_test; -CREATE EXTERNAL TABLE hoodie_test(`_row_key` string, -`_hoodie_commit_time` string, -`_hoodie_commit_seqno` string, - rider string, - driver string, - begin_lat double, - begin_lon double, - end_lat double, - end_lon double, - fare double) -PARTITIONED BY (`datestr` string) -ROW FORMAT SERDE - 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' -STORED AS INPUTFORMAT - 'com.uber.hoodie.hadoop.HoodieInputFormat' -OUTPUTFORMAT - 'com.uber.hoodie.hadoop.HoodieOutputFormat' -LOCATION - 'hdfs:///tmp/hoodie/sample-table'; - -ALTER TABLE `hoodie_test` ADD IF NOT EXISTS PARTITION (datestr='2016-03-15') LOCATION 'hdfs:///tmp/hoodie/sample-table/2016/03/15'; -ALTER TABLE `hoodie_test` ADD IF NOT EXISTS PARTITION (datestr='2015-03-16') LOCATION 'hdfs:///tmp/hoodie/sample-table/2015/03/16'; -ALTER TABLE `hoodie_test` ADD IF NOT EXISTS PARTITION (datestr='2015-03-17') LOCATION 'hdfs:///tmp/hoodie/sample-table/2015/03/17'; -``` - -Let's first perform a query on the latest committed snapshot of the table - -``` -hive> select count(*) from hoodie_test; -... -OK -100 -Time taken: 18.05 seconds, Fetched: 1 row(s) -hive> -``` - - -Let's now perform a query, to obtain the changed rows since a commit in the past - -``` -hive> set hoodie.scan.mode=INCREMENTAL; -hive> set hoodie.last.commitTs=001; -hive> select `_hoodie_commit_time`, rider, driver from hoodie_test limit 10; -OK -All commits :[001, 002] -002 rider-001 driver-001 -002 rider-001 driver-001 -002 rider-002 driver-002 -002 rider-001 driver-001 -002 rider-001 driver-001 -002 rider-002 driver-002 -002 rider-001 driver-001 -002 rider-002 driver-002 -002 rider-002 driver-002 -002 rider-001 driver-001 -Time taken: 0.056 seconds, Fetched: 10 row(s) -hive> -hive> -``` - - -## Access via Spark ## - -Spark is super easy, once you get Hive working as above. Just spin up a Spark Shell as below - -``` -$ cd $SPARK_INSTALL -$ export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop -$ spark-shell --jars /tmp/hoodie-mr-0.1.jar --driver-class-path $HADOOP_CONF_DIR --conf spark.sql.hive.convertMetastoreParquet=false - - -scala> sqlContext.sql("show tables").show(10000) -scala> sqlContext.sql("describe hoodie_test").show(10000) -scala> sqlContext.sql("select count(*) from hoodie_test").show(10000) -``` - - - -## Access via Presto ## - -Checkout the 'hoodie-integration' branch, build off it, and place your installation somewhere. - -* Copy the hoodie-mr jar into $PRESTO_INSTALL/plugin/hive-hadoop2/ - -* Change your catalog config, to make presto respect the __HoodieInputFormat__ - -``` -$ cat etc/catalog/hive.properties -connector.name=hive-hadoop2 -hive.metastore.uri=thrift://localhost:10000 -hive.respect-input-format-splits=true -``` - -startup your server and you should be able to query the same Hive table via Presto - -``` -show columns from hive.default.hoodie_test; -select count(*) from hive.default.hoodie_test -``` - -> NOTE: As of now, Presto has trouble accessing HDFS locally, hence create a new table as above, backed on local filesystem file:// as a workaround - -# Planned # -* Support for Self Joins - As of now, you cannot incrementally consume the same table more than once, since the InputFormat does not understand the QueryPlan. -* Hoodie Spark Datasource - Allows for reading and writing data back using Apache Spark natively (without falling back to InputFormat), which can be more performant -* Hoodie Presto Connector - Allows for querying data managed by Hoodie using Presto natively, which can again boost [performance](https://prestodb.io/docs/current/release/release-0.138.html) - - -# Hoodie Admin CLI -# Launching Command Line # - - - -* mvn clean install in hoodie-cli -* ./hoodie-cli - -If all is good you should get a command prompt similar to this one -``` -prasanna@:~/hoodie/hoodie-cli$ ./hoodie-cli.sh -16/07/13 21:27:47 INFO xml.XmlBeanDefinitionReader: Loading XML bean definitions from URL [jar:file:/home/prasanna/hoodie/hoodie-cli/target/hoodie-cli-0.1-SNAPSHOT.jar!/META-INF/spring/spring-shell-plugin.xml] -16/07/13 21:27:47 INFO support.GenericApplicationContext: Refreshing org.springframework.context.support.GenericApplicationContext@372688e8: startup date [Wed Jul 13 21:27:47 UTC 2016]; root of context hierarchy -16/07/13 21:27:47 INFO annotation.AutowiredAnnotationBeanPostProcessor: JSR-330 'javax.inject.Inject' annotation found and supported for autowiring -============================================ -* * -* _ _ _ _ * -* | | | | | (_) * -* | |__| | ___ ___ __| |_ ___ * -* | __ |/ _ \ / _ \ / _` | |/ _ \ * -* | | | | (_) | (_) | (_| | | __/ * -* |_| |_|\___/ \___/ \__,_|_|\___| * -* * -============================================ - -Welcome to Hoodie CLI. Please type help if you are looking for help. -hoodie-> -``` - -# Commands # - - * connect --path [dataset_path] : Connect to the specific dataset by its path - * commits show : Show all details about the commits - * commits refresh : Refresh the commits from HDFS - * commit rollback --commit [commitTime] : Rollback a commit - * commit showfiles --commit [commitTime] : Show details of a commit (lists all the files modified along with other metrics) - * commit showpartitions --commit [commitTime] : Show details of a commit (lists statistics aggregated at partition level) - - * commits compare --path [otherBasePath] : Compares the current dataset commits with the path provided and tells you how many commits behind or ahead - * stats wa : Calculate commit level and overall write amplification factor (total records written / total records upserted) - * help - -## Contributing -We :heart: contributions. If you find a bug in the library or would like to add new features, go ahead and open -issues or pull requests against this repo. Before you do so, please sign the -[Uber CLA](https://docs.google.com/a/uber.com/forms/d/1pAwS_-dA1KhPlfxzYLBqK6rsSWwRwH95OCCZrcsY5rk/viewform). -Also, be sure to write unit tests for your bug fix or feature to show that it works as expected. diff --git a/docs/powered_by.md b/docs/powered_by.md index 9153034c8..f4e8e7d2d 100644 --- a/docs/powered_by.md +++ b/docs/powered_by.md @@ -5,5 +5,7 @@ sidebar: mydoc_sidebar permalink: powered_by.html --- -Work In Progress +## Uber +Hoodie was originally developed at [Uber](https://uber.com), to achieve [low latency database ingestion, with high efficiency](http://www.slideshare.net/vinothchandar/hadoop-strata-talk-uber-your-hadoop-has-arrived/32). +It has been in production since Aug 2016, powering highly business critical (7/10 most used including trips,riders,partners totalling 100s of TBs) tables on Hadoop. It also powers several incremental Hive ETL pipelines and being currently integrated into Uber's data dispersal system. diff --git a/docs/quickstart.md b/docs/quickstart.md index 95c2ef25a..6642accd5 100644 --- a/docs/quickstart.md +++ b/docs/quickstart.md @@ -6,5 +6,149 @@ sidebar: mydoc_sidebar permalink: quickstart.html --- -Work In Progress + + + +## Download Hoodie + +Check out code and pull it into Intellij as a normal maven project. + +Normally build the maven project, from command line +``` +$ mvn clean install -DskipTests +``` + +{% include callout.html content="You might want to add your spark assembly jar to project dependencies under 'Module Setttings', to be able to run Spark from IDE" type="info" %} + +{% include note.html content="Setup your local hadoop/hive test environment, so you can play with entire ecosystem. See [this](http://www.bytearray.io/2016/05/setting-up-hadoopyarnsparkhive-on-mac.html) for reference" %} + + + +## Generate a Hoodie Dataset + +Create the output folder on your local HDFS +``` +hdfs dfs -mkdir -p /tmp/hoodie/sample-table +``` + +You can run the __HoodieClientExample__ class, to place a two commits (commit 1 => 100 inserts, commit 2 => 100 updates to previously inserted 100 records) onto your HDFS at /tmp/hoodie/sample-table + + +## Register Dataset to Hive Metastore + +Add in the hoodie-hadoop-mr jar so, Hive can read the Hoodie dataset and answer the query. + +``` +hive> add jar file:///tmp/hoodie-hadoop-mr-0.2.7.jar; +Added [file:///tmp/hoodie-hadoop-mr-0.2.7.jar] to class path +Added resources: [file:///tmp/hoodie-hadoop-mr-0.2.7.jar] +``` + +Then, you need to create a ReadOptimized table as below (only type supported as of now)and register the sample partitions + + +``` +drop table hoodie_test; +CREATE EXTERNAL TABLE hoodie_test(`_row_key` string, +`_hoodie_commit_time` string, +`_hoodie_commit_seqno` string, + rider string, + driver string, + begin_lat double, + begin_lon double, + end_lat double, + end_lon double, + fare double) +PARTITIONED BY (`datestr` string) +ROW FORMAT SERDE + 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' +STORED AS INPUTFORMAT + 'com.uber.hoodie.hadoop.HoodieInputFormat' +OUTPUTFORMAT + 'com.uber.hoodie.hadoop.HoodieOutputFormat' +LOCATION + 'hdfs:///tmp/hoodie/sample-table'; + +ALTER TABLE `hoodie_test` ADD IF NOT EXISTS PARTITION (datestr='2016-03-15') LOCATION 'hdfs:///tmp/hoodie/sample-table/2016/03/15'; +ALTER TABLE `hoodie_test` ADD IF NOT EXISTS PARTITION (datestr='2015-03-16') LOCATION 'hdfs:///tmp/hoodie/sample-table/2015/03/16'; +ALTER TABLE `hoodie_test` ADD IF NOT EXISTS PARTITION (datestr='2015-03-17') LOCATION 'hdfs:///tmp/hoodie/sample-table/2015/03/17'; +``` + +## Querying The Dataset + +Now, we can proceed to query the dataset, as we would normally do across all the three query engines supported. + +### HiveQL + +Let's first perform a query on the latest committed snapshot of the table + +``` +hive> select count(*) from hoodie_test; +... +OK +100 +Time taken: 18.05 seconds, Fetched: 1 row(s) +hive> +``` + +### SparkSQL + +Spark is super easy, once you get Hive working as above. Just spin up a Spark Shell as below + +``` +$ cd $SPARK_INSTALL +$ export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop +$ spark-shell --jars /tmp/hoodie-hadoop-mr-0.2.7.jar --driver-class-path $HADOOP_CONF_DIR --conf spark.sql.hive.convertMetastoreParquet=false + + +scala> sqlContext.sql("show tables").show(10000) +scala> sqlContext.sql("describe hoodie_test").show(10000) +scala> sqlContext.sql("select count(*) from hoodie_test").show(10000) +``` + + +### Presto + +Checkout the 'master' branch on OSS Presto, build it, and place your installation somewhere. + +* Copy the hoodie-hadoop-mr-0.2.7 jar into $PRESTO_INSTALL/plugin/hive-hadoop2/ +* Startup your server and you should be able to query the same Hive table via Presto + +``` +show columns from hive.default.hoodie_test; +select count(*) from hive.default.hoodie_test +``` + + + +## Incremental Queries + +Let's now perform a query, to obtain the __ONLY__ changed rows since a commit in the past. + +``` +hive> set hoodie.scan.mode=INCREMENTAL; +hive> set hoodie.last.commitTs=001; +hive> select `_hoodie_commit_time`, rider, driver from hoodie_test limit 10; +OK +All commits :[001, 002] +002 rider-001 driver-001 +002 rider-001 driver-001 +002 rider-002 driver-002 +002 rider-001 driver-001 +002 rider-001 driver-001 +002 rider-002 driver-002 +002 rider-001 driver-001 +002 rider-002 driver-002 +002 rider-002 driver-002 +002 rider-001 driver-001 +Time taken: 0.056 seconds, Fetched: 10 row(s) +hive> +hive> +``` + + + + + + diff --git a/docs/roadmap.md b/docs/roadmap.md index 8c376e544..692240c92 100644 --- a/docs/roadmap.md +++ b/docs/roadmap.md @@ -5,6 +5,10 @@ sidebar: mydoc_sidebar permalink: roadmap.html --- -Work In Progress +## Planned Features + +* Support for Self Joins - As of now, you cannot incrementally consume the same table more than once, since the InputFormat does not understand the QueryPlan. +* Hoodie Spark Datasource - Allows for reading and writing data back using Apache Spark natively (without falling back to InputFormat), which can be more performant +* Hoodie Presto Connector - Allows for querying data managed by Hoodie using Presto natively, which can again boost [performance](https://prestodb.io/docs/current/release/release-0.138.html) diff --git a/docs/use_cases.md b/docs/use_cases.md index 0cf54607b..b0a16ab9e 100644 --- a/docs/use_cases.md +++ b/docs/use_cases.md @@ -3,7 +3,77 @@ title: Use Cases keywords: usecases sidebar: mydoc_sidebar permalink: use_cases.html +toc: false --- -Work In Progress +Following are some sample use-cases for Hoodie. + + +## Near Real-Time Ingestion + +Ingesting data from external sources like (event logs, databases, external sources) into a [Hadoop Data Lake](http://martinfowler.com/bliki/DataLake.html) is a well known problem. +In most (if not all) Hadoop deployments, it is unfortunately solved in a piecemeal fashion, using a medley of ingestion tools, +even though this data is arguably the most valuable for the entire organization. + + +For RDBMS ingestion, Hoodie provides __faster loads via Upserts__, as opposed costly & inefficient bulk loads. For e.g, you can read the MySQL BIN log or [Sqoop Incremental Import](https://sqoop.apache.org/docs/1.4.2/SqoopUserGuide.html#_incremental_imports) and apply them to an +equivalent Hoodie table on HDFS. This would be much faster/efficient than a [bulk merge job](https://sqoop.apache.org/docs/1.4.0-incubating/SqoopUserGuide.html#id1770457) +or [complicated handcrafted merge workflows](http://hortonworks.com/blog/four-step-strategy-incremental-updates-hive/) + + +For NoSQL datastores like [Cassandra](http://cassandra.apache.org/) / [Voldemort](http://www.project-voldemort.com/voldemort/) / [HBase](https://hbase.apache.org/), even moderately big installations store billions of rows. +It goes without saying that __full bulk loads are simply infeasible__ and more efficient approaches are needed if ingestion is to keep up with the typically high update volumes. + + +Even for immutable data sources like [Kafka](kafka.apache.org) , Hoodie helps __enforces a minimum file size on HDFS__, which improves [NameNode health](https://blog.cloudera.com/blog/2009/02/the-small-files-problem/). +This is all the more important in such an use-case since typically event data is high volume (eg: click streams) and if not managed well, can cause serious damage to your Hadoop cluster. + +Across all sources, Hoodie adds the much needed ability to atomically publish new data to consumers via notion of commits, shielding them from partial ingestion failures + + +## Near Real-time Analytics + +Typically, real-time [datamarts](https://en.wikipedia.org/wiki/Data_mart) are powered by specialized analytical stores such as [Druid](http://druid.io/) or [Memsql](http://www.memsql.com/) or [even OpenTSDB](http://opentsdb.net/) . +This is absolutely perfect for lower scale ([relative to Hadoop installations like this](https://blog.twitter.com/2015/hadoop-filesystem-at-twitter)) data, +that needs sub-second query responses such as system monitoring or interactive real-time analysis. +But, typically these systems end up getting abused for less interactive queries also since data on Hadoop is intolerably stale. This leads to under utilization & wasteful hardware/license costs. + + +On the other hand, interactive SQL solutions on Hadoop such as Presto & SparkSQL excel in __queries that finish within few seconds__. +By bringing __data freshness to a few minutes__, Hoodie can provide a much efficient alternative, as well unlock real-time analytics on __several magnitudes larger datasets__ stored in HDFS. +Also, Hoodie has no external dependencies (like a dedicated HBase cluster, purely used for real-time analytics) and thus enabled faster analytics on much fresher analytics, without increasing the operational overhead. + + +## Incremental Processing Pipelines + +One fundamental ability Hadoop provides is to build a chain of datasets derived from each other via DAGs expressed as workflows. +Workflows often depend on new data being output by multiple upstream workflows and traditionally, availability of new data is indicated by a new HDFS Folder/Hive Partition. +Let's take a concrete example to illustrate this. An upstream workflow `U` can create a Hive partition for every hour, with data for that hour (event_time) at the end of each hour (processing_time), providing effective freshness of 1 hour. +Then, a downstream workflow `D`, kicks off immediately after `U` finishes, and does its own processing for the next hour, increasing the effective latency to 2 hours. + +The above paradigm simply ignores late arriving data i.e when `processing_time` and `event_time` drift apart. +Unfortunately, in today's post-mobile & pre-IoT world, __late data from intermittently connected mobile devices & sensors are the norm, not an anomaly__. +In such cases, the only remedy to guarantee correctness is to [reprocess the last few hours](https://falcon.apache.org/FalconDocumentation.html#Handling_late_input_data) worth of data, +over and over again each hour, which can significantly hurt the efficiency across the entire ecosystem. For e.g; imagine reprocessing TBs worth of data every hour across hundreds of workflows. + + +Hoodie comes to the rescue again, by providing a way to consume new data (including late data) from an upsteam Hoodie dataset `HU` at a record granularity (not folders/partitions), +apply the processing logic, and efficiently update/reconcile late data with a downstream Hoodie dataset `HD`. Here, `HU` and `HD` can be continuously scheduled at a much more frequent schedule +like 15 mins, and providing an end-end latency of 30 mins at `HD`. + + +{% include callout.html content="To achieve this, Hoodie borrows concepts from stream processing frameworks like [Spark Streaming](https://spark.apache.org/docs/latest/streaming-programming-guide.html#join-operations) , Pub/Sub systems like [Kafka](http://kafka.apache.org/documentation/#theconsumer) +or database replication technologies like [Oracle XStream](https://docs.oracle.com/cd/E11882_01/server.112/e16545/xstrm_cncpt.htm#XSTRM187). +For the more curious, a more detailed explanation of the benefits of Incremetal Processing (compared to Stream Processing & Batch Processing) can be found [here](https://www.oreilly.com/ideas/ubers-case-for-incremental-processing-on-hadoop)" type="info" %} + + +## Data Dispersal From Hadoop + +A popular use-case for Hadoop, is to crunch data and then disperse it back to an online serving store, to be used by an application. +For e.g, a Spark Pipeline can [determine hard braking events on Hadoop](https://eng.uber.com/telematics/) and load them into a serving store like ElasticSearch, +to be used by the Uber application to increase safe driving. Typical architectures for this employ a `queue` between Hadoop and serving store, to prevent overwhelming the target serving store. +A popular choice for this queue is Kafka and this model often results in __redundant storage of same data on HDFS (for offline analysis on computed results) and Kafka (for dispersal)__ + +Once again Hoodie can efficiently solve this problem efficiently. Using the same example, the Spark Pipeline can keep upserting output from +each run into a Hoodie dataset, which can now be incrementally tailed (just like a Kafka topic) for new data to be written into the serving store.