Merge pull request #28 from vinothchandar/add-docs
Adding Documentation for Getting Started Section
This commit is contained in:
@@ -12,7 +12,7 @@
|
|||||||
<link rel="stylesheet" href="{{ "css/modern-business.css" }}">
|
<link rel="stylesheet" href="{{ "css/modern-business.css" }}">
|
||||||
<link rel="stylesheet" href="{{ "css/lavish-bootstrap.css" }}">
|
<link rel="stylesheet" href="{{ "css/lavish-bootstrap.css" }}">
|
||||||
<link rel="stylesheet" href="{{ "css/customstyles.css" }}">
|
<link rel="stylesheet" href="{{ "css/customstyles.css" }}">
|
||||||
<link rel="stylesheet" href="{{ "css/theme-green.css" }}">
|
<link rel="stylesheet" href="{{ "css/theme-blue.css" }}">
|
||||||
|
|
||||||
<script src="https://cdnjs.cloudflare.com/ajax/libs/jquery/2.1.4/jquery.min.js"></script>
|
<script src="https://cdnjs.cloudflare.com/ajax/libs/jquery/2.1.4/jquery.min.js"></script>
|
||||||
<script src="https://cdnjs.cloudflare.com/ajax/libs/jquery-cookie/1.4.1/jquery.cookie.min.js"></script>
|
<script src="https://cdnjs.cloudflare.com/ajax/libs/jquery-cookie/1.4.1/jquery.cookie.min.js"></script>
|
||||||
|
|||||||
@@ -5,6 +5,47 @@ sidebar: mydoc_sidebar
|
|||||||
permalink: admin_guide.html
|
permalink: admin_guide.html
|
||||||
---
|
---
|
||||||
|
|
||||||
Work In Progress
|
## Hoodie Admin CLI
|
||||||
|
### Launching Command Line
|
||||||
|
|
||||||
|
<todo - change this after packaging is done>
|
||||||
|
|
||||||
|
* mvn clean install in hoodie-cli
|
||||||
|
* ./hoodie-cli
|
||||||
|
|
||||||
|
If all is good you should get a command prompt similar to this one
|
||||||
|
```
|
||||||
|
prasanna@:~/hoodie/hoodie-cli$ ./hoodie-cli.sh
|
||||||
|
16/07/13 21:27:47 INFO xml.XmlBeanDefinitionReader: Loading XML bean definitions from URL [jar:file:/home/prasanna/hoodie/hoodie-cli/target/hoodie-cli-0.1-SNAPSHOT.jar!/META-INF/spring/spring-shell-plugin.xml]
|
||||||
|
16/07/13 21:27:47 INFO support.GenericApplicationContext: Refreshing org.springframework.context.support.GenericApplicationContext@372688e8: startup date [Wed Jul 13 21:27:47 UTC 2016]; root of context hierarchy
|
||||||
|
16/07/13 21:27:47 INFO annotation.AutowiredAnnotationBeanPostProcessor: JSR-330 'javax.inject.Inject' annotation found and supported for autowiring
|
||||||
|
============================================
|
||||||
|
* \*
|
||||||
|
* _ _ _ _ \*
|
||||||
|
* | | | | | (_) *
|
||||||
|
* | |__| | ___ ___ __| |_ ___ *
|
||||||
|
* | __ |/ _ \ / _ \ / _` | |/ _ \ *
|
||||||
|
* | | | | (_) | (_) | (_| | | __/ *
|
||||||
|
* |_| |_|\___/ \___/ \__,_|_|\___| *
|
||||||
|
* *
|
||||||
|
============================================
|
||||||
|
|
||||||
|
Welcome to Hoodie CLI. Please type help if you are looking for help.
|
||||||
|
hoodie->
|
||||||
|
```
|
||||||
|
|
||||||
|
### Commands
|
||||||
|
|
||||||
|
* connect --path [dataset_path] : Connect to the specific dataset by its path
|
||||||
|
* commits show : Show all details about the commits
|
||||||
|
* commits refresh : Refresh the commits from HDFS
|
||||||
|
* commit rollback --commit [commitTime] : Rollback a commit
|
||||||
|
* commit showfiles --commit [commitTime] : Show details of a commit (lists all the files modified along with other metrics)
|
||||||
|
* commit showpartitions --commit [commitTime] : Show details of a commit (lists statistics aggregated at partition level)
|
||||||
|
|
||||||
|
* commits compare --path [otherBasePath] : Compares the current dataset commits with the path provided and tells you how many commits behind or ahead
|
||||||
|
* stats wa : Calculate commit level and overall write amplification factor (total records written / total records upserted)
|
||||||
|
* help
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|||||||
@@ -5,6 +5,13 @@ sidebar: mydoc_sidebar
|
|||||||
permalink: code_structure.html
|
permalink: code_structure.html
|
||||||
---
|
---
|
||||||
|
|
||||||
Work In Progress
|
## Code & Project Structure
|
||||||
|
|
||||||
|
* hoodie-client : Spark client library to take a bunch of inserts + updates and apply them to a Hoodie table
|
||||||
|
* hoodie-common : Common code shared between different artifacts of Hoodie
|
||||||
|
|
||||||
|
|
||||||
|
We have embraced the [Google Java code style](https://google.github.io/styleguide/javaguide.html). Please setup your IDE accordingly with style files from [here] (https://github.com/google/styleguide)
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|||||||
@@ -5,6 +5,11 @@ sidebar: mydoc_sidebar
|
|||||||
permalink: community.html
|
permalink: community.html
|
||||||
---
|
---
|
||||||
|
|
||||||
Work In Progress
|
## Contributing
|
||||||
|
We :heart: contributions. If you find a bug in the library or would like to add new features, go ahead and open
|
||||||
|
issues or pull requests against this repo. Before you do so, please sign the
|
||||||
|
[Uber CLA](https://docs.google.com/a/uber.com/forms/d/1pAwS_-dA1KhPlfxzYLBqK6rsSWwRwH95OCCZrcsY5rk/viewform).
|
||||||
|
Also, be sure to write unit tests for your bug fix or feature to show that it works as expected.
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|||||||
@@ -5,6 +5,10 @@ sidebar: mydoc_sidebar
|
|||||||
permalink: concepts.html
|
permalink: concepts.html
|
||||||
---
|
---
|
||||||
|
|
||||||
Work In Progress
|
Hoodie provides the following primitives to build & access datasets on HDFS
|
||||||
|
|
||||||
|
* Upsert (how do I change the table efficiently?)
|
||||||
|
* Incremental consumption (how do I obtain records that changed?)
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|||||||
BIN
docs/images/hoodie_intro_1.png
Normal file
BIN
docs/images/hoodie_intro_1.png
Normal file
Binary file not shown.
|
After Width: | Height: | Size: 23 KiB |
217
docs/index.md
217
docs/index.md
@@ -1,220 +1,29 @@
|
|||||||
---
|
---
|
||||||
title: Hoodie - Upserts & Incrementals On Hadoop
|
title: Hoodie Overview
|
||||||
keywords: homepage
|
keywords: homepage
|
||||||
tags: [getting_started]
|
tags: [getting_started]
|
||||||
sidebar: mydoc_sidebar
|
sidebar: mydoc_sidebar
|
||||||
permalink: index.html
|
permalink: index.html
|
||||||
|
summary: "Hoodie lowers data latency across the board, while simultaenously achieving orders of magnitude of efficiency over traditional batch processing."
|
||||||
---
|
---
|
||||||
|
|
||||||
|
|
||||||
Hoodie - Spark Library For Upserts & Incremental Consumption
|
|
||||||
=============================================================
|
|
||||||
|
|
||||||
- - - -
|
|
||||||
|
|
||||||
# Core Functionality #
|
|
||||||
|
|
||||||
Hoodie provides the following abilities on a Hive table
|
|
||||||
|
|
||||||
* Upsert (how do I change the table efficiently?)
|
|
||||||
* Incremental consumption (how do I obtain records that changed?)
|
|
||||||
|
|
||||||
|
|
||||||
Ultimately, make the built Hive table, queryable via Spark & Presto as well.
|
Hoodie manages storage of large analytical datasets on [HDFS](http://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-hdfs/HdfsDesign.html) and serve them out via two types of tables
|
||||||
|
|
||||||
|
* **Read Optimized Table** - Provides excellent query performance via purely columnar storage (e.g. [Parquet](https://parquet.apache.org/))
|
||||||
|
* **Near-Real time Table** - Provides queries on real-time data, using a combination of columnar & row based storage (e.g Parquet + [Avro](http://avro.apache.org/docs/current/mr.html))
|
||||||
|
|
||||||
|
|
||||||
# Code & Project Structure #
|
{% include image.html file="hoodie_intro_1.png" alt="hoodie_intro_1.png" %}
|
||||||
|
|
||||||
* hoodie-client : Spark client library to take a bunch of inserts + updates and apply them to a Hoodie table
|
By carefully managing how data is laid out on storage & how its exposed to queries, Hoodie is able to power a rich data ecosystem where external sources can be ingested into Hadoop in near-real time.
|
||||||
* hoodie-common : Common code shared between different artifacts of Hoodie
|
The ingested data is then available for interactive SQL Engines like [Presto](https://prestodb.io) & [Spark](https://spark.apache.org/sql/),
|
||||||
|
while at the same time capable of being consumed incrementally from processing/ETL frameoworks like [Hive](https://hive.apache.org/) & [Spark](https://spark.apache.org/docs/latest/) to build derived (hoodie) datasets.
|
||||||
|
|
||||||
|
Hoodie broadly consists of a self contained Spark library to build datasets and integrations with existing query engines for data access.
|
||||||
|
|
||||||
|
|
||||||
We have embraced the [Google Java code style](https://google.github.io/styleguide/javaguide.html). Please setup your IDE accordingly with style files from [here] (https://github.com/google/styleguide)
|
{% include callout.html content="Hoodie is a young project. Near-Real time Table implementation is currently underway. Get involved [here](https://github.com/uber/hoodie/projects/1)" type="info" %}
|
||||||
|
|
||||||
|
|
||||||
# Quickstart #
|
|
||||||
|
|
||||||
Check out code and pull it into Intellij as a normal maven project.
|
|
||||||
> You might want to add your spark assembly jar to project dependencies under "Module Setttings", to be able to run Spark from IDE
|
|
||||||
|
|
||||||
Setup your local hadoop/hive test environment. See [this](http://www.bytearray.io/2016/05/setting-up-hadoopyarnsparkhive-on-mac.html) for reference
|
|
||||||
|
|
||||||
## Run the Hoodie Test Job ##
|
|
||||||
|
|
||||||
Create the output folder on your local HDFS
|
|
||||||
```
|
|
||||||
hdfs dfs -mkdir -p /tmp/hoodie/sample-table
|
|
||||||
```
|
|
||||||
|
|
||||||
You can run the __HoodieClientExample__ class, to place a set of inserts + updates onto your HDFS at /tmp/hoodie/sample-table
|
|
||||||
|
|
||||||
## Access via Hive ##
|
|
||||||
|
|
||||||
Add in the hoodie-mr jar so, Hive can pick up the right files to hit, to answer the query.
|
|
||||||
|
|
||||||
```
|
|
||||||
hive> add jar file:///tmp/hoodie-mr-0.1.jar;
|
|
||||||
Added [file:///tmp/hoodie-mr-0.1.jar] to class path
|
|
||||||
Added resources: [file:///tmp/hoodie-mr-0.1.jar]
|
|
||||||
```
|
|
||||||
|
|
||||||
Then, you need to create a table and register the sample partitions
|
|
||||||
|
|
||||||
|
|
||||||
```
|
|
||||||
drop table hoodie_test;
|
|
||||||
CREATE EXTERNAL TABLE hoodie_test(`_row_key` string,
|
|
||||||
`_hoodie_commit_time` string,
|
|
||||||
`_hoodie_commit_seqno` string,
|
|
||||||
rider string,
|
|
||||||
driver string,
|
|
||||||
begin_lat double,
|
|
||||||
begin_lon double,
|
|
||||||
end_lat double,
|
|
||||||
end_lon double,
|
|
||||||
fare double)
|
|
||||||
PARTITIONED BY (`datestr` string)
|
|
||||||
ROW FORMAT SERDE
|
|
||||||
'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
|
|
||||||
STORED AS INPUTFORMAT
|
|
||||||
'com.uber.hoodie.hadoop.HoodieInputFormat'
|
|
||||||
OUTPUTFORMAT
|
|
||||||
'com.uber.hoodie.hadoop.HoodieOutputFormat'
|
|
||||||
LOCATION
|
|
||||||
'hdfs:///tmp/hoodie/sample-table';
|
|
||||||
|
|
||||||
ALTER TABLE `hoodie_test` ADD IF NOT EXISTS PARTITION (datestr='2016-03-15') LOCATION 'hdfs:///tmp/hoodie/sample-table/2016/03/15';
|
|
||||||
ALTER TABLE `hoodie_test` ADD IF NOT EXISTS PARTITION (datestr='2015-03-16') LOCATION 'hdfs:///tmp/hoodie/sample-table/2015/03/16';
|
|
||||||
ALTER TABLE `hoodie_test` ADD IF NOT EXISTS PARTITION (datestr='2015-03-17') LOCATION 'hdfs:///tmp/hoodie/sample-table/2015/03/17';
|
|
||||||
```
|
|
||||||
|
|
||||||
Let's first perform a query on the latest committed snapshot of the table
|
|
||||||
|
|
||||||
```
|
|
||||||
hive> select count(*) from hoodie_test;
|
|
||||||
...
|
|
||||||
OK
|
|
||||||
100
|
|
||||||
Time taken: 18.05 seconds, Fetched: 1 row(s)
|
|
||||||
hive>
|
|
||||||
```
|
|
||||||
|
|
||||||
|
|
||||||
Let's now perform a query, to obtain the changed rows since a commit in the past
|
|
||||||
|
|
||||||
```
|
|
||||||
hive> set hoodie.scan.mode=INCREMENTAL;
|
|
||||||
hive> set hoodie.last.commitTs=001;
|
|
||||||
hive> select `_hoodie_commit_time`, rider, driver from hoodie_test limit 10;
|
|
||||||
OK
|
|
||||||
All commits :[001, 002]
|
|
||||||
002 rider-001 driver-001
|
|
||||||
002 rider-001 driver-001
|
|
||||||
002 rider-002 driver-002
|
|
||||||
002 rider-001 driver-001
|
|
||||||
002 rider-001 driver-001
|
|
||||||
002 rider-002 driver-002
|
|
||||||
002 rider-001 driver-001
|
|
||||||
002 rider-002 driver-002
|
|
||||||
002 rider-002 driver-002
|
|
||||||
002 rider-001 driver-001
|
|
||||||
Time taken: 0.056 seconds, Fetched: 10 row(s)
|
|
||||||
hive>
|
|
||||||
hive>
|
|
||||||
```
|
|
||||||
|
|
||||||
|
|
||||||
## Access via Spark ##
|
|
||||||
|
|
||||||
Spark is super easy, once you get Hive working as above. Just spin up a Spark Shell as below
|
|
||||||
|
|
||||||
```
|
|
||||||
$ cd $SPARK_INSTALL
|
|
||||||
$ export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop
|
|
||||||
$ spark-shell --jars /tmp/hoodie-mr-0.1.jar --driver-class-path $HADOOP_CONF_DIR --conf spark.sql.hive.convertMetastoreParquet=false
|
|
||||||
|
|
||||||
|
|
||||||
scala> sqlContext.sql("show tables").show(10000)
|
|
||||||
scala> sqlContext.sql("describe hoodie_test").show(10000)
|
|
||||||
scala> sqlContext.sql("select count(*) from hoodie_test").show(10000)
|
|
||||||
```
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
## Access via Presto ##
|
|
||||||
|
|
||||||
Checkout the 'hoodie-integration' branch, build off it, and place your installation somewhere.
|
|
||||||
|
|
||||||
* Copy the hoodie-mr jar into $PRESTO_INSTALL/plugin/hive-hadoop2/
|
|
||||||
|
|
||||||
* Change your catalog config, to make presto respect the __HoodieInputFormat__
|
|
||||||
|
|
||||||
```
|
|
||||||
$ cat etc/catalog/hive.properties
|
|
||||||
connector.name=hive-hadoop2
|
|
||||||
hive.metastore.uri=thrift://localhost:10000
|
|
||||||
hive.respect-input-format-splits=true
|
|
||||||
```
|
|
||||||
|
|
||||||
startup your server and you should be able to query the same Hive table via Presto
|
|
||||||
|
|
||||||
```
|
|
||||||
show columns from hive.default.hoodie_test;
|
|
||||||
select count(*) from hive.default.hoodie_test
|
|
||||||
```
|
|
||||||
|
|
||||||
> NOTE: As of now, Presto has trouble accessing HDFS locally, hence create a new table as above, backed on local filesystem file:// as a workaround
|
|
||||||
|
|
||||||
# Planned #
|
|
||||||
* Support for Self Joins - As of now, you cannot incrementally consume the same table more than once, since the InputFormat does not understand the QueryPlan.
|
|
||||||
* Hoodie Spark Datasource - Allows for reading and writing data back using Apache Spark natively (without falling back to InputFormat), which can be more performant
|
|
||||||
* Hoodie Presto Connector - Allows for querying data managed by Hoodie using Presto natively, which can again boost [performance](https://prestodb.io/docs/current/release/release-0.138.html)
|
|
||||||
|
|
||||||
|
|
||||||
# Hoodie Admin CLI
|
|
||||||
# Launching Command Line #
|
|
||||||
|
|
||||||
<todo - change this after packaging is done>
|
|
||||||
|
|
||||||
* mvn clean install in hoodie-cli
|
|
||||||
* ./hoodie-cli
|
|
||||||
|
|
||||||
If all is good you should get a command prompt similar to this one
|
|
||||||
```
|
|
||||||
prasanna@:~/hoodie/hoodie-cli$ ./hoodie-cli.sh
|
|
||||||
16/07/13 21:27:47 INFO xml.XmlBeanDefinitionReader: Loading XML bean definitions from URL [jar:file:/home/prasanna/hoodie/hoodie-cli/target/hoodie-cli-0.1-SNAPSHOT.jar!/META-INF/spring/spring-shell-plugin.xml]
|
|
||||||
16/07/13 21:27:47 INFO support.GenericApplicationContext: Refreshing org.springframework.context.support.GenericApplicationContext@372688e8: startup date [Wed Jul 13 21:27:47 UTC 2016]; root of context hierarchy
|
|
||||||
16/07/13 21:27:47 INFO annotation.AutowiredAnnotationBeanPostProcessor: JSR-330 'javax.inject.Inject' annotation found and supported for autowiring
|
|
||||||
============================================
|
|
||||||
* *
|
|
||||||
* _ _ _ _ *
|
|
||||||
* | | | | | (_) *
|
|
||||||
* | |__| | ___ ___ __| |_ ___ *
|
|
||||||
* | __ |/ _ \ / _ \ / _` | |/ _ \ *
|
|
||||||
* | | | | (_) | (_) | (_| | | __/ *
|
|
||||||
* |_| |_|\___/ \___/ \__,_|_|\___| *
|
|
||||||
* *
|
|
||||||
============================================
|
|
||||||
|
|
||||||
Welcome to Hoodie CLI. Please type help if you are looking for help.
|
|
||||||
hoodie->
|
|
||||||
```
|
|
||||||
|
|
||||||
# Commands #
|
|
||||||
|
|
||||||
* connect --path [dataset_path] : Connect to the specific dataset by its path
|
|
||||||
* commits show : Show all details about the commits
|
|
||||||
* commits refresh : Refresh the commits from HDFS
|
|
||||||
* commit rollback --commit [commitTime] : Rollback a commit
|
|
||||||
* commit showfiles --commit [commitTime] : Show details of a commit (lists all the files modified along with other metrics)
|
|
||||||
* commit showpartitions --commit [commitTime] : Show details of a commit (lists statistics aggregated at partition level)
|
|
||||||
|
|
||||||
* commits compare --path [otherBasePath] : Compares the current dataset commits with the path provided and tells you how many commits behind or ahead
|
|
||||||
* stats wa : Calculate commit level and overall write amplification factor (total records written / total records upserted)
|
|
||||||
* help
|
|
||||||
|
|
||||||
## Contributing
|
|
||||||
We :heart: contributions. If you find a bug in the library or would like to add new features, go ahead and open
|
|
||||||
issues or pull requests against this repo. Before you do so, please sign the
|
|
||||||
[Uber CLA](https://docs.google.com/a/uber.com/forms/d/1pAwS_-dA1KhPlfxzYLBqK6rsSWwRwH95OCCZrcsY5rk/viewform).
|
|
||||||
Also, be sure to write unit tests for your bug fix or feature to show that it works as expected.
|
|
||||||
|
|||||||
@@ -5,5 +5,7 @@ sidebar: mydoc_sidebar
|
|||||||
permalink: powered_by.html
|
permalink: powered_by.html
|
||||||
---
|
---
|
||||||
|
|
||||||
Work In Progress
|
## Uber
|
||||||
|
|
||||||
|
Hoodie was originally developed at [Uber](https://uber.com), to achieve [low latency database ingestion, with high efficiency](http://www.slideshare.net/vinothchandar/hadoop-strata-talk-uber-your-hadoop-has-arrived/32).
|
||||||
|
It has been in production since Aug 2016, powering highly business critical (7/10 most used including trips,riders,partners totalling 100s of TBs) tables on Hadoop. It also powers several incremental Hive ETL pipelines and being currently integrated into Uber's data dispersal system.
|
||||||
|
|||||||
@@ -6,5 +6,149 @@ sidebar: mydoc_sidebar
|
|||||||
permalink: quickstart.html
|
permalink: quickstart.html
|
||||||
---
|
---
|
||||||
|
|
||||||
Work In Progress
|
|
||||||
|
|
||||||
|
|
||||||
|
## Download Hoodie
|
||||||
|
|
||||||
|
Check out code and pull it into Intellij as a normal maven project.
|
||||||
|
|
||||||
|
Normally build the maven project, from command line
|
||||||
|
```
|
||||||
|
$ mvn clean install -DskipTests
|
||||||
|
```
|
||||||
|
|
||||||
|
{% include callout.html content="You might want to add your spark assembly jar to project dependencies under 'Module Setttings', to be able to run Spark from IDE" type="info" %}
|
||||||
|
|
||||||
|
{% include note.html content="Setup your local hadoop/hive test environment, so you can play with entire ecosystem. See [this](http://www.bytearray.io/2016/05/setting-up-hadoopyarnsparkhive-on-mac.html) for reference" %}
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
## Generate a Hoodie Dataset
|
||||||
|
|
||||||
|
Create the output folder on your local HDFS
|
||||||
|
```
|
||||||
|
hdfs dfs -mkdir -p /tmp/hoodie/sample-table
|
||||||
|
```
|
||||||
|
|
||||||
|
You can run the __HoodieClientExample__ class, to place a two commits (commit 1 => 100 inserts, commit 2 => 100 updates to previously inserted 100 records) onto your HDFS at /tmp/hoodie/sample-table
|
||||||
|
|
||||||
|
|
||||||
|
## Register Dataset to Hive Metastore
|
||||||
|
|
||||||
|
Add in the hoodie-hadoop-mr jar so, Hive can read the Hoodie dataset and answer the query.
|
||||||
|
|
||||||
|
```
|
||||||
|
hive> add jar file:///tmp/hoodie-hadoop-mr-0.2.7.jar;
|
||||||
|
Added [file:///tmp/hoodie-hadoop-mr-0.2.7.jar] to class path
|
||||||
|
Added resources: [file:///tmp/hoodie-hadoop-mr-0.2.7.jar]
|
||||||
|
```
|
||||||
|
|
||||||
|
Then, you need to create a ReadOptimized table as below (only type supported as of now)and register the sample partitions
|
||||||
|
|
||||||
|
|
||||||
|
```
|
||||||
|
drop table hoodie_test;
|
||||||
|
CREATE EXTERNAL TABLE hoodie_test(`_row_key` string,
|
||||||
|
`_hoodie_commit_time` string,
|
||||||
|
`_hoodie_commit_seqno` string,
|
||||||
|
rider string,
|
||||||
|
driver string,
|
||||||
|
begin_lat double,
|
||||||
|
begin_lon double,
|
||||||
|
end_lat double,
|
||||||
|
end_lon double,
|
||||||
|
fare double)
|
||||||
|
PARTITIONED BY (`datestr` string)
|
||||||
|
ROW FORMAT SERDE
|
||||||
|
'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
|
||||||
|
STORED AS INPUTFORMAT
|
||||||
|
'com.uber.hoodie.hadoop.HoodieInputFormat'
|
||||||
|
OUTPUTFORMAT
|
||||||
|
'com.uber.hoodie.hadoop.HoodieOutputFormat'
|
||||||
|
LOCATION
|
||||||
|
'hdfs:///tmp/hoodie/sample-table';
|
||||||
|
|
||||||
|
ALTER TABLE `hoodie_test` ADD IF NOT EXISTS PARTITION (datestr='2016-03-15') LOCATION 'hdfs:///tmp/hoodie/sample-table/2016/03/15';
|
||||||
|
ALTER TABLE `hoodie_test` ADD IF NOT EXISTS PARTITION (datestr='2015-03-16') LOCATION 'hdfs:///tmp/hoodie/sample-table/2015/03/16';
|
||||||
|
ALTER TABLE `hoodie_test` ADD IF NOT EXISTS PARTITION (datestr='2015-03-17') LOCATION 'hdfs:///tmp/hoodie/sample-table/2015/03/17';
|
||||||
|
```
|
||||||
|
|
||||||
|
## Querying The Dataset
|
||||||
|
|
||||||
|
Now, we can proceed to query the dataset, as we would normally do across all the three query engines supported.
|
||||||
|
|
||||||
|
### HiveQL
|
||||||
|
|
||||||
|
Let's first perform a query on the latest committed snapshot of the table
|
||||||
|
|
||||||
|
```
|
||||||
|
hive> select count(*) from hoodie_test;
|
||||||
|
...
|
||||||
|
OK
|
||||||
|
100
|
||||||
|
Time taken: 18.05 seconds, Fetched: 1 row(s)
|
||||||
|
hive>
|
||||||
|
```
|
||||||
|
|
||||||
|
### SparkSQL
|
||||||
|
|
||||||
|
Spark is super easy, once you get Hive working as above. Just spin up a Spark Shell as below
|
||||||
|
|
||||||
|
```
|
||||||
|
$ cd $SPARK_INSTALL
|
||||||
|
$ export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop
|
||||||
|
$ spark-shell --jars /tmp/hoodie-hadoop-mr-0.2.7.jar --driver-class-path $HADOOP_CONF_DIR --conf spark.sql.hive.convertMetastoreParquet=false
|
||||||
|
|
||||||
|
|
||||||
|
scala> sqlContext.sql("show tables").show(10000)
|
||||||
|
scala> sqlContext.sql("describe hoodie_test").show(10000)
|
||||||
|
scala> sqlContext.sql("select count(*) from hoodie_test").show(10000)
|
||||||
|
```
|
||||||
|
|
||||||
|
|
||||||
|
### Presto
|
||||||
|
|
||||||
|
Checkout the 'master' branch on OSS Presto, build it, and place your installation somewhere.
|
||||||
|
|
||||||
|
* Copy the hoodie-hadoop-mr-0.2.7 jar into $PRESTO_INSTALL/plugin/hive-hadoop2/
|
||||||
|
* Startup your server and you should be able to query the same Hive table via Presto
|
||||||
|
|
||||||
|
```
|
||||||
|
show columns from hive.default.hoodie_test;
|
||||||
|
select count(*) from hive.default.hoodie_test
|
||||||
|
```
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
## Incremental Queries
|
||||||
|
|
||||||
|
Let's now perform a query, to obtain the __ONLY__ changed rows since a commit in the past.
|
||||||
|
|
||||||
|
```
|
||||||
|
hive> set hoodie.scan.mode=INCREMENTAL;
|
||||||
|
hive> set hoodie.last.commitTs=001;
|
||||||
|
hive> select `_hoodie_commit_time`, rider, driver from hoodie_test limit 10;
|
||||||
|
OK
|
||||||
|
All commits :[001, 002]
|
||||||
|
002 rider-001 driver-001
|
||||||
|
002 rider-001 driver-001
|
||||||
|
002 rider-002 driver-002
|
||||||
|
002 rider-001 driver-001
|
||||||
|
002 rider-001 driver-001
|
||||||
|
002 rider-002 driver-002
|
||||||
|
002 rider-001 driver-001
|
||||||
|
002 rider-002 driver-002
|
||||||
|
002 rider-002 driver-002
|
||||||
|
002 rider-001 driver-001
|
||||||
|
Time taken: 0.056 seconds, Fetched: 10 row(s)
|
||||||
|
hive>
|
||||||
|
hive>
|
||||||
|
```
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|||||||
@@ -5,6 +5,10 @@ sidebar: mydoc_sidebar
|
|||||||
permalink: roadmap.html
|
permalink: roadmap.html
|
||||||
---
|
---
|
||||||
|
|
||||||
Work In Progress
|
## Planned Features
|
||||||
|
|
||||||
|
* Support for Self Joins - As of now, you cannot incrementally consume the same table more than once, since the InputFormat does not understand the QueryPlan.
|
||||||
|
* Hoodie Spark Datasource - Allows for reading and writing data back using Apache Spark natively (without falling back to InputFormat), which can be more performant
|
||||||
|
* Hoodie Presto Connector - Allows for querying data managed by Hoodie using Presto natively, which can again boost [performance](https://prestodb.io/docs/current/release/release-0.138.html)
|
||||||
|
|
||||||
|
|
||||||
|
|||||||
@@ -3,7 +3,77 @@ title: Use Cases
|
|||||||
keywords: usecases
|
keywords: usecases
|
||||||
sidebar: mydoc_sidebar
|
sidebar: mydoc_sidebar
|
||||||
permalink: use_cases.html
|
permalink: use_cases.html
|
||||||
|
toc: false
|
||||||
---
|
---
|
||||||
|
|
||||||
Work In Progress
|
Following are some sample use-cases for Hoodie.
|
||||||
|
|
||||||
|
|
||||||
|
## Near Real-Time Ingestion
|
||||||
|
|
||||||
|
Ingesting data from external sources like (event logs, databases, external sources) into a [Hadoop Data Lake](http://martinfowler.com/bliki/DataLake.html) is a well known problem.
|
||||||
|
In most (if not all) Hadoop deployments, it is unfortunately solved in a piecemeal fashion, using a medley of ingestion tools,
|
||||||
|
even though this data is arguably the most valuable for the entire organization.
|
||||||
|
|
||||||
|
|
||||||
|
For RDBMS ingestion, Hoodie provides __faster loads via Upserts__, as opposed costly & inefficient bulk loads. For e.g, you can read the MySQL BIN log or [Sqoop Incremental Import](https://sqoop.apache.org/docs/1.4.2/SqoopUserGuide.html#_incremental_imports) and apply them to an
|
||||||
|
equivalent Hoodie table on HDFS. This would be much faster/efficient than a [bulk merge job](https://sqoop.apache.org/docs/1.4.0-incubating/SqoopUserGuide.html#id1770457)
|
||||||
|
or [complicated handcrafted merge workflows](http://hortonworks.com/blog/four-step-strategy-incremental-updates-hive/)
|
||||||
|
|
||||||
|
|
||||||
|
For NoSQL datastores like [Cassandra](http://cassandra.apache.org/) / [Voldemort](http://www.project-voldemort.com/voldemort/) / [HBase](https://hbase.apache.org/), even moderately big installations store billions of rows.
|
||||||
|
It goes without saying that __full bulk loads are simply infeasible__ and more efficient approaches are needed if ingestion is to keep up with the typically high update volumes.
|
||||||
|
|
||||||
|
|
||||||
|
Even for immutable data sources like [Kafka](kafka.apache.org) , Hoodie helps __enforces a minimum file size on HDFS__, which improves [NameNode health](https://blog.cloudera.com/blog/2009/02/the-small-files-problem/).
|
||||||
|
This is all the more important in such an use-case since typically event data is high volume (eg: click streams) and if not managed well, can cause serious damage to your Hadoop cluster.
|
||||||
|
|
||||||
|
Across all sources, Hoodie adds the much needed ability to atomically publish new data to consumers via notion of commits, shielding them from partial ingestion failures
|
||||||
|
|
||||||
|
|
||||||
|
## Near Real-time Analytics
|
||||||
|
|
||||||
|
Typically, real-time [datamarts](https://en.wikipedia.org/wiki/Data_mart) are powered by specialized analytical stores such as [Druid](http://druid.io/) or [Memsql](http://www.memsql.com/) or [even OpenTSDB](http://opentsdb.net/) .
|
||||||
|
This is absolutely perfect for lower scale ([relative to Hadoop installations like this](https://blog.twitter.com/2015/hadoop-filesystem-at-twitter)) data,
|
||||||
|
that needs sub-second query responses such as system monitoring or interactive real-time analysis.
|
||||||
|
But, typically these systems end up getting abused for less interactive queries also since data on Hadoop is intolerably stale. This leads to under utilization & wasteful hardware/license costs.
|
||||||
|
|
||||||
|
|
||||||
|
On the other hand, interactive SQL solutions on Hadoop such as Presto & SparkSQL excel in __queries that finish within few seconds__.
|
||||||
|
By bringing __data freshness to a few minutes__, Hoodie can provide a much efficient alternative, as well unlock real-time analytics on __several magnitudes larger datasets__ stored in HDFS.
|
||||||
|
Also, Hoodie has no external dependencies (like a dedicated HBase cluster, purely used for real-time analytics) and thus enabled faster analytics on much fresher analytics, without increasing the operational overhead.
|
||||||
|
|
||||||
|
|
||||||
|
## Incremental Processing Pipelines
|
||||||
|
|
||||||
|
One fundamental ability Hadoop provides is to build a chain of datasets derived from each other via DAGs expressed as workflows.
|
||||||
|
Workflows often depend on new data being output by multiple upstream workflows and traditionally, availability of new data is indicated by a new HDFS Folder/Hive Partition.
|
||||||
|
Let's take a concrete example to illustrate this. An upstream workflow `U` can create a Hive partition for every hour, with data for that hour (event_time) at the end of each hour (processing_time), providing effective freshness of 1 hour.
|
||||||
|
Then, a downstream workflow `D`, kicks off immediately after `U` finishes, and does its own processing for the next hour, increasing the effective latency to 2 hours.
|
||||||
|
|
||||||
|
The above paradigm simply ignores late arriving data i.e when `processing_time` and `event_time` drift apart.
|
||||||
|
Unfortunately, in today's post-mobile & pre-IoT world, __late data from intermittently connected mobile devices & sensors are the norm, not an anomaly__.
|
||||||
|
In such cases, the only remedy to guarantee correctness is to [reprocess the last few hours](https://falcon.apache.org/FalconDocumentation.html#Handling_late_input_data) worth of data,
|
||||||
|
over and over again each hour, which can significantly hurt the efficiency across the entire ecosystem. For e.g; imagine reprocessing TBs worth of data every hour across hundreds of workflows.
|
||||||
|
|
||||||
|
|
||||||
|
Hoodie comes to the rescue again, by providing a way to consume new data (including late data) from an upsteam Hoodie dataset `HU` at a record granularity (not folders/partitions),
|
||||||
|
apply the processing logic, and efficiently update/reconcile late data with a downstream Hoodie dataset `HD`. Here, `HU` and `HD` can be continuously scheduled at a much more frequent schedule
|
||||||
|
like 15 mins, and providing an end-end latency of 30 mins at `HD`.
|
||||||
|
|
||||||
|
|
||||||
|
{% include callout.html content="To achieve this, Hoodie borrows concepts from stream processing frameworks like [Spark Streaming](https://spark.apache.org/docs/latest/streaming-programming-guide.html#join-operations) , Pub/Sub systems like [Kafka](http://kafka.apache.org/documentation/#theconsumer)
|
||||||
|
or database replication technologies like [Oracle XStream](https://docs.oracle.com/cd/E11882_01/server.112/e16545/xstrm_cncpt.htm#XSTRM187).
|
||||||
|
For the more curious, a more detailed explanation of the benefits of Incremetal Processing (compared to Stream Processing & Batch Processing) can be found [here](https://www.oreilly.com/ideas/ubers-case-for-incremental-processing-on-hadoop)" type="info" %}
|
||||||
|
|
||||||
|
|
||||||
|
## Data Dispersal From Hadoop
|
||||||
|
|
||||||
|
A popular use-case for Hadoop, is to crunch data and then disperse it back to an online serving store, to be used by an application.
|
||||||
|
For e.g, a Spark Pipeline can [determine hard braking events on Hadoop](https://eng.uber.com/telematics/) and load them into a serving store like ElasticSearch,
|
||||||
|
to be used by the Uber application to increase safe driving. Typical architectures for this employ a `queue` between Hadoop and serving store, to prevent overwhelming the target serving store.
|
||||||
|
A popular choice for this queue is Kafka and this model often results in __redundant storage of same data on HDFS (for offline analysis on computed results) and Kafka (for dispersal)__
|
||||||
|
|
||||||
|
Once again Hoodie can efficiently solve this problem efficiently. Using the same example, the Spark Pipeline can keep upserting output from
|
||||||
|
each run into a Hoodie dataset, which can now be incrementally tailed (just like a Kafka topic) for new data to be written into the serving store.
|
||||||
|
|
||||||
|
|||||||
Reference in New Issue
Block a user