1
0

Adding hoodie-spark to support Spark Datasource for Hoodie

- Write with COW/MOR paths work fully
 - Read with RO view works on both storages*
 - Incremental view supported on COW
 - Refactored out HoodieReadClient methods, to just contain key based access
 - HoodieDataSourceHelpers class can be now used to construct inputs to datasource
 - Tests in hoodie-client using new helpers and mechanisms
 - Basic tests around save modes & insert/upserts (more to follow)
 - Bumped up scala to 2.11, since 2.10 is deprecated & complains with scalatest
 - Updated documentation to describe usage
 - New sample app written using the DataSource API
This commit is contained in:
Vinoth Chandar
2017-08-28 01:28:08 -07:00
committed by vinoth chandar
parent c98ee057fc
commit 64e0573aca
44 changed files with 1830 additions and 331 deletions

View File

@@ -2,8 +2,8 @@
entries:
- title: sidebar
product: Hoodie Ver
version: 0.2
product: Documentation
version:
folders:
- title: Getting Started

View File

@@ -90,4 +90,49 @@ summary: "Here we list all possible configurations and what they mean"
- [GCSConfigs](gcs_hoodie.html) (Hoodie GCS Configs) <br/>
<span style="color:grey">Configurations required for GCS and Hoodie co-operability.</span>
* [Hoodie Datasource](#datasource) <br/>
<span style="color:grey">Configs for datasource</span>
- [write options](#writeoptions) (write.format.option(...)) <br/>
<span style="color:grey"> Options useful for writing datasets </span>
- [OPERATION_OPT_KEY](#OPERATION_OPT_KEY) (Default: upsert) <br/>
<span style="color:grey">whether to do upsert, insert or bulkinsert for the write operation</span>
- [STORAGE_TYPE_OPT_KEY](#STORAGE_TYPE_OPT_KEY) (Default: COPY_ON_WRITE) <br/>
<span style="color:grey">The storage type for the underlying data, for this write.</span>
- [TABLE_NAME_OPT_KEY](#TABLE_NAME_OPT_KEY) (Default: None (mandatory)) <br/>
<span style="color:grey">Hive table name, to register the dataset into.</span>
- [PRECOMBINE_FIELD_OPT_KEY](#PRECOMBINE_FIELD_OPT_KEY) (Default: ts) <br/>
<span style="color:grey">Field used in preCombining before actual write. When two records have the same key value,
we will pick the one with the largest value for the precombine field, determined by Object.compareTo(..)</span>
- [PAYLOAD_CLASS_OPT_KEY](#PAYLOAD_CLASS_OPT_KEY) (Default: com.uber.hoodie.OverwriteWithLatestAvroPayload) <br/>
<span style="color:grey">Payload class used. Override this, if you like to roll your own merge logic, when upserting/inserting.
This will render any value set for `PRECOMBINE_FIELD_OPT_VAL` in-effective</span>
- [RECORDKEY_FIELD_OPT_KEY](#RECORDKEY_FIELD_OPT_KEY) (Default: uuid) <br/>
<span style="color:grey">Record key field. Value to be used as the `recordKey` component of `HoodieKey`. Actual value
will be obtained by invoking .toString() on the field value. Nested fields can be specified using
the dot notation eg: `a.b.c`</span>
- [PARTITIONPATH_FIELD_OPT_KEY](#PARTITIONPATH_FIELD_OPT_KEY) (Default: partitionpath) <br/>
<span style="color:grey">Partition path field. Value to be used at the `partitionPath` component of `HoodieKey`.
Actual value ontained by invoking .toString()</span>
- [KEYGENERATOR_CLASS_OPT_KEY](#KEYGENERATOR_CLASS_OPT_KEY) (Default: com.uber.hoodie.SimpleKeyGenerator) <br/>
<span style="color:grey">Key generator class, that implements will extract the key out of incoming `Row` object</span>
- [COMMIT_METADATA_KEYPREFIX_OPT_KEY](#COMMIT_METADATA_KEYPREFIX_OPT_KEY) (Default: _) <br/>
<span style="color:grey">Option keys beginning with this prefix, are automatically added to the commit/deltacommit metadata.
This is useful to store checkpointing information, in a consistent way with the hoodie timeline</span>
- [read options](#readoptions) (read.format.option(...)) <br/>
<span style="color:grey">Options useful for reading datasets</span>
- [VIEW_TYPE_OPT_KEY](#VIEW_TYPE_OPT_KEY) (Default: = READ_OPTIMIZED) <br/>
<span style="color:grey">Whether data needs to be read, in incremental mode (new data since an instantTime)
(or) Read Optimized mode (obtain latest view, based on columnar data)
(or) Real time mode (obtain latest view, based on row & columnar data)</span>
- [BEGIN_INSTANTTIME_OPT_KEY](#BEGIN_INSTANTTIME_OPT_KEY) (Default: None (Mandatory in incremental mode)) <br/>
<span style="color:grey">Instant time to start incrementally pulling data from. The instanttime here need not
necessarily correspond to an instant on the timeline. New data written with an
`instant_time > BEGIN_INSTANTTIME` are fetched out. For e.g: '20170901080000' will get
all new data written after Sep 1, 2017 08:00AM.</span>
- [END_INSTANTTIME_OPT_KEY](#END_INSTANTTIME_OPT_KEY) (Default: latest instant (i.e fetches all new data since begin instant time)) <br/>
<span style="color:grey"> Instant time to limit incrementally fetched data to. New data written with an
`instant_time <= END_INSTANTTIME` are fetched out.</span>
{% include callout.html content="Hoodie is a young project. A lot of pluggable interfaces and configurations to support diverse workloads need to be created. Get involved [here](https://github.com/uber/hoodie)" type="info" %}

View File

@@ -19,8 +19,28 @@ discusses a few tools that can be used to achieve these on different contexts.
Upserts can be used to apply a delta or an incremental change to a Hoodie dataset. For e.g, the incremental changes could be from a Kafka topic or files uploaded to HDFS or
even changes pulled from another Hoodie dataset.
#### Via Spark Job
#### DeltaStreamer
The `hoodie-spark` module offers the DataSource API to write any data frame into a Hoodie dataset. Following is how we can upsert a dataframe, while specifying the field names that need to be used
for `recordKey => _row_key`, `partitionPath => partition` and `precombineKey => timestamp`
```
inputDF.write()
.format("com.uber.hoodie")
.options(clientOpts) // any of the hoodie client opts can be passed in as well
.option(DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY(), "_row_key")
.option(DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY(), "partition")
.option(DataSourceWriteOptions.PRECOMBINE_FIELD_OPT_KEY(), "timestamp")
.option(HoodieWriteConfig.TABLE_NAME, tableName)
.mode(SaveMode.Append)
.save(basePath);
```
Please refer to [configurations](configurations.html) section, to view all datasource options.
#### DeltaStreamer Tool
The `HoodieDeltaStreamer` utility provides the way to achieve all of these, by using the capabilities of `HoodieWriteClient`.
@@ -151,8 +171,34 @@ This, together with upserts, are particularly useful for building data pipelines
joined with other tables (datasets/dimensions), to produce deltas to a target hoodie dataset. Then, using the delta streamer tool these deltas can be upserted into the
target hoodie dataset to complete the pipeline.
#### Pulling through Hive
#### Via Spark Job
The `hoodie-spark` module offers the DataSource API, offers a more elegant way to pull data from Hoodie dataset (plus more) and process it via Spark.
This class can be used within existing Spark jobs and offers the following functionality.
A sample incremental pull, that will obtain all records written since `beginInstantTime`, looks like below.
```
Dataset<Row> hoodieIncViewDF = spark.read()
.format("com.uber.hoodie")
.option(DataSourceReadOptions.VIEW_TYPE_OPT_KEY(),
DataSourceReadOptions.VIEW_TYPE_INCREMENTAL_OPT_VAL())
.option(DataSourceReadOptions.BEGIN_INSTANTTIME_OPT_KEY(),
<beginInstantTime>)
.load(tablePath); // For incremental view, pass in the root/base path of dataset
```
Please refer to [configurations](configurations.html) section, to view all datasource options.
Additionally, `HoodieReadClient` offers the following functionality using Hoodie's implicit indexing.
| **API** | **Description** |
| read(keys) | Read out the data corresponding to the keys as a DataFrame, using Hoodie's own index for faster lookup |
| filterExists() | Filter out already existing records from the provided RDD[HoodieRecord]. Useful for de-duplication |
| checkExists(keys) | Check if the provided keys exist in a Hoodie dataset |
#### HiveIncrementalPuller Tool
`HiveIncrementalPuller` allows the above to be done via HiveQL, combining the benefits of Hive (reliably process complex SQL queries) and incremental primitives
(speed up query by pulling tables incrementally instead of scanning fully). The tool uses Hive JDBC to run the Hive query saving its results in a temp table.
that can later be upserted. Upsert utility (`HoodieDeltaStreamer`) has all the state it needs from the directory structure to know what should be the commit time on the target table.
@@ -183,24 +229,4 @@ it will automatically use the backfill configuration, since applying the last 24
is the lack of support for self-joining the same table in mixed mode (normal and incremental modes).
#### Pulling through Spark
`HoodieReadClient` (inside hoodie-client) offers a more elegant way to pull data from Hoodie dataset (plus more) and process it via Spark.
This class can be used within existing Spark jobs and offers the following functionality.
| **API** | **Description** |
| listCommitsSince(),latestCommit() | Obtain commit times to pull data from |
| readSince(commitTime),readCommit(commitTime) | Provide the data from the commit time as a DataFrame, to process further on |
| read(keys) | Read out the data corresponding to the keys as a DataFrame, using Hoodie's own index for faster lookup |
| read(paths) | Read out the data under specified path, with the functionality of HoodieInputFormat. An alternative way to do SparkSQL on Hoodie datasets |
| filterExists() | Filter out already existing records from the provided RDD[HoodieRecord]. Useful for de-duplication |
| checkExists(keys) | Check if the provided keys exist in a Hoodie dataset |
## SQL Streamer
work in progress, tool being refactored out into open source Hoodie
{% include callout.html content="Get involved in building this tool [here](https://github.com/uber/hoodie/issues/20)" type="info" %}

View File

@@ -3,6 +3,7 @@ title: Quickstart
keywords: quickstart
tags: [quickstart]
sidebar: mydoc_sidebar
toc: false
permalink: quickstart.html
---
@@ -16,7 +17,7 @@ Normally build the maven project, from command line
$ mvn clean install -DskipTests
```
{% include callout.html content="You might want to add your spark assembly jar to project dependencies under 'Module Setttings', to be able to run Spark from IDE" type="info" %}
{% include callout.html content="You might want to add your spark jars folder to project dependencies under 'Module Setttings', to be able to run Spark from IDE" type="info" %}
{% include note.html content="Setup your local hadoop/hive test environment, so you can play with entire ecosystem. See [this](http://www.bytearray.io/2016/05/setting-up-hadoopyarnsparkhive-on-mac.html) for reference" %}
@@ -25,13 +26,15 @@ $ mvn clean install -DskipTests
## Generate a Hoodie Dataset
You can run the __hoodie-client/src/test/java/HoodieClientExample.java__ class, to place a two commits (commit 1 => 100 inserts, commit 2 => 100 updates to previously inserted 100 records) onto your HDFS/local filesystem
### DataSource API
Run __hoodie-spark/src/test/java/HoodieJavaApp.java__ class, to place a two commits (commit 1 => 100 inserts, commit 2 => 100 updates to previously inserted 100 records) onto your HDFS/local filesystem
```
Usage: <main class> [options]
Options:
--help, -h
Default: false
--table-name, -n
table name for Hoodie sample table
@@ -46,7 +49,14 @@ Usage: <main class> [options]
```
The class lets you choose table names, output paths and one of the storage types.
The class lets you choose table names, output paths and one of the storage types. In your own applications, be sure to include the `hoodie-spark` module as dependency
and follow a similar pattern to write/read datasets via the datasource.
### RDD API
RDD level APIs give you more power and control over things, via the `hoodie-client` module .
Refer to __hoodie-client/src/test/java/HoodieClientExample.java__ class for an example.
## Register Dataset to Hive Metastore