Adding hoodie-spark to support Spark Datasource for Hoodie
- Write with COW/MOR paths work fully - Read with RO view works on both storages* - Incremental view supported on COW - Refactored out HoodieReadClient methods, to just contain key based access - HoodieDataSourceHelpers class can be now used to construct inputs to datasource - Tests in hoodie-client using new helpers and mechanisms - Basic tests around save modes & insert/upserts (more to follow) - Bumped up scala to 2.11, since 2.10 is deprecated & complains with scalatest - Updated documentation to describe usage - New sample app written using the DataSource API
This commit is contained in:
committed by
vinoth chandar
parent
c98ee057fc
commit
64e0573aca
@@ -2,8 +2,8 @@
|
||||
|
||||
entries:
|
||||
- title: sidebar
|
||||
product: Hoodie Ver
|
||||
version: 0.2
|
||||
product: Documentation
|
||||
version:
|
||||
folders:
|
||||
|
||||
- title: Getting Started
|
||||
|
||||
@@ -90,4 +90,49 @@ summary: "Here we list all possible configurations and what they mean"
|
||||
- [GCSConfigs](gcs_hoodie.html) (Hoodie GCS Configs) <br/>
|
||||
<span style="color:grey">Configurations required for GCS and Hoodie co-operability.</span>
|
||||
|
||||
* [Hoodie Datasource](#datasource) <br/>
|
||||
<span style="color:grey">Configs for datasource</span>
|
||||
- [write options](#writeoptions) (write.format.option(...)) <br/>
|
||||
<span style="color:grey"> Options useful for writing datasets </span>
|
||||
- [OPERATION_OPT_KEY](#OPERATION_OPT_KEY) (Default: upsert) <br/>
|
||||
<span style="color:grey">whether to do upsert, insert or bulkinsert for the write operation</span>
|
||||
- [STORAGE_TYPE_OPT_KEY](#STORAGE_TYPE_OPT_KEY) (Default: COPY_ON_WRITE) <br/>
|
||||
<span style="color:grey">The storage type for the underlying data, for this write.</span>
|
||||
- [TABLE_NAME_OPT_KEY](#TABLE_NAME_OPT_KEY) (Default: None (mandatory)) <br/>
|
||||
<span style="color:grey">Hive table name, to register the dataset into.</span>
|
||||
- [PRECOMBINE_FIELD_OPT_KEY](#PRECOMBINE_FIELD_OPT_KEY) (Default: ts) <br/>
|
||||
<span style="color:grey">Field used in preCombining before actual write. When two records have the same key value,
|
||||
we will pick the one with the largest value for the precombine field, determined by Object.compareTo(..)</span>
|
||||
- [PAYLOAD_CLASS_OPT_KEY](#PAYLOAD_CLASS_OPT_KEY) (Default: com.uber.hoodie.OverwriteWithLatestAvroPayload) <br/>
|
||||
<span style="color:grey">Payload class used. Override this, if you like to roll your own merge logic, when upserting/inserting.
|
||||
This will render any value set for `PRECOMBINE_FIELD_OPT_VAL` in-effective</span>
|
||||
- [RECORDKEY_FIELD_OPT_KEY](#RECORDKEY_FIELD_OPT_KEY) (Default: uuid) <br/>
|
||||
<span style="color:grey">Record key field. Value to be used as the `recordKey` component of `HoodieKey`. Actual value
|
||||
will be obtained by invoking .toString() on the field value. Nested fields can be specified using
|
||||
the dot notation eg: `a.b.c`</span>
|
||||
- [PARTITIONPATH_FIELD_OPT_KEY](#PARTITIONPATH_FIELD_OPT_KEY) (Default: partitionpath) <br/>
|
||||
<span style="color:grey">Partition path field. Value to be used at the `partitionPath` component of `HoodieKey`.
|
||||
Actual value ontained by invoking .toString()</span>
|
||||
- [KEYGENERATOR_CLASS_OPT_KEY](#KEYGENERATOR_CLASS_OPT_KEY) (Default: com.uber.hoodie.SimpleKeyGenerator) <br/>
|
||||
<span style="color:grey">Key generator class, that implements will extract the key out of incoming `Row` object</span>
|
||||
- [COMMIT_METADATA_KEYPREFIX_OPT_KEY](#COMMIT_METADATA_KEYPREFIX_OPT_KEY) (Default: _) <br/>
|
||||
<span style="color:grey">Option keys beginning with this prefix, are automatically added to the commit/deltacommit metadata.
|
||||
This is useful to store checkpointing information, in a consistent way with the hoodie timeline</span>
|
||||
|
||||
- [read options](#readoptions) (read.format.option(...)) <br/>
|
||||
<span style="color:grey">Options useful for reading datasets</span>
|
||||
- [VIEW_TYPE_OPT_KEY](#VIEW_TYPE_OPT_KEY) (Default: = READ_OPTIMIZED) <br/>
|
||||
<span style="color:grey">Whether data needs to be read, in incremental mode (new data since an instantTime)
|
||||
(or) Read Optimized mode (obtain latest view, based on columnar data)
|
||||
(or) Real time mode (obtain latest view, based on row & columnar data)</span>
|
||||
- [BEGIN_INSTANTTIME_OPT_KEY](#BEGIN_INSTANTTIME_OPT_KEY) (Default: None (Mandatory in incremental mode)) <br/>
|
||||
<span style="color:grey">Instant time to start incrementally pulling data from. The instanttime here need not
|
||||
necessarily correspond to an instant on the timeline. New data written with an
|
||||
`instant_time > BEGIN_INSTANTTIME` are fetched out. For e.g: '20170901080000' will get
|
||||
all new data written after Sep 1, 2017 08:00AM.</span>
|
||||
- [END_INSTANTTIME_OPT_KEY](#END_INSTANTTIME_OPT_KEY) (Default: latest instant (i.e fetches all new data since begin instant time)) <br/>
|
||||
<span style="color:grey"> Instant time to limit incrementally fetched data to. New data written with an
|
||||
`instant_time <= END_INSTANTTIME` are fetched out.</span>
|
||||
|
||||
|
||||
{% include callout.html content="Hoodie is a young project. A lot of pluggable interfaces and configurations to support diverse workloads need to be created. Get involved [here](https://github.com/uber/hoodie)" type="info" %}
|
||||
|
||||
@@ -19,8 +19,28 @@ discusses a few tools that can be used to achieve these on different contexts.
|
||||
Upserts can be used to apply a delta or an incremental change to a Hoodie dataset. For e.g, the incremental changes could be from a Kafka topic or files uploaded to HDFS or
|
||||
even changes pulled from another Hoodie dataset.
|
||||
|
||||
#### Via Spark Job
|
||||
|
||||
#### DeltaStreamer
|
||||
The `hoodie-spark` module offers the DataSource API to write any data frame into a Hoodie dataset. Following is how we can upsert a dataframe, while specifying the field names that need to be used
|
||||
for `recordKey => _row_key`, `partitionPath => partition` and `precombineKey => timestamp`
|
||||
|
||||
|
||||
```
|
||||
inputDF.write()
|
||||
.format("com.uber.hoodie")
|
||||
.options(clientOpts) // any of the hoodie client opts can be passed in as well
|
||||
.option(DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY(), "_row_key")
|
||||
.option(DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY(), "partition")
|
||||
.option(DataSourceWriteOptions.PRECOMBINE_FIELD_OPT_KEY(), "timestamp")
|
||||
.option(HoodieWriteConfig.TABLE_NAME, tableName)
|
||||
.mode(SaveMode.Append)
|
||||
.save(basePath);
|
||||
```
|
||||
|
||||
Please refer to [configurations](configurations.html) section, to view all datasource options.
|
||||
|
||||
|
||||
#### DeltaStreamer Tool
|
||||
|
||||
The `HoodieDeltaStreamer` utility provides the way to achieve all of these, by using the capabilities of `HoodieWriteClient`.
|
||||
|
||||
@@ -151,8 +171,34 @@ This, together with upserts, are particularly useful for building data pipelines
|
||||
joined with other tables (datasets/dimensions), to produce deltas to a target hoodie dataset. Then, using the delta streamer tool these deltas can be upserted into the
|
||||
target hoodie dataset to complete the pipeline.
|
||||
|
||||
#### Pulling through Hive
|
||||
#### Via Spark Job
|
||||
The `hoodie-spark` module offers the DataSource API, offers a more elegant way to pull data from Hoodie dataset (plus more) and process it via Spark.
|
||||
This class can be used within existing Spark jobs and offers the following functionality.
|
||||
|
||||
A sample incremental pull, that will obtain all records written since `beginInstantTime`, looks like below.
|
||||
|
||||
```
|
||||
Dataset<Row> hoodieIncViewDF = spark.read()
|
||||
.format("com.uber.hoodie")
|
||||
.option(DataSourceReadOptions.VIEW_TYPE_OPT_KEY(),
|
||||
DataSourceReadOptions.VIEW_TYPE_INCREMENTAL_OPT_VAL())
|
||||
.option(DataSourceReadOptions.BEGIN_INSTANTTIME_OPT_KEY(),
|
||||
<beginInstantTime>)
|
||||
.load(tablePath); // For incremental view, pass in the root/base path of dataset
|
||||
```
|
||||
|
||||
Please refer to [configurations](configurations.html) section, to view all datasource options.
|
||||
|
||||
|
||||
Additionally, `HoodieReadClient` offers the following functionality using Hoodie's implicit indexing.
|
||||
|
||||
| **API** | **Description** |
|
||||
| read(keys) | Read out the data corresponding to the keys as a DataFrame, using Hoodie's own index for faster lookup |
|
||||
| filterExists() | Filter out already existing records from the provided RDD[HoodieRecord]. Useful for de-duplication |
|
||||
| checkExists(keys) | Check if the provided keys exist in a Hoodie dataset |
|
||||
|
||||
|
||||
#### HiveIncrementalPuller Tool
|
||||
`HiveIncrementalPuller` allows the above to be done via HiveQL, combining the benefits of Hive (reliably process complex SQL queries) and incremental primitives
|
||||
(speed up query by pulling tables incrementally instead of scanning fully). The tool uses Hive JDBC to run the Hive query saving its results in a temp table.
|
||||
that can later be upserted. Upsert utility (`HoodieDeltaStreamer`) has all the state it needs from the directory structure to know what should be the commit time on the target table.
|
||||
@@ -183,24 +229,4 @@ it will automatically use the backfill configuration, since applying the last 24
|
||||
is the lack of support for self-joining the same table in mixed mode (normal and incremental modes).
|
||||
|
||||
|
||||
#### Pulling through Spark
|
||||
|
||||
`HoodieReadClient` (inside hoodie-client) offers a more elegant way to pull data from Hoodie dataset (plus more) and process it via Spark.
|
||||
This class can be used within existing Spark jobs and offers the following functionality.
|
||||
|
||||
| **API** | **Description** |
|
||||
| listCommitsSince(),latestCommit() | Obtain commit times to pull data from |
|
||||
| readSince(commitTime),readCommit(commitTime) | Provide the data from the commit time as a DataFrame, to process further on |
|
||||
| read(keys) | Read out the data corresponding to the keys as a DataFrame, using Hoodie's own index for faster lookup |
|
||||
| read(paths) | Read out the data under specified path, with the functionality of HoodieInputFormat. An alternative way to do SparkSQL on Hoodie datasets |
|
||||
| filterExists() | Filter out already existing records from the provided RDD[HoodieRecord]. Useful for de-duplication |
|
||||
| checkExists(keys) | Check if the provided keys exist in a Hoodie dataset |
|
||||
|
||||
|
||||
## SQL Streamer
|
||||
|
||||
work in progress, tool being refactored out into open source Hoodie
|
||||
|
||||
|
||||
{% include callout.html content="Get involved in building this tool [here](https://github.com/uber/hoodie/issues/20)" type="info" %}
|
||||
|
||||
|
||||
@@ -3,6 +3,7 @@ title: Quickstart
|
||||
keywords: quickstart
|
||||
tags: [quickstart]
|
||||
sidebar: mydoc_sidebar
|
||||
toc: false
|
||||
permalink: quickstart.html
|
||||
---
|
||||
|
||||
@@ -16,7 +17,7 @@ Normally build the maven project, from command line
|
||||
$ mvn clean install -DskipTests
|
||||
```
|
||||
|
||||
{% include callout.html content="You might want to add your spark assembly jar to project dependencies under 'Module Setttings', to be able to run Spark from IDE" type="info" %}
|
||||
{% include callout.html content="You might want to add your spark jars folder to project dependencies under 'Module Setttings', to be able to run Spark from IDE" type="info" %}
|
||||
|
||||
{% include note.html content="Setup your local hadoop/hive test environment, so you can play with entire ecosystem. See [this](http://www.bytearray.io/2016/05/setting-up-hadoopyarnsparkhive-on-mac.html) for reference" %}
|
||||
|
||||
@@ -25,13 +26,15 @@ $ mvn clean install -DskipTests
|
||||
## Generate a Hoodie Dataset
|
||||
|
||||
|
||||
You can run the __hoodie-client/src/test/java/HoodieClientExample.java__ class, to place a two commits (commit 1 => 100 inserts, commit 2 => 100 updates to previously inserted 100 records) onto your HDFS/local filesystem
|
||||
### DataSource API
|
||||
|
||||
Run __hoodie-spark/src/test/java/HoodieJavaApp.java__ class, to place a two commits (commit 1 => 100 inserts, commit 2 => 100 updates to previously inserted 100 records) onto your HDFS/local filesystem
|
||||
|
||||
```
|
||||
|
||||
Usage: <main class> [options]
|
||||
Options:
|
||||
--help, -h
|
||||
|
||||
Default: false
|
||||
--table-name, -n
|
||||
table name for Hoodie sample table
|
||||
@@ -46,7 +49,14 @@ Usage: <main class> [options]
|
||||
|
||||
```
|
||||
|
||||
The class lets you choose table names, output paths and one of the storage types.
|
||||
The class lets you choose table names, output paths and one of the storage types. In your own applications, be sure to include the `hoodie-spark` module as dependency
|
||||
and follow a similar pattern to write/read datasets via the datasource.
|
||||
|
||||
### RDD API
|
||||
|
||||
RDD level APIs give you more power and control over things, via the `hoodie-client` module .
|
||||
Refer to __hoodie-client/src/test/java/HoodieClientExample.java__ class for an example.
|
||||
|
||||
|
||||
|
||||
## Register Dataset to Hive Metastore
|
||||
|
||||
Reference in New Issue
Block a user