From 848814beceb6103ba36aee9e5b7b05ec2f52eafe Mon Sep 17 00:00:00 2001 From: Vinoth Chandar Date: Mon, 3 Apr 2017 16:59:44 -0700 Subject: [PATCH] Adding docs for deltastreamer, hivesync tool usage --- docs/incremental_processing.md | 120 ++++++++++++++++++++++++++++++++- docs/quickstart.md | 35 +++++++++- 2 files changed, 151 insertions(+), 4 deletions(-) diff --git a/docs/incremental_processing.md b/docs/incremental_processing.md index 7226fff30..f81547553 100644 --- a/docs/incremental_processing.md +++ b/docs/incremental_processing.md @@ -17,9 +17,12 @@ discusses a few tools that can be used to achieve these on different contexts. ## Upserts Upserts can be used to apply a delta or an incremental change to a Hoodie dataset. For e.g, the incremental changes could be from a Kafka topic or files uploaded to HDFS or -even changes pulled from another Hoodie dataset. The `HoodieDeltaStreamer` utility provides the way to achieve all of these, by using the capabilities of `HoodieWriteClient`. +even changes pulled from another Hoodie dataset. -{% include callout.html content="Get involved in rewriting this tool [here](https://github.com/uber/hoodie/issues/20)" type="info" %} + +#### DeltaStreamer + +The `HoodieDeltaStreamer` utility provides the way to achieve all of these, by using the capabilities of `HoodieWriteClient`. The tool is a spark job (part of hoodie-utilities), that provides the following functionality @@ -27,6 +30,119 @@ The tool is a spark job (part of hoodie-utilities), that provides the following - Support json, avro or a custom payload types for the incoming data - New data is written to a Hoodie dataset, with support for checkpointing & schemas and registered onto Hive + To understand more + +``` + +[hoodie]$ spark-submit --class com.uber.hoodie.utilities.deltastreamer.HoodieDeltaStreamer hoodie-utilities/target/hoodie-utilities-0.3.6-SNAPSHOT-bin.jar --help +Usage:
[options] + Options: + --help, -h + Default: false + --hoodie-client-config + path to properties file on localfs or dfs, with hoodie client config. + Sane defaultsare used, but recommend use to provide basic things like metrics + endpoints, hive configs etc + --key-generator-class + Subclass of com.uber.hoodie.utilities.common.KeyExtractor to generatea + HoodieKey from the given avro record. Built in: SimpleKeyGenerator (Uses provided + field names as recordkey & partitionpath. Nested fields specified via dot + notation, e.g: a.b.c) + Default: com.uber.hoodie.utilities.keygen.SimpleKeyGenerator + --key-generator-config + Path to properties file on localfs or dfs, with KeyGenerator configs. For + list of acceptable properites, refer the KeyGenerator class + --max-input-bytes + Maximum number of bytes to read from source. Default: 1TB + Default: 1099511627776 + --op + Takes one of these values : UPSERT (default), INSERT (use when input is + purely new data/inserts to gain speed) + Default: UPSERT + Possible Values: [UPSERT, INSERT] + --payload-class + subclass of HoodieRecordPayload, that works off a GenericRecord. Default: + SourceWrapperPayload. Implement your own, if you want to do something other than overwriting + existing value + Default: com.uber.hoodie.utilities.deltastreamer.DeltaStreamerAvroPayload + --schemaprovider-class + subclass of com.uber.hoodie.utilities.schema.SchemaProvider to attach + schemas to input & target table data, built in options: FilebasedSchemaProvider + Default: com.uber.hoodie.utilities.schema.FilebasedSchemaProvider + --schemaprovider-config + path to properties file on localfs or dfs, with schema configs. For list + of acceptable properties, refer the schema provider class + --source-class + subclass of com.uber.hoodie.utilities.sources.Source to use to read data. + built-in options: com.uber.hoodie.utilities.common.{DFSSource (default), + KafkaSource, HiveIncrPullSource} + Default: com.uber.hoodie.utilities.sources.DFSSource + --source-config + path to properties file on localfs or dfs, with source configs. For list + of acceptable properties, refer the source class + --source-format + Format of data in source, JSON (default), Avro. All source data is + converted to Avro using the provided schema in any case + Default: JSON + Possible Values: [AVRO, JSON, ROW, CUSTOM] + --source-ordering-field + Field within source record to decide how to break ties between records + with same key in input data. Default: 'ts' holding unix timestamp of record + Default: ts + --target-base-path + base path for the target hoodie dataset + --target-table + name of the target table in Hive + + +``` + +For e.g, followings ingests data from Kafka (avro records as the client example) + + +``` +[hoodie]$ spark-submit --class com.uber.hoodie.utilities.deltastreamer.HoodieDeltaStreamer hoodie-utilities/target/hoodie-utilities-0.3.6-SNAPSHOT-bin.jar \ + --hoodie-client-config hoodie-utilities/src/main/resources/delta-streamer-config/hoodie-client.properties \ + --key-generator-config hoodie-utilities/src/main/resources/delta-streamer-config/key-generator.properties \ + --schemaprovider-config hoodie-utilities/src/main/resources/delta-streamer-config/schema-provider.properties \ + --source-class com.uber.hoodie.utilities.sources.KafkaSource \ + --source-config hoodie-utilities/src/main/resources/delta-streamer-config/source.properties \ + --source-ordering-field rider \ + --target-base-path file:///tmp/hoodie-deltastreamer-op \ + --target-table uber.trips +``` + + +#### Syncing to Hive + +Once new data is written to a Hoodie dataset, via tools like above, we need the ability to sync with Hive and reflect the table schema such that queries can pick up new columns and partitions. To do this, Hoodie provides a `HiveSyncTool`, which can be +invoked as below, once you have built the hoodie-hive module. + +``` + [hoodie-hive]$ java -cp target/hoodie-hive-0.3.6-SNAPSHOT-jar-with-dependencies.jar:target/jars/* com.uber.hoodie.hive.HiveSyncTool --help +Usage:
[options] + Options: + * --base-path + Basepath of hoodie dataset to sync + * --database + name of the target database in Hive + --help, -h + + Default: false + * --jdbc-url + Hive jdbc connect url + * --pass + Hive password + * --table + name of the target table in Hive + * --user + Hive username + + +``` + +{% include callout.html content="Note that for now, due to jar mismatches between Spark & Hive, its recommended to run this as a separate Java task in your workflow manager/cron. This is getting fix [here](https://github.com/uber/hoodie/issues/123)" type="info" %} + ## Incremental Pull diff --git a/docs/quickstart.md b/docs/quickstart.md index 2adb7ad2b..220c4084d 100644 --- a/docs/quickstart.md +++ b/docs/quickstart.md @@ -7,8 +7,6 @@ permalink: quickstart.html --- - - ## Download Hoodie Check out code and pull it into Intellij as a normal maven project. @@ -38,6 +36,39 @@ hdfs dfs -copyFromLocal /tmp/hoodie/sample-table/* /tmp/hoodie/sample-table ## Register Dataset to Hive Metastore +Now, lets see how we can publish this data into Hive. + +#### Starting up Hive locally + +``` +hdfs namenode # start name node +hdfs datanode # start data node + +bin/hive --service metastore -p 10000 # start metastore +bin/hiveserver2 \ + --hiveconf hive.server2.thrift.port=10010 \ + --hiveconf hive.root.logger=INFO,console \ + --hiveconf hive.aux.jars.path=hoodie/hoodie-hadoop-mr/target/hoodie-hadoop-mr-0.3.6-SNAPSHOT.jar + +``` + + +#### Hive Sync Tool + +Once Hive is up and running, the sync tool can be used to sync commits done above to a Hive table, as follows. + +``` +java -cp target/hoodie-hive-0.3.1-SNAPSHOT-jar-with-dependencies.jar:target/jars/* com.uber.hoodie.hive.HiveSyncTool \ + --base-path file:///tmp/hoodie/sample-table/ \ + --database default \ + --table hoodie_test \ + --user hive \ + --pass hive \ + --jdbc-url jdbc:hive2://localhost:10010/ + +``` + +#### Manually via Beeline Add in the hoodie-hadoop-mr jar so, Hive can read the Hoodie dataset and answer the query. ```