From dcc15d5d6f71f87d9f3055ef2e06ff3acc645bdc Mon Sep 17 00:00:00 2001 From: Vinoth Chandar Date: Fri, 17 Feb 2017 20:19:18 -0800 Subject: [PATCH] Adding docs for running sql queries on hoodie datasets --- docs/sql_queries.md | 60 ++++++++++++++++++++++++++++++++++++++++++++- 1 file changed, 59 insertions(+), 1 deletion(-) diff --git a/docs/sql_queries.md b/docs/sql_queries.md index 2ac6fbe6e..43d4eef46 100644 --- a/docs/sql_queries.md +++ b/docs/sql_queries.md @@ -3,8 +3,66 @@ title: SQL Queries keywords: sql hive spark presto sidebar: mydoc_sidebar permalink: sql_queries.html +toc: false +summary: In this page, we go over how to enable SQL queries on Hoodie built tables. --- -Work In Progress +Hoodie registers the dataset into the Hive metastore backed by `HoodieInputFormat`. This makes the data accessible to +Hive & Spark & Presto automatically. To be able to perform normal SQL queries on such a dataset, we need to get the individual query engines +to call `HoodieInputFormat.getSplits()`, during query planning. +In the following sections, we cover the configs needed across different query engines to achieve this. + +{% include callout.html content="Instructions are currently only for Copy-on-write storage" type="info" %} + + +## Hive + +For HiveServer2 access, [install](https://www.cloudera.com/documentation/enterprise/5-6-x/topics/cm_mc_hive_udf.html#concept_nc3_mms_lr) +the hoodie-hadoop-mr-x.y.z-SNAPSHOT.jar into the aux jars path and we should be able to recognize the Hoodie tables and query them correctly. + +For beeline access, the `hive.input.format` variable needs to be set to the fully qualified path name of the inputformat `com.uber.hoodie.hadoop.HoodieInputFormat` +For Tez, additionally the `hive.tez.input.format` needs to be set to `org.apache.hadoop.hive.ql.io.HiveInputFormat` + +## Spark + +There are two ways of running Spark SQL on Hoodie datasets. + +First method involves, setting `spark.sql.hive.convertMetastoreParquet=false`, forcing Spark to fallback +to using the Hive Serde to read the data (planning/executions is still Spark). This turns off optimizations in Spark +towards Parquet reading, which we will address in the next method based on path filters. +However benchmarks have not revealed any real performance degradation with Hoodie & SparkSQL, compared to native support. + +{% include callout.html content="Get involved to improve this integration [here](https://github.com/uber/hoodie/issues/7) and [here](https://issues.apache.org/jira/browse/SPARK-19351) " type="info" %} + +Sample command is provided below to spin up Spark Shell + +``` +$ spark-shell --jars hoodie-hadoop-mr-x.y.z-SNAPSHOT.jar --driver-class-path /etc/hive/conf --conf spark.sql.hive.convertMetastoreParquet=false --num-executors 10 --driver-memory 7g --executor-memory 2g --master yarn-client + +scala> sqlContext.sql("select count(*) from uber.trips where datestr = '2016-10-02'").show() + +``` + + +For scheduled Spark jobs, a dependency to [hoodie-hadoop-mr](https://mvnrepository.com/artifact/com.uber.hoodie/hoodie-hadoop-mr) and [hoodie-client](https://mvnrepository.com/artifact/com.uber.hoodie/hoodie-client) modules needs to be added +and the same config needs to be set on `SparkConf` or conveniently via `HoodieReadClient.addHoodieSupport(conf)` + +{% include callout.html content="Don't instantiate a HoodieWriteClient against a table you don't own. Hoodie is a single writer & multiple reader system as of now. You may accidentally cause incidents otherwise. +" type="warning" %} + +The second method uses a new feature in Spark 2.x, which allows for the work of HoodieInputFormat to be done via a path filter as below. This method uses Spark built-in optimizations for +reading Parquet files, just like queries on non-hoodie tables. + +``` +spark.sparkContext.hadoopConfiguration.setClass("mapreduce.input.pathFilter.class", classOf[com.uber.hoodie.hadoop.HoodieROTablePathFilter], classOf[org.apache.hadoop.fs.PathFilter]); +``` + + +## Presto + +Presto requires a [patch](https://github.com/prestodb/presto/pull/7002) (until the PR is merged) and the hoodie-hadoop-mr jar to be placed +into `/plugin/hive-hadoop2/`. + +{% include callout.html content="Get involved to improve this integration [here](https://github.com/uber/hoodie/issues/81)" type="info" %}