1
0
Files
hudi/docs/sql_queries.md

69 lines
3.7 KiB
Markdown

---
title: SQL Queries
keywords: sql hive spark presto
sidebar: mydoc_sidebar
permalink: sql_queries.html
toc: false
summary: In this page, we go over how to enable SQL queries on Hoodie built tables.
---
Hoodie registers the dataset into the Hive metastore backed by `HoodieInputFormat`. This makes the data accessible to
Hive & Spark & Presto automatically. To be able to perform normal SQL queries on such a dataset, we need to get the individual query engines
to call `HoodieInputFormat.getSplits()`, during query planning such that the right versions of files are exposed to it.
In the following sections, we cover the configs needed across different query engines to achieve this.
{% include callout.html content="Instructions are currently only for Copy-on-write storage" type="info" %}
## Hive
For HiveServer2 access, [install](https://www.cloudera.com/documentation/enterprise/5-6-x/topics/cm_mc_hive_udf.html#concept_nc3_mms_lr)
the hoodie-hadoop-mr-x.y.z-SNAPSHOT.jar into the aux jars path and we should be able to recognize the Hoodie tables and query them correctly.
For beeline access, the `hive.input.format` variable needs to be set to the fully qualified path name of the inputformat `com.uber.hoodie.hadoop.HoodieInputFormat`
For Tez, additionally the `hive.tez.input.format` needs to be set to `org.apache.hadoop.hive.ql.io.HiveInputFormat`
## Spark
There are two ways of running Spark SQL on Hoodie datasets.
First method involves, setting `spark.sql.hive.convertMetastoreParquet=false`, forcing Spark to fallback
to using the Hive Serde to read the data (planning/executions is still Spark). This turns off optimizations in Spark
towards Parquet reading, which we will address in the next method based on path filters.
However benchmarks have not revealed any real performance degradation with Hoodie & SparkSQL, compared to native support.
{% include callout.html content="Get involved to improve this integration [here](https://github.com/uber/hoodie/issues/7) and [here](https://issues.apache.org/jira/browse/SPARK-19351) " type="info" %}
Sample command is provided below to spin up Spark Shell
```
$ spark-shell --jars hoodie-hadoop-mr-x.y.z-SNAPSHOT.jar --driver-class-path /etc/hive/conf --conf spark.sql.hive.convertMetastoreParquet=false --num-executors 10 --driver-memory 7g --executor-memory 2g --master yarn-client
scala> sqlContext.sql("select count(*) from uber.trips where datestr = '2016-10-02'").show()
```
For scheduled Spark jobs, a dependency to [hoodie-hadoop-mr](https://mvnrepository.com/artifact/com.uber.hoodie/hoodie-hadoop-mr) and [hoodie-client](https://mvnrepository.com/artifact/com.uber.hoodie/hoodie-client) modules needs to be added
and the same config needs to be set on `SparkConf` or conveniently via `HoodieReadClient.addHoodieSupport(conf)`
{% include callout.html content="Don't instantiate a HoodieWriteClient against a table you don't own. Hoodie is a single writer & multiple reader system as of now. You may accidentally cause incidents otherwise.
" type="warning" %}
The second method uses a new feature in Spark 2.x, which allows for the work of HoodieInputFormat to be done via a path filter as below. This method uses Spark built-in optimizations for
reading Parquet files, just like queries on non-hoodie tables.
```
spark.sparkContext.hadoopConfiguration.setClass("mapreduce.input.pathFilter.class", classOf[com.uber.hoodie.hadoop.HoodieROTablePathFilter], classOf[org.apache.hadoop.fs.PathFilter]);
```
## Presto
Presto requires a [patch](https://github.com/prestodb/presto/pull/7002) (until the PR is merged) and the hoodie-hadoop-mr jar to be placed
into `<presto_install>/plugin/hive-hadoop2/`.
{% include callout.html content="Get involved to improve this integration [here](https://github.com/uber/hoodie/issues/81)" type="info" %}