[HUDI-2864] Fix README and scripts with current limitations of hive sync (#4129)
* Fix README with current limitations of hive sync * Fix README with current limitations of hive sync * Fix dep issue * Fix Copy on Write flow Co-authored-by: Rajesh Mahindra <rmahindra@Rajeshs-MacBook-Pro.local>
This commit is contained in:
@@ -29,34 +29,30 @@ The first thing you need to do to start using this connector is building it. In
|
||||
- Install [kcat](https://github.com/edenhill/kcat)
|
||||
= Install jq. `brew install jq`
|
||||
|
||||
After installing these dependencies, execute the following commands. This will install all the Hudi dependency jars,
|
||||
including the fat packaged jar that contains all the dependencies required for a functional Hudi Kafka Connect Sink.
|
||||
|
||||
|
||||
```bash
|
||||
mvn package -DskipTests -pl packaging/hudi-kafka-connect-bundle -am
|
||||
```
|
||||
|
||||
Next, we need to make sure that the hudi sink connector bundle jar is in Kafka Connect classpath. Note that the connect
|
||||
classpath should be same as the one configured in the connector configuration file.
|
||||
|
||||
```bash
|
||||
cp $HUDI_DIR/packaging/hudi-kafka-connect-bundle/target/hudi-kafka-connect-bundle-0.10.0-SNAPSHOT.jar /usr/local/share/java/hudi-kafka-connect/
|
||||
```
|
||||
|
||||
|
||||
## Trying the connector
|
||||
|
||||
After building the package, we need to install the Apache Kafka
|
||||
After installing these dependencies, follow steps based on your requirement.
|
||||
|
||||
### 1 - Starting the environment
|
||||
|
||||
For runtime dependencies, we encourage using the confluent HDFS connector jars. We have tested our setup with version `10.1.0`.
|
||||
After downloading the connector, copy the jars from the lib folder to the Kafka Connect classpath.
|
||||
Either use confluent-hub to install the connector or download it from [here](https://tinyurl.com/yb472f79).
|
||||
|
||||
Copy the entire folder to the classpath that will be used by the Hudi Kafka Connector.
|
||||
|
||||
```bash
|
||||
confluent-hub install confluentinc/kafka-connect-hdfs:10.1.0
|
||||
cp confluentinc-kafka-connect-hdfs-10.1.0/* /usr/local/share/kafka/plugins/
|
||||
```
|
||||
|
||||
Now, build the packaged jar that contains all the hudi classes, including the Hudi Kafka Connector. And copy it
|
||||
to the plugin path that contains all the other jars (`/usr/local/share/kafka/plugins/lib`)
|
||||
```bash
|
||||
cd ${HUDI_DIR}
|
||||
mvn package -DskipTests -pl packaging/hudi-kafka-connect-bundle -am
|
||||
cp $HUDI_DIR/packaging/hudi-kafka-connect-bundle/target/hudi-kafka-connect-bundle-0.10.0-SNAPSHOT.jar /usr/local/share/kafka/plugins/lib
|
||||
```
|
||||
Add `confluentinc-kafka-connect-hdfs-10.1.0/lib` to the plugin.path (comma separated) in $HUDI_DIR/hudi-kafka-connect/demo/connect-distributed.properties
|
||||
|
||||
### 2 - Set up the docker containers
|
||||
|
||||
@@ -80,7 +76,7 @@ Essentially, follow the steps listed here:
|
||||
|
||||
Bring up the docker containers
|
||||
```bash
|
||||
cd $HUDI_DIR/docker
|
||||
cd ${HUDI_DIR}/docker
|
||||
./setup_demo.sh
|
||||
```
|
||||
|
||||
@@ -93,7 +89,7 @@ registries, we use Confluent schema registry. Download the
|
||||
latest [confluent platform](https://docs.confluent.io/platform/current/installation/index.html) and run the schema
|
||||
registry service.
|
||||
|
||||
NOTE: You might need to change the port from `8081` to `8082`.
|
||||
NOTE: You must change the port from `8081` (default) to `8082` to avoid conflict.
|
||||
|
||||
```bash
|
||||
cd $CONFLUENT_DIR
|
||||
@@ -137,8 +133,8 @@ Note that if multiple workers need to be run, the webserver needs to be reconfig
|
||||
successful running of the workers.
|
||||
|
||||
```bash
|
||||
cd $KAFKA_HOME
|
||||
./bin/connect-distributed.sh $HUDI_DIR/hudi-kafka-connect/demo/connect-distributed.properties
|
||||
cd ${KAFKA_HOME}
|
||||
./bin/connect-distributed.sh ${HUDI_DIR}/hudi-kafka-connect/demo/connect-distributed.properties
|
||||
```
|
||||
|
||||
### 7 - To add the Hudi Sink to the Connector (delete it if you want to re-configure)
|
||||
@@ -155,6 +151,7 @@ curl -X POST -H "Content-Type:application/json" -d @$HUDI_DIR/hudi-kafka-connect
|
||||
Now, you should see that the connector is created and tasks are running.
|
||||
|
||||
```bash
|
||||
mkdir /tmp/hoodie/
|
||||
curl -X GET -H "Content-Type:application/json" http://localhost:8083/connectors
|
||||
["hudi-sink"]
|
||||
curl -X GET -H "Content-Type:application/json" http://localhost:8083/connectors/hudi-sink/status | jq
|
||||
@@ -191,52 +188,7 @@ total 5168
|
||||
-rw-r--r-- 1 user wheel 440214 Sep 13 21:43 E200FA75DCD1CED60BE86BCE6BF5D23A-0_0-0-0_20210913214114.parquet
|
||||
```
|
||||
|
||||
### 8- Querying via Hive
|
||||
|
||||
```bash
|
||||
docker exec -it adhoc-2 /bin/bash
|
||||
beeline -u jdbc:hive2://hiveserver:10000 \
|
||||
--hiveconf hive.input.format=org.apache.hadoop.hive.ql.io.HiveInputFormat \
|
||||
--hiveconf hive.stats.autogather=false
|
||||
|
||||
|
||||
# List Tables
|
||||
0: jdbc:hive2://hiveserver:10000> show tables;
|
||||
+---------------------+--+
|
||||
| tab_name |
|
||||
+---------------------+--+
|
||||
| huditesttopic_ro |
|
||||
| huditesttopic_rt |
|
||||
+---------------------+--+
|
||||
3 rows selected (1.199 seconds)
|
||||
0: jdbc:hive2://hiveserver:10000>
|
||||
|
||||
|
||||
# Look at partitions that were added
|
||||
0: jdbc:hive2://hiveserver:10000> show partitions huditesttopic_rt;
|
||||
+-------------------+--+
|
||||
| partition |
|
||||
+-------------------+--+
|
||||
| date=partition_0 |
|
||||
| date=partition_1 |
|
||||
| date=partition_2 |
|
||||
| date=partition_3 |
|
||||
| date=partition_4 |
|
||||
+-------------------+--+
|
||||
1 row selected (0.24 seconds)
|
||||
|
||||
|
||||
0: jdbc:hive2://hiveserver:10000> select `_hoodie_commit_time`, symbol, ts, volume, open, close from huditesttopic_rt;
|
||||
+----------------------+---------+----------------------+---------+------------+-----------+--+
|
||||
| _hoodie_commit_time | symbol | ts | volume | open | close |
|
||||
+----------------------+---------+----------------------+---------+------------+-----------+--+
|
||||
| 20180924222155 | GOOG | 2018-08-31 09:59:00 | 6330 | 1230.5 | 1230.02 |
|
||||
| 20180924222155 | GOOG | 2018-08-31 10:29:00 | 3391 | 1230.1899 | 1230.085 |
|
||||
+----------------------+---------+----------------------+---------+------------+-----------+--+
|
||||
```
|
||||
|
||||
|
||||
### 9 - Run async compaction and clustering if scheduled
|
||||
### 8 - Run async compaction and clustering if scheduled
|
||||
|
||||
When using Merge-On-Read (MOR) as the table type, async compaction and clustering can be scheduled when the Sink is
|
||||
running. Inline compaction and clustering are disabled by default due to performance reason. By default, async
|
||||
@@ -365,3 +317,68 @@ hoodie.write.concurrency.mode=single_writer
|
||||
|
||||
Note that you don't have to provide the instant time through `--instant-time`. In that case, the earliest scheduled
|
||||
clustering is going to be executed.
|
||||
|
||||
### 9- Querying via Hive
|
||||
|
||||
In this section we explain how one can test syncing of the Hudi table with Hive server/ Hive Metastore,
|
||||
that enable querying via Hive, Presto etc.
|
||||
|
||||
Firstly, (re)-install a different connector that is configured to write the Hudi table to Hdfs instead of local filesystem.
|
||||
|
||||
```bash
|
||||
curl -X DELETE http://localhost:8083/connectors/hudi-sink
|
||||
curl -X POST -H "Content-Type:application/json" -d @$HUDI_DIR/hudi-kafka-connect/demo/config-sink-hive.json http://localhost:8083/connectors
|
||||
```
|
||||
|
||||
After running the connector, you can query the hive server using the following steps:
|
||||
|
||||
```bash
|
||||
docker exec -it adhoc-2 /bin/bash
|
||||
beeline -u jdbc:hive2://hiveserver:10000 \
|
||||
--hiveconf hive.input.format=org.apache.hadoop.hive.ql.io.HiveInputFormat \
|
||||
--hiveconf hive.stats.autogather=false
|
||||
|
||||
|
||||
# List Tables
|
||||
0: jdbc:hive2://hiveserver:10000> show tables;
|
||||
+---------------------+--+
|
||||
| tab_name |
|
||||
+---------------------+--+
|
||||
| huditesttopic_ro |
|
||||
| huditesttopic_rt |
|
||||
+---------------------+--+
|
||||
3 rows selected (1.199 seconds)
|
||||
0: jdbc:hive2://hiveserver:10000>
|
||||
|
||||
|
||||
# Look at partitions that were added
|
||||
0: jdbc:hive2://hiveserver:10000> show partitions huditesttopic_ro;
|
||||
+-------------------+--+
|
||||
| partition |
|
||||
+-------------------+--+
|
||||
| date=partition_0 |
|
||||
| date=partition_1 |
|
||||
| date=partition_2 |
|
||||
| date=partition_3 |
|
||||
| date=partition_4 |
|
||||
+-------------------+--+
|
||||
1 row selected (0.24 seconds)
|
||||
|
||||
|
||||
0: jdbc:hive2://hiveserver:10000> select `_hoodie_commit_time`, symbol, ts, volume, open, close from huditesttopic_ro;
|
||||
+----------------------+---------+----------------------+---------+------------+-----------+--+
|
||||
| _hoodie_commit_time | symbol | ts | volume | open | close |
|
||||
+----------------------+---------+----------------------+---------+------------+-----------+--+
|
||||
| 20180924222155 | GOOG | 2018-08-31 09:59:00 | 6330 | 1230.5 | 1230.02 |
|
||||
| 20180924222155 | GOOG | 2018-08-31 10:29:00 | 3391 | 1230.1899 | 1230.085 |
|
||||
+----------------------+---------+----------------------+---------+------------+-----------+--+
|
||||
```
|
||||
|
||||
`Current Limitation:` The Hudi Kafka Connect sink uses `Merge-On-Read` by default, and inserts/ appends the kafka records
|
||||
directly to the log file(s). Asynchronously, compaction service can be executed to merge the log files into base file (Parquet format).
|
||||
Generally, we support both Read-Optimized that reads only parquet base files and Snapshot queries that read and merge
|
||||
records across base and log files. However, currently there is a limitation where we are not able to read records from
|
||||
only log files. Hence, the queries for Hudi Kafka Connect will only work after compaction merges the records into base files. Alternatively,
|
||||
users have the option to reconfigure the table type to `COPY_ON_WRITE` in config-sink.json.
|
||||
|
||||
|
||||
|
||||
Reference in New Issue
Block a user