[HUDI-2325] Add hive sync support to kafka connect (#3660)
Co-authored-by: Rajesh Mahindra <rmahindra@Rajeshs-MacBook-Pro.local>
This commit is contained in:
@@ -27,20 +27,14 @@ The first thing you need to do to start using this connector is building it. In
|
||||
- [Java 1.8+](https://openjdk.java.net/)
|
||||
- [Apache Maven](https://maven.apache.org/)
|
||||
- Install [kcat](https://github.com/edenhill/kcat)
|
||||
= Install jq. `brew install jq`
|
||||
|
||||
After installing these dependencies, execute the following commands. This will install all the Hudi dependency jars,
|
||||
including the fat packaged jar that contains all the dependencies required for a functional Hudi Kafka Connect Sink.
|
||||
|
||||
```bash
|
||||
cd $HUDI_DIR
|
||||
mvn clean -DskipTests install
|
||||
```
|
||||
|
||||
Henceforth, incremental builds can be performed as follows.
|
||||
|
||||
```bash
|
||||
mvn clean -pl hudi-kafka-connect install -DskipTests
|
||||
mvn clean -pl packaging/hudi-kafka-connect-bundle install
|
||||
mvn package -DskipTests -pl packaging/hudi-kafka-connect-bundle -am
|
||||
```
|
||||
|
||||
Next, we need to make sure that the hudi sink connector bundle jar is in Kafka Connect classpath. Note that the connect
|
||||
@@ -56,31 +50,58 @@ After building the package, we need to install the Apache Kafka
|
||||
|
||||
### 1 - Starting the environment
|
||||
|
||||
To try out the Connect Sink locally, set up a Kafka broker locally. Download the latest apache kafka from https://kafka.apache.org/downloads.
|
||||
Once downloaded and built, run the Zookeeper server and Kafka server using the command line tools.
|
||||
For runtime dependencies, we encourage using the confluent HDFS connector jars. We have tested our setup with version `10.1.0`.
|
||||
After downloading the connector, copy the jars from the lib folder to the Kafka Connect classpath.
|
||||
|
||||
```bash
|
||||
export KAFKA_HOME=/path/to/kafka_install_dir
|
||||
cd $KAFKA_HOME
|
||||
./bin/zookeeper-server-start.sh ./config/zookeeper.properties
|
||||
./bin/kafka-server-start.sh ./config/server.properties
|
||||
confluent-hub install confluentinc/kafka-connect-hdfs:10.1.0
|
||||
```
|
||||
Add `confluentinc-kafka-connect-hdfs-10.1.0/lib` to the plugin.path (comma separated) in $HUDI_DIR/hudi-kafka-connect/demo/connect-distributed.properties
|
||||
|
||||
### 2 - Set up the docker containers
|
||||
|
||||
To run the connect locally, we need kafka, zookeeper, hdfs, hive etc. To make the setup easier, we use the docker
|
||||
containers from the hudi docker demo. Refer to [this link for the setup](https://hudi.apache.org/docs/docker_demo)
|
||||
|
||||
Essentially, follow the steps listed here:
|
||||
|
||||
/etc/hosts : The demo references many services running in container by the hostname. Add the following settings to /etc/hosts
|
||||
```bash
|
||||
127.0.0.1 adhoc-1
|
||||
127.0.0.1 adhoc-2
|
||||
127.0.0.1 namenode
|
||||
127.0.0.1 datanode1
|
||||
127.0.0.1 hiveserver
|
||||
127.0.0.1 hivemetastore
|
||||
127.0.0.1 kafkabroker
|
||||
127.0.0.1 sparkmaster
|
||||
127.0.0.1 zookeeper
|
||||
```
|
||||
|
||||
Wait until the kafka cluster is up and running.
|
||||
Bring up the docker containers
|
||||
```bash
|
||||
cd $HUDI_DIR/docker
|
||||
./setup_demo.sh
|
||||
```
|
||||
|
||||
### 2 - Set up the schema registry
|
||||
The schema registry and kafka connector can be run from host system directly (mac/ linux).
|
||||
|
||||
### 3 - Set up the schema registry
|
||||
|
||||
Hudi leverages schema registry to obtain the latest schema when writing records. While it supports most popular schema
|
||||
registries, we use Confluent schema registry. Download the
|
||||
latest [confluent platform](https://docs.confluent.io/platform/current/installation/index.html) and run the schema
|
||||
registry service.
|
||||
|
||||
NOTE: You might need to change the port from `8081` to `8082`.
|
||||
|
||||
```bash
|
||||
cd $CONFLUENT_DIR
|
||||
/bin/kafka-configs --zookeeper localhost --entity-type topics --entity-name _schemas --alter --add-config cleanup.policy=compact
|
||||
./bin/schema-registry-start etc/schema-registry/schema-registry.properties
|
||||
```
|
||||
|
||||
### 3 - Create the Hudi Control Topic for Coordination of the transactions
|
||||
### 4 - Create the Hudi Control Topic for Coordination of the transactions
|
||||
|
||||
The control topic should only have `1` partition, since its used to coordinate the Hudi write transactions across the multiple Connect tasks.
|
||||
|
||||
@@ -90,7 +111,7 @@ cd $KAFKA_HOME
|
||||
./bin/kafka-topics.sh --create --topic hudi-control-topic --partitions 1 --replication-factor 1 --bootstrap-server localhost:9092
|
||||
```
|
||||
|
||||
### 4 - Create the Hudi Topic for the Sink and insert data into the topic
|
||||
### 5 - Create the Hudi Topic for the Sink and insert data into the topic
|
||||
|
||||
Open a terminal to execute the following command:
|
||||
|
||||
@@ -106,7 +127,7 @@ to generate, with each batch containing a number of messages and idle time betwe
|
||||
bash setupKafka.sh -n <num_kafka_messages_per_batch> -b <num_batches>
|
||||
```
|
||||
|
||||
### 5 - Run the Sink connector worker (multiple workers can be run)
|
||||
### 6 - Run the Sink connector worker (multiple workers can be run)
|
||||
|
||||
The Kafka connect is a distributed platform, with the ability to run one or more workers (each running multiple tasks)
|
||||
that parallely process the records from the Kafka partitions for the same topic. We provide a properties file with
|
||||
@@ -120,7 +141,7 @@ cd $KAFKA_HOME
|
||||
./bin/connect-distributed.sh $HUDI_DIR/hudi-kafka-connect/demo/connect-distributed.properties
|
||||
```
|
||||
|
||||
### 6 - To add the Hudi Sink to the Connector (delete it if you want to re-configure)
|
||||
### 7 - To add the Hudi Sink to the Connector (delete it if you want to re-configure)
|
||||
|
||||
Once the Connector has started, it will not run the Sink, until the Hudi sink is added using the web api. The following
|
||||
curl APIs can be used to delete and add a new Hudi Sink. Again, a default configuration is provided for the Hudi Sink,
|
||||
@@ -144,8 +165,8 @@ Note: HUDI-2325 tracks Hive sync, which will unlock pretty much every other quer
|
||||
|
||||
```bash
|
||||
ls -a /tmp/hoodie/hudi-test-topic
|
||||
. .hoodie partition-1 partition-3
|
||||
.. partition-0 partition-2 partition-4
|
||||
. .hoodie partition_1 partition_3
|
||||
.. partition_0 partition_2 partition_4
|
||||
|
||||
ls -lt /tmp/hoodie/hudi-test-topic/.hoodie
|
||||
total 72
|
||||
@@ -160,7 +181,7 @@ total 72
|
||||
-rw-r--r-- 1 user wheel 0 Sep 13 21:41 20210913214114.commit.requested
|
||||
drwxr-xr-x 2 user wheel 64 Sep 13 21:41 archived
|
||||
|
||||
ls -l /tmp/hoodie/hudi-test-topic/partition-0
|
||||
ls -l /tmp/hoodie/hudi-test-topic/partition_0
|
||||
total 5168
|
||||
-rw-r--r-- 1 user wheel 439332 Sep 13 21:43 2E0E6DB44ACC8479059574A2C71C7A7E-0_0-0-0_20210913214114.parquet
|
||||
-rw-r--r-- 1 user wheel 440179 Sep 13 21:42 3B56FAAAE2BDD04E480C1CBACD463D3E-0_0-0-0_20210913214114.parquet
|
||||
@@ -170,7 +191,52 @@ total 5168
|
||||
-rw-r--r-- 1 user wheel 440214 Sep 13 21:43 E200FA75DCD1CED60BE86BCE6BF5D23A-0_0-0-0_20210913214114.parquet
|
||||
```
|
||||
|
||||
### 7 - Run async compaction and clustering if scheduled
|
||||
### 8- Querying via Hive
|
||||
|
||||
```bash
|
||||
docker exec -it adhoc-2 /bin/bash
|
||||
beeline -u jdbc:hive2://hiveserver:10000 \
|
||||
--hiveconf hive.input.format=org.apache.hadoop.hive.ql.io.HiveInputFormat \
|
||||
--hiveconf hive.stats.autogather=false
|
||||
|
||||
|
||||
# List Tables
|
||||
0: jdbc:hive2://hiveserver:10000> show tables;
|
||||
+---------------------+--+
|
||||
| tab_name |
|
||||
+---------------------+--+
|
||||
| huditesttopic_ro |
|
||||
| huditesttopic_rt |
|
||||
+---------------------+--+
|
||||
3 rows selected (1.199 seconds)
|
||||
0: jdbc:hive2://hiveserver:10000>
|
||||
|
||||
|
||||
# Look at partitions that were added
|
||||
0: jdbc:hive2://hiveserver:10000> show partitions huditesttopic_rt;
|
||||
+-------------------+--+
|
||||
| partition |
|
||||
+-------------------+--+
|
||||
| date=partition_0 |
|
||||
| date=partition_1 |
|
||||
| date=partition_2 |
|
||||
| date=partition_3 |
|
||||
| date=partition_4 |
|
||||
+-------------------+--+
|
||||
1 row selected (0.24 seconds)
|
||||
|
||||
|
||||
0: jdbc:hive2://hiveserver:10000> select `_hoodie_commit_time`, symbol, ts, volume, open, close from huditesttopic_rt;
|
||||
+----------------------+---------+----------------------+---------+------------+-----------+--+
|
||||
| _hoodie_commit_time | symbol | ts | volume | open | close |
|
||||
+----------------------+---------+----------------------+---------+------------+-----------+--+
|
||||
| 20180924222155 | GOOG | 2018-08-31 09:59:00 | 6330 | 1230.5 | 1230.02 |
|
||||
| 20180924222155 | GOOG | 2018-08-31 10:29:00 | 3391 | 1230.1899 | 1230.085 |
|
||||
+----------------------+---------+----------------------+---------+------------+-----------+--+
|
||||
```
|
||||
|
||||
|
||||
### 9 - Run async compaction and clustering if scheduled
|
||||
|
||||
When using Merge-On-Read (MOR) as the table type, async compaction and clustering can be scheduled when the Sink is
|
||||
running. Inline compaction and clustering are disabled by default due to performance reason. By default, async
|
||||
|
||||
Reference in New Issue
Block a user