[HUDI-2428] Fix protocol and other issues after stress testing Hudi Kafka Connect (#3656)
* Fixes based on tests and some improvements * Fix the issues after running stress tests * Fixing checkstyle issues and updating README Co-authored-by: Rajesh Mahindra <rmahindra@Rajeshs-MacBook-Pro.local> Co-authored-by: Vinoth Chandar <vinoth@apache.org>
This commit is contained in:
@@ -15,32 +15,35 @@
|
||||
* See the License for the specific language governing permissions and
|
||||
-->
|
||||
|
||||
# Quick Start guide for Kafka Connect Sink for Hudi
|
||||
# Quick Start (demo) guide for Kafka Connect Sink for Hudi
|
||||
|
||||
This repo contains a sample project that can be used to start off your own source connector for Kafka Connect.
|
||||
|
||||
## Building the connector
|
||||
## Building the Hudi Sink Connector
|
||||
|
||||
The first thing you need to do to start using this connector is building it. In order to do that, you need to install the following dependencies:
|
||||
|
||||
- [Java 1.8+](https://openjdk.java.net/)
|
||||
- [Apache Maven](https://maven.apache.org/)
|
||||
- Install [kcat](https://github.com/edenhill/kcat)
|
||||
|
||||
After installing these dependencies, execute the following command:
|
||||
After installing these dependencies, execute the following commands. This will install all the Hudi dependency jars,
|
||||
including the fat packaged jar that contains all the dependencies required for a functional Hudi Kafka Connect Sink.
|
||||
|
||||
```bash
|
||||
cd $HUDI_DIR
|
||||
mvn clean package
|
||||
mvn clean -DskipTests install
|
||||
```
|
||||
|
||||
## Incremental Builds
|
||||
Henceforth, incremental builds can be performed as follows.
|
||||
|
||||
```bash
|
||||
mvn clean -pl hudi-kafka-connect install -DskipTests
|
||||
mvn clean -pl packaging/hudi-kafka-connect-bundle install
|
||||
```
|
||||
|
||||
## Put hudi connector in Kafka Connect classpath
|
||||
Next, we need to make sure that the hudi sink connector bundle jar is in Kafka Connect classpath. Note that the connect
|
||||
classpath should be same as the one configured in the connector configuration file.
|
||||
|
||||
```bash
|
||||
cp $HUDI_DIR/packaging/hudi-kafka-connect-bundle/target/hudi-kafka-connect-bundle-0.10.0-SNAPSHOT.jar /usr/local/share/java/hudi-kafka-connect/
|
||||
@@ -52,43 +55,77 @@ After building the package, we need to install the Apache Kafka
|
||||
|
||||
### 1 - Starting the environment
|
||||
|
||||
Start the ZK and Kafka:
|
||||
To try out the Connect Sink locally, set up a Kafka broker locally. Download the latest apache kafka from https://kafka.apache.org/downloads.
|
||||
Once downloaded and built, run the Zookeeper server and Kafka server using the command line tools.
|
||||
|
||||
```bash
|
||||
export KAFKA_HOME=/path/to/kafka_install_dir
|
||||
cd $KAFKA_KAFKA_HOME
|
||||
./bin/zookeeper-server-start.sh ./config/zookeeper.properties
|
||||
./bin/kafka-server-start.sh ./config/server.properties
|
||||
```
|
||||
|
||||
Wait until the kafka cluster is up and running.
|
||||
|
||||
### 2 - Create the Hudi Control Topic for Coordination of the transactions
|
||||
### 2 - Set up the schema registry
|
||||
|
||||
The control topic should only have `1` partition
|
||||
Hudi leverages schema registry to obtain the latest schema when writing records. While it supports most popular schema registries,
|
||||
we use Confluent schema registry. Download the latest confluent schema registry code from https://github.com/confluentinc/schema-registry
|
||||
and start the schema registry service.
|
||||
|
||||
```bash
|
||||
cd $CONFLUENT_DIR
|
||||
./bin/schema-registry-start etc/schema-registry/schema-registry.properties
|
||||
```
|
||||
|
||||
### 3 - Create the Hudi Control Topic for Coordination of the transactions
|
||||
|
||||
The control topic should only have `1` partition, since its used to coordinate the Hudi write transactions across the multiple Connect tasks.
|
||||
|
||||
```bash
|
||||
cd $KAFKA_HOME
|
||||
./bin/kafka-topics.sh --delete --topic hudi-control-topic --bootstrap-server localhost:9092
|
||||
./bin/kafka-topics.sh --create --topic hudi-control-topic --partitions 1 --replication-factor 1 --bootstrap-server localhost:9092
|
||||
```
|
||||
|
||||
### 3 - Create the Hudi Topic for the Sink and insert data into the topic
|
||||
### 4 - Create the Hudi Topic for the Sink and insert data into the topic
|
||||
|
||||
Open a terminal to execute the following command:
|
||||
|
||||
```bash
|
||||
bash runKafkaTrafficGenerator.sh <total_messages>
|
||||
cd $HUDI_DIR/demo/
|
||||
bash setupKafka.sh -n <total_kafka_messages>
|
||||
```
|
||||
|
||||
### 4 - Run the Sink connector worker (multiple workers can be run)
|
||||
|
||||
Open a terminal to execute the following command:
|
||||
The Kafka connect is a distributed platform, with the ability to run one or more workers (each running multiple tasks)
|
||||
that parallely process the records from the Kafka partitions for the same topic. We provide a properties file with
|
||||
default properties to start a Hudi connector.
|
||||
|
||||
Note that if multiple workers need to be run, the webserver needs to be reconfigured for subsequent workers to ensure
|
||||
successful running of the workers.
|
||||
|
||||
```bash
|
||||
./bin/connect-distributed.sh ../hudi-kafka-connect/configs/connect-distributed.properties
|
||||
cd $KAFKA_HOME
|
||||
./bin/connect-distributed.sh $HUDI_DIR/hudi-kafka-connect/demo/connect-distributed.properties
|
||||
```
|
||||
|
||||
### 5- To add the Hudi Sink to the Connector (delete it if you want to re-configure)
|
||||
|
||||
Once the Connector has started, it will not run the Sink, until the Hudi sink is added using the web api. The following
|
||||
curl APIs can be used to delete and add a new Hudi Sink. Again, a default configuration is provided for the Hudi Sink,
|
||||
that can be changed based on the desired properties.
|
||||
|
||||
```bash
|
||||
curl -X DELETE http://localhost:8083/connectors/hudi-sink
|
||||
curl -X POST -H "Content-Type:application/json" -d @$HUDI-DIR/hudi-kafka-connect/configs/config-sink.json http://localhost:8083/connectors
|
||||
curl -X POST -H "Content-Type:application/json" -d @$HUDI-DIR/hudi-kafka-connect/demo/config-sink.json http://localhost:8083/connectors
|
||||
```
|
||||
|
||||
Now, you should see that the connector is created and tasks are running.
|
||||
|
||||
```bash
|
||||
curl -X GET -H "Content-Type:application/json" http://localhost:8083/connectors
|
||||
["hudi-sink"]
|
||||
curl -X GET -H "Content-Type:application/json" http://localhost:8083/connectors/hudi-sink/status | jq
|
||||
```
|
||||
|
||||
Reference in New Issue
Block a user