[HUDI-394] Provide a basic implementation of test suite

2019-11-01 12:40:09 -07:00
parent d5b593b7d9
commit 2fc2b01d86
102 changed files with 8633 additions and 64 deletions
--- a/hudi-integ-test/README.md
+++ b/hudi-integ-test/README.md
@@ -0,0 +1,262 @@
+<!--
+  Licensed to the Apache Software Foundation (ASF) under one or more
+  contributor license agreements.  See the NOTICE file distributed with
+  this work for additional information regarding copyright ownership.
+  The ASF licenses this file to You under the Apache License, Version 2.0
+  (the "License"); you may not use this file except in compliance with
+  the License.  You may obtain a copy of the License at
+
+       http://www.apache.org/licenses/LICENSE-2.0
+
+  Unless required by applicable law or agreed to in writing, software
+  distributed under the License is distributed on an "AS IS" BASIS,
+  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+  See the License for the specific language governing permissions and
+  limitations under the License.
+-->
+
+This page describes in detail how to run end to end tests on a hudi dataset that helps in improving our confidence 
+in a release as well as perform large scale performance benchmarks.  
+
+# Objectives
+
+1. Test with different versions of core libraries and components such as `hdfs`, `parquet`, `spark`, 
+`hive` and `avro`.
+2. Generate different types of workloads across different dimensions such as `payload size`, `number of updates`, 
+`number of inserts`, `number of partitions`
+3. Perform multiple types of operations such as `insert`, `bulk_insert`, `upsert`, `compact`, `query`
+4. Support custom post process actions and validations
+
+# High Level Design
+
+The Hudi test suite runs as a long running spark job. The suite is divided into the following high level components : 
+
+## Workload Generation
+
+This component does the work of generating the workload; `inserts`, `upserts` etc.
+
+## Workload Scheduling
+
+Depending on the type of workload generated, data is either ingested into the target hudi 
+dataset or the corresponding workload operation is executed. For example compaction does not necessarily need a workload
+to be generated/ingested but can require an execution.
+
+## Other actions/operatons
+
+The test suite supports different types of operations besides ingestion such as Hive Query execution, Clean action etc.
+
+# Usage instructions
+
+
+## Entry class to the test suite
+
+```
+org.apache.hudi.testsuite.HoodieTestSuiteJob.java - Entry Point of the hudi test suite job. This 
+class wraps all the functionalities required to run a configurable integration suite.
+```
+
+## Configurations required to run the job
+```
+org.apache.hudi.testsuite.HoodieTestSuiteJob.HoodieTestSuiteConfig - Config class that drives the behavior of the 
+integration test suite. This class extends from com.uber.hoodie.utilities.DeltaStreamerConfig. Look at 
+link#HudiDeltaStreamer page to learn about all the available configs applicable to your test suite.
+```
+
+## Generating a custom Workload Pattern
+
+There are 2 ways to generate a workload pattern
+
+ 1.Programatically
+
+Choose to write up the entire DAG of operations programatically, take a look at `WorkflowDagGenerator` class.
+Once you're ready with the DAG you want to execute, simply pass the class name as follows:
+
+```
+spark-submit
+...
+...
+--class org.apache.hudi.testsuite.HoodieTestSuiteJob 
+--workload-generator-classname org.apache.hudi.testsuite.dag.scheduler.<your_workflowdaggenerator>
+...
+```
+
+ 2.YAML file
+
+Choose to write up the entire DAG of operations in YAML, take a look at `complex-workload-dag-cow.yaml` or 
+`complex-workload-dag-mor.yaml`.
+Once you're ready with the DAG you want to execute, simply pass the yaml file path as follows:
+
+```
+spark-submit
+...
+...
+--class org.apache.hudi.testsuite.HoodieTestSuiteJob 
+--workload-yaml-path /path/to/your-workflow-dag.yaml
+...
+```
+
+## Building the test suite
+
+The test suite can be found in the `hudi-integ-test` module. Use the `prepare_integration_suite.sh` script to 
+build 
+the test suite, you can provide different parameters to the script.
+
+```
+shell$ ./prepare_integration_suite.sh --help
+Usage: prepare_integration_suite.sh
+   --spark-command, prints the spark command
+   -h, hdfs-version
+   -s, spark version
+   -p, parquet version
+   -a, avro version
+   -s, hive version
+```
+
+```
+shell$ ./prepare_integration_suite.sh
+....
+....
+Final command : mvn clean install -DskipTests
+```
+
+## Running on the cluster or in your local machine
+Copy over the necessary files and jars that are required to your cluster and then run the following spark-submit 
+command after replacing the correct values for the parameters. 
+NOTE : The properties-file should have all the necessary information required to ingest into a Hudi dataset. For more
+ information on what properties need to be set, take a look at the test suite section under demo steps.
+```
+shell$ ./prepare_integration_suite.sh --spark-command
+spark-submit --packages com.databricks:spark-avro_2.11:4.0.0 --master prepare_integration_suite.sh --deploy-mode
+--properties-file  --class org.apache.hudi.integ.testsuite.HoodieTestSuiteJob target/hudi-integ-test-0.6
+.0-SNAPSHOT.jar --source-class  --source-ordering-field  --input-base-path  --target-base-path  --target-table  --props  --storage-type  --payload-class  --workload-yaml-path  --input-file-size  --<deltastreamer-ingest>
+```
+
+## Running through a test-case (local)
+Take a look at the `TestHoodieTestSuiteJob` to check how you can run the entire suite using JUnit.
+
+## Running an end to end test suite in Local Docker environment
+
+```
+docker exec -it adhoc-2 /bin/bash
+# COPY_ON_WRITE tables
+=========================
+## Run the following command to start the test suite
+spark-submit \ 
+--packages com.databricks:spark-avro_2.11:4.0.0 \ 
+--conf spark.task.cpus=1 \ 
+--conf spark.executor.cores=1 \ 
+--conf spark.task.maxFailures=100 \ 
+--conf spark.memory.fraction=0.4 \ 
+--conf spark.rdd.compress=true \ 
+--conf spark.kryoserializer.buffer.max=2000m \ 
+--conf spark.serializer=org.apache.spark.serializer.KryoSerializer \ 
+--conf spark.memory.storageFraction=0.1 \ 
+--conf spark.shuffle.service.enabled=true \ 
+--conf spark.sql.hive.convertMetastoreParquet=false \ 
+--conf spark.ui.port=5555 \ 
+--conf spark.driver.maxResultSize=12g \ 
+--conf spark.executor.heartbeatInterval=120s \ 
+--conf spark.network.timeout=600s \ 
+--conf spark.eventLog.overwrite=true \ 
+--conf spark.eventLog.enabled=true \ 
+--conf spark.yarn.max.executor.failures=10 \ 
+--conf spark.sql.catalogImplementation=hive \ 
+--conf spark.sql.shuffle.partitions=1000 \ 
+--class org.apache.hudi.testsuite.HoodieTestSuiteJob $HUDI_TEST_SUITE_BUNDLE \
+--source-ordering-field timestamp \ 
+--target-base-path /user/hive/warehouse/hudi-integ-test-suite/output \ 
+--input-base-path /user/hive/warehouse/hudi-integ-test-suite/input \ 
+--target-table test_table \ 
+--props /var/hoodie/ws/docker/demo/config/testsuite/test-source.properties \ 
+--schemaprovider-class org.apache.hudi.utilities.schema.FilebasedSchemaProvider \ 
+--source-limit 300000 \ 
+--source-class org.apache.hudi.utilities.sources.AvroDFSSource \ 
+--input-file-size 125829120 \ 
+--workload-yaml-path /var/hoodie/ws/docker/demo/config/testsuite/complex-workflow-dag-cow.yaml \ 
+--storage-type COPY_ON_WRITE \ 
+--compact-scheduling-minshare 1 \ 
+--hoodie-conf "hoodie.deltastreamer.source.test.num_partitions=100" \ 
+--hoodie-conf "hoodie.deltastreamer.source.test.datagen.use_rocksdb_for_storing_existing_keys=false" \ 
+--hoodie-conf "hoodie.deltastreamer.source.test.max_unique_records=100000000" \ 
+--hoodie-conf "hoodie.embed.timeline.server=false" \ 
+--hoodie-conf "hoodie.datasource.write.recordkey.field=_row_key" \ 
+--hoodie-conf "hoodie.deltastreamer.source.dfs.root=/user/hive/warehouse/hudi-integ-test-suite/input" \ 
+--hoodie-conf "hoodie.datasource.write.keygenerator.class=org.apache.hudi.ComplexKeyGenerator" \ 
+--hoodie-conf "hoodie.datasource.write.partitionpath.field=timestamp" \ 
+--hoodie-conf "hoodie.deltastreamer.schemaprovider.source.schema.file=/var/hoodie/ws/docker/demo/config/testsuite/source.avsc" \ 
+--hoodie-conf "hoodie.datasource.hive_sync.assume_date_partitioning=false" \ 
+--hoodie-conf "hoodie.datasource.hive_sync.jdbcurl=jdbc:hive2://hiveserver:10000/" \ 
+--hoodie-conf "hoodie.datasource.hive_sync.database=testdb" \ 
+--hoodie-conf "hoodie.datasource.hive_sync.table=test_table" \ 
+--hoodie-conf "hoodie.datasource.hive_sync.partition_extractor_class=org.apache.hudi.hive.NonPartitionedExtractor" \ 
+--hoodie-conf "hoodie.datasource.hive_sync.assume_date_partitioning=true" \ 
+--hoodie-conf "hoodie.datasource.write.keytranslator.class=org.apache.hudi.DayBasedPartitionPathKeyTranslator" \ 
+--hoodie-conf "hoodie.deltastreamer.schemaprovider.target.schema.file=/var/hoodie/ws/docker/demo/config/testsuite/source.avsc"
+...
+...
+2019-11-03 05:44:47 INFO  DagScheduler:69 - ----------- Finished workloads ----------
+2019-11-03 05:44:47 INFO  HoodieTestSuiteJob:138 - Finished scheduling all tasks
+...
+2019-11-03 05:44:48 INFO  SparkContext:54 - Successfully stopped SparkContext
+# MERGE_ON_READ tables
+=========================
+## Run the following command to start the test suite
+spark-submit \ 
+--packages com.databricks:spark-avro_2.11:4.0.0 \ 
+--conf spark.task.cpus=1 \ 
+--conf spark.executor.cores=1 \ 
+--conf spark.task.maxFailures=100 \ 
+--conf spark.memory.fraction=0.4 \ 
+--conf spark.rdd.compress=true \ 
+--conf spark.kryoserializer.buffer.max=2000m \ 
+--conf spark.serializer=org.apache.spark.serializer.KryoSerializer \ 
+--conf spark.memory.storageFraction=0.1 \ 
+--conf spark.shuffle.service.enabled=true \ 
+--conf spark.sql.hive.convertMetastoreParquet=false \ 
+--conf spark.ui.port=5555 \ 
+--conf spark.driver.maxResultSize=12g \ 
+--conf spark.executor.heartbeatInterval=120s \ 
+--conf spark.network.timeout=600s \ 
+--conf spark.eventLog.overwrite=true \ 
+--conf spark.eventLog.enabled=true \ 
+--conf spark.yarn.max.executor.failures=10 \ 
+--conf spark.sql.catalogImplementation=hive \ 
+--conf spark.sql.shuffle.partitions=1000 \ 
+--class org.apache.hudi.testsuite.HoodieTestSuiteJob $HUDI_TEST_SUITE_BUNDLE \
+--source-ordering-field timestamp \ 
+--target-base-path /user/hive/warehouse/hudi-integ-test-suite/output \ 
+--input-base-path /user/hive/warehouse/hudi-integ-test-suite/input \ 
+--target-table test_table \ 
+--props /var/hoodie/ws/docker/demo/config/testsuite/test-source.properties \ 
+--schemaprovider-class org.apache.hudi.utilities.schema.FilebasedSchemaProvider \ 
+--source-limit 300000 \ 
+--source-class org.apache.hudi.utilities.sources.AvroDFSSource \ 
+--input-file-size 125829120 \ 
+--workload-yaml-path /var/hoodie/ws/docker/demo/config/testsuite/complex-workflow-dag-mor.yaml \ 
+--storage-type MERGE_ON_READ \ 
+--compact-scheduling-minshare 1 \ 
+--hoodie-conf "hoodie.deltastreamer.source.test.num_partitions=100" \ 
+--hoodie-conf "hoodie.deltastreamer.source.test.datagen.use_rocksdb_for_storing_existing_keys=false" \ 
+--hoodie-conf "hoodie.deltastreamer.source.test.max_unique_records=100000000" \ 
+--hoodie-conf "hoodie.embed.timeline.server=false" \ 
+--hoodie-conf "hoodie.datasource.write.recordkey.field=_row_key" \ 
+--hoodie-conf "hoodie.deltastreamer.source.dfs.root=/user/hive/warehouse/hudi-integ-test-suite/input" \ 
+--hoodie-conf "hoodie.datasource.write.keygenerator.class=org.apache.hudi.ComplexKeyGenerator" \ 
+--hoodie-conf "hoodie.datasource.write.partitionpath.field=timestamp" \ 
+--hoodie-conf "hoodie.deltastreamer.schemaprovider.source.schema.file=/var/hoodie/ws/docker/demo/config/testsuite/source.avsc" \ 
+--hoodie-conf "hoodie.datasource.hive_sync.assume_date_partitioning=false" \ 
+--hoodie-conf "hoodie.datasource.hive_sync.jdbcurl=jdbc:hive2://hiveserver:10000/" \ 
+--hoodie-conf "hoodie.datasource.hive_sync.database=testdb" \ 
+--hoodie-conf "hoodie.datasource.hive_sync.table=test_table" \ 
+--hoodie-conf "hoodie.datasource.hive_sync.partition_extractor_class=org.apache.hudi.hive.NonPartitionedExtractor" \ 
+--hoodie-conf "hoodie.datasource.hive_sync.assume_date_partitioning=true" \ 
+--hoodie-conf "hoodie.datasource.write.keytranslator.class=org.apache.hudi.DayBasedPartitionPathKeyTranslator" \ 
+--hoodie-conf "hoodie.deltastreamer.schemaprovider.target.schema.file=/var/hoodie/ws/docker/demo/config/testsuite/source.avsc"
+...
+...
+2019-11-03 05:44:47 INFO  DagScheduler:69 - ----------- Finished workloads ----------
+2019-11-03 05:44:47 INFO  HoodieTestSuiteJob:138 - Finished scheduling all tasks
+...
+2019-11-03 05:44:48 INFO  SparkContext:54 - Successfully stopped SparkContext
+```
+