1
0

[HUDI-3312] Fixing spark yaml and adding hive validation to integ test suite (#4731)

This commit is contained in:
Sivabalan Narayanan
2022-02-08 00:40:36 -05:00
committed by GitHub
parent 8ab6f17149
commit 0ab1a8ec80
14 changed files with 295 additions and 38 deletions

View File

@@ -82,8 +82,8 @@ spark-submit
2.YAML file
Choose to write up the entire DAG of operations in YAML, take a look at `complex-dag-cow.yaml` or
`complex-dag-mor.yaml`.
Choose to write up the entire DAG of operations in YAML, take a look at `simple-deltastreamer.yaml` or
`simple-deltastreamer.yaml`.
Once you're ready with the DAG you want to execute, simply pass the yaml file path as follows:
```
@@ -177,7 +177,7 @@ cd /opt
Copy the integration tests jar into the docker container
```
docker cp packaging/hudi-integ-test-bundle/target/hudi-integ-test-bundle-0.10.0-SNAPSHOT.jar adhoc-2:/opt
docker cp packaging/hudi-integ-test-bundle/target/hudi-integ-test-bundle-0.11.0-SNAPSHOT.jar adhoc-2:/opt
```
```
@@ -217,7 +217,7 @@ spark-submit \
--conf spark.driver.extraClassPath=/var/demo/jars/* \
--conf spark.executor.extraClassPath=/var/demo/jars/* \
--class org.apache.hudi.integ.testsuite.HoodieTestSuiteJob \
/opt/hudi-integ-test-bundle-0.10.0-SNAPSHOT.jar \
/opt/hudi-integ-test-bundle-0.11.0-SNAPSHOT.jar \
--source-ordering-field test_suite_source_ordering_field \
--use-deltastreamer \
--target-base-path /user/hive/warehouse/hudi-integ-test-suite/output \
@@ -227,7 +227,7 @@ spark-submit \
--schemaprovider-class org.apache.hudi.integ.testsuite.schema.TestSuiteFileBasedSchemaProvider \
--source-class org.apache.hudi.utilities.sources.AvroDFSSource \
--input-file-size 125829120 \
--workload-yaml-path file:/var/hoodie/ws/docker/demo/config/test-suite/complex-dag-cow.yaml \
--workload-yaml-path file:/var/hoodie/ws/docker/demo/config/test-suite/simple-deltastreamer.yaml \
--workload-generator-classname org.apache.hudi.integ.testsuite.dag.WorkflowDagGenerator \
--table-type COPY_ON_WRITE \
--compact-scheduling-minshare 1 \
@@ -264,7 +264,7 @@ spark-submit \
--conf spark.driver.extraClassPath=/var/demo/jars/* \
--conf spark.executor.extraClassPath=/var/demo/jars/* \
--class org.apache.hudi.integ.testsuite.HoodieTestSuiteJob \
/opt/hudi-integ-test-bundle-0.10.0-SNAPSHOT.jar \
/opt/hudi-integ-test-bundle-0.11.0-SNAPSHOT.jar \
--source-ordering-field test_suite_source_ordering_field \
--use-deltastreamer \
--target-base-path /user/hive/warehouse/hudi-integ-test-suite/output \
@@ -274,7 +274,7 @@ spark-submit \
--schemaprovider-class org.apache.hudi.integ.testsuite.schema.TestSuiteFileBasedSchemaProvider \
--source-class org.apache.hudi.utilities.sources.AvroDFSSource \
--input-file-size 125829120 \
--workload-yaml-path file:/var/hoodie/ws/docker/demo/config/test-suite/complex-dag-mor.yaml \
--workload-yaml-path file:/var/hoodie/ws/docker/demo/config/test-suite/simple-deltastreamer.yaml \
--workload-generator-classname org.apache.hudi.integ.testsuite.dag.WorkflowDagGenerator \
--table-type MERGE_ON_READ \
--compact-scheduling-minshare 1 \
@@ -308,16 +308,16 @@ contents both via spark datasource and hive table via spark sql engine. Hive val
If you have "ValidateDatasetNode" in your dag, do not replace hive jars as instructed above. Spark sql engine does not
go well w/ hive2* jars. So, after running docker setup, follow the below steps.
```
docker cp packaging/hudi-integ-test-bundle/target/hudi-integ-test-bundle-0.10.0-SNAPSHOT.jar adhoc-2:/opt/
docker cp packaging/hudi-integ-test-bundle/target/hudi-integ-test-bundle-0.11.0-SNAPSHOT.jar adhoc-2:/opt/
docker cp docker/demo/config/test-suite/test.properties adhoc-2:/opt/
```
Also copy your dag of interest to adhoc-2:/opt/
```
docker cp docker/demo/config/test-suite/complex-dag-cow.yaml adhoc-2:/opt/
docker cp docker/demo/config/test-suite/simple-deltastreamer.yaml adhoc-2:/opt/
```
For repeated runs, two additional configs need to be set. "dag_rounds" and "dag_intermittent_delay_mins".
This means that your dag will be repeated for N times w/ a delay of Y mins between each round. Note: complex-dag-cow.yaml
This means that your dag will be repeated for N times w/ a delay of Y mins between each round. Note: simple-deltastreamer.yaml
already has all these configs set. So no changes required just to try it out.
Also, ValidateDatasetNode can be configured in two ways. Either with "delete_input_data" set to true or without
@@ -457,7 +457,7 @@ spark-submit \
--conf spark.driver.extraClassPath=/var/demo/jars/* \
--conf spark.executor.extraClassPath=/var/demo/jars/* \
--class org.apache.hudi.integ.testsuite.HoodieTestSuiteJob \
/opt/hudi-integ-test-bundle-0.10.0-SNAPSHOT.jar \
/opt/hudi-integ-test-bundle-0.11.0-SNAPSHOT.jar \
--source-ordering-field test_suite_source_ordering_field \
--use-deltastreamer \
--target-base-path /user/hive/warehouse/hudi-integ-test-suite/output \
@@ -467,7 +467,7 @@ spark-submit \
--schemaprovider-class org.apache.hudi.integ.testsuite.schema.TestSuiteFileBasedSchemaProvider \
--source-class org.apache.hudi.utilities.sources.AvroDFSSource \
--input-file-size 125829120 \
--workload-yaml-path file:/opt/complex-dag-cow.yaml \
--workload-yaml-path file:/opt/simple-deltastreamer.yaml \
--workload-generator-classname org.apache.hudi.integ.testsuite.dag.WorkflowDagGenerator \
--table-type COPY_ON_WRITE \
--compact-scheduling-minshare 1 \
@@ -486,8 +486,8 @@ If you wish to enable metrics add below properties as well
Few ready to use dags are available under docker/demo/config/test-suite/ that could give you an idea for long running
dags.
```
complex-dag-cow.yaml: simple 1 round dag for COW table.
complex-dag-mor.yaml: simple 1 round dag for MOR table.
simple-deltastreamer.yaml: simple 1 round dag for COW table.
simple-deltastreamer.yaml: simple 1 round dag for MOR table.
cow-clustering-example.yaml : dag with 3 rounds, in which inline clustering will trigger during 2nd iteration.
cow-long-running-example.yaml : long running dag with 50 iterations. only 1 partition is used.
cow-long-running-multi-partitions.yaml: long running dag wit 50 iterations with multiple partitions.