Fixing README for hudi test suite long running job (#2578)
This commit is contained in:
committed by
GitHub
parent
8c2197ae5e
commit
9f5e8cc7c3
@@ -270,20 +270,31 @@ spark-submit \
|
|||||||
--compact-scheduling-minshare 1
|
--compact-scheduling-minshare 1
|
||||||
```
|
```
|
||||||
|
|
||||||
For long running test suite, validation has to be done differently. Idea is to run same dag in a repeated manner.
|
## Running long running test suite in Local Docker environment
|
||||||
Hence "ValidateDatasetNode" is introduced which will read entire input data and compare it with hudi contents both via
|
|
||||||
spark datasource and hive table via spark sql engine.
|
For long running test suite, validation has to be done differently. Idea is to run same dag in a repeated manner for
|
||||||
|
N iterations. Hence "ValidateDatasetNode" is introduced which will read entire input data and compare it with hudi
|
||||||
|
contents both via spark datasource and hive table via spark sql engine. Hive validation is configurable.
|
||||||
|
|
||||||
If you have "ValidateDatasetNode" in your dag, do not replace hive jars as instructed above. Spark sql engine does not
|
If you have "ValidateDatasetNode" in your dag, do not replace hive jars as instructed above. Spark sql engine does not
|
||||||
go well w/ hive2* jars. So, after running docker setup, just copy test.properties and your dag of interest and you are
|
go well w/ hive2* jars. So, after running docker setup, follow the below steps.
|
||||||
good to go ahead.
|
```
|
||||||
|
docker cp packaging/hudi-integ-test-bundle/target/hudi-integ-test-bundle-0.8.0-SNAPSHOT.jar adhoc-2:/opt/
|
||||||
|
docker cp demo/config/test-suite/test.properties adhoc-2:/opt/
|
||||||
|
```
|
||||||
|
Also copy your dag of interest to adhoc-2:/opt/
|
||||||
|
```
|
||||||
|
docker cp demo/config/test-suite/complex-dag-cow.yaml adhoc-2:/opt/
|
||||||
|
```
|
||||||
|
|
||||||
For repeated runs, two additional configs need to be set. "dag_rounds" and "dag_intermittent_delay_mins".
|
For repeated runs, two additional configs need to be set. "dag_rounds" and "dag_intermittent_delay_mins".
|
||||||
This means that your dag will be repeated for N times w/ a delay of Y mins between each round.
|
This means that your dag will be repeated for N times w/ a delay of Y mins between each round. Note: complex-dag-cow.yaml
|
||||||
|
already has all these configs set. So no changes required just to try it out.
|
||||||
|
|
||||||
|
Also, ValidateDatasetNode can be configured in two ways. Either with "delete_input_data" set to true or without
|
||||||
|
setting the config. When "delete_input_data" is set for ValidateDatasetNode, once validation is complete, entire input
|
||||||
|
data will be deleted. So, suggestion is to use this ValidateDatasetNode as the last node in the dag with "delete_input_data".
|
||||||
|
|
||||||
Also, ValidateDatasetNode can be configured in two ways. Either with "delete_input_data: true" set or not set.
|
|
||||||
When "delete_input_data" is set for ValidateDatasetNode, once validation is complete, entire input data will be deleted.
|
|
||||||
So, suggestion is to use this ValidateDatasetNode as the last node in the dag with "delete_input_data".
|
|
||||||
Example dag:
|
Example dag:
|
||||||
```
|
```
|
||||||
Insert
|
Insert
|
||||||
@@ -294,7 +305,7 @@ Example dag:
|
|||||||
If above dag is run with "dag_rounds" = 10 and "dag_intermittent_delay_mins" = 10, then this dag will run for 10 times
|
If above dag is run with "dag_rounds" = 10 and "dag_intermittent_delay_mins" = 10, then this dag will run for 10 times
|
||||||
with 10 mins delay between every run. At the end of every run, records written as part of this round will be validated.
|
with 10 mins delay between every run. At the end of every run, records written as part of this round will be validated.
|
||||||
At the end of each validation, all contents of input are deleted.
|
At the end of each validation, all contents of input are deleted.
|
||||||
For eg: incase of above dag,
|
To illustrate each round
|
||||||
```
|
```
|
||||||
Round1:
|
Round1:
|
||||||
insert => inputPath/batch1
|
insert => inputPath/batch1
|
||||||
@@ -322,6 +333,12 @@ may not scale beyond certain point since input data as well as hudi content's ke
|
|||||||
every cycle.
|
every cycle.
|
||||||
|
|
||||||
Lets see an example where you don't set "delete_input_data" as part of Validation.
|
Lets see an example where you don't set "delete_input_data" as part of Validation.
|
||||||
|
```
|
||||||
|
Insert
|
||||||
|
Upsert
|
||||||
|
ValidateDatasetNode
|
||||||
|
```
|
||||||
|
Here is the illustration of each round
|
||||||
```
|
```
|
||||||
Round1:
|
Round1:
|
||||||
insert => inputPath/batch1
|
insert => inputPath/batch1
|
||||||
@@ -383,26 +400,13 @@ Above dag was just an example for illustration purposes. But you can make it com
|
|||||||
Upsert
|
Upsert
|
||||||
Validate w/ deletion
|
Validate w/ deletion
|
||||||
```
|
```
|
||||||
With this dag, you can set the two additional configs "dag_rounds" and "dag_intermittent_delay_mins" and have a long
|
|
||||||
running test suite.
|
|
||||||
|
|
||||||
|
Once you have copied the jar, test.properties and your dag to adhoc-2:/opt/, you can run the following command to execute
|
||||||
|
the test suite job.
|
||||||
```
|
```
|
||||||
dag_rounds: 1
|
docker exec -it adhoc-2 /bin/bash
|
||||||
dag_intermittent_delay_mins: 10
|
|
||||||
dag_content:
|
|
||||||
Insert
|
|
||||||
Upsert
|
|
||||||
Delete
|
|
||||||
Validate w/o deleting
|
|
||||||
Insert
|
|
||||||
Rollback
|
|
||||||
Validate w/o deleting
|
|
||||||
Upsert
|
|
||||||
Validate w/ deletion
|
|
||||||
|
|
||||||
```
|
```
|
||||||
|
Sample COW command
|
||||||
Sample COW command with repeated runs.
|
|
||||||
```
|
```
|
||||||
spark-submit \
|
spark-submit \
|
||||||
--packages org.apache.spark:spark-avro_2.11:2.4.0 \
|
--packages org.apache.spark:spark-avro_2.11:2.4.0 \
|
||||||
@@ -424,25 +428,45 @@ spark-submit \
|
|||||||
--conf spark.driver.extraClassPath=/var/demo/jars/* \
|
--conf spark.driver.extraClassPath=/var/demo/jars/* \
|
||||||
--conf spark.executor.extraClassPath=/var/demo/jars/* \
|
--conf spark.executor.extraClassPath=/var/demo/jars/* \
|
||||||
--class org.apache.hudi.integ.testsuite.HoodieTestSuiteJob \
|
--class org.apache.hudi.integ.testsuite.HoodieTestSuiteJob \
|
||||||
/opt/hudi-integ-test-bundle-0.6.1-SNAPSHOT.jar \
|
/opt/hudi-integ-test-bundle-0.8.0-SNAPSHOT.jar \
|
||||||
--source-ordering-field test_suite_source_ordering_field \
|
--source-ordering-field test_suite_source_ordering_field \
|
||||||
--use-deltastreamer \
|
--use-deltastreamer \
|
||||||
--target-base-path /user/hive/warehouse/hudi-integ-test-suite/output \
|
--target-base-path /user/hive/warehouse/hudi-integ-test-suite/output \
|
||||||
--input-base-path /user/hive/warehouse/hudi-integ-test-suite/input \
|
--input-base-path /user/hive/warehouse/hudi-integ-test-suite/input \
|
||||||
--target-table table1 \
|
--target-table table1 \
|
||||||
--props test.properties \
|
--props test.properties \
|
||||||
--schemaprovider-class org.apache.hudi.utilities.schema.FilebasedSchemaProvider \
|
--schemaprovider-class org.apache.hudi.integ.testsuite.schema.TestSuiteFileBasedSchemaProvider \
|
||||||
--source-class org.apache.hudi.utilities.sources.AvroDFSSource \
|
--source-class org.apache.hudi.utilities.sources.AvroDFSSource \
|
||||||
--input-file-size 125829120 \
|
--input-file-size 125829120 \
|
||||||
--workload-yaml-path file:/var/hoodie/ws/docker/demo/config/test-suite/complex-dag-cow.yaml \
|
--workload-yaml-path file:/opt/complex-dag-cow.yaml \
|
||||||
--workload-generator-classname org.apache.hudi.integ.testsuite.dag.WorkflowDagGenerator \
|
--workload-generator-classname org.apache.hudi.integ.testsuite.dag.WorkflowDagGenerator \
|
||||||
--table-type COPY_ON_WRITE \
|
--table-type COPY_ON_WRITE \
|
||||||
--compact-scheduling-minshare 1
|
--compact-scheduling-minshare 1 \
|
||||||
|
--clean-input
|
||||||
|
--clean-output
|
||||||
```
|
```
|
||||||
|
|
||||||
A ready to use dag is available under docker/demo/config/test-suite/ that could give you an idea for long running
|
Few ready to use dags are available under docker/demo/config/test-suite/ that could give you an idea for long running
|
||||||
dags.
|
dags.
|
||||||
cow-per-round-mixed-validate.yaml
|
```
|
||||||
|
complex-dag-cow.yaml: simple 1 round dag for COW table.
|
||||||
|
complex-dag-mor.yaml: simple 1 round dag for MOR table.
|
||||||
|
cow-clustering-example.yaml : dag with 3 rounds, in which inline clustering will trigger during 2nd iteration.
|
||||||
|
cow-long-running-example.yaml : long running dag with 50 iterations. only 1 partition is used.
|
||||||
|
cow-long-running-multi-partitions.yaml: long running dag wit 50 iterations with multiple partitions.
|
||||||
|
```
|
||||||
|
|
||||||
|
To run test suite jobs for MOR table, pretty much any of these dags can be used as is. Only change is with the
|
||||||
|
spark-shell commnad, you need to fix the table type.
|
||||||
|
```
|
||||||
|
--table-type MERGE_ON_READ
|
||||||
|
```
|
||||||
|
But if you had to switch from one table type to other, ensure you clean up all test paths explicitly before switching to
|
||||||
|
a different table type.
|
||||||
|
```
|
||||||
|
hdfs dfs -rm -r /user/hive/warehouse/hudi-integ-test-suite/output/
|
||||||
|
hdfs dfs -rm -r /user/hive/warehouse/hudi-integ-test-suite/input/
|
||||||
|
```
|
||||||
|
|
||||||
As of now, "ValidateDatasetNode" uses spark data source and hive tables for comparison. Hence COW and real time view in
|
As of now, "ValidateDatasetNode" uses spark data source and hive tables for comparison. Hence COW and real time view in
|
||||||
MOR can be tested.
|
MOR can be tested.
|
||||||
|
|||||||
Reference in New Issue
Block a user