Fixing README for hudi test suite long running job (#2578)
This commit is contained in:
committed by
GitHub
parent
8c2197ae5e
commit
9f5e8cc7c3
@@ -270,20 +270,31 @@ spark-submit \
|
||||
--compact-scheduling-minshare 1
|
||||
```
|
||||
|
||||
For long running test suite, validation has to be done differently. Idea is to run same dag in a repeated manner.
|
||||
Hence "ValidateDatasetNode" is introduced which will read entire input data and compare it with hudi contents both via
|
||||
spark datasource and hive table via spark sql engine.
|
||||
## Running long running test suite in Local Docker environment
|
||||
|
||||
For long running test suite, validation has to be done differently. Idea is to run same dag in a repeated manner for
|
||||
N iterations. Hence "ValidateDatasetNode" is introduced which will read entire input data and compare it with hudi
|
||||
contents both via spark datasource and hive table via spark sql engine. Hive validation is configurable.
|
||||
|
||||
If you have "ValidateDatasetNode" in your dag, do not replace hive jars as instructed above. Spark sql engine does not
|
||||
go well w/ hive2* jars. So, after running docker setup, just copy test.properties and your dag of interest and you are
|
||||
good to go ahead.
|
||||
go well w/ hive2* jars. So, after running docker setup, follow the below steps.
|
||||
```
|
||||
docker cp packaging/hudi-integ-test-bundle/target/hudi-integ-test-bundle-0.8.0-SNAPSHOT.jar adhoc-2:/opt/
|
||||
docker cp demo/config/test-suite/test.properties adhoc-2:/opt/
|
||||
```
|
||||
Also copy your dag of interest to adhoc-2:/opt/
|
||||
```
|
||||
docker cp demo/config/test-suite/complex-dag-cow.yaml adhoc-2:/opt/
|
||||
```
|
||||
|
||||
For repeated runs, two additional configs need to be set. "dag_rounds" and "dag_intermittent_delay_mins".
|
||||
This means that your dag will be repeated for N times w/ a delay of Y mins between each round.
|
||||
This means that your dag will be repeated for N times w/ a delay of Y mins between each round. Note: complex-dag-cow.yaml
|
||||
already has all these configs set. So no changes required just to try it out.
|
||||
|
||||
Also, ValidateDatasetNode can be configured in two ways. Either with "delete_input_data" set to true or without
|
||||
setting the config. When "delete_input_data" is set for ValidateDatasetNode, once validation is complete, entire input
|
||||
data will be deleted. So, suggestion is to use this ValidateDatasetNode as the last node in the dag with "delete_input_data".
|
||||
|
||||
Also, ValidateDatasetNode can be configured in two ways. Either with "delete_input_data: true" set or not set.
|
||||
When "delete_input_data" is set for ValidateDatasetNode, once validation is complete, entire input data will be deleted.
|
||||
So, suggestion is to use this ValidateDatasetNode as the last node in the dag with "delete_input_data".
|
||||
Example dag:
|
||||
```
|
||||
Insert
|
||||
@@ -294,7 +305,7 @@ Example dag:
|
||||
If above dag is run with "dag_rounds" = 10 and "dag_intermittent_delay_mins" = 10, then this dag will run for 10 times
|
||||
with 10 mins delay between every run. At the end of every run, records written as part of this round will be validated.
|
||||
At the end of each validation, all contents of input are deleted.
|
||||
For eg: incase of above dag,
|
||||
To illustrate each round
|
||||
```
|
||||
Round1:
|
||||
insert => inputPath/batch1
|
||||
@@ -322,6 +333,12 @@ may not scale beyond certain point since input data as well as hudi content's ke
|
||||
every cycle.
|
||||
|
||||
Lets see an example where you don't set "delete_input_data" as part of Validation.
|
||||
```
|
||||
Insert
|
||||
Upsert
|
||||
ValidateDatasetNode
|
||||
```
|
||||
Here is the illustration of each round
|
||||
```
|
||||
Round1:
|
||||
insert => inputPath/batch1
|
||||
@@ -382,27 +399,14 @@ Above dag was just an example for illustration purposes. But you can make it com
|
||||
Validate w/o deleting
|
||||
Upsert
|
||||
Validate w/ deletion
|
||||
```
|
||||
With this dag, you can set the two additional configs "dag_rounds" and "dag_intermittent_delay_mins" and have a long
|
||||
running test suite.
|
||||
|
||||
```
|
||||
dag_rounds: 1
|
||||
dag_intermittent_delay_mins: 10
|
||||
dag_content:
|
||||
Insert
|
||||
Upsert
|
||||
Delete
|
||||
Validate w/o deleting
|
||||
Insert
|
||||
Rollback
|
||||
Validate w/o deleting
|
||||
Upsert
|
||||
Validate w/ deletion
|
||||
|
||||
```
|
||||
|
||||
Sample COW command with repeated runs.
|
||||
Once you have copied the jar, test.properties and your dag to adhoc-2:/opt/, you can run the following command to execute
|
||||
the test suite job.
|
||||
```
|
||||
docker exec -it adhoc-2 /bin/bash
|
||||
```
|
||||
Sample COW command
|
||||
```
|
||||
spark-submit \
|
||||
--packages org.apache.spark:spark-avro_2.11:2.4.0 \
|
||||
@@ -424,25 +428,45 @@ spark-submit \
|
||||
--conf spark.driver.extraClassPath=/var/demo/jars/* \
|
||||
--conf spark.executor.extraClassPath=/var/demo/jars/* \
|
||||
--class org.apache.hudi.integ.testsuite.HoodieTestSuiteJob \
|
||||
/opt/hudi-integ-test-bundle-0.6.1-SNAPSHOT.jar \
|
||||
/opt/hudi-integ-test-bundle-0.8.0-SNAPSHOT.jar \
|
||||
--source-ordering-field test_suite_source_ordering_field \
|
||||
--use-deltastreamer \
|
||||
--target-base-path /user/hive/warehouse/hudi-integ-test-suite/output \
|
||||
--input-base-path /user/hive/warehouse/hudi-integ-test-suite/input \
|
||||
--target-table table1 \
|
||||
--props test.properties \
|
||||
--schemaprovider-class org.apache.hudi.utilities.schema.FilebasedSchemaProvider \
|
||||
--schemaprovider-class org.apache.hudi.integ.testsuite.schema.TestSuiteFileBasedSchemaProvider \
|
||||
--source-class org.apache.hudi.utilities.sources.AvroDFSSource \
|
||||
--input-file-size 125829120 \
|
||||
--workload-yaml-path file:/var/hoodie/ws/docker/demo/config/test-suite/complex-dag-cow.yaml \
|
||||
--workload-yaml-path file:/opt/complex-dag-cow.yaml \
|
||||
--workload-generator-classname org.apache.hudi.integ.testsuite.dag.WorkflowDagGenerator \
|
||||
--table-type COPY_ON_WRITE \
|
||||
--compact-scheduling-minshare 1
|
||||
--compact-scheduling-minshare 1 \
|
||||
--clean-input
|
||||
--clean-output
|
||||
```
|
||||
|
||||
A ready to use dag is available under docker/demo/config/test-suite/ that could give you an idea for long running
|
||||
Few ready to use dags are available under docker/demo/config/test-suite/ that could give you an idea for long running
|
||||
dags.
|
||||
cow-per-round-mixed-validate.yaml
|
||||
```
|
||||
complex-dag-cow.yaml: simple 1 round dag for COW table.
|
||||
complex-dag-mor.yaml: simple 1 round dag for MOR table.
|
||||
cow-clustering-example.yaml : dag with 3 rounds, in which inline clustering will trigger during 2nd iteration.
|
||||
cow-long-running-example.yaml : long running dag with 50 iterations. only 1 partition is used.
|
||||
cow-long-running-multi-partitions.yaml: long running dag wit 50 iterations with multiple partitions.
|
||||
```
|
||||
|
||||
To run test suite jobs for MOR table, pretty much any of these dags can be used as is. Only change is with the
|
||||
spark-shell commnad, you need to fix the table type.
|
||||
```
|
||||
--table-type MERGE_ON_READ
|
||||
```
|
||||
But if you had to switch from one table type to other, ensure you clean up all test paths explicitly before switching to
|
||||
a different table type.
|
||||
```
|
||||
hdfs dfs -rm -r /user/hive/warehouse/hudi-integ-test-suite/output/
|
||||
hdfs dfs -rm -r /user/hive/warehouse/hudi-integ-test-suite/input/
|
||||
```
|
||||
|
||||
As of now, "ValidateDatasetNode" uses spark data source and hive tables for comparison. Hence COW and real time view in
|
||||
MOR can be tested.
|
||||
|
||||
Reference in New Issue
Block a user