1
0

[HUDI-1331] Adding support for validating entire dataset and long running tests in test suite framework (#2168)

* trigger rebuild

* [HUDI-1156] Remove unused dependencies from HoodieDeltaStreamerWrapper Class (#1927)

* Adding support for validating records and long running tests in test sutie framework

* Adding partial validate node

* Fixing spark session initiation in Validate nodes

* Fixing validation

* Adding hive table validation to ValidateDatasetNode

* Rebasing with latest commits from master

* Addressing feedback

* Addressing comments

Co-authored-by: lamber-ken <lamberken@163.com>
Co-authored-by: linshan-ma <mabin194046@163.com>
This commit is contained in:
Sivabalan Narayanan
2020-12-26 12:29:24 -05:00
committed by GitHub
parent 3ec9270e8e
commit 8cf6a7223f
33 changed files with 1101 additions and 518 deletions

View File

@@ -142,7 +142,9 @@ Start the Hudi Docker demo:
docker/setup_demo.sh
```
NOTE: We need to make a couple of environment changes for Hive 2.x support. This will be fixed once Hudi moves to Spark 3.x
NOTE: We need to make a couple of environment changes for Hive 2.x support. This will be fixed once Hudi moves to Spark 3.x.
Execute below if you are using Hudi query node in your dag. If not, below section is not required.
Also, for longer running tests, go to next section.
```
docker exec -it adhoc-2 bash
@@ -214,7 +216,7 @@ spark-submit \
--conf spark.sql.catalogImplementation=hive \
--class org.apache.hudi.integ.testsuite.HoodieTestSuiteJob \
/opt/hudi-integ-test-bundle-0.6.1-SNAPSHOT.jar \
--source-ordering-field timestamp \
--source-ordering-field test_suite_source_ordering_field \
--use-deltastreamer \
--target-base-path /user/hive/warehouse/hudi-integ-test-suite/output \
--input-base-path /user/hive/warehouse/hudi-integ-test-suite/input \
@@ -253,7 +255,7 @@ spark-submit \
--conf spark.sql.catalogImplementation=hive \
--class org.apache.hudi.integ.testsuite.HoodieTestSuiteJob \
/opt/hudi-integ-test-bundle-0.6.1-SNAPSHOT.jar \
--source-ordering-field timestamp \
--source-ordering-field test_suite_source_ordering_field \
--use-deltastreamer \
--target-base-path /user/hive/warehouse/hudi-integ-test-suite/output \
--input-base-path /user/hive/warehouse/hudi-integ-test-suite/input \
@@ -267,3 +269,182 @@ spark-submit \
--table-type MERGE_ON_READ \
--compact-scheduling-minshare 1
```
For long running test suite, validation has to be done differently. Idea is to run same dag in a repeated manner.
Hence "ValidateDatasetNode" is introduced which will read entire input data and compare it with hudi contents both via
spark datasource and hive table via spark sql engine.
If you have "ValidateDatasetNode" in your dag, do not replace hive jars as instructed above. Spark sql engine does not
go well w/ hive2* jars. So, after running docker setup, just copy test.properties and your dag of interest and you are
good to go ahead.
For repeated runs, two additional configs need to be set. "dag_rounds" and "dag_intermittent_delay_mins".
This means that your dag will be repeated for N times w/ a delay of Y mins between each round.
Also, ValidateDatasetNode can be configured in two ways. Either with "delete_input_data: true" set or not set.
When "delete_input_data" is set for ValidateDatasetNode, once validation is complete, entire input data will be deleted.
So, suggestion is to use this ValidateDatasetNode as the last node in the dag with "delete_input_data".
Example dag:
```
Insert
Upsert
ValidateDatasetNode with delete_input_data = true
```
If above dag is run with "dag_rounds" = 10 and "dag_intermittent_delay_mins" = 10, then this dag will run for 10 times
with 10 mins delay between every run. At the end of every run, records written as part of this round will be validated.
At the end of each validation, all contents of input are deleted.
For eg: incase of above dag,
```
Round1:
insert => inputPath/batch1
upsert -> inputPath/batch2
Validate with delete_input_data = true
Validates contents from batch1 and batch2 are in hudi and ensures Row equality
Since "delete_input_data" is set, deletes contents from batch1 and batch2.
Round2:
insert => inputPath/batch3
upsert -> inputPath/batch4
Validate with delete_input_data = true
Validates contents from batch3 and batch4 are in hudi and ensures Row equality
Since "delete_input_data" is set, deletes contents from batch3 and batch4.
Round3:
insert => inputPath/batch5
upsert -> inputPath/batch6
Validate with delete_input_data = true
Validates contents from batch5 and batch6 are in hudi and ensures Row equality
Since "delete_input_data" is set, deletes contents from batch5 and batch6.
.
.
```
If you wish to do a cumulative validation, do not set delete_input_data in ValidateDatasetNode. But remember that this
may not scale beyond certain point since input data as well as hudi content's keeps occupying the disk and grows for
every cycle.
Lets see an example where you don't set "delete_input_data" as part of Validation.
```
Round1:
insert => inputPath/batch1
upsert -> inputPath/batch2
Validate: validates contents from batch1 and batch2 are in hudi and ensures Row equality
Round2:
insert => inputPath/batch3
upsert -> inputPath/batch4
Validate: validates contents from batch1 to batch4 are in hudi and ensures Row equality
Round3:
insert => inputPath/batch5
upsert -> inputPath/batch6
Validate: validates contents from batch1 and batch6 are in hudi and ensures Row equality
.
.
```
You could also have validations in the middle of your dag and not set the "delete_input_data". But set it only in the
last node in the dag.
```
Round1:
insert => inputPath/batch1
upsert -> inputPath/batch2
Validate: validates contents from batch1 and batch2 are in hudi and ensures Row equality
insert => inputPath/batch3
upsert -> inputPath/batch4
Validate with delete_input_data = true
Validates contents from batch1 to batch4 are in hudi and ensures Row equality
since "delete_input_data" is set to true, this node deletes contents from batch1 and batch4.
Round2:
insert => inputPath/batch5
upsert -> inputPath/batch6
Validate: validates contents from batch5 and batch6 are in hudi and ensures Row equality
insert => inputPath/batch7
upsert -> inputPath/batch8
Validate: validates contents from batch5 to batch8 are in hudi and ensures Row equality
since "delete_input_data" is set to true, this node deletes contents from batch5 to batch8.
Round3:
insert => inputPath/batch9
upsert -> inputPath/batch10
Validate: validates contents from batch9 and batch10 are in hudi and ensures Row equality
insert => inputPath/batch11
upsert -> inputPath/batch12
Validate with delete_input_data = true
Validates contents from batch9 to batch12 are in hudi and ensures Row equality
Set "delete_input_data" to true. so this node deletes contents from batch9 to batch12.
.
.
```
Above dag was just an example for illustration purposes. But you can make it complex as per your needs.
```
Insert
Upsert
Delete
Validate w/o deleting
Insert
Rollback
Validate w/o deleting
Upsert
Validate w/ deletion
```
With this dag, you can set the two additional configs "dag_rounds" and "dag_intermittent_delay_mins" and have a long
running test suite.
```
dag_rounds: 1
dag_intermittent_delay_mins: 10
dag_content:
Insert
Upsert
Delete
Validate w/o deleting
Insert
Rollback
Validate w/o deleting
Upsert
Validate w/ deletion
```
Sample COW command with repeated runs.
```
spark-submit \
--packages org.apache.spark:spark-avro_2.11:2.4.0 \
--conf spark.task.cpus=1 \
--conf spark.executor.cores=1 \
--conf spark.task.maxFailures=100 \
--conf spark.memory.fraction=0.4 \
--conf spark.rdd.compress=true \
--conf spark.kryoserializer.buffer.max=2000m \
--conf spark.serializer=org.apache.spark.serializer.KryoSerializer \
--conf spark.memory.storageFraction=0.1 \
--conf spark.shuffle.service.enabled=true \
--conf spark.sql.hive.convertMetastoreParquet=false \
--conf spark.driver.maxResultSize=12g \
--conf spark.executor.heartbeatInterval=120s \
--conf spark.network.timeout=600s \
--conf spark.yarn.max.executor.failures=10 \
--conf spark.sql.catalogImplementation=hive \
--conf spark.driver.extraClassPath=/var/demo/jars/* \
--conf spark.executor.extraClassPath=/var/demo/jars/* \
--class org.apache.hudi.integ.testsuite.HoodieTestSuiteJob \
/opt/hudi-integ-test-bundle-0.6.1-SNAPSHOT.jar \
--source-ordering-field test_suite_source_ordering_field \
--use-deltastreamer \
--target-base-path /user/hive/warehouse/hudi-integ-test-suite/output \
--input-base-path /user/hive/warehouse/hudi-integ-test-suite/input \
--target-table table1 \
--props test.properties \
--schemaprovider-class org.apache.hudi.utilities.schema.FilebasedSchemaProvider \
--source-class org.apache.hudi.utilities.sources.AvroDFSSource \
--input-file-size 125829120 \
--workload-yaml-path file:/var/hoodie/ws/docker/demo/config/test-suite/complex-dag-cow.yaml \
--workload-generator-classname org.apache.hudi.integ.testsuite.dag.WorkflowDagGenerator \
--table-type COPY_ON_WRITE \
--compact-scheduling-minshare 1
```
A ready to use dag is available under docker/demo/config/test-suite/ that could give you an idea for long running
dags.
cow-per-round-mixed-validate.yaml
As of now, "ValidateDatasetNode" uses spark data source and hive tables for comparison. Hence COW and real time view in
MOR can be tested.