From 9f5e8cc7c3789fd3658f52f960fb5cbc8e4efce9 Mon Sep 17 00:00:00 2001
From: Sivabalan Narayanan <sivabala@uber.com>
Date: Thu, 25 Feb 2021 19:50:18 -0500
Subject: [PATCH] Fixing README for hudi test suite long running job (#2578)

---
 hudi-integ-test/README.md | 94 ++++++++++++++++++++++++---------------
 1 file changed, 59 insertions(+), 35 deletions(-)

diff --git a/hudi-integ-test/README.md b/hudi-integ-test/README.md
index ff64ed124..06de263e9 100644
--- a/hudi-integ-test/README.md
+++ b/hudi-integ-test/README.md
@@ -270,20 +270,31 @@ spark-submit \
 --compact-scheduling-minshare 1
 ``` 
 
-For long running test suite, validation has to be done differently. Idea is to run same dag in a repeated manner. 
-Hence "ValidateDatasetNode" is introduced which will read entire input data and compare it with hudi contents both via 
-spark datasource and hive table via spark sql engine.
+## Running long running test suite in Local Docker environment
+
+For long running test suite, validation has to be done differently. Idea is to run same dag in a repeated manner for 
+N iterations. Hence "ValidateDatasetNode" is introduced which will read entire input data and compare it with hudi 
+contents both via spark datasource and hive table via spark sql engine. Hive validation is configurable. 
 
 If you have "ValidateDatasetNode" in your dag, do not replace hive jars as instructed above. Spark sql engine does not 
-go well w/ hive2* jars. So, after running docker setup, just copy test.properties and your dag of interest and you are 
-good to go ahead. 
+go well w/ hive2* jars. So, after running docker setup, follow the below steps. 
+```
+docker cp packaging/hudi-integ-test-bundle/target/hudi-integ-test-bundle-0.8.0-SNAPSHOT.jar adhoc-2:/opt/
+docker cp demo/config/test-suite/test.properties adhoc-2:/opt/
+```
+Also copy your dag of interest to adhoc-2:/opt/
+```
+docker cp demo/config/test-suite/complex-dag-cow.yaml adhoc-2:/opt/
+```
 
 For repeated runs, two additional configs need to be set. "dag_rounds" and "dag_intermittent_delay_mins". 
-This means that your dag will be repeated for N times w/ a delay of Y mins between each round.
+This means that your dag will be repeated for N times w/ a delay of Y mins between each round. Note: complex-dag-cow.yaml
+already has all these configs set. So no changes required just to try it out. 
+
+Also, ValidateDatasetNode can be configured in two ways. Either with "delete_input_data" set to true or without 
+setting the config. When "delete_input_data" is set for ValidateDatasetNode, once validation is complete, entire input 
+data will be deleted. So, suggestion is to use this ValidateDatasetNode as the last node in the dag with "delete_input_data". 
 
-Also, ValidateDatasetNode can be configured in two ways. Either with "delete_input_data: true" set or not set. 
-When "delete_input_data" is set for ValidateDatasetNode, once validation is complete, entire input data will be deleted. 
-So, suggestion is to use this ValidateDatasetNode as the last node in the dag with "delete_input_data". 
 Example dag: 
 ```
      Insert
@@ -294,7 +305,7 @@ Example dag:
 If above dag is run with "dag_rounds" = 10 and "dag_intermittent_delay_mins" = 10, then this dag will run for 10 times 
 with 10 mins delay between every run. At the end of every run, records written as part of this round will be validated. 
 At the end of each validation, all contents of input are deleted.   
-For eg: incase of above dag, 
+To illustrate each round 
 ```
 Round1: 
     insert => inputPath/batch1
@@ -322,6 +333,12 @@ may not scale beyond certain point since input data as well as hudi content's ke
 every cycle.
 
 Lets see an example where you don't set "delete_input_data" as part of Validation. 
+```
+     Insert
+     Upsert
+     ValidateDatasetNode 
+```
+Here is the illustration of each round
 ```
 Round1: 
     insert => inputPath/batch1
@@ -382,27 +399,14 @@ Above dag was just an example for illustration purposes. But you can make it com
     Validate w/o deleting
     Upsert
     Validate w/ deletion
-```                        
-With this dag, you can set the two additional configs "dag_rounds" and "dag_intermittent_delay_mins" and have a long 
-running test suite. 
-
-```
-dag_rounds: 1
-dag_intermittent_delay_mins: 10
-dag_content:
-    Insert
-    Upsert
-    Delete
-    Validate w/o deleting
-    Insert
-    Rollback
-    Validate w/o deleting
-    Upsert
-    Validate w/ deletion
-
 ```
 
-Sample COW command with repeated runs. 
+Once you have copied the jar, test.properties and your dag to adhoc-2:/opt/, you can run the following command to execute 
+the test suite job. 
+```
+docker exec -it adhoc-2 /bin/bash
+```
+Sample COW command
 ```
 spark-submit \
 --packages org.apache.spark:spark-avro_2.11:2.4.0 \
@@ -424,25 +428,45 @@ spark-submit \
 --conf spark.driver.extraClassPath=/var/demo/jars/* \
 --conf spark.executor.extraClassPath=/var/demo/jars/* \
 --class org.apache.hudi.integ.testsuite.HoodieTestSuiteJob \
-/opt/hudi-integ-test-bundle-0.6.1-SNAPSHOT.jar \
+/opt/hudi-integ-test-bundle-0.8.0-SNAPSHOT.jar \
 --source-ordering-field test_suite_source_ordering_field \
 --use-deltastreamer \
 --target-base-path /user/hive/warehouse/hudi-integ-test-suite/output \
 --input-base-path /user/hive/warehouse/hudi-integ-test-suite/input \
 --target-table table1 \
 --props test.properties \
---schemaprovider-class org.apache.hudi.utilities.schema.FilebasedSchemaProvider \
+--schemaprovider-class org.apache.hudi.integ.testsuite.schema.TestSuiteFileBasedSchemaProvider \
 --source-class org.apache.hudi.utilities.sources.AvroDFSSource \
 --input-file-size 125829120 \
---workload-yaml-path file:/var/hoodie/ws/docker/demo/config/test-suite/complex-dag-cow.yaml \
+--workload-yaml-path file:/opt/complex-dag-cow.yaml \
 --workload-generator-classname org.apache.hudi.integ.testsuite.dag.WorkflowDagGenerator \
 --table-type COPY_ON_WRITE \
---compact-scheduling-minshare 1 
+--compact-scheduling-minshare 1 \
+--clean-input
+--clean-output
 ```
 
-A ready to use dag is available under docker/demo/config/test-suite/ that could give you an idea for long running 
+Few ready to use dags are available under docker/demo/config/test-suite/ that could give you an idea for long running 
 dags.
-cow-per-round-mixed-validate.yaml
+```
+complex-dag-cow.yaml: simple 1 round dag for COW table.
+complex-dag-mor.yaml: simple 1 round dag for MOR table.
+cow-clustering-example.yaml : dag with 3 rounds, in which inline clustering will trigger during 2nd iteration. 
+cow-long-running-example.yaml : long running dag with 50 iterations. only 1 partition is used. 
+cow-long-running-multi-partitions.yaml: long running dag wit 50 iterations with multiple partitions.
+```
+
+To run test suite jobs for MOR table, pretty much any of these dags can be used as is. Only change is with the 
+spark-shell commnad, you need to fix the table type. 
+```
+--table-type MERGE_ON_READ
+```
+But if you had to switch from one table type to other, ensure you clean up all test paths explicitly before switching to
+a different table type.
+```
+hdfs dfs -rm -r /user/hive/warehouse/hudi-integ-test-suite/output/
+hdfs dfs -rm -r /user/hive/warehouse/hudi-integ-test-suite/input/
+```
 
 As of now, "ValidateDatasetNode" uses spark data source and hive tables for comparison. Hence COW and real time view in 
 MOR can be tested.