[HUDI-4018][HUDI-4027] Adding integ test yamls for immutable use-cases. Added delete partition support to integ tests (#5501)

- Added pure immutable test yamls to integ test framework. Added SparkBulkInsertNode as part of it. - Added delete_partition support to integ test framework using spark-datasource. - Added a single yaml to test all non core write operations (insert overwrite, insert overwrite table and delete partitions) - Added tests for 4 concurrent spark datasource writers (multi-writer tests). - Fixed readme w/ sample commands for multi-writer.
2022-05-12 21:01:55 -04:00
parent ecd47e7aae
commit 0cec955fa2
26 changed files with 970 additions and 46 deletions
--- a/hudi-integ-test/README.md
+++ b/hudi-integ-test/README.md
@@ -522,6 +522,78 @@ Spark submit with the flag:
 --saferSchemaEvolution
 ```

+### Multi-writer tests
+Integ test framework also supports multi-writer tests. 
+
+#### Multi-writer tests with deltastreamer and a spark data source writer. 
+
+Sample spark-submit command to test one delta streamer and a spark data source writer. 
+```shell
+./bin/spark-submit --packages org.apache.spark:spark-avro_2.11:2.4.0 \
+--conf spark.task.cpus=3 --conf spark.executor.cores=3  \
+--conf spark.task.maxFailures=100 --conf spark.memory.fraction=0.4 \  
+--conf spark.rdd.compress=true  --conf spark.kryoserializer.buffer.max=2000m \ 
+--conf spark.serializer=org.apache.spark.serializer.KryoSerializer \
+--conf spark.memory.storageFraction=0.1 --conf spark.shuffle.service.enabled=true \  
+--conf spark.sql.hive.convertMetastoreParquet=false  --conf spark.driver.maxResultSize=12g \ 
+--conf spark.executor.heartbeatInterval=120s --conf spark.network.timeout=600s \
+--conf spark.yarn.max.executor.failures=10 \
+--conf spark.sql.catalogImplementation=hive \
+--class org.apache.hudi.integ.testsuite.HoodieMultiWriterTestSuiteJob \ 
+<HUDI_REPO_DIR>/packaging/hudi-integ-test-bundle/target/hudi-integ-test-bundle-0.12.0-SNAPSHOT.jar \ 
+--source-ordering-field test_suite_source_ordering_field \
+--use-deltastreamer \
+--target-base-path /tmp/hudi/output \ 
+--input-base-paths "/tmp/hudi/input1,/tmp/hudi/input2" \ 
+--target-table table1 \
+--props-paths "file:<HUDI_REPO_DIR>/docker/demo/config/test-suite/multi-writer-local-1.properties,file:<HUDI_REPO_DIR>/hudi/docker/demo/config/test-suite/multi-writer-local-2.properties" \ 
+--schemaprovider-class org.apache.hudi.integ.testsuite.schema.TestSuiteFileBasedSchemaProvider \
+--source-class org.apache.hudi.utilities.sources.AvroDFSSource \
+--input-file-size 125829120 \
+--workload-yaml-paths "file:<HUDI_REPO_DIR>/docker/demo/config/test-suite/multi-writer-1-ds.yaml,file:<HUDI_REPO_DIR>/docker/demo/config/test-suite/multi-writer-2-sds.yaml" \ 
+--workload-generator-classname org.apache.hudi.integ.testsuite.dag.WorkflowDagGenerator \
+--table-type COPY_ON_WRITE \
+--compact-scheduling-minshare 1 \ 
+--input-base-path "dummyValue" \
+--workload-yaml-path "dummyValue" \ 
+--props "dummyValue" \
+--use-hudi-data-to-generate-updates 
+```
+
+#### Multi-writer tests with 4 concurrent spark data source writer. 
+
+```shell
+./bin/spark-submit --packages org.apache.spark:spark-avro_2.11:2.4.0 \
+--conf spark.task.cpus=3 --conf spark.executor.cores=3 \
+--conf spark.task.maxFailures=100 --conf spark.memory.fraction=0.4 \  
+--conf spark.rdd.compress=true  --conf spark.kryoserializer.buffer.max=2000m \
+--conf spark.serializer=org.apache.spark.serializer.KryoSerializer \
+--conf spark.memory.storageFraction=0.1 --conf spark.shuffle.service.enabled=true  \
+--conf spark.sql.hive.convertMetastoreParquet=false  --conf spark.driver.maxResultSize=12g \
+--conf spark.executor.heartbeatInterval=120s --conf spark.network.timeout=600s \
+--conf spark.yarn.max.executor.failures=10 --conf spark.sql.catalogImplementation=hive \
+--class org.apache.hudi.integ.testsuite.HoodieMultiWriterTestSuiteJob \
+<BUNDLE_LOCATION>/hudi-integ-test-bundle-0.12.0-SNAPSHOT.jar \
+--source-ordering-field test_suite_source_ordering_field \
+--use-deltastreamer \
+--target-base-path /tmp/hudi/output \
+--input-base-paths "/tmp/hudi/input1,/tmp/hudi/input2,/tmp/hudi/input3,/tmp/hudi/input4" \
+--target-table table1 \
+--props-paths "file:<PROPS_LOCATION>/multi-writer-local-1.properties,file:<PROPS_LOCATION>/multi-writer-local-2.properties,file:<PROPS_LOCATION>/multi-writer-local-3.properties,file:<PROPS_LOCATION>/multi-writer-local-4.properties" 
+--schemaprovider-class org.apache.hudi.integ.testsuite.schema.TestSuiteFileBasedSchemaProvider \
+--source-class org.apache.hudi.utilities.sources.AvroDFSSource \
+--input-file-size 125829120 \
+--workload-yaml-paths "file:<PROPS_LOCATION>/multi-writer-1-sds.yaml,file:<PROPS_LOCATION>/multi-writer-2-sds.yaml,file:<PROPS_LOCATION>/multi-writer-3-sds.yaml,file:<PROPS_LOCATION>/multi-writer-4-sds.yaml" \
+--workload-generator-classname org.apache.hudi.integ.testsuite.dag.WorkflowDagGenerator \
+--table-type COPY_ON_WRITE \
+--compact-scheduling-minshare 1 \
+--input-base-path "dummyValue" \
+--workload-yaml-path "dummyValue" \
+--props "dummyValue" \
+--use-hudi-data-to-generate-updates
+```
+
+
 ## Automated tests for N no of yamls in Local Docker environment

 Hudi provides a script to assist you in testing N no of yamls automatically. Checkout the script under