1
0

[HUDI-4018][HUDI-4027] Adding integ test yamls for immutable use-cases. Added delete partition support to integ tests (#5501)

- Added pure immutable test yamls to integ test framework. Added SparkBulkInsertNode as part of it.
- Added delete_partition support to integ test framework using spark-datasource.
- Added a single yaml to test all non core write operations (insert overwrite, insert overwrite table and delete partitions)
- Added tests for 4 concurrent spark datasource writers (multi-writer tests).
- Fixed readme w/ sample commands for multi-writer.
This commit is contained in:
Sivabalan Narayanan
2022-05-12 21:01:55 -04:00
committed by GitHub
parent ecd47e7aae
commit 0cec955fa2
26 changed files with 970 additions and 46 deletions

View File

@@ -522,6 +522,78 @@ Spark submit with the flag:
--saferSchemaEvolution
```
### Multi-writer tests
Integ test framework also supports multi-writer tests.
#### Multi-writer tests with deltastreamer and a spark data source writer.
Sample spark-submit command to test one delta streamer and a spark data source writer.
```shell
./bin/spark-submit --packages org.apache.spark:spark-avro_2.11:2.4.0 \
--conf spark.task.cpus=3 --conf spark.executor.cores=3 \
--conf spark.task.maxFailures=100 --conf spark.memory.fraction=0.4 \
--conf spark.rdd.compress=true --conf spark.kryoserializer.buffer.max=2000m \
--conf spark.serializer=org.apache.spark.serializer.KryoSerializer \
--conf spark.memory.storageFraction=0.1 --conf spark.shuffle.service.enabled=true \
--conf spark.sql.hive.convertMetastoreParquet=false --conf spark.driver.maxResultSize=12g \
--conf spark.executor.heartbeatInterval=120s --conf spark.network.timeout=600s \
--conf spark.yarn.max.executor.failures=10 \
--conf spark.sql.catalogImplementation=hive \
--class org.apache.hudi.integ.testsuite.HoodieMultiWriterTestSuiteJob \
<HUDI_REPO_DIR>/packaging/hudi-integ-test-bundle/target/hudi-integ-test-bundle-0.12.0-SNAPSHOT.jar \
--source-ordering-field test_suite_source_ordering_field \
--use-deltastreamer \
--target-base-path /tmp/hudi/output \
--input-base-paths "/tmp/hudi/input1,/tmp/hudi/input2" \
--target-table table1 \
--props-paths "file:<HUDI_REPO_DIR>/docker/demo/config/test-suite/multi-writer-local-1.properties,file:<HUDI_REPO_DIR>/hudi/docker/demo/config/test-suite/multi-writer-local-2.properties" \
--schemaprovider-class org.apache.hudi.integ.testsuite.schema.TestSuiteFileBasedSchemaProvider \
--source-class org.apache.hudi.utilities.sources.AvroDFSSource \
--input-file-size 125829120 \
--workload-yaml-paths "file:<HUDI_REPO_DIR>/docker/demo/config/test-suite/multi-writer-1-ds.yaml,file:<HUDI_REPO_DIR>/docker/demo/config/test-suite/multi-writer-2-sds.yaml" \
--workload-generator-classname org.apache.hudi.integ.testsuite.dag.WorkflowDagGenerator \
--table-type COPY_ON_WRITE \
--compact-scheduling-minshare 1 \
--input-base-path "dummyValue" \
--workload-yaml-path "dummyValue" \
--props "dummyValue" \
--use-hudi-data-to-generate-updates
```
#### Multi-writer tests with 4 concurrent spark data source writer.
```shell
./bin/spark-submit --packages org.apache.spark:spark-avro_2.11:2.4.0 \
--conf spark.task.cpus=3 --conf spark.executor.cores=3 \
--conf spark.task.maxFailures=100 --conf spark.memory.fraction=0.4 \
--conf spark.rdd.compress=true --conf spark.kryoserializer.buffer.max=2000m \
--conf spark.serializer=org.apache.spark.serializer.KryoSerializer \
--conf spark.memory.storageFraction=0.1 --conf spark.shuffle.service.enabled=true \
--conf spark.sql.hive.convertMetastoreParquet=false --conf spark.driver.maxResultSize=12g \
--conf spark.executor.heartbeatInterval=120s --conf spark.network.timeout=600s \
--conf spark.yarn.max.executor.failures=10 --conf spark.sql.catalogImplementation=hive \
--class org.apache.hudi.integ.testsuite.HoodieMultiWriterTestSuiteJob \
<BUNDLE_LOCATION>/hudi-integ-test-bundle-0.12.0-SNAPSHOT.jar \
--source-ordering-field test_suite_source_ordering_field \
--use-deltastreamer \
--target-base-path /tmp/hudi/output \
--input-base-paths "/tmp/hudi/input1,/tmp/hudi/input2,/tmp/hudi/input3,/tmp/hudi/input4" \
--target-table table1 \
--props-paths "file:<PROPS_LOCATION>/multi-writer-local-1.properties,file:<PROPS_LOCATION>/multi-writer-local-2.properties,file:<PROPS_LOCATION>/multi-writer-local-3.properties,file:<PROPS_LOCATION>/multi-writer-local-4.properties"
--schemaprovider-class org.apache.hudi.integ.testsuite.schema.TestSuiteFileBasedSchemaProvider \
--source-class org.apache.hudi.utilities.sources.AvroDFSSource \
--input-file-size 125829120 \
--workload-yaml-paths "file:<PROPS_LOCATION>/multi-writer-1-sds.yaml,file:<PROPS_LOCATION>/multi-writer-2-sds.yaml,file:<PROPS_LOCATION>/multi-writer-3-sds.yaml,file:<PROPS_LOCATION>/multi-writer-4-sds.yaml" \
--workload-generator-classname org.apache.hudi.integ.testsuite.dag.WorkflowDagGenerator \
--table-type COPY_ON_WRITE \
--compact-scheduling-minshare 1 \
--input-base-path "dummyValue" \
--workload-yaml-path "dummyValue" \
--props "dummyValue" \
--use-hudi-data-to-generate-updates
```
## Automated tests for N no of yamls in Local Docker environment
Hudi provides a script to assist you in testing N no of yamls automatically. Checkout the script under