1
0
Files
hudi/packaging/hudi-utilities-slim-bundle
Shiyan Xu 5aaac21d1d [HUDI-4224] Fix CI issues (#5842)
- Upgrade junit to 5.7.2
- Downgrade surefire and failsafe to 2.22.2
- Fix test failures that were previously not reported
- Improve azure pipeline configs

Co-authored-by: liujinhui1994 <965147871@qq.com>
Co-authored-by: Y Ethan Guo <ethan.guoyihua@gmail.com>
2022-06-12 11:44:18 -07:00
..
2022-06-12 11:44:18 -07:00

Usage of hudi-utilities-slim-bundle

Starting from versions 0.11, Hudi provides hudi-utilities-slim-bundle which excludes hudi-spark-datasource modules. This new bundle is intended to be used with Hudi Spark bundle together, if using hudi-utilities-bundle solely introduces problems for a specific Spark version.

Example with Spark 2.4.7

  • Build Hudi: mvn clean install -DskipTests
  • Run deltastreamer
bin/spark-submit \
  --driver-memory 4g --executor-memory 2g --num-executors 3 --executor-cores 1 \
  --conf spark.serializer=org.apache.spark.serializer.KryoSerializer \
  --conf spark.sql.catalogImplementation=hive \
  --conf spark.driver.maxResultSize=1g \
  --conf spark.ui.port=6679 \
  --packages org.apache.spark:spark-avro_2.11:2.4.7 \
  --jars /path/to/hudi/packaging/hudi-spark-bundle/target/hudi-spark-bundle_2.11-0.12.0-SNAPSHOT.jar \
  --class org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer `ls /path/to/hudi/packaging/hudi-utilities-slim-bundle/target/hudi-utilities-slim-bundle_2.11-0.12.0-SNAPSHOT.jar` \
  --props `ls /path/to/hudi/dfs-source.properties` \
  --source-class org.apache.hudi.utilities.sources.ParquetDFSSource  \
  --schemaprovider-class org.apache.hudi.utilities.schema.FilebasedSchemaProvider \
  --source-ordering-field tpep_dropoff_datetime   \
  --table-type COPY_ON_WRITE \
  --target-base-path file:\/\/\/tmp/hudi-ny-taxi-spark24/   \
  --target-table ny_hudi_tbl  \
  --op UPSERT  \
  --continuous \
  --source-limit 5000000 \
  --min-sync-interval-seconds 60

Example with Spark 3.1.2

  • Build Hudi: mvn clean install -DskipTests -Dspark3.1 -Dscala-2.12
  • Run deltastreamer
bin/spark-submit \
  --driver-memory 4g --executor-memory 2g --num-executors 3 --executor-cores 1 \
  --conf spark.serializer=org.apache.spark.serializer.KryoSerializer \
  --conf spark.sql.catalogImplementation=hive \
  --conf spark.driver.maxResultSize=1g \
  --conf spark.ui.port=6679 \
  --packages org.apache.spark:spark-avro_2.12:3.1.2 \
  --jars /path/to/hudi/packaging/hudi-spark-bundle/target/hudi-spark3.1-bundle_2.12-0.12.0-SNAPSHOT.jar \
  --class org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer `ls /path/to/hudi/packaging/hudi-utilities-slim-bundle/target/hudi-utilities-slim-bundle_2.12-0.12.0-SNAPSHOT.jar` \
  --props `ls /path/to/hudi/dfs-source.properties` \
  --source-class org.apache.hudi.utilities.sources.ParquetDFSSource  \
  --schemaprovider-class org.apache.hudi.utilities.schema.FilebasedSchemaProvider \
  --source-ordering-field tpep_dropoff_datetime   \
  --table-type COPY_ON_WRITE \
  --target-base-path file:\/\/\/tmp/hudi-ny-taxi-spark31/   \
  --target-table ny_hudi_tbl  \
  --op UPSERT  \
  --continuous \
  --source-limit 5000000 \
  --min-sync-interval-seconds 60

Example with Spark 3.2.0

  • Build Hudi: mvn clean install -DskipTests -Dspark3.2 -Dscala-2.12
  • Run deltastreamer
bin/spark-submit \
  --driver-memory 4g --executor-memory 2g --num-executors 3 --executor-cores 1 \
  --conf spark.serializer=org.apache.spark.serializer.KryoSerializer \
  --conf spark.sql.catalogImplementation=hive \
  --conf spark.driver.maxResultSize=1g \
  --conf spark.ui.port=6679 \
  --packages org.apache.spark:spark-avro_2.12:3.2.0 \
  --jars /path/to/hudi/packaging/hudi-spark-bundle/target/hudi-spark3.2-bundle_2.12-0.12.0-SNAPSHOT.jar \
  --class org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer `ls /path/to/hudi/packaging/hudi-utilities-slim-bundle/target/hudi-utilities-slim-bundle_2.12-0.12.0-SNAPSHOT.jar` \
  --props `ls /path/to/hudi/dfs-source.properties` \
  --source-class org.apache.hudi.utilities.sources.ParquetDFSSource  \
  --schemaprovider-class org.apache.hudi.utilities.schema.FilebasedSchemaProvider \
  --source-ordering-field tpep_dropoff_datetime   \
  --table-type COPY_ON_WRITE \
  --target-base-path file:\/\/\/tmp/hudi-ny-taxi-spark32/   \
  --target-table ny_hudi_tbl  \
  --op UPSERT  \
  --continuous \
  --source-limit 5000000 \
  --min-sync-interval-seconds 60