Improving out of box experience for data source
- Fixes #246 - Bump up default parallelism to 1500, to handle large upserts - Add docs on s3 confuration & tuning tips with tested spark knobs - Fix bug to not duplicate hoodie metadata fields when input dataframe is another hoodie dataset - Improve speed of ROTablePathFilter by removing directory check - Move to spark-avro 4.0 to handle issue with nested fields with same name - Keep AvroConversionUtils in sync with spark-avro 4.0
This commit is contained in:
committed by
vinoth chandar
parent
a97814462d
commit
85dd265b7b
@@ -17,7 +17,9 @@ There are two configurations required for Hoodie-S3 compatibility:
|
||||
|
||||
### AWS Credentials
|
||||
|
||||
Add the required configs in your core-site.xml from where Hoodie can fetch them. Replace the `fs.defaultFS` with your S3 bucket name and Hoodie should be able to read/write from the bucket.
|
||||
Simplest way to use Hoodie with S3, is to configure your `SparkSession` or `SparkContext` with S3 credentials. Hoodie will automatically pick this up and talk to S3.
|
||||
|
||||
Alternatively, add the required configs in your core-site.xml from where Hoodie can fetch them. Replace the `fs.defaultFS` with your S3 bucket name and Hoodie should be able to read/write from the bucket.
|
||||
|
||||
```
|
||||
<property>
|
||||
@@ -51,6 +53,22 @@ Add the required configs in your core-site.xml from where Hoodie can fetch them.
|
||||
</property>
|
||||
```
|
||||
|
||||
|
||||
Utilities such as hoodie-cli or deltastreamer tool, can pick up s3 creds via environmental variable prefixed with `HOODIE_ENV_`. For e.g below is a bash snippet to setup
|
||||
such variables and then have cli be able to work on datasets stored in s3
|
||||
|
||||
```
|
||||
export HOODIE_ENV_fs_DOT_s3a_DOT_access_DOT_key=$accessKey
|
||||
export HOODIE_ENV_fs_DOT_s3a_DOT_secret_DOT_key=$secretKey
|
||||
export HOODIE_ENV_fs_DOT_s3_DOT_awsAccessKeyId=$accessKey
|
||||
export HOODIE_ENV_fs_DOT_s3_DOT_awsSecretAccessKey=$secretKey
|
||||
export HOODIE_ENV_fs_DOT_s3n_DOT_awsAccessKeyId=$accessKey
|
||||
export HOODIE_ENV_fs_DOT_s3n_DOT_awsSecretAccessKey=$secretKey
|
||||
export HOODIE_ENV_fs_DOT_s3n_DOT_impl=org.apache.hadoop.fs.s3a.S3AFileSystem
|
||||
```
|
||||
|
||||
|
||||
|
||||
### AWS Libs
|
||||
|
||||
AWS hadoop libraries to add to our classpath
|
||||
|
||||
Reference in New Issue
Block a user