1
0
Files
hudi/docs/s3_filesystem.md
Vinoth Chandar 85dd265b7b Improving out of box experience for data source
- Fixes #246
 - Bump up default parallelism to 1500, to handle large upserts
 - Add docs on s3 confuration & tuning tips with tested spark knobs
 - Fix bug to not duplicate hoodie metadata fields when input dataframe is another hoodie dataset
 - Improve speed of ROTablePathFilter by removing directory check
 - Move to spark-avro 4.0 to handle issue with nested fields with same name
 - Keep AvroConversionUtils in sync with spark-avro 4.0
2018-06-10 19:16:44 -07:00

80 lines
2.3 KiB
Markdown

---
title: S3 Filesystem (experimental)
keywords: sql hive s3 spark presto
sidebar: mydoc_sidebar
permalink: s3_hoodie.html
toc: false
summary: In this page, we go over how to configure hoodie with S3 filesystem.
---
Hoodie works with HDFS by default. There is an experimental work going on Hoodie-S3 compatibility.
## AWS configs
There are two configurations required for Hoodie-S3 compatibility:
- Adding AWS Credentials for Hoodie
- Adding required Jars to classpath
### AWS Credentials
Simplest way to use Hoodie with S3, is to configure your `SparkSession` or `SparkContext` with S3 credentials. Hoodie will automatically pick this up and talk to S3.
Alternatively, add the required configs in your core-site.xml from where Hoodie can fetch them. Replace the `fs.defaultFS` with your S3 bucket name and Hoodie should be able to read/write from the bucket.
```
<property>
<name>fs.defaultFS</name>
<value>s3://ysharma</value>
</property>
<property>
<name>fs.s3.impl</name>
<value>org.apache.hadoop.fs.s3native.NativeS3FileSystem</value>
</property>
<property>
<name>fs.s3.awsAccessKeyId</name>
<value>AWS_KEY</value>
</property>
<property>
<name>fs.s3.awsSecretAccessKey</name>
<value>AWS_SECRET</value>
</property>
<property>
<name>fs.s3n.awsAccessKeyId</name>
<value>AWS_KEY</value>
</property>
<property>
<name>fs.s3n.awsSecretAccessKey</name>
<value>AWS_SECRET</value>
</property>
```
Utilities such as hoodie-cli or deltastreamer tool, can pick up s3 creds via environmental variable prefixed with `HOODIE_ENV_`. For e.g below is a bash snippet to setup
such variables and then have cli be able to work on datasets stored in s3
```
export HOODIE_ENV_fs_DOT_s3a_DOT_access_DOT_key=$accessKey
export HOODIE_ENV_fs_DOT_s3a_DOT_secret_DOT_key=$secretKey
export HOODIE_ENV_fs_DOT_s3_DOT_awsAccessKeyId=$accessKey
export HOODIE_ENV_fs_DOT_s3_DOT_awsSecretAccessKey=$secretKey
export HOODIE_ENV_fs_DOT_s3n_DOT_awsAccessKeyId=$accessKey
export HOODIE_ENV_fs_DOT_s3n_DOT_awsSecretAccessKey=$secretKey
export HOODIE_ENV_fs_DOT_s3n_DOT_impl=org.apache.hadoop.fs.s3a.S3AFileSystem
```
### AWS Libs
AWS hadoop libraries to add to our classpath
- com.amazonaws:aws-java-sdk:1.10.34
- org.apache.hadoop:hadoop-aws:2.7.3