Reworking the deltastreamer tool
- Standardize version of jackson - DFSPropertiesConfiguration replaces usage of commons PropertiesConfiguration - Remove dependency on ConstructorUtils - Throw error if ordering value is not present, during key generation - Switch to shade plugin for hoodie-utilities - Added support for consumption for Confluent avro kafka serdes - Support for Confluent schema registry - KafkaSource now deals with skews nicely, by doing round robin allocation of source limit across partitions - Added support for BULK_INSERT operations as well - Pass in the payload class config properly into HoodieWriteClient - Fix documentation based on new usage - Adding tests on deltastreamer, sources and all new util classes.
This commit is contained in:
committed by
vinoth chandar
parent
fb95dbdedb
commit
d58ddbd999
@@ -47,7 +47,7 @@ just for analytics. Finally, HBase does not support incremental processing primi
|
|||||||
A popular question, we get is : "How does Hoodie relate to stream processing systems?", which we will try to answer here. Simply put, Hoodie can integrate with
|
A popular question, we get is : "How does Hoodie relate to stream processing systems?", which we will try to answer here. Simply put, Hoodie can integrate with
|
||||||
batch (`copy-on-write storage`) and streaming (`merge-on-read storage`) jobs of today, to store the computed results in Hadoop. For Spark apps, this can happen via direct
|
batch (`copy-on-write storage`) and streaming (`merge-on-read storage`) jobs of today, to store the computed results in Hadoop. For Spark apps, this can happen via direct
|
||||||
integration of Hoodie library with Spark/Spark streaming DAGs. In case of Non-Spark processing systems (eg: Flink, Hive), the processing can be done in the respective systems
|
integration of Hoodie library with Spark/Spark streaming DAGs. In case of Non-Spark processing systems (eg: Flink, Hive), the processing can be done in the respective systems
|
||||||
and later sent into a Hoodie table via a Kafka topic/HDFS intermediate file. (TODO: Need link to SQLStreamer/DeltaStreamer after reworking). In more conceptual level, data processing
|
and later sent into a Hoodie table via a Kafka topic/HDFS intermediate file. In more conceptual level, data processing
|
||||||
pipelines just consist of three components : `source`, `processing`, `sink`, with users ultimately running queries against the sink to use the results of the pipeline.
|
pipelines just consist of three components : `source`, `processing`, `sink`, with users ultimately running queries against the sink to use the results of the pipeline.
|
||||||
Hoodie can act as either a source or sink, that stores data on HDFS. Applicability of Hoodie to a given stream processing pipeline ultimately boils down to suitability
|
Hoodie can act as either a source or sink, that stores data on HDFS. Applicability of Hoodie to a given stream processing pipeline ultimately boils down to suitability
|
||||||
of Presto/SparkSQL/Hive for your queries.
|
of Presto/SparkSQL/Hive for your queries.
|
||||||
|
|||||||
@@ -4,22 +4,117 @@ keywords: incremental processing
|
|||||||
sidebar: mydoc_sidebar
|
sidebar: mydoc_sidebar
|
||||||
permalink: incremental_processing.html
|
permalink: incremental_processing.html
|
||||||
toc: false
|
toc: false
|
||||||
summary: In this page, we will discuss incremental processing primitives that Hoodie has to offer.
|
summary: In this page, we will discuss some available tools for ingesting data incrementally & consuming the changes.
|
||||||
---
|
---
|
||||||
|
|
||||||
As discussed in the concepts section, the two basic primitives needed for [incrementally processing](https://www.oreilly.com/ideas/ubers-case-for-incremental-processing-on-hadoop),
|
As discussed in the concepts section, the two basic primitives needed for [incrementally processing](https://www.oreilly.com/ideas/ubers-case-for-incremental-processing-on-hadoop),
|
||||||
data using Hoodie are `upserts` (to apply changes to a dataset) and `incremental pulls` (to obtain a change stream/log from a dataset). This section
|
data using Hoodie are `upserts` (to apply changes to a dataset) and `incremental pulls` (to obtain a change stream/log from a dataset). This section
|
||||||
discusses a few tools that can be used to achieve these on different contexts.
|
discusses a few tools that can be used to achieve these on different contexts.
|
||||||
|
|
||||||
{% include callout.html content="Instructions are currently only for Copy-on-write storage. When merge-on-read storage is added, these tools would be revised to add that support" type="info" %}
|
## Ingesting/Writing Data
|
||||||
|
|
||||||
|
Following means can be used to apply a delta or an incremental change to a Hoodie dataset. For e.g, the incremental changes could be from a Kafka topic or files uploaded to HDFS or
|
||||||
## Upserts
|
|
||||||
|
|
||||||
Upserts can be used to apply a delta or an incremental change to a Hoodie dataset. For e.g, the incremental changes could be from a Kafka topic or files uploaded to HDFS or
|
|
||||||
even changes pulled from another Hoodie dataset.
|
even changes pulled from another Hoodie dataset.
|
||||||
|
|
||||||
#### Via Spark Job
|
#### DeltaStreamer Tool
|
||||||
|
|
||||||
|
The `HoodieDeltaStreamer` utility provides the way to achieve all of these, by using the capabilities of `HoodieWriteClient`, and support simply row-row ingestion (no transformations)
|
||||||
|
from different sources such as DFS or Kafka.
|
||||||
|
|
||||||
|
The tool is a spark job (part of hoodie-utilities), that provides the following functionality
|
||||||
|
|
||||||
|
- Ability to consume new events from Kafka, incremental imports from Sqoop or output of `HiveIncrementalPuller` or files under a folder on HDFS
|
||||||
|
- Support json, avro or a custom payload types for the incoming data
|
||||||
|
- New data is written to a Hoodie dataset, with support for checkpointing & schemas and registered onto Hive
|
||||||
|
|
||||||
|
Command line options describe capabilities in more detail (first build hoodie-utilities using `mvn clean package`).
|
||||||
|
|
||||||
|
```
|
||||||
|
[hoodie]$ spark-submit --class com.uber.hoodie.utilities.deltastreamer.HoodieDeltaStreamer `ls hoodie-utilities/target/hoodie-utilities-*-SNAPSHOT.jar` --help
|
||||||
|
Usage: <main class> [options]
|
||||||
|
Options:
|
||||||
|
--help, -h
|
||||||
|
|
||||||
|
--key-generator-class
|
||||||
|
Subclass of com.uber.hoodie.KeyGenerator to generate a HoodieKey from
|
||||||
|
the given avro record. Built in: SimpleKeyGenerator (uses provided field
|
||||||
|
names as recordkey & partitionpath. Nested fields specified via dot
|
||||||
|
notation, e.g: a.b.c)
|
||||||
|
Default: com.uber.hoodie.SimpleKeyGenerator
|
||||||
|
--op
|
||||||
|
Takes one of these values : UPSERT (default), INSERT (use when input is
|
||||||
|
purely new data/inserts to gain speed)
|
||||||
|
Default: UPSERT
|
||||||
|
Possible Values: [UPSERT, INSERT, BULK_INSERT]
|
||||||
|
--payload-class
|
||||||
|
subclass of HoodieRecordPayload, that works off a GenericRecord.
|
||||||
|
Implement your own, if you want to do something other than overwriting
|
||||||
|
existing value
|
||||||
|
Default: com.uber.hoodie.OverwriteWithLatestAvroPayload
|
||||||
|
--props
|
||||||
|
path to properties file on localfs or dfs, with configurations for
|
||||||
|
hoodie client, schema provider, key generator and data source. For
|
||||||
|
hoodie client props, sane defaults are used, but recommend use to
|
||||||
|
provide basic things like metrics endpoints, hive configs etc. For
|
||||||
|
sources, referto individual classes, for supported properties.
|
||||||
|
Default: file:///Users/vinoth/bin/hoodie/src/test/resources/delta-streamer-config/dfs-source.properties
|
||||||
|
--schemaprovider-class
|
||||||
|
subclass of com.uber.hoodie.utilities.schema.SchemaProvider to attach
|
||||||
|
schemas to input & target table data, built in options:
|
||||||
|
FilebasedSchemaProvider
|
||||||
|
Default: com.uber.hoodie.utilities.schema.FilebasedSchemaProvider
|
||||||
|
--source-class
|
||||||
|
Subclass of com.uber.hoodie.utilities.sources to read data. Built-in
|
||||||
|
options: com.uber.hoodie.utilities.sources.{JsonDFSSource (default),
|
||||||
|
AvroDFSSource, JsonKafkaSource, AvroKafkaSource, HiveIncrPullSource}
|
||||||
|
Default: com.uber.hoodie.utilities.sources.JsonDFSSource
|
||||||
|
--source-limit
|
||||||
|
Maximum amount of data to read from source. Default: No limit For e.g:
|
||||||
|
DFSSource => max bytes to read, KafkaSource => max events to read
|
||||||
|
Default: 9223372036854775807
|
||||||
|
--source-ordering-field
|
||||||
|
Field within source record to decide how to break ties between records
|
||||||
|
with same key in input data. Default: 'ts' holding unix timestamp of
|
||||||
|
record
|
||||||
|
Default: ts
|
||||||
|
--spark-master
|
||||||
|
spark master to use.
|
||||||
|
Default: local[2]
|
||||||
|
* --target-base-path
|
||||||
|
base path for the target hoodie dataset. (Will be created if did not
|
||||||
|
exist first time around. If exists, expected to be a hoodie dataset)
|
||||||
|
* --target-table
|
||||||
|
name of the target table in Hive
|
||||||
|
|
||||||
|
|
||||||
|
```
|
||||||
|
|
||||||
|
|
||||||
|
The tool takes a hierarchically composed property file and has pluggable interfaces for extracting data, key generation and providing schema. Sample configs for ingesting from kafka and dfs are
|
||||||
|
provided under `hoodie-utilities/src/test/resources/delta-streamer-config`.
|
||||||
|
|
||||||
|
For e.g: once you have Confluent Kafka, Schema registry up & running, produce some test data using ([impressions.avro](https://docs.confluent.io/current/ksql/docs/tutorials/generate-custom-test-data.html) provided by schema-registry repo)
|
||||||
|
|
||||||
|
```
|
||||||
|
[confluent-5.0.0]$ bin/ksql-datagen schema=../impressions.avro format=avro topic=impressions key=impressionid
|
||||||
|
```
|
||||||
|
|
||||||
|
and then ingest it as follows.
|
||||||
|
|
||||||
|
```
|
||||||
|
[hoodie]$ spark-submit --class com.uber.hoodie.utilities.deltastreamer.HoodieDeltaStreamer `ls hoodie-utilities/target/hoodie-utilities-*-SNAPSHOT.jar` \
|
||||||
|
--props file://${PWD}/hoodie-utilities/src/test/resources/delta-streamer-config/kafka-source.properties \
|
||||||
|
--schemaprovider-class com.uber.hoodie.utilities.schema.SchemaRegistryProvider \
|
||||||
|
--source-class com.uber.hoodie.utilities.sources.AvroKafkaSource \
|
||||||
|
--source-ordering-field impresssiontime \
|
||||||
|
--target-base-path file:///tmp/hoodie-deltastreamer-op --target-table uber.impressions \
|
||||||
|
--op BULK_INSERT
|
||||||
|
```
|
||||||
|
|
||||||
|
In some cases, you may want to convert your existing dataset into Hoodie, before you can begin ingesting new data. This can be accomplished using the `hdfsparquetimport` command on the `hoodie-cli`.
|
||||||
|
Currently, there is support for converting parquet datasets.
|
||||||
|
|
||||||
|
#### Via Custom Spark Job
|
||||||
|
|
||||||
The `hoodie-spark` module offers the DataSource API to write any data frame into a Hoodie dataset. Following is how we can upsert a dataframe, while specifying the field names that need to be used
|
The `hoodie-spark` module offers the DataSource API to write any data frame into a Hoodie dataset. Following is how we can upsert a dataframe, while specifying the field names that need to be used
|
||||||
for `recordKey => _row_key`, `partitionPath => partition` and `precombineKey => timestamp`
|
for `recordKey => _row_key`, `partitionPath => partition` and `precombineKey => timestamp`
|
||||||
@@ -39,100 +134,6 @@ inputDF.write()
|
|||||||
|
|
||||||
Please refer to [configurations](configurations.html) section, to view all datasource options.
|
Please refer to [configurations](configurations.html) section, to view all datasource options.
|
||||||
|
|
||||||
|
|
||||||
#### DeltaStreamer Tool
|
|
||||||
|
|
||||||
The `HoodieDeltaStreamer` utility provides the way to achieve all of these, by using the capabilities of `HoodieWriteClient`.
|
|
||||||
|
|
||||||
The tool is a spark job (part of hoodie-utilities), that provides the following functionality
|
|
||||||
|
|
||||||
- Ability to consume new events from Kafka, incremental imports from Sqoop or output of `HiveIncrementalPuller` or files under a folder on HDFS
|
|
||||||
- Support json, avro or a custom payload types for the incoming data
|
|
||||||
- New data is written to a Hoodie dataset, with support for checkpointing & schemas and registered onto Hive
|
|
||||||
|
|
||||||
To understand more
|
|
||||||
|
|
||||||
```
|
|
||||||
|
|
||||||
[hoodie]$ spark-submit --class com.uber.hoodie.utilities.deltastreamer.HoodieDeltaStreamer hoodie-utilities/target/hoodie-utilities-0.3.6-SNAPSHOT-bin.jar --help
|
|
||||||
Usage: <main class> [options]
|
|
||||||
Options:
|
|
||||||
--help, -h
|
|
||||||
Default: false
|
|
||||||
--hoodie-client-config
|
|
||||||
path to properties file on localfs or dfs, with hoodie client config.
|
|
||||||
Sane defaultsare used, but recommend use to provide basic things like metrics
|
|
||||||
endpoints, hive configs etc
|
|
||||||
--key-generator-class
|
|
||||||
Subclass of com.uber.hoodie.utilities.common.KeyExtractor to generatea
|
|
||||||
HoodieKey from the given avro record. Built in: SimpleKeyGenerator (Uses provided
|
|
||||||
field names as recordkey & partitionpath. Nested fields specified via dot
|
|
||||||
notation, e.g: a.b.c)
|
|
||||||
Default: com.uber.hoodie.utilities.keygen.SimpleKeyGenerator
|
|
||||||
--key-generator-config
|
|
||||||
Path to properties file on localfs or dfs, with KeyGenerator configs. For
|
|
||||||
list of acceptable properites, refer the KeyGenerator class
|
|
||||||
--max-input-bytes
|
|
||||||
Maximum number of bytes to read from source. Default: 1TB
|
|
||||||
Default: 1099511627776
|
|
||||||
--op
|
|
||||||
Takes one of these values : UPSERT (default), INSERT (use when input is
|
|
||||||
purely new data/inserts to gain speed)
|
|
||||||
Default: UPSERT
|
|
||||||
Possible Values: [UPSERT, INSERT]
|
|
||||||
--payload-class
|
|
||||||
subclass of HoodieRecordPayload, that works off a GenericRecord. Default:
|
|
||||||
SourceWrapperPayload. Implement your own, if you want to do something other than overwriting
|
|
||||||
existing value
|
|
||||||
Default: com.uber.hoodie.utilities.deltastreamer.DeltaStreamerAvroPayload
|
|
||||||
--schemaprovider-class
|
|
||||||
subclass of com.uber.hoodie.utilities.schema.SchemaProvider to attach
|
|
||||||
schemas to input & target table data, built in options: FilebasedSchemaProvider
|
|
||||||
Default: com.uber.hoodie.utilities.schema.FilebasedSchemaProvider
|
|
||||||
--schemaprovider-config
|
|
||||||
path to properties file on localfs or dfs, with schema configs. For list
|
|
||||||
of acceptable properties, refer the schema provider class
|
|
||||||
--source-class
|
|
||||||
subclass of com.uber.hoodie.utilities.sources.Source to use to read data.
|
|
||||||
built-in options: com.uber.hoodie.utilities.common.{DFSSource (default),
|
|
||||||
KafkaSource, HiveIncrPullSource}
|
|
||||||
Default: com.uber.hoodie.utilities.sources.DFSSource
|
|
||||||
--source-config
|
|
||||||
path to properties file on localfs or dfs, with source configs. For list
|
|
||||||
of acceptable properties, refer the source class
|
|
||||||
--source-format
|
|
||||||
Format of data in source, JSON (default), Avro. All source data is
|
|
||||||
converted to Avro using the provided schema in any case
|
|
||||||
Default: JSON
|
|
||||||
Possible Values: [AVRO, JSON, ROW, CUSTOM]
|
|
||||||
--source-ordering-field
|
|
||||||
Field within source record to decide how to break ties between records
|
|
||||||
with same key in input data. Default: 'ts' holding unix timestamp of record
|
|
||||||
Default: ts
|
|
||||||
--target-base-path
|
|
||||||
base path for the target hoodie dataset
|
|
||||||
--target-table
|
|
||||||
name of the target table in Hive
|
|
||||||
|
|
||||||
|
|
||||||
```
|
|
||||||
|
|
||||||
For e.g, followings ingests data from Kafka (avro records as the client example)
|
|
||||||
|
|
||||||
|
|
||||||
```
|
|
||||||
[hoodie]$ spark-submit --class com.uber.hoodie.utilities.deltastreamer.HoodieDeltaStreamer hoodie-utilities/target/hoodie-utilities-0.3.6-SNAPSHOT-bin.jar \
|
|
||||||
--hoodie-client-config hoodie-utilities/src/main/resources/delta-streamer-config/hoodie-client.properties \
|
|
||||||
--key-generator-config hoodie-utilities/src/main/resources/delta-streamer-config/key-generator.properties \
|
|
||||||
--schemaprovider-config hoodie-utilities/src/main/resources/delta-streamer-config/schema-provider.properties \
|
|
||||||
--source-class com.uber.hoodie.utilities.sources.KafkaSource \
|
|
||||||
--source-config hoodie-utilities/src/main/resources/delta-streamer-config/source.properties \
|
|
||||||
--source-ordering-field rider \
|
|
||||||
--target-base-path file:///tmp/hoodie-deltastreamer-op \
|
|
||||||
--target-table uber.trips
|
|
||||||
```
|
|
||||||
|
|
||||||
|
|
||||||
#### Syncing to Hive
|
#### Syncing to Hive
|
||||||
|
|
||||||
Once new data is written to a Hoodie dataset, via tools like above, we need the ability to sync with Hive and reflect the table schema such that queries can pick up new columns and partitions. To do this, Hoodie provides a `HiveSyncTool`, which can be
|
Once new data is written to a Hoodie dataset, via tools like above, we need the ability to sync with Hive and reflect the table schema such that queries can pick up new columns and partitions. To do this, Hoodie provides a `HiveSyncTool`, which can be
|
||||||
@@ -164,7 +165,7 @@ Usage: <main class> [options]
|
|||||||
{% include callout.html content="Note that for now, due to jar mismatches between Spark & Hive, its recommended to run this as a separate Java task in your workflow manager/cron. This is getting fix [here](https://github.com/uber/hoodie/issues/123)" type="info" %}
|
{% include callout.html content="Note that for now, due to jar mismatches between Spark & Hive, its recommended to run this as a separate Java task in your workflow manager/cron. This is getting fix [here](https://github.com/uber/hoodie/issues/123)" type="info" %}
|
||||||
|
|
||||||
|
|
||||||
## Incremental Pull
|
## Incrementally Pulling
|
||||||
|
|
||||||
Hoodie datasets can be pulled incrementally, which means you can get ALL and ONLY the updated & new rows since a specified commit timestamp.
|
Hoodie datasets can be pulled incrementally, which means you can get ALL and ONLY the updated & new rows since a specified commit timestamp.
|
||||||
This, together with upserts, are particularly useful for building data pipelines where 1 or more source hoodie tables are incrementally pulled (streams/facts),
|
This, together with upserts, are particularly useful for building data pipelines where 1 or more source hoodie tables are incrementally pulled (streams/facts),
|
||||||
|
|||||||
@@ -41,7 +41,6 @@ public class HDFSParquetImportCommand implements CommandMarker {
|
|||||||
@CliOption(key = "upsert", mandatory = false, unspecifiedDefaultValue = "false",
|
@CliOption(key = "upsert", mandatory = false, unspecifiedDefaultValue = "false",
|
||||||
help = "Uses upsert API instead of the default insert API of WriteClient") boolean useUpsert,
|
help = "Uses upsert API instead of the default insert API of WriteClient") boolean useUpsert,
|
||||||
@CliOption(key = "srcPath", mandatory = true, help = "Base path for the input dataset") final String srcPath,
|
@CliOption(key = "srcPath", mandatory = true, help = "Base path for the input dataset") final String srcPath,
|
||||||
@CliOption(key = "srcType", mandatory = true, help = "Source type for the input dataset") final String srcType,
|
|
||||||
@CliOption(key = "targetPath", mandatory = true, help = "Base path for the target hoodie dataset") final String
|
@CliOption(key = "targetPath", mandatory = true, help = "Base path for the target hoodie dataset") final String
|
||||||
targetPath,
|
targetPath,
|
||||||
@CliOption(key = "tableName", mandatory = true, help = "Table name") final String tableName,
|
@CliOption(key = "tableName", mandatory = true, help = "Table name") final String tableName,
|
||||||
@@ -57,7 +56,7 @@ public class HDFSParquetImportCommand implements CommandMarker {
|
|||||||
@CliOption(key = "sparkMemory", mandatory = true, help = "Spark executor memory") final String sparkMemory,
|
@CliOption(key = "sparkMemory", mandatory = true, help = "Spark executor memory") final String sparkMemory,
|
||||||
@CliOption(key = "retry", mandatory = true, help = "Number of retries") final String retry) throws Exception {
|
@CliOption(key = "retry", mandatory = true, help = "Number of retries") final String retry) throws Exception {
|
||||||
|
|
||||||
validate(format, srcType);
|
(new HDFSParquetImporter.FormatValidator()).validate("format", format);
|
||||||
|
|
||||||
boolean initialized = HoodieCLI.initConf();
|
boolean initialized = HoodieCLI.initConf();
|
||||||
HoodieCLI.initFS(initialized);
|
HoodieCLI.initFS(initialized);
|
||||||
@@ -81,9 +80,4 @@ public class HDFSParquetImportCommand implements CommandMarker {
|
|||||||
}
|
}
|
||||||
return "Dataset imported to hoodie format";
|
return "Dataset imported to hoodie format";
|
||||||
}
|
}
|
||||||
|
|
||||||
private void validate(String format, String srcType) {
|
|
||||||
(new HDFSParquetImporter.FormatValidator()).validate("format", format);
|
|
||||||
(new HDFSParquetImporter.SourceTypeValidator()).validate("srcType", srcType);
|
|
||||||
}
|
|
||||||
}
|
}
|
||||||
|
|||||||
@@ -65,7 +65,6 @@ public class HoodieTestDataGenerator {
|
|||||||
DEFAULT_THIRD_PARTITION_PATH
|
DEFAULT_THIRD_PARTITION_PATH
|
||||||
};
|
};
|
||||||
public static final int DEFAULT_PARTITION_DEPTH = 3;
|
public static final int DEFAULT_PARTITION_DEPTH = 3;
|
||||||
|
|
||||||
public static String TRIP_EXAMPLE_SCHEMA = "{\"type\": \"record\"," + "\"name\": \"triprec\"," + "\"fields\": [ "
|
public static String TRIP_EXAMPLE_SCHEMA = "{\"type\": \"record\"," + "\"name\": \"triprec\"," + "\"fields\": [ "
|
||||||
+ "{\"name\": \"timestamp\",\"type\": \"double\"},"
|
+ "{\"name\": \"timestamp\",\"type\": \"double\"},"
|
||||||
+ "{\"name\": \"_row_key\", \"type\": \"string\"},"
|
+ "{\"name\": \"_row_key\", \"type\": \"string\"},"
|
||||||
@@ -78,6 +77,7 @@ public class HoodieTestDataGenerator {
|
|||||||
+ "{\"name\":\"fare\",\"type\": \"double\"}]}";
|
+ "{\"name\":\"fare\",\"type\": \"double\"}]}";
|
||||||
public static Schema avroSchema = HoodieAvroUtils.addMetadataFields(new Schema.Parser().parse(TRIP_EXAMPLE_SCHEMA));
|
public static Schema avroSchema = HoodieAvroUtils.addMetadataFields(new Schema.Parser().parse(TRIP_EXAMPLE_SCHEMA));
|
||||||
private static Random rand = new Random(46474747);
|
private static Random rand = new Random(46474747);
|
||||||
|
|
||||||
private List<KeyPartition> existingKeysList = new ArrayList<>();
|
private List<KeyPartition> existingKeysList = new ArrayList<>();
|
||||||
private String[] partitionPaths;
|
private String[] partitionPaths;
|
||||||
|
|
||||||
@@ -162,10 +162,6 @@ public class HoodieTestDataGenerator {
|
|||||||
}
|
}
|
||||||
}
|
}
|
||||||
|
|
||||||
public static void createSavepointFile(String basePath, String commitTime) throws IOException {
|
|
||||||
createSavepointFile(basePath, commitTime, HoodieTestUtils.getDefaultHadoopConf());
|
|
||||||
}
|
|
||||||
|
|
||||||
public static void createSavepointFile(String basePath, String commitTime, Configuration configuration)
|
public static void createSavepointFile(String basePath, String commitTime, Configuration configuration)
|
||||||
throws IOException {
|
throws IOException {
|
||||||
Path commitFile = new Path(basePath + "/" + HoodieTableMetaClient.METAFOLDER_NAME
|
Path commitFile = new Path(basePath + "/" + HoodieTableMetaClient.METAFOLDER_NAME
|
||||||
@@ -289,8 +285,11 @@ public class HoodieTestDataGenerator {
|
|||||||
return partitionPaths;
|
return partitionPaths;
|
||||||
}
|
}
|
||||||
|
|
||||||
static class KeyPartition {
|
public List<KeyPartition> getExistingKeysList() {
|
||||||
|
return existingKeysList;
|
||||||
|
}
|
||||||
|
|
||||||
|
public static class KeyPartition {
|
||||||
HoodieKey key;
|
HoodieKey key;
|
||||||
String partitionPath;
|
String partitionPath;
|
||||||
}
|
}
|
||||||
|
|||||||
@@ -88,6 +88,7 @@
|
|||||||
<dependency>
|
<dependency>
|
||||||
<groupId>com.fasterxml.jackson.core</groupId>
|
<groupId>com.fasterxml.jackson.core</groupId>
|
||||||
<artifactId>jackson-annotations</artifactId>
|
<artifactId>jackson-annotations</artifactId>
|
||||||
|
<version>${fasterxml.version}</version>
|
||||||
</dependency>
|
</dependency>
|
||||||
<dependency>
|
<dependency>
|
||||||
<groupId>org.apache.parquet</groupId>
|
<groupId>org.apache.parquet</groupId>
|
||||||
|
|||||||
@@ -0,0 +1,102 @@
|
|||||||
|
/*
|
||||||
|
* Copyright (c) 2017 Uber Technologies, Inc. (hoodie-dev-group@uber.com)
|
||||||
|
*
|
||||||
|
* Licensed under the Apache License, Version 2.0 (the "License");
|
||||||
|
* you may not use this file except in compliance with the License.
|
||||||
|
* You may obtain a copy of the License at
|
||||||
|
*
|
||||||
|
* http://www.apache.org/licenses/LICENSE-2.0
|
||||||
|
*
|
||||||
|
* Unless required by applicable law or agreed to in writing, software
|
||||||
|
* distributed under the License is distributed on an "AS IS" BASIS,
|
||||||
|
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||||
|
* See the License for the specific language governing permissions and
|
||||||
|
* limitations under the License.
|
||||||
|
*
|
||||||
|
*
|
||||||
|
*/
|
||||||
|
|
||||||
|
package com.uber.hoodie.common.util;
|
||||||
|
|
||||||
|
import java.io.BufferedReader;
|
||||||
|
import java.io.IOException;
|
||||||
|
import java.io.InputStreamReader;
|
||||||
|
import java.util.HashSet;
|
||||||
|
import java.util.Set;
|
||||||
|
import org.apache.hadoop.fs.FileSystem;
|
||||||
|
import org.apache.hadoop.fs.Path;
|
||||||
|
import org.apache.log4j.LogManager;
|
||||||
|
import org.apache.log4j.Logger;
|
||||||
|
|
||||||
|
/**
|
||||||
|
* A simplified versions of Apache commons - PropertiesConfiguration, that supports limited field types and hierarchical
|
||||||
|
* configurations within the same folder as the root file.
|
||||||
|
*
|
||||||
|
* Includes denoted by the same include=filename.properties syntax, with relative path from root file's folder. Lines
|
||||||
|
* beginning with '#' are ignored as comments. Final values for properties are resolved by the order in which they are
|
||||||
|
* specified in the files, with included files treated as if they are inline.
|
||||||
|
*
|
||||||
|
* Note: Not reusing commons-configuration since it has too many conflicting runtime deps.
|
||||||
|
*/
|
||||||
|
public class DFSPropertiesConfiguration {
|
||||||
|
|
||||||
|
private static volatile Logger log = LogManager.getLogger(DFSPropertiesConfiguration.class);
|
||||||
|
|
||||||
|
private final FileSystem fs;
|
||||||
|
|
||||||
|
private final Path rootFile;
|
||||||
|
|
||||||
|
private final TypedProperties props;
|
||||||
|
|
||||||
|
// Keep track of files visited, to detect loops
|
||||||
|
private final Set<String> visitedFiles;
|
||||||
|
|
||||||
|
public DFSPropertiesConfiguration(FileSystem fs, Path rootFile, TypedProperties defaults) {
|
||||||
|
this.fs = fs;
|
||||||
|
this.rootFile = rootFile;
|
||||||
|
this.props = defaults;
|
||||||
|
this.visitedFiles = new HashSet<>();
|
||||||
|
visitFile(rootFile);
|
||||||
|
}
|
||||||
|
|
||||||
|
private String[] splitProperty(String line) {
|
||||||
|
int ind = line.indexOf('=');
|
||||||
|
String k = line.substring(0, ind).trim();
|
||||||
|
String v = line.substring(ind + 1).trim();
|
||||||
|
return new String[]{k, v};
|
||||||
|
}
|
||||||
|
|
||||||
|
private void visitFile(Path file) {
|
||||||
|
try {
|
||||||
|
if (visitedFiles.contains(file.getName())) {
|
||||||
|
throw new IllegalStateException("Loop detected; file " + file + " already referenced");
|
||||||
|
}
|
||||||
|
visitedFiles.add(file.getName());
|
||||||
|
BufferedReader reader = new BufferedReader(new InputStreamReader(fs.open(file)));
|
||||||
|
String line;
|
||||||
|
while ((line = reader.readLine()) != null) {
|
||||||
|
if (line.startsWith("#") || line.equals("") || !line.contains("=")) {
|
||||||
|
continue;
|
||||||
|
}
|
||||||
|
String[] split = splitProperty(line);
|
||||||
|
if (line.startsWith("include=") || line.startsWith("include =")) {
|
||||||
|
visitFile(new Path(rootFile.getParent(), split[1]));
|
||||||
|
} else {
|
||||||
|
props.setProperty(split[0], split[1]);
|
||||||
|
}
|
||||||
|
}
|
||||||
|
reader.close();
|
||||||
|
} catch (IOException ioe) {
|
||||||
|
log.error("Error reading in properies from dfs", ioe);
|
||||||
|
throw new IllegalArgumentException("Cannot read properties from dfs", ioe);
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
public DFSPropertiesConfiguration(FileSystem fs, Path rootFile) {
|
||||||
|
this(fs, rootFile, new TypedProperties());
|
||||||
|
}
|
||||||
|
|
||||||
|
public TypedProperties getConfig() {
|
||||||
|
return props;
|
||||||
|
}
|
||||||
|
}
|
||||||
@@ -19,6 +19,7 @@ package com.uber.hoodie.common.util;
|
|||||||
import com.uber.hoodie.common.model.HoodieRecordPayload;
|
import com.uber.hoodie.common.model.HoodieRecordPayload;
|
||||||
import com.uber.hoodie.exception.HoodieException;
|
import com.uber.hoodie.exception.HoodieException;
|
||||||
import java.lang.reflect.InvocationTargetException;
|
import java.lang.reflect.InvocationTargetException;
|
||||||
|
import java.util.Arrays;
|
||||||
import java.util.HashMap;
|
import java.util.HashMap;
|
||||||
import java.util.Map;
|
import java.util.Map;
|
||||||
|
|
||||||
@@ -26,18 +27,22 @@ public class ReflectionUtils {
|
|||||||
|
|
||||||
private static Map<String, Class<?>> clazzCache = new HashMap<>();
|
private static Map<String, Class<?>> clazzCache = new HashMap<>();
|
||||||
|
|
||||||
|
private static Class<?> getClass(String clazzName) {
|
||||||
|
if (!clazzCache.containsKey(clazzName)) {
|
||||||
|
try {
|
||||||
|
Class<?> clazz = Class.forName(clazzName);
|
||||||
|
clazzCache.put(clazzName, clazz);
|
||||||
|
} catch (ClassNotFoundException e) {
|
||||||
|
throw new HoodieException("Unable to load class", e);
|
||||||
|
}
|
||||||
|
}
|
||||||
|
return clazzCache.get(clazzName);
|
||||||
|
}
|
||||||
|
|
||||||
public static <T> T loadClass(String fqcn) {
|
public static <T> T loadClass(String fqcn) {
|
||||||
try {
|
try {
|
||||||
if (clazzCache.get(fqcn) == null) {
|
return (T) getClass(fqcn).newInstance();
|
||||||
Class<?> clazz = Class.<HoodieRecordPayload>forName(fqcn);
|
} catch (InstantiationException | IllegalAccessException e) {
|
||||||
clazzCache.put(fqcn, clazz);
|
|
||||||
}
|
|
||||||
return (T) clazzCache.get(fqcn).newInstance();
|
|
||||||
} catch (ClassNotFoundException e) {
|
|
||||||
throw new HoodieException("Could not load class " + fqcn, e);
|
|
||||||
} catch (InstantiationException e) {
|
|
||||||
throw new HoodieException("Could not load class " + fqcn, e);
|
|
||||||
} catch (IllegalAccessException e) {
|
|
||||||
throw new HoodieException("Could not load class " + fqcn, e);
|
throw new HoodieException("Could not load class " + fqcn, e);
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
@@ -49,18 +54,32 @@ public class ReflectionUtils {
|
|||||||
Object[] payloadArgs,
|
Object[] payloadArgs,
|
||||||
Class<?>... constructorArgTypes) {
|
Class<?>... constructorArgTypes) {
|
||||||
try {
|
try {
|
||||||
if (clazzCache.get(recordPayloadClass) == null) {
|
return (T) getClass(recordPayloadClass).getConstructor(constructorArgTypes)
|
||||||
Class<?> clazz = Class.<HoodieRecordPayload>forName(recordPayloadClass);
|
|
||||||
clazzCache.put(recordPayloadClass, clazz);
|
|
||||||
}
|
|
||||||
return (T) clazzCache.get(recordPayloadClass).getConstructor(constructorArgTypes)
|
|
||||||
.newInstance(payloadArgs);
|
.newInstance(payloadArgs);
|
||||||
} catch (InstantiationException | IllegalAccessException
|
} catch (InstantiationException | IllegalAccessException
|
||||||
| InvocationTargetException | NoSuchMethodException e) {
|
| InvocationTargetException | NoSuchMethodException e) {
|
||||||
throw new HoodieException("Unable to instantiate payload class ", e);
|
throw new HoodieException("Unable to instantiate payload class ", e);
|
||||||
} catch (ClassNotFoundException e) {
|
|
||||||
throw new HoodieException("Unable to instantiate payload class ", e);
|
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
|
|
||||||
|
/**
|
||||||
|
* Creates an instnace of the given class. Use this version when dealing with interface types as constructor args.
|
||||||
|
*/
|
||||||
|
public static Object loadClass(String clazz, Class<?>[] constructorArgTypes, Object... constructorArgs) {
|
||||||
|
try {
|
||||||
|
return getClass(clazz).getConstructor(constructorArgTypes).newInstance(constructorArgs);
|
||||||
|
} catch (InstantiationException | IllegalAccessException
|
||||||
|
| InvocationTargetException | NoSuchMethodException e) {
|
||||||
|
throw new HoodieException("Unable to instantiate class ", e);
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
/**
|
||||||
|
* Creates an instance of the given class. Constructor arg types are inferred.
|
||||||
|
*/
|
||||||
|
public static Object loadClass(String clazz, Object... constructorArgs) {
|
||||||
|
Class<?>[] constructorArgTypes = Arrays.stream(constructorArgs)
|
||||||
|
.map(arg -> arg.getClass()).toArray(Class<?>[]::new);
|
||||||
|
return loadClass(clazz, constructorArgTypes, constructorArgs);
|
||||||
|
}
|
||||||
}
|
}
|
||||||
|
|||||||
@@ -0,0 +1,87 @@
|
|||||||
|
/*
|
||||||
|
* Copyright (c) 2017 Uber Technologies, Inc. (hoodie-dev-group@uber.com)
|
||||||
|
*
|
||||||
|
* Licensed under the Apache License, Version 2.0 (the "License");
|
||||||
|
* you may not use this file except in compliance with the License.
|
||||||
|
* You may obtain a copy of the License at
|
||||||
|
*
|
||||||
|
* http://www.apache.org/licenses/LICENSE-2.0
|
||||||
|
*
|
||||||
|
* Unless required by applicable law or agreed to in writing, software
|
||||||
|
* distributed under the License is distributed on an "AS IS" BASIS,
|
||||||
|
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||||
|
* See the License for the specific language governing permissions and
|
||||||
|
* limitations under the License.
|
||||||
|
*
|
||||||
|
*
|
||||||
|
*/
|
||||||
|
|
||||||
|
package com.uber.hoodie.common.util;
|
||||||
|
|
||||||
|
import java.io.Serializable;
|
||||||
|
import java.util.Properties;
|
||||||
|
|
||||||
|
/**
|
||||||
|
* Type-aware extension of {@link java.util.Properties}
|
||||||
|
*/
|
||||||
|
public class TypedProperties extends Properties implements Serializable {
|
||||||
|
|
||||||
|
public TypedProperties() {
|
||||||
|
super(null);
|
||||||
|
}
|
||||||
|
|
||||||
|
public TypedProperties(Properties defaults) {
|
||||||
|
super(defaults);
|
||||||
|
}
|
||||||
|
|
||||||
|
private void checkKey(String property) {
|
||||||
|
if (!containsKey(property)) {
|
||||||
|
throw new IllegalArgumentException("Property " + property + " not found");
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
public String getString(String property) {
|
||||||
|
checkKey(property);
|
||||||
|
return getProperty(property);
|
||||||
|
}
|
||||||
|
|
||||||
|
public String getString(String property, String defaultValue) {
|
||||||
|
return containsKey(property) ? getProperty(property) : defaultValue;
|
||||||
|
}
|
||||||
|
|
||||||
|
public int getInteger(String property) {
|
||||||
|
checkKey(property);
|
||||||
|
return Integer.valueOf(getProperty(property));
|
||||||
|
}
|
||||||
|
|
||||||
|
public int getInteger(String property, int defaultValue) {
|
||||||
|
return containsKey(property) ? Integer.valueOf(getProperty(property)) : defaultValue;
|
||||||
|
}
|
||||||
|
|
||||||
|
public long getLong(String property) {
|
||||||
|
checkKey(property);
|
||||||
|
return Long.valueOf(getProperty(property));
|
||||||
|
}
|
||||||
|
|
||||||
|
public long getLong(String property, long defaultValue) {
|
||||||
|
return containsKey(property) ? Long.valueOf(getProperty(property)) : defaultValue;
|
||||||
|
}
|
||||||
|
|
||||||
|
public boolean getBoolean(String property) {
|
||||||
|
checkKey(property);
|
||||||
|
return Boolean.valueOf(getProperty(property));
|
||||||
|
}
|
||||||
|
|
||||||
|
public boolean getBoolean(String property, boolean defaultValue) {
|
||||||
|
return containsKey(property) ? Boolean.valueOf(getProperty(property)) : defaultValue;
|
||||||
|
}
|
||||||
|
|
||||||
|
public double getDouble(String property) {
|
||||||
|
checkKey(property);
|
||||||
|
return Double.valueOf(getProperty(property));
|
||||||
|
}
|
||||||
|
|
||||||
|
public double getDouble(String property, double defaultValue) {
|
||||||
|
return containsKey(property) ? Double.valueOf(getProperty(property)) : defaultValue;
|
||||||
|
}
|
||||||
|
}
|
||||||
@@ -0,0 +1,138 @@
|
|||||||
|
/*
|
||||||
|
* Copyright (c) 2017 Uber Technologies, Inc. (hoodie-dev-group@uber.com)
|
||||||
|
*
|
||||||
|
* Licensed under the Apache License, Version 2.0 (the "License");
|
||||||
|
* you may not use this file except in compliance with the License.
|
||||||
|
* You may obtain a copy of the License at
|
||||||
|
*
|
||||||
|
* http://www.apache.org/licenses/LICENSE-2.0
|
||||||
|
*
|
||||||
|
* Unless required by applicable law or agreed to in writing, software
|
||||||
|
* distributed under the License is distributed on an "AS IS" BASIS,
|
||||||
|
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||||
|
* See the License for the specific language governing permissions and
|
||||||
|
* limitations under the License.
|
||||||
|
*
|
||||||
|
*
|
||||||
|
*/
|
||||||
|
|
||||||
|
package com.uber.hoodie.common.util;
|
||||||
|
|
||||||
|
import static org.junit.Assert.assertEquals;
|
||||||
|
import static org.junit.Assert.fail;
|
||||||
|
|
||||||
|
import com.uber.hoodie.common.minicluster.HdfsTestService;
|
||||||
|
import java.io.IOException;
|
||||||
|
import java.io.PrintStream;
|
||||||
|
import org.apache.hadoop.fs.Path;
|
||||||
|
import org.apache.hadoop.hdfs.DistributedFileSystem;
|
||||||
|
import org.apache.hadoop.hdfs.MiniDFSCluster;
|
||||||
|
import org.junit.AfterClass;
|
||||||
|
import org.junit.BeforeClass;
|
||||||
|
import org.junit.Test;
|
||||||
|
|
||||||
|
/**
|
||||||
|
* Tests basic functionality of {@link DFSPropertiesConfiguration}
|
||||||
|
*/
|
||||||
|
public class TestDFSPropertiesConfiguration {
|
||||||
|
|
||||||
|
private static String dfsBasePath;
|
||||||
|
private static HdfsTestService hdfsTestService;
|
||||||
|
private static MiniDFSCluster dfsCluster;
|
||||||
|
private static DistributedFileSystem dfs;
|
||||||
|
|
||||||
|
|
||||||
|
@BeforeClass
|
||||||
|
public static void initClass() throws Exception {
|
||||||
|
hdfsTestService = new HdfsTestService();
|
||||||
|
dfsCluster = hdfsTestService.start(true);
|
||||||
|
// Create a temp folder as the base path
|
||||||
|
dfs = dfsCluster.getFileSystem();
|
||||||
|
dfsBasePath = dfs.getWorkingDirectory().toString();
|
||||||
|
dfs.mkdirs(new Path(dfsBasePath));
|
||||||
|
|
||||||
|
// create some files.
|
||||||
|
Path filePath = new Path(dfsBasePath + "/t1.props");
|
||||||
|
writePropertiesFile(filePath, new String[]{
|
||||||
|
"", "#comment", "abc",// to be ignored
|
||||||
|
"int.prop=123", "double.prop=113.4", "string.prop=str", "boolean.prop=true", "long.prop=1354354354"
|
||||||
|
});
|
||||||
|
|
||||||
|
filePath = new Path(dfsBasePath + "/t2.props");
|
||||||
|
writePropertiesFile(filePath, new String[]{
|
||||||
|
"string.prop=ignored", "include=t1.props"
|
||||||
|
});
|
||||||
|
|
||||||
|
filePath = new Path(dfsBasePath + "/t3.props");
|
||||||
|
writePropertiesFile(filePath, new String[]{
|
||||||
|
"double.prop=838.3", "include = t2.props", "double.prop=243.4", "string.prop=t3.value"
|
||||||
|
});
|
||||||
|
|
||||||
|
filePath = new Path(dfsBasePath + "/t4.props");
|
||||||
|
writePropertiesFile(filePath, new String[]{
|
||||||
|
"double.prop=838.3", "include = t4.props"
|
||||||
|
});
|
||||||
|
}
|
||||||
|
|
||||||
|
@AfterClass
|
||||||
|
public static void cleanupClass() throws Exception {
|
||||||
|
if (hdfsTestService != null) {
|
||||||
|
hdfsTestService.stop();
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
private static void writePropertiesFile(Path path, String[] lines) throws IOException {
|
||||||
|
PrintStream out = new PrintStream(dfs.create(path, true));
|
||||||
|
for (String line : lines) {
|
||||||
|
out.println(line);
|
||||||
|
}
|
||||||
|
out.flush();
|
||||||
|
out.close();
|
||||||
|
}
|
||||||
|
|
||||||
|
@Test
|
||||||
|
public void testParsing() throws IOException {
|
||||||
|
DFSPropertiesConfiguration cfg = new DFSPropertiesConfiguration(dfs, new Path(dfsBasePath + "/t1.props"));
|
||||||
|
TypedProperties props = cfg.getConfig();
|
||||||
|
assertEquals(5, props.size());
|
||||||
|
try {
|
||||||
|
props.getString("invalid.key");
|
||||||
|
fail("Should error out here.");
|
||||||
|
} catch (IllegalArgumentException iae) { /* ignore */ }
|
||||||
|
|
||||||
|
assertEquals(123, props.getInteger("int.prop"));
|
||||||
|
assertEquals(113.4, props.getDouble("double.prop"), 0.001);
|
||||||
|
assertEquals(true, props.getBoolean("boolean.prop"));
|
||||||
|
assertEquals("str", props.getString("string.prop"));
|
||||||
|
assertEquals(1354354354, props.getLong("long.prop"));
|
||||||
|
|
||||||
|
assertEquals(123, props.getInteger("int.prop", 456));
|
||||||
|
assertEquals(113.4, props.getDouble("double.prop", 223.4), 0.001);
|
||||||
|
assertEquals(true, props.getBoolean("boolean.prop", false));
|
||||||
|
assertEquals("str", props.getString("string.prop", "default"));
|
||||||
|
assertEquals(1354354354, props.getLong("long.prop", 8578494434L));
|
||||||
|
|
||||||
|
assertEquals(456, props.getInteger("bad.int.prop", 456));
|
||||||
|
assertEquals(223.4, props.getDouble("bad.double.prop", 223.4), 0.001);
|
||||||
|
assertEquals(false, props.getBoolean("bad.boolean.prop", false));
|
||||||
|
assertEquals("default", props.getString("bad.string.prop", "default"));
|
||||||
|
assertEquals(8578494434L, props.getLong("bad.long.prop", 8578494434L));
|
||||||
|
}
|
||||||
|
|
||||||
|
@Test
|
||||||
|
public void testIncludes() {
|
||||||
|
DFSPropertiesConfiguration cfg = new DFSPropertiesConfiguration(dfs, new Path(dfsBasePath + "/t3.props"));
|
||||||
|
TypedProperties props = cfg.getConfig();
|
||||||
|
|
||||||
|
assertEquals(123, props.getInteger("int.prop"));
|
||||||
|
assertEquals(243.4, props.getDouble("double.prop"), 0.001);
|
||||||
|
assertEquals(true, props.getBoolean("boolean.prop"));
|
||||||
|
assertEquals("t3.value", props.getString("string.prop"));
|
||||||
|
assertEquals(1354354354, props.getLong("long.prop"));
|
||||||
|
|
||||||
|
try {
|
||||||
|
new DFSPropertiesConfiguration(dfs, new Path(dfsBasePath + "/t4.props"));
|
||||||
|
fail("Should error out on a self-included file.");
|
||||||
|
} catch (IllegalStateException ise) { /* ignore */ }
|
||||||
|
}
|
||||||
|
}
|
||||||
@@ -148,7 +148,10 @@
|
|||||||
<groupId>com.fasterxml.jackson.core</groupId>
|
<groupId>com.fasterxml.jackson.core</groupId>
|
||||||
<artifactId>jackson-annotations</artifactId>
|
<artifactId>jackson-annotations</artifactId>
|
||||||
</dependency>
|
</dependency>
|
||||||
|
<dependency>
|
||||||
|
<groupId>com.fasterxml.jackson.module</groupId>
|
||||||
|
<artifactId>jackson-module-scala_2.11</artifactId>
|
||||||
|
</dependency>
|
||||||
<dependency>
|
<dependency>
|
||||||
<groupId>org.apache.hadoop</groupId>
|
<groupId>org.apache.hadoop</groupId>
|
||||||
<artifactId>hadoop-client</artifactId>
|
<artifactId>hadoop-client</artifactId>
|
||||||
|
|||||||
@@ -18,6 +18,7 @@
|
|||||||
|
|
||||||
package com.uber.hoodie;
|
package com.uber.hoodie;
|
||||||
|
|
||||||
|
import com.uber.hoodie.exception.HoodieException;
|
||||||
import java.io.Serializable;
|
import java.io.Serializable;
|
||||||
import org.apache.avro.generic.GenericRecord;
|
import org.apache.avro.generic.GenericRecord;
|
||||||
|
|
||||||
@@ -44,5 +45,8 @@ public abstract class BaseAvroPayload implements Serializable {
|
|||||||
public BaseAvroPayload(GenericRecord record, Comparable orderingVal) {
|
public BaseAvroPayload(GenericRecord record, Comparable orderingVal) {
|
||||||
this.record = record;
|
this.record = record;
|
||||||
this.orderingVal = orderingVal;
|
this.orderingVal = orderingVal;
|
||||||
|
if (orderingVal == null) {
|
||||||
|
throw new HoodieException("Ordering value is null for record: " + record);
|
||||||
|
}
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
|
|||||||
@@ -21,6 +21,8 @@ package com.uber.hoodie;
|
|||||||
import com.uber.hoodie.common.model.HoodieKey;
|
import com.uber.hoodie.common.model.HoodieKey;
|
||||||
import com.uber.hoodie.common.model.HoodieRecord;
|
import com.uber.hoodie.common.model.HoodieRecord;
|
||||||
import com.uber.hoodie.common.model.HoodieRecordPayload;
|
import com.uber.hoodie.common.model.HoodieRecordPayload;
|
||||||
|
import com.uber.hoodie.common.util.ReflectionUtils;
|
||||||
|
import com.uber.hoodie.common.util.TypedProperties;
|
||||||
import com.uber.hoodie.config.HoodieCompactionConfig;
|
import com.uber.hoodie.config.HoodieCompactionConfig;
|
||||||
import com.uber.hoodie.config.HoodieIndexConfig;
|
import com.uber.hoodie.config.HoodieIndexConfig;
|
||||||
import com.uber.hoodie.config.HoodieWriteConfig;
|
import com.uber.hoodie.config.HoodieWriteConfig;
|
||||||
@@ -31,8 +33,6 @@ import java.io.IOException;
|
|||||||
import java.util.List;
|
import java.util.List;
|
||||||
import java.util.Map;
|
import java.util.Map;
|
||||||
import org.apache.avro.generic.GenericRecord;
|
import org.apache.avro.generic.GenericRecord;
|
||||||
import org.apache.commons.configuration.PropertiesConfiguration;
|
|
||||||
import org.apache.commons.lang3.reflect.ConstructorUtils;
|
|
||||||
import org.apache.spark.api.java.JavaRDD;
|
import org.apache.spark.api.java.JavaRDD;
|
||||||
import org.apache.spark.api.java.JavaSparkContext;
|
import org.apache.spark.api.java.JavaSparkContext;
|
||||||
|
|
||||||
@@ -72,10 +72,9 @@ public class DataSourceUtils {
|
|||||||
* Create a key generator class via reflection, passing in any configs needed
|
* Create a key generator class via reflection, passing in any configs needed
|
||||||
*/
|
*/
|
||||||
public static KeyGenerator createKeyGenerator(String keyGeneratorClass,
|
public static KeyGenerator createKeyGenerator(String keyGeneratorClass,
|
||||||
PropertiesConfiguration cfg) throws IOException {
|
TypedProperties props) throws IOException {
|
||||||
try {
|
try {
|
||||||
return (KeyGenerator) ConstructorUtils
|
return (KeyGenerator) ReflectionUtils.loadClass(keyGeneratorClass, props);
|
||||||
.invokeConstructor(Class.forName(keyGeneratorClass), (Object) cfg);
|
|
||||||
} catch (Throwable e) {
|
} catch (Throwable e) {
|
||||||
throw new IOException("Could not load key generator class " + keyGeneratorClass, e);
|
throw new IOException("Could not load key generator class " + keyGeneratorClass, e);
|
||||||
}
|
}
|
||||||
@@ -87,17 +86,17 @@ public class DataSourceUtils {
|
|||||||
public static HoodieRecordPayload createPayload(String payloadClass, GenericRecord record,
|
public static HoodieRecordPayload createPayload(String payloadClass, GenericRecord record,
|
||||||
Comparable orderingVal) throws IOException {
|
Comparable orderingVal) throws IOException {
|
||||||
try {
|
try {
|
||||||
return (HoodieRecordPayload) ConstructorUtils.invokeConstructor(Class.forName(payloadClass),
|
return (HoodieRecordPayload) ReflectionUtils
|
||||||
(Object) record, (Object) orderingVal);
|
.loadClass(payloadClass, new Class<?>[]{GenericRecord.class, Comparable.class}, record, orderingVal);
|
||||||
} catch (Throwable e) {
|
} catch (Throwable e) {
|
||||||
throw new IOException("Could not create payload for class: " + payloadClass, e);
|
throw new IOException("Could not create payload for class: " + payloadClass, e);
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
|
|
||||||
public static void checkRequiredProperties(PropertiesConfiguration configuration,
|
public static void checkRequiredProperties(TypedProperties props,
|
||||||
List<String> checkPropNames) {
|
List<String> checkPropNames) {
|
||||||
checkPropNames.stream().forEach(prop -> {
|
checkPropNames.stream().forEach(prop -> {
|
||||||
if (!configuration.containsKey(prop)) {
|
if (!props.containsKey(prop)) {
|
||||||
throw new HoodieNotSupportedException("Required property " + prop + " is missing");
|
throw new HoodieNotSupportedException("Required property " + prop + " is missing");
|
||||||
}
|
}
|
||||||
});
|
});
|
||||||
|
|||||||
@@ -19,9 +19,9 @@
|
|||||||
package com.uber.hoodie;
|
package com.uber.hoodie;
|
||||||
|
|
||||||
import com.uber.hoodie.common.model.HoodieKey;
|
import com.uber.hoodie.common.model.HoodieKey;
|
||||||
|
import com.uber.hoodie.common.util.TypedProperties;
|
||||||
import java.io.Serializable;
|
import java.io.Serializable;
|
||||||
import org.apache.avro.generic.GenericRecord;
|
import org.apache.avro.generic.GenericRecord;
|
||||||
import org.apache.commons.configuration.PropertiesConfiguration;
|
|
||||||
|
|
||||||
/**
|
/**
|
||||||
* Abstract class to extend for plugging in extraction of
|
* Abstract class to extend for plugging in extraction of
|
||||||
@@ -30,9 +30,9 @@ import org.apache.commons.configuration.PropertiesConfiguration;
|
|||||||
*/
|
*/
|
||||||
public abstract class KeyGenerator implements Serializable {
|
public abstract class KeyGenerator implements Serializable {
|
||||||
|
|
||||||
protected transient PropertiesConfiguration config;
|
protected transient TypedProperties config;
|
||||||
|
|
||||||
protected KeyGenerator(PropertiesConfiguration config) {
|
protected KeyGenerator(TypedProperties config) {
|
||||||
this.config = config;
|
this.config = config;
|
||||||
}
|
}
|
||||||
|
|
||||||
|
|||||||
@@ -19,9 +19,9 @@
|
|||||||
package com.uber.hoodie;
|
package com.uber.hoodie;
|
||||||
|
|
||||||
import com.uber.hoodie.common.model.HoodieKey;
|
import com.uber.hoodie.common.model.HoodieKey;
|
||||||
|
import com.uber.hoodie.common.util.TypedProperties;
|
||||||
import com.uber.hoodie.exception.HoodieException;
|
import com.uber.hoodie.exception.HoodieException;
|
||||||
import org.apache.avro.generic.GenericRecord;
|
import org.apache.avro.generic.GenericRecord;
|
||||||
import org.apache.commons.configuration.PropertiesConfiguration;
|
|
||||||
|
|
||||||
/**
|
/**
|
||||||
* Simple key generator, which takes names of fields to be used for recordKey and partitionPath as
|
* Simple key generator, which takes names of fields to be used for recordKey and partitionPath as
|
||||||
@@ -29,14 +29,16 @@ import org.apache.commons.configuration.PropertiesConfiguration;
|
|||||||
*/
|
*/
|
||||||
public class SimpleKeyGenerator extends KeyGenerator {
|
public class SimpleKeyGenerator extends KeyGenerator {
|
||||||
|
|
||||||
|
private static final String DEFAULT_PARTITION_PATH = "default";
|
||||||
|
|
||||||
protected final String recordKeyField;
|
protected final String recordKeyField;
|
||||||
|
|
||||||
protected final String partitionPathField;
|
protected final String partitionPathField;
|
||||||
|
|
||||||
public SimpleKeyGenerator(PropertiesConfiguration config) {
|
public SimpleKeyGenerator(TypedProperties props) {
|
||||||
super(config);
|
super(props);
|
||||||
this.recordKeyField = config.getString(DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY());
|
this.recordKeyField = props.getString(DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY());
|
||||||
this.partitionPathField = config
|
this.partitionPathField = props
|
||||||
.getString(DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY());
|
.getString(DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY());
|
||||||
}
|
}
|
||||||
|
|
||||||
@@ -46,7 +48,16 @@ public class SimpleKeyGenerator extends KeyGenerator {
|
|||||||
throw new HoodieException(
|
throw new HoodieException(
|
||||||
"Unable to find field names for record key or partition path in cfg");
|
"Unable to find field names for record key or partition path in cfg");
|
||||||
}
|
}
|
||||||
return new HoodieKey(DataSourceUtils.getNestedFieldValAsString(record, recordKeyField),
|
|
||||||
DataSourceUtils.getNestedFieldValAsString(record, partitionPathField));
|
String recordKey = DataSourceUtils.getNestedFieldValAsString(record, recordKeyField);
|
||||||
|
String partitionPath;
|
||||||
|
try {
|
||||||
|
partitionPath = DataSourceUtils.getNestedFieldValAsString(record, partitionPathField);
|
||||||
|
} catch (HoodieException e) {
|
||||||
|
// if field is not found, lump it into default partition
|
||||||
|
partitionPath = DEFAULT_PARTITION_PATH;
|
||||||
|
}
|
||||||
|
|
||||||
|
return new HoodieKey(recordKey, partitionPath);
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
|
|||||||
@@ -25,10 +25,10 @@ import java.util.{Optional, Properties}
|
|||||||
import com.uber.hoodie.DataSourceReadOptions._
|
import com.uber.hoodie.DataSourceReadOptions._
|
||||||
import com.uber.hoodie.DataSourceWriteOptions._
|
import com.uber.hoodie.DataSourceWriteOptions._
|
||||||
import com.uber.hoodie.common.table.{HoodieTableConfig, HoodieTableMetaClient}
|
import com.uber.hoodie.common.table.{HoodieTableConfig, HoodieTableMetaClient}
|
||||||
|
import com.uber.hoodie.common.util.TypedProperties
|
||||||
import com.uber.hoodie.config.HoodieWriteConfig
|
import com.uber.hoodie.config.HoodieWriteConfig
|
||||||
import com.uber.hoodie.exception.HoodieException
|
import com.uber.hoodie.exception.HoodieException
|
||||||
import org.apache.avro.generic.GenericRecord
|
import org.apache.avro.generic.GenericRecord
|
||||||
import org.apache.commons.configuration.PropertiesConfiguration
|
|
||||||
import org.apache.hadoop.fs.Path
|
import org.apache.hadoop.fs.Path
|
||||||
import org.apache.log4j.LogManager
|
import org.apache.log4j.LogManager
|
||||||
import org.apache.spark.api.java.JavaSparkContext
|
import org.apache.spark.api.java.JavaSparkContext
|
||||||
@@ -121,10 +121,10 @@ class DefaultSource extends RelationProvider
|
|||||||
mapAsScalaMap(defaultsMap)
|
mapAsScalaMap(defaultsMap)
|
||||||
}
|
}
|
||||||
|
|
||||||
def toPropertiesConfiguration(params: Map[String, String]): PropertiesConfiguration = {
|
def toProperties(params: Map[String, String]): TypedProperties = {
|
||||||
val propCfg = new PropertiesConfiguration()
|
val props = new TypedProperties()
|
||||||
params.foreach(kv => propCfg.addProperty(kv._1, kv._2))
|
params.foreach(kv => props.setProperty(kv._1, kv._2))
|
||||||
propCfg
|
props
|
||||||
}
|
}
|
||||||
|
|
||||||
|
|
||||||
@@ -161,7 +161,7 @@ class DefaultSource extends RelationProvider
|
|||||||
// Convert to RDD[HoodieRecord]
|
// Convert to RDD[HoodieRecord]
|
||||||
val keyGenerator = DataSourceUtils.createKeyGenerator(
|
val keyGenerator = DataSourceUtils.createKeyGenerator(
|
||||||
parameters(KEYGENERATOR_CLASS_OPT_KEY),
|
parameters(KEYGENERATOR_CLASS_OPT_KEY),
|
||||||
toPropertiesConfiguration(parameters)
|
toProperties(parameters)
|
||||||
)
|
)
|
||||||
val genericRecords: RDD[GenericRecord] = AvroConversionUtils.createRdd(df, structName, nameSpace)
|
val genericRecords: RDD[GenericRecord] = AvroConversionUtils.createRdd(df, structName, nameSpace)
|
||||||
val hoodieRecords = genericRecords.map(gr => {
|
val hoodieRecords = genericRecords.map(gr => {
|
||||||
|
|||||||
@@ -16,11 +16,10 @@
|
|||||||
*
|
*
|
||||||
*/
|
*/
|
||||||
|
|
||||||
import com.uber.hoodie.common.util.SchemaTestUtil
|
import com.uber.hoodie.common.util.{SchemaTestUtil, TypedProperties}
|
||||||
import com.uber.hoodie.exception.HoodieException
|
import com.uber.hoodie.exception.HoodieException
|
||||||
import com.uber.hoodie.{DataSourceWriteOptions, OverwriteWithLatestAvroPayload, SimpleKeyGenerator}
|
import com.uber.hoodie.{DataSourceWriteOptions, OverwriteWithLatestAvroPayload, SimpleKeyGenerator}
|
||||||
import org.apache.avro.generic.GenericRecord
|
import org.apache.avro.generic.GenericRecord
|
||||||
import org.apache.commons.configuration.PropertiesConfiguration
|
|
||||||
import org.junit.Assert._
|
import org.junit.Assert._
|
||||||
import org.junit.{Before, Test}
|
import org.junit.{Before, Test}
|
||||||
import org.scalatest.junit.AssertionsForJUnit
|
import org.scalatest.junit.AssertionsForJUnit
|
||||||
@@ -39,10 +38,10 @@ class DataSourceDefaultsTest extends AssertionsForJUnit {
|
|||||||
}
|
}
|
||||||
|
|
||||||
|
|
||||||
private def getKeyConfig(recordKeyFieldName: String, paritionPathField: String): PropertiesConfiguration = {
|
private def getKeyConfig(recordKeyFieldName: String, paritionPathField: String): TypedProperties = {
|
||||||
val props = new PropertiesConfiguration()
|
val props = new TypedProperties()
|
||||||
props.addProperty(DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY, recordKeyFieldName)
|
props.setProperty(DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY, recordKeyFieldName)
|
||||||
props.addProperty(DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY, paritionPathField)
|
props.setProperty(DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY, paritionPathField)
|
||||||
props
|
props
|
||||||
}
|
}
|
||||||
|
|
||||||
@@ -52,24 +51,26 @@ class DataSourceDefaultsTest extends AssertionsForJUnit {
|
|||||||
assertEquals("field1", hk1.getRecordKey)
|
assertEquals("field1", hk1.getRecordKey)
|
||||||
assertEquals("name1", hk1.getPartitionPath)
|
assertEquals("name1", hk1.getPartitionPath)
|
||||||
|
|
||||||
// recordKey field not specified
|
// partition path field not specified
|
||||||
try {
|
try {
|
||||||
val props = new PropertiesConfiguration()
|
val props = new TypedProperties()
|
||||||
props.addProperty(DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY, "field1")
|
props.setProperty(DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY, "field1")
|
||||||
new SimpleKeyGenerator(props).getKey(baseRecord)
|
new SimpleKeyGenerator(props).getKey(baseRecord)
|
||||||
fail("Should have errored out")
|
fail("Should have errored out")
|
||||||
} catch {
|
} catch {
|
||||||
case e: HoodieException => {
|
case e: IllegalArgumentException => {
|
||||||
// do nothing
|
// do nothing
|
||||||
}
|
}
|
||||||
};
|
};
|
||||||
|
|
||||||
// partitionPath field is null
|
// recordkey field not specified
|
||||||
try {
|
try {
|
||||||
new SimpleKeyGenerator(getKeyConfig("field1", null)).getKey(baseRecord)
|
val props = new TypedProperties()
|
||||||
|
props.setProperty(DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY, "partitionField")
|
||||||
|
new SimpleKeyGenerator(props).getKey(baseRecord)
|
||||||
fail("Should have errored out")
|
fail("Should have errored out")
|
||||||
} catch {
|
} catch {
|
||||||
case e: HoodieException => {
|
case e: IllegalArgumentException => {
|
||||||
// do nothing
|
// do nothing
|
||||||
}
|
}
|
||||||
};
|
};
|
||||||
@@ -90,6 +91,11 @@ class DataSourceDefaultsTest extends AssertionsForJUnit {
|
|||||||
// do nothing
|
// do nothing
|
||||||
}
|
}
|
||||||
};
|
};
|
||||||
|
|
||||||
|
// if partition path can't be found, return default partition path
|
||||||
|
val hk3 = new SimpleKeyGenerator(getKeyConfig("testNestedRecord.userId", "testNestedRecord.notThere"))
|
||||||
|
.getKey(baseRecord);
|
||||||
|
assertEquals("default", hk3.getPartitionPath)
|
||||||
}
|
}
|
||||||
|
|
||||||
@Test def testOverwriteWithLatestAvroPayload() = {
|
@Test def testOverwriteWithLatestAvroPayload() = {
|
||||||
|
|||||||
@@ -42,27 +42,43 @@
|
|||||||
</plugin>
|
</plugin>
|
||||||
<plugin>
|
<plugin>
|
||||||
<groupId>org.apache.maven.plugins</groupId>
|
<groupId>org.apache.maven.plugins</groupId>
|
||||||
<artifactId>maven-assembly-plugin</artifactId>
|
<artifactId>maven-shade-plugin</artifactId>
|
||||||
<version>2.4.1</version>
|
<version>3.1.1</version>
|
||||||
<configuration>
|
|
||||||
<descriptors>
|
|
||||||
<descriptor>src/assembly/src.xml</descriptor>
|
|
||||||
</descriptors>
|
|
||||||
<archive>
|
|
||||||
<manifest>
|
|
||||||
<mainClass>com.uber.hoodie.utilities.deltastreamer.HoodieDeltaStreamer</mainClass>
|
|
||||||
</manifest>
|
|
||||||
</archive>
|
|
||||||
|
|
||||||
</configuration>
|
|
||||||
<executions>
|
<executions>
|
||||||
<execution>
|
<execution>
|
||||||
<id>make-assembly</id>
|
|
||||||
<!-- bind to the packaging phase -->
|
|
||||||
<phase>package</phase>
|
<phase>package</phase>
|
||||||
<goals>
|
<goals>
|
||||||
<goal>single</goal>
|
<goal>shade</goal>
|
||||||
</goals>
|
</goals>
|
||||||
|
<configuration>
|
||||||
|
<dependencyReducedPomLocation>${project.build.directory}/dependency-reduced-pom.xml
|
||||||
|
</dependencyReducedPomLocation>
|
||||||
|
<minimizeJar>true</minimizeJar>
|
||||||
|
<artifactSet>
|
||||||
|
<includes>
|
||||||
|
<include>com.uber.hoodie:hoodie-common</include>
|
||||||
|
<include>com.uber.hoodie:hoodie-client</include>
|
||||||
|
<include>com.uber.hoodie:hoodie-spark</include>
|
||||||
|
<include>com.uber.hoodie:hoodie-hive</include>
|
||||||
|
<include>com.uber.hoodie:hoodie-hadoop-mr</include>
|
||||||
|
<include>com.beust:jcommander</include>
|
||||||
|
<include>com.twitter:bijection-avro_2.11</include>
|
||||||
|
<include>com.twitter:bijection-core_2.11</include>
|
||||||
|
<include>org.apache.parquet:parquet-avro</include>
|
||||||
|
<include>com.twitter:parquet-avro</include>
|
||||||
|
<include>com.twitter.common:objectsize</include>
|
||||||
|
<include>io.confluent:kafka-avro-serializer</include>
|
||||||
|
<include>io.confluent:common-config</include>
|
||||||
|
<include>io.confluent:common-utils</include>
|
||||||
|
<include>io.confluent:kafka-schema-registry-client</include>
|
||||||
|
<include>org.apache.spark:spark-streaming-kafka-0-8_2.11</include>
|
||||||
|
<include>org.apache.kafka:kafka_2.11</include>
|
||||||
|
<include>com.yammer.metrics:metrics-core</include>
|
||||||
|
<include>com.101tec:zkclient</include>
|
||||||
|
<include>org.apache.kafka:kafka-clients</include>
|
||||||
|
</includes>
|
||||||
|
</artifactSet>
|
||||||
|
</configuration>
|
||||||
</execution>
|
</execution>
|
||||||
</executions>
|
</executions>
|
||||||
</plugin>
|
</plugin>
|
||||||
@@ -78,10 +94,18 @@
|
|||||||
</resources>
|
</resources>
|
||||||
</build>
|
</build>
|
||||||
|
|
||||||
|
<repositories>
|
||||||
|
<repository>
|
||||||
|
<id>confluent</id>
|
||||||
|
<url>http://packages.confluent.io/maven/</url>
|
||||||
|
</repository>
|
||||||
|
</repositories>
|
||||||
|
|
||||||
<dependencies>
|
<dependencies>
|
||||||
|
|
||||||
<dependency>
|
<dependency>
|
||||||
<groupId>org.apache.spark</groupId>
|
<groupId>com.fasterxml.jackson.module</groupId>
|
||||||
<artifactId>spark-sql_2.11</artifactId>
|
<artifactId>jackson-module-scala_2.11</artifactId>
|
||||||
</dependency>
|
</dependency>
|
||||||
|
|
||||||
<dependency>
|
<dependency>
|
||||||
@@ -108,21 +132,6 @@
|
|||||||
<groupId>org.apache.hadoop</groupId>
|
<groupId>org.apache.hadoop</groupId>
|
||||||
<artifactId>hadoop-hdfs</artifactId>
|
<artifactId>hadoop-hdfs</artifactId>
|
||||||
<classifier>tests</classifier>
|
<classifier>tests</classifier>
|
||||||
<!-- Need these exclusions to make sure JavaSparkContext can be setup. https://issues.apache.org/jira/browse/SPARK-1693 -->
|
|
||||||
<exclusions>
|
|
||||||
<exclusion>
|
|
||||||
<groupId>org.mortbay.jetty</groupId>
|
|
||||||
<artifactId>*</artifactId>
|
|
||||||
</exclusion>
|
|
||||||
<exclusion>
|
|
||||||
<groupId>javax.servlet.jsp</groupId>
|
|
||||||
<artifactId>*</artifactId>
|
|
||||||
</exclusion>
|
|
||||||
<exclusion>
|
|
||||||
<groupId>javax.servlet</groupId>
|
|
||||||
<artifactId>*</artifactId>
|
|
||||||
</exclusion>
|
|
||||||
</exclusions>
|
|
||||||
</dependency>
|
</dependency>
|
||||||
<dependency>
|
<dependency>
|
||||||
<groupId>org.apache.hadoop</groupId>
|
<groupId>org.apache.hadoop</groupId>
|
||||||
@@ -184,6 +193,10 @@
|
|||||||
<groupId>javax.servlet</groupId>
|
<groupId>javax.servlet</groupId>
|
||||||
<artifactId>servlet-api</artifactId>
|
<artifactId>servlet-api</artifactId>
|
||||||
</exclusion>
|
</exclusion>
|
||||||
|
<exclusion>
|
||||||
|
<groupId>com.fasterxml.jackson.*</groupId>
|
||||||
|
<artifactId>*</artifactId>
|
||||||
|
</exclusion>
|
||||||
</exclusions>
|
</exclusions>
|
||||||
</dependency>
|
</dependency>
|
||||||
|
|
||||||
@@ -219,23 +232,16 @@
|
|||||||
<dependency>
|
<dependency>
|
||||||
<groupId>org.apache.hadoop</groupId>
|
<groupId>org.apache.hadoop</groupId>
|
||||||
<artifactId>hadoop-client</artifactId>
|
<artifactId>hadoop-client</artifactId>
|
||||||
<exclusions>
|
</dependency>
|
||||||
<exclusion>
|
|
||||||
<groupId>javax.servlet</groupId>
|
|
||||||
<artifactId>servlet-api</artifactId>
|
|
||||||
</exclusion>
|
|
||||||
</exclusions>
|
|
||||||
</dependency>
|
|
||||||
|
|
||||||
<dependency>
|
<dependency>
|
||||||
<groupId>org.apache.spark</groupId>
|
<groupId>org.apache.spark</groupId>
|
||||||
<artifactId>spark-core_2.11</artifactId>
|
<artifactId>spark-core_2.11</artifactId>
|
||||||
<exclusions>
|
</dependency>
|
||||||
<exclusion>
|
|
||||||
<groupId>javax.servlet</groupId>
|
<dependency>
|
||||||
<artifactId>servlet-api</artifactId>
|
<groupId>org.apache.spark</groupId>
|
||||||
</exclusion>
|
<artifactId>spark-sql_2.11</artifactId>
|
||||||
</exclusions>
|
|
||||||
</dependency>
|
</dependency>
|
||||||
|
|
||||||
<dependency>
|
<dependency>
|
||||||
@@ -291,6 +297,30 @@
|
|||||||
<version>0.9.2</version>
|
<version>0.9.2</version>
|
||||||
</dependency>
|
</dependency>
|
||||||
|
|
||||||
|
<dependency>
|
||||||
|
<groupId>io.confluent</groupId>
|
||||||
|
<artifactId>kafka-avro-serializer</artifactId>
|
||||||
|
<version>3.0.0</version>
|
||||||
|
</dependency>
|
||||||
|
|
||||||
|
<dependency>
|
||||||
|
<groupId>io.confluent</groupId>
|
||||||
|
<artifactId>common-config</artifactId>
|
||||||
|
<version>3.0.0</version>
|
||||||
|
</dependency>
|
||||||
|
|
||||||
|
<dependency>
|
||||||
|
<groupId>io.confluent</groupId>
|
||||||
|
<artifactId>common-utils</artifactId>
|
||||||
|
<version>3.0.0</version>
|
||||||
|
</dependency>
|
||||||
|
|
||||||
|
<dependency>
|
||||||
|
<groupId>io.confluent</groupId>
|
||||||
|
<artifactId>kafka-schema-registry-client</artifactId>
|
||||||
|
<version>3.0.0</version>
|
||||||
|
</dependency>
|
||||||
|
|
||||||
</dependencies>
|
</dependencies>
|
||||||
|
|
||||||
</project>
|
</project>
|
||||||
|
|||||||
@@ -1,50 +0,0 @@
|
|||||||
<!--
|
|
||||||
~ Copyright (c) 2016 Uber Technologies, Inc. (hoodie-dev-group@uber.com)
|
|
||||||
~
|
|
||||||
~ Licensed under the Apache License, Version 2.0 (the "License");
|
|
||||||
~ you may not use this file except in compliance with the License.
|
|
||||||
~ You may obtain a copy of the License at
|
|
||||||
~
|
|
||||||
~ http://www.apache.org/licenses/LICENSE-2.0
|
|
||||||
~
|
|
||||||
~ Unless required by applicable law or agreed to in writing, software
|
|
||||||
~ distributed under the License is distributed on an "AS IS" BASIS,
|
|
||||||
~ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
|
||||||
~ See the License for the specific language governing permissions and
|
|
||||||
~ limitations under the License.
|
|
||||||
-->
|
|
||||||
|
|
||||||
<assembly xmlns="http://maven.apache.org/plugins/maven-assembly-plugin/assembly/1.1.3"
|
|
||||||
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
|
|
||||||
xsi:schemaLocation="http://maven.apache.org/plugins/maven-assembly-plugin/assembly/1.1.3 http://maven.apache.org/xsd/assembly-1.1.3.xsd">
|
|
||||||
<id>bin</id>
|
|
||||||
<formats>
|
|
||||||
<format>jar</format>
|
|
||||||
</formats>
|
|
||||||
|
|
||||||
<includeBaseDirectory>false</includeBaseDirectory>
|
|
||||||
<dependencySets>
|
|
||||||
<dependencySet>
|
|
||||||
<outputDirectory>/</outputDirectory>
|
|
||||||
<unpack>true</unpack>
|
|
||||||
<scope>runtime</scope>
|
|
||||||
<excludes>
|
|
||||||
<exclude>junit:junit</exclude>
|
|
||||||
<exclude>com.google.code.findbugs:*</exclude>
|
|
||||||
<exclude>org.apache.hadoop:*</exclude>
|
|
||||||
<exclude>org.apache.hbase:*</exclude>
|
|
||||||
<!--<exclude>log4j:*</exclude>-->
|
|
||||||
<!--<exclude>org.slf4j:*</exclude>-->
|
|
||||||
<!--<exclude>commons-dbcp:*</exclude>-->
|
|
||||||
<!--<exclude>org.apache.httpcomponents:*</exclude>-->
|
|
||||||
</excludes>
|
|
||||||
<unpackOptions>
|
|
||||||
<!--<excludes>-->
|
|
||||||
<!--<exclude>-->
|
|
||||||
<!--**/slf4j/**-->
|
|
||||||
<!--</exclude>-->
|
|
||||||
<!--</excludes>-->
|
|
||||||
</unpackOptions>
|
|
||||||
</dependencySet>
|
|
||||||
</dependencySets>
|
|
||||||
</assembly>
|
|
||||||
@@ -75,7 +75,9 @@ public class HDFSParquetImporter implements Serializable {
|
|||||||
System.exit(1);
|
System.exit(1);
|
||||||
}
|
}
|
||||||
HDFSParquetImporter dataImporter = new HDFSParquetImporter(cfg);
|
HDFSParquetImporter dataImporter = new HDFSParquetImporter(cfg);
|
||||||
dataImporter.dataImport(UtilHelpers.buildSparkContext(cfg.tableName, cfg.sparkMaster, cfg.sparkMemory), cfg.retry);
|
dataImporter
|
||||||
|
.dataImport(UtilHelpers.buildSparkContext("data-importer-" + cfg.tableName, cfg.sparkMaster, cfg.sparkMemory),
|
||||||
|
cfg.retry);
|
||||||
}
|
}
|
||||||
|
|
||||||
public int dataImport(JavaSparkContext jsc, int retry) throws Exception {
|
public int dataImport(JavaSparkContext jsc, int retry) throws Exception {
|
||||||
@@ -206,19 +208,6 @@ public class HDFSParquetImporter implements Serializable {
|
|||||||
}
|
}
|
||||||
}
|
}
|
||||||
|
|
||||||
public static class SourceTypeValidator implements IValueValidator<String> {
|
|
||||||
|
|
||||||
List<String> validSourceTypes = Arrays.asList("hdfs");
|
|
||||||
|
|
||||||
@Override
|
|
||||||
public void validate(String name, String value) throws ParameterException {
|
|
||||||
if (value == null || !validSourceTypes.contains(value)) {
|
|
||||||
throw new ParameterException(String.format(
|
|
||||||
"Invalid source type: value:%s: supported source types:%s", value, validSourceTypes));
|
|
||||||
}
|
|
||||||
}
|
|
||||||
}
|
|
||||||
|
|
||||||
public static class Config implements Serializable {
|
public static class Config implements Serializable {
|
||||||
|
|
||||||
@Parameter(names = {"--command", "-c"},
|
@Parameter(names = {"--command", "-c"},
|
||||||
@@ -228,10 +217,6 @@ public class HDFSParquetImporter implements Serializable {
|
|||||||
@Parameter(names = {"--src-path",
|
@Parameter(names = {"--src-path",
|
||||||
"-sp"}, description = "Base path for the input dataset", required = true)
|
"-sp"}, description = "Base path for the input dataset", required = true)
|
||||||
public String srcPath = null;
|
public String srcPath = null;
|
||||||
@Parameter(names = {"--src-type",
|
|
||||||
"-st"}, description = "Source type for the input dataset", required = true,
|
|
||||||
validateValueWith = SourceTypeValidator.class)
|
|
||||||
public String srcType = null;
|
|
||||||
@Parameter(names = {"--target-path",
|
@Parameter(names = {"--target-path",
|
||||||
"-tp"}, description = "Base path for the target hoodie dataset", required = true)
|
"-tp"}, description = "Base path for the target hoodie dataset", required = true)
|
||||||
public String targetPath = null;
|
public String targetPath = null;
|
||||||
|
|||||||
@@ -93,8 +93,6 @@ public class HiveIncrementalPuller {
|
|||||||
public int maxCommits = 3;
|
public int maxCommits = 3;
|
||||||
@Parameter(names = {"--help", "-h"}, help = true)
|
@Parameter(names = {"--help", "-h"}, help = true)
|
||||||
public Boolean help = false;
|
public Boolean help = false;
|
||||||
@Parameter(names = {"--storageFormat"})
|
|
||||||
public String tempTableStorageFormat = "AVRO";
|
|
||||||
}
|
}
|
||||||
|
|
||||||
static {
|
static {
|
||||||
@@ -207,12 +205,7 @@ public class HiveIncrementalPuller {
|
|||||||
}
|
}
|
||||||
|
|
||||||
private String getStoredAsClause() {
|
private String getStoredAsClause() {
|
||||||
if (config.tempTableStorageFormat.equalsIgnoreCase("json")) {
|
return "STORED AS AVRO";
|
||||||
// Special case for json
|
|
||||||
// default json serde does not support having same key even if its under multiple depths
|
|
||||||
return "ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe' STORED AS TEXTFILE";
|
|
||||||
}
|
|
||||||
return "STORED AS " + config.tempTableStorageFormat;
|
|
||||||
}
|
}
|
||||||
|
|
||||||
private void initHiveBeelineProperties(Statement stmt) throws SQLException {
|
private void initHiveBeelineProperties(Statement stmt) throws SQLException {
|
||||||
|
|||||||
@@ -80,7 +80,8 @@ public class HoodieCompactor {
|
|||||||
System.exit(1);
|
System.exit(1);
|
||||||
}
|
}
|
||||||
HoodieCompactor compactor = new HoodieCompactor(cfg);
|
HoodieCompactor compactor = new HoodieCompactor(cfg);
|
||||||
compactor.compact(UtilHelpers.buildSparkContext(cfg.tableName, cfg.sparkMaster, cfg.sparkMemory), cfg.retry);
|
compactor.compact(UtilHelpers.buildSparkContext("compactor-" + cfg.tableName, cfg.sparkMaster, cfg.sparkMemory),
|
||||||
|
cfg.retry);
|
||||||
}
|
}
|
||||||
|
|
||||||
public int compact(JavaSparkContext jsc, int retry) {
|
public int compact(JavaSparkContext jsc, int retry) {
|
||||||
|
|||||||
@@ -20,22 +20,20 @@ package com.uber.hoodie.utilities;
|
|||||||
|
|
||||||
import com.uber.hoodie.HoodieWriteClient;
|
import com.uber.hoodie.HoodieWriteClient;
|
||||||
import com.uber.hoodie.WriteStatus;
|
import com.uber.hoodie.WriteStatus;
|
||||||
|
import com.uber.hoodie.common.util.DFSPropertiesConfiguration;
|
||||||
import com.uber.hoodie.common.util.ReflectionUtils;
|
import com.uber.hoodie.common.util.ReflectionUtils;
|
||||||
|
import com.uber.hoodie.common.util.TypedProperties;
|
||||||
import com.uber.hoodie.config.HoodieCompactionConfig;
|
import com.uber.hoodie.config.HoodieCompactionConfig;
|
||||||
import com.uber.hoodie.config.HoodieIndexConfig;
|
import com.uber.hoodie.config.HoodieIndexConfig;
|
||||||
import com.uber.hoodie.config.HoodieWriteConfig;
|
import com.uber.hoodie.config.HoodieWriteConfig;
|
||||||
import com.uber.hoodie.exception.HoodieIOException;
|
import com.uber.hoodie.exception.HoodieException;
|
||||||
import com.uber.hoodie.index.HoodieIndex;
|
import com.uber.hoodie.index.HoodieIndex;
|
||||||
import com.uber.hoodie.utilities.exception.HoodieDeltaStreamerException;
|
|
||||||
import com.uber.hoodie.utilities.schema.SchemaProvider;
|
import com.uber.hoodie.utilities.schema.SchemaProvider;
|
||||||
import com.uber.hoodie.utilities.sources.Source;
|
import com.uber.hoodie.utilities.sources.Source;
|
||||||
import com.uber.hoodie.utilities.sources.SourceDataFormat;
|
|
||||||
import java.io.IOException;
|
import java.io.IOException;
|
||||||
|
import java.io.InputStream;
|
||||||
import java.nio.ByteBuffer;
|
import java.nio.ByteBuffer;
|
||||||
import java.util.Optional;
|
import java.util.Optional;
|
||||||
import org.apache.commons.configuration.ConfigurationException;
|
|
||||||
import org.apache.commons.configuration.PropertiesConfiguration;
|
|
||||||
import org.apache.commons.lang3.reflect.ConstructorUtils;
|
|
||||||
import org.apache.hadoop.fs.FSDataInputStream;
|
import org.apache.hadoop.fs.FSDataInputStream;
|
||||||
import org.apache.hadoop.fs.FileSystem;
|
import org.apache.hadoop.fs.FileSystem;
|
||||||
import org.apache.hadoop.fs.Path;
|
import org.apache.hadoop.fs.Path;
|
||||||
@@ -52,42 +50,34 @@ import org.apache.spark.api.java.JavaSparkContext;
|
|||||||
public class UtilHelpers {
|
public class UtilHelpers {
|
||||||
private static Logger logger = LogManager.getLogger(UtilHelpers.class);
|
private static Logger logger = LogManager.getLogger(UtilHelpers.class);
|
||||||
|
|
||||||
public static Source createSource(String sourceClass, PropertiesConfiguration cfg,
|
public static Source createSource(String sourceClass, TypedProperties cfg,
|
||||||
JavaSparkContext jssc, SourceDataFormat dataFormat, SchemaProvider schemaProvider)
|
JavaSparkContext jssc, SchemaProvider schemaProvider)
|
||||||
throws IOException {
|
throws IOException {
|
||||||
try {
|
try {
|
||||||
return (Source) ConstructorUtils.invokeConstructor(Class.forName(sourceClass), (Object) cfg,
|
return (Source) ReflectionUtils.loadClass(sourceClass,
|
||||||
(Object) jssc, (Object) dataFormat, (Object) schemaProvider);
|
new Class<?>[]{TypedProperties.class, JavaSparkContext.class, SchemaProvider.class},
|
||||||
|
cfg, jssc, schemaProvider);
|
||||||
} catch (Throwable e) {
|
} catch (Throwable e) {
|
||||||
throw new IOException("Could not load source class " + sourceClass, e);
|
throw new IOException("Could not load source class " + sourceClass, e);
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
|
|
||||||
public static SchemaProvider createSchemaProvider(String schemaProviderClass,
|
public static SchemaProvider createSchemaProvider(String schemaProviderClass,
|
||||||
PropertiesConfiguration cfg) throws IOException {
|
TypedProperties cfg, JavaSparkContext jssc) throws IOException {
|
||||||
try {
|
try {
|
||||||
return (SchemaProvider) ConstructorUtils.invokeConstructor(Class.forName(schemaProviderClass),
|
return (SchemaProvider) ReflectionUtils.loadClass(schemaProviderClass, cfg, jssc);
|
||||||
(Object) cfg);
|
|
||||||
} catch (Throwable e) {
|
} catch (Throwable e) {
|
||||||
throw new IOException("Could not load schema provider class " + schemaProviderClass, e);
|
throw new IOException("Could not load schema provider class " + schemaProviderClass, e);
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
|
|
||||||
/**
|
/**
|
||||||
* TODO: Support hierarchical config files (see CONFIGURATION-609 for sample)
|
|
||||||
*/
|
*/
|
||||||
public static PropertiesConfiguration readConfig(FileSystem fs, Path cfgPath) {
|
public static DFSPropertiesConfiguration readConfig(FileSystem fs, Path cfgPath) {
|
||||||
try {
|
try {
|
||||||
FSDataInputStream in = fs.open(cfgPath);
|
return new DFSPropertiesConfiguration(fs, cfgPath);
|
||||||
PropertiesConfiguration config = new PropertiesConfiguration();
|
} catch (Exception e) {
|
||||||
config.load(in);
|
throw new HoodieException("Unable to read props file at :" + cfgPath, e);
|
||||||
in.close();
|
|
||||||
return config;
|
|
||||||
} catch (IOException e) {
|
|
||||||
throw new HoodieIOException("Unable to read config file at :" + cfgPath, e);
|
|
||||||
} catch (ConfigurationException e) {
|
|
||||||
throw new HoodieDeltaStreamerException("Invalid configs found in config file at :" + cfgPath,
|
|
||||||
e);
|
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
|
|
||||||
@@ -117,24 +107,16 @@ public class UtilHelpers {
|
|||||||
return new String(buf.array());
|
return new String(buf.array());
|
||||||
}
|
}
|
||||||
|
|
||||||
/**
|
private static SparkConf buildSparkConf(String appName, String defaultMaster) {
|
||||||
* Build Spark Context for ingestion/compaction
|
SparkConf sparkConf = new SparkConf().setAppName(appName);
|
||||||
* @return
|
String master = sparkConf.get("spark.master", defaultMaster);
|
||||||
*/
|
sparkConf.setMaster(master);
|
||||||
public static JavaSparkContext buildSparkContext(String tableName, String sparkMaster, String sparkMemory) {
|
if (master.startsWith("yarn")) {
|
||||||
SparkConf sparkConf = new SparkConf().setAppName("hoodie-data-importer-" + tableName);
|
|
||||||
sparkConf.setMaster(sparkMaster);
|
|
||||||
|
|
||||||
if (sparkMaster.startsWith("yarn")) {
|
|
||||||
sparkConf.set("spark.eventLog.overwrite", "true");
|
sparkConf.set("spark.eventLog.overwrite", "true");
|
||||||
sparkConf.set("spark.eventLog.enabled", "true");
|
sparkConf.set("spark.eventLog.enabled", "true");
|
||||||
}
|
}
|
||||||
|
|
||||||
sparkConf.set("spark.driver.maxResultSize", "2g");
|
sparkConf.set("spark.driver.maxResultSize", "2g");
|
||||||
sparkConf.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer");
|
sparkConf.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer");
|
||||||
sparkConf.set("spark.executor.memory", sparkMemory);
|
|
||||||
|
|
||||||
// Configure hadoop conf
|
|
||||||
sparkConf.set("spark.hadoop.mapred.output.compress", "true");
|
sparkConf.set("spark.hadoop.mapred.output.compress", "true");
|
||||||
sparkConf.set("spark.hadoop.mapred.output.compression.codec", "true");
|
sparkConf.set("spark.hadoop.mapred.output.compression.codec", "true");
|
||||||
sparkConf.set("spark.hadoop.mapred.output.compression.codec",
|
sparkConf.set("spark.hadoop.mapred.output.compression.codec",
|
||||||
@@ -142,6 +124,20 @@ public class UtilHelpers {
|
|||||||
sparkConf.set("spark.hadoop.mapred.output.compression.type", "BLOCK");
|
sparkConf.set("spark.hadoop.mapred.output.compression.type", "BLOCK");
|
||||||
|
|
||||||
sparkConf = HoodieWriteClient.registerClasses(sparkConf);
|
sparkConf = HoodieWriteClient.registerClasses(sparkConf);
|
||||||
|
return sparkConf;
|
||||||
|
}
|
||||||
|
|
||||||
|
public static JavaSparkContext buildSparkContext(String appName, String defaultMaster) {
|
||||||
|
return new JavaSparkContext(buildSparkConf(appName, defaultMaster));
|
||||||
|
}
|
||||||
|
|
||||||
|
/**
|
||||||
|
* Build Spark Context for ingestion/compaction
|
||||||
|
* @return
|
||||||
|
*/
|
||||||
|
public static JavaSparkContext buildSparkContext(String appName, String sparkMaster, String sparkMemory) {
|
||||||
|
SparkConf sparkConf = buildSparkConf(appName, sparkMaster);
|
||||||
|
sparkConf.set("spark.executor.memory", sparkMemory);
|
||||||
return new JavaSparkContext(sparkConf);
|
return new JavaSparkContext(sparkConf);
|
||||||
}
|
}
|
||||||
|
|
||||||
@@ -185,4 +181,10 @@ public class UtilHelpers {
|
|||||||
logger.error(String.format("Import failed with %d errors.", errors.value()));
|
logger.error(String.format("Import failed with %d errors.", errors.value()));
|
||||||
return -1;
|
return -1;
|
||||||
}
|
}
|
||||||
|
|
||||||
|
public static TypedProperties readConfig(InputStream in) throws IOException {
|
||||||
|
TypedProperties defaults = new TypedProperties();
|
||||||
|
defaults.load(in);
|
||||||
|
return defaults;
|
||||||
|
}
|
||||||
}
|
}
|
||||||
|
|||||||
@@ -35,6 +35,8 @@ import com.uber.hoodie.common.table.HoodieTableMetaClient;
|
|||||||
import com.uber.hoodie.common.table.HoodieTimeline;
|
import com.uber.hoodie.common.table.HoodieTimeline;
|
||||||
import com.uber.hoodie.common.table.timeline.HoodieInstant;
|
import com.uber.hoodie.common.table.timeline.HoodieInstant;
|
||||||
import com.uber.hoodie.common.util.FSUtils;
|
import com.uber.hoodie.common.util.FSUtils;
|
||||||
|
import com.uber.hoodie.common.util.TypedProperties;
|
||||||
|
import com.uber.hoodie.common.util.collection.Pair;
|
||||||
import com.uber.hoodie.config.HoodieCompactionConfig;
|
import com.uber.hoodie.config.HoodieCompactionConfig;
|
||||||
import com.uber.hoodie.config.HoodieIndexConfig;
|
import com.uber.hoodie.config.HoodieIndexConfig;
|
||||||
import com.uber.hoodie.config.HoodieWriteConfig;
|
import com.uber.hoodie.config.HoodieWriteConfig;
|
||||||
@@ -44,9 +46,8 @@ import com.uber.hoodie.utilities.UtilHelpers;
|
|||||||
import com.uber.hoodie.utilities.exception.HoodieDeltaStreamerException;
|
import com.uber.hoodie.utilities.exception.HoodieDeltaStreamerException;
|
||||||
import com.uber.hoodie.utilities.schema.FilebasedSchemaProvider;
|
import com.uber.hoodie.utilities.schema.FilebasedSchemaProvider;
|
||||||
import com.uber.hoodie.utilities.schema.SchemaProvider;
|
import com.uber.hoodie.utilities.schema.SchemaProvider;
|
||||||
import com.uber.hoodie.utilities.sources.DFSSource;
|
import com.uber.hoodie.utilities.sources.JsonDFSSource;
|
||||||
import com.uber.hoodie.utilities.sources.Source;
|
import com.uber.hoodie.utilities.sources.Source;
|
||||||
import com.uber.hoodie.utilities.sources.SourceDataFormat;
|
|
||||||
import java.io.IOException;
|
import java.io.IOException;
|
||||||
import java.io.Serializable;
|
import java.io.Serializable;
|
||||||
import java.util.Arrays;
|
import java.util.Arrays;
|
||||||
@@ -56,13 +57,10 @@ import java.util.Optional;
|
|||||||
import java.util.Properties;
|
import java.util.Properties;
|
||||||
import org.apache.avro.Schema;
|
import org.apache.avro.Schema;
|
||||||
import org.apache.avro.generic.GenericRecord;
|
import org.apache.avro.generic.GenericRecord;
|
||||||
import org.apache.commons.configuration.PropertiesConfiguration;
|
|
||||||
import org.apache.commons.lang3.tuple.Pair;
|
|
||||||
import org.apache.hadoop.fs.FileSystem;
|
import org.apache.hadoop.fs.FileSystem;
|
||||||
import org.apache.hadoop.fs.Path;
|
import org.apache.hadoop.fs.Path;
|
||||||
import org.apache.log4j.LogManager;
|
import org.apache.log4j.LogManager;
|
||||||
import org.apache.log4j.Logger;
|
import org.apache.log4j.Logger;
|
||||||
import org.apache.spark.SparkConf;
|
|
||||||
import org.apache.spark.api.java.JavaRDD;
|
import org.apache.spark.api.java.JavaRDD;
|
||||||
import org.apache.spark.api.java.JavaSparkContext;
|
import org.apache.spark.api.java.JavaSparkContext;
|
||||||
import scala.collection.JavaConversions;
|
import scala.collection.JavaConversions;
|
||||||
@@ -77,7 +75,7 @@ public class HoodieDeltaStreamer implements Serializable {
|
|||||||
|
|
||||||
private static volatile Logger log = LogManager.getLogger(HoodieDeltaStreamer.class);
|
private static volatile Logger log = LogManager.getLogger(HoodieDeltaStreamer.class);
|
||||||
|
|
||||||
private static String CHECKPOINT_KEY = "deltastreamer.checkpoint.key";
|
public static String CHECKPOINT_KEY = "deltastreamer.checkpoint.key";
|
||||||
|
|
||||||
private final Config cfg;
|
private final Config cfg;
|
||||||
|
|
||||||
@@ -113,9 +111,15 @@ public class HoodieDeltaStreamer implements Serializable {
|
|||||||
private transient JavaSparkContext jssc;
|
private transient JavaSparkContext jssc;
|
||||||
|
|
||||||
|
|
||||||
public HoodieDeltaStreamer(Config cfg) throws IOException {
|
/**
|
||||||
|
* Bag of properties with source, hoodie client, key generator etc.
|
||||||
|
*/
|
||||||
|
TypedProperties props;
|
||||||
|
|
||||||
|
|
||||||
|
public HoodieDeltaStreamer(Config cfg, JavaSparkContext jssc) throws IOException {
|
||||||
this.cfg = cfg;
|
this.cfg = cfg;
|
||||||
this.jssc = getSparkContext();
|
this.jssc = jssc;
|
||||||
this.fs = FSUtils.getFs(cfg.targetBasePath, jssc.hadoopConfiguration());
|
this.fs = FSUtils.getFs(cfg.targetBasePath, jssc.hadoopConfiguration());
|
||||||
|
|
||||||
if (fs.exists(new Path(cfg.targetBasePath))) {
|
if (fs.exists(new Path(cfg.targetBasePath))) {
|
||||||
@@ -126,61 +130,19 @@ public class HoodieDeltaStreamer implements Serializable {
|
|||||||
this.commitTimelineOpt = Optional.empty();
|
this.commitTimelineOpt = Optional.empty();
|
||||||
}
|
}
|
||||||
|
|
||||||
//TODO(vc) Should these be passed from outside?
|
this.props = UtilHelpers.readConfig(fs, new Path(cfg.propsFilePath)).getConfig();
|
||||||
initSchemaProvider();
|
log.info("Creating delta streamer with configs : " + props.toString());
|
||||||
initKeyGenerator();
|
this.schemaProvider = UtilHelpers.createSchemaProvider(cfg.schemaProviderClassName, props, jssc);
|
||||||
initSource();
|
this.keyGenerator = DataSourceUtils.createKeyGenerator(cfg.keyGeneratorClass, props);
|
||||||
}
|
this.source = UtilHelpers.createSource(cfg.sourceClassName, props, jssc, schemaProvider);
|
||||||
|
|
||||||
private void initSource() throws IOException {
|
|
||||||
// Create the source & schema providers
|
|
||||||
PropertiesConfiguration sourceCfg = UtilHelpers.readConfig(fs, new Path(cfg.sourceConfigProps));
|
|
||||||
log.info("Creating source " + cfg.sourceClassName + " with configs : " + sourceCfg.toString());
|
|
||||||
this.source = UtilHelpers.createSource(cfg.sourceClassName, sourceCfg, jssc, cfg.sourceFormat,
|
|
||||||
schemaProvider);
|
|
||||||
}
|
|
||||||
|
|
||||||
private void initSchemaProvider() throws IOException {
|
|
||||||
PropertiesConfiguration schemaCfg = UtilHelpers.readConfig(fs,
|
|
||||||
new Path(cfg.schemaProviderConfigProps));
|
|
||||||
log.info(
|
|
||||||
"Creating schema provider " + cfg.schemaProviderClassName + " with configs : " + schemaCfg
|
|
||||||
.toString());
|
|
||||||
this.schemaProvider = UtilHelpers.createSchemaProvider(cfg.schemaProviderClassName, schemaCfg);
|
|
||||||
}
|
|
||||||
|
|
||||||
private void initKeyGenerator() throws IOException {
|
|
||||||
PropertiesConfiguration keygenCfg = UtilHelpers.readConfig(fs, new Path(cfg.keyGeneratorProps));
|
|
||||||
log.info("Creating key generator " + cfg.keyGeneratorClass + " with configs : " + keygenCfg
|
|
||||||
.toString());
|
|
||||||
this.keyGenerator = DataSourceUtils.createKeyGenerator(cfg.keyGeneratorClass, keygenCfg);
|
|
||||||
}
|
|
||||||
|
|
||||||
|
|
||||||
private JavaSparkContext getSparkContext() {
|
|
||||||
SparkConf sparkConf = new SparkConf()
|
|
||||||
.setAppName("hoodie-delta-streamer-" + cfg.targetTableName);
|
|
||||||
//sparkConf.setMaster(cfg.sparkMaster);
|
|
||||||
sparkConf.setMaster("local[2]");
|
|
||||||
sparkConf.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer");
|
|
||||||
sparkConf.set("spark.driver.maxResultSize", "2g");
|
|
||||||
|
|
||||||
// Configure hadoop conf
|
|
||||||
sparkConf.set("spark.hadoop.mapred.output.compress", "true");
|
|
||||||
sparkConf.set("spark.hadoop.mapred.output.compression.codec", "true");
|
|
||||||
sparkConf.set("spark.hadoop.mapred.output.compression.codec",
|
|
||||||
"org.apache.hadoop.io.compress.GzipCodec");
|
|
||||||
sparkConf.set("spark.hadoop.mapred.output.compression.type", "BLOCK");
|
|
||||||
|
|
||||||
sparkConf = HoodieWriteClient.registerClasses(sparkConf);
|
|
||||||
// register the schemas, so that shuffle does not serialize the full schemas
|
// register the schemas, so that shuffle does not serialize the full schemas
|
||||||
List<Schema> schemas = Arrays.asList(schemaProvider.getSourceSchema(),
|
List<Schema> schemas = Arrays.asList(schemaProvider.getSourceSchema(),
|
||||||
schemaProvider.getTargetSchema());
|
schemaProvider.getTargetSchema());
|
||||||
sparkConf.registerAvroSchemas(JavaConversions.asScalaBuffer(schemas).toList());
|
jssc.sc().getConf().registerAvroSchemas(JavaConversions.asScalaBuffer(schemas).toList());
|
||||||
return new JavaSparkContext(sparkConf);
|
|
||||||
}
|
}
|
||||||
|
|
||||||
private void sync() throws Exception {
|
public void sync() throws Exception {
|
||||||
// Retrieve the previous round checkpoints, if any
|
// Retrieve the previous round checkpoints, if any
|
||||||
Optional<String> resumeCheckpointStr = Optional.empty();
|
Optional<String> resumeCheckpointStr = Optional.empty();
|
||||||
if (commitTimelineOpt.isPresent()) {
|
if (commitTimelineOpt.isPresent()) {
|
||||||
@@ -207,7 +169,7 @@ public class HoodieDeltaStreamer implements Serializable {
|
|||||||
|
|
||||||
// Pull the data from the source & prepare the write
|
// Pull the data from the source & prepare the write
|
||||||
Pair<Optional<JavaRDD<GenericRecord>>, String> dataAndCheckpoint = source.fetchNewData(
|
Pair<Optional<JavaRDD<GenericRecord>>, String> dataAndCheckpoint = source.fetchNewData(
|
||||||
resumeCheckpointStr, cfg.maxInputBytes);
|
resumeCheckpointStr, cfg.sourceLimit);
|
||||||
|
|
||||||
if (!dataAndCheckpoint.getKey().isPresent()) {
|
if (!dataAndCheckpoint.getKey().isPresent()) {
|
||||||
log.info("No new data, nothing to commit.. ");
|
log.info("No new data, nothing to commit.. ");
|
||||||
@@ -222,7 +184,7 @@ public class HoodieDeltaStreamer implements Serializable {
|
|||||||
});
|
});
|
||||||
|
|
||||||
// Perform the write
|
// Perform the write
|
||||||
HoodieWriteConfig hoodieCfg = getHoodieClientConfig(cfg.hoodieClientProps);
|
HoodieWriteConfig hoodieCfg = getHoodieClientConfig();
|
||||||
HoodieWriteClient client = new HoodieWriteClient<>(jssc, hoodieCfg);
|
HoodieWriteClient client = new HoodieWriteClient<>(jssc, hoodieCfg);
|
||||||
String commitTime = client.startCommit();
|
String commitTime = client.startCommit();
|
||||||
log.info("Starting commit : " + commitTime);
|
log.info("Starting commit : " + commitTime);
|
||||||
@@ -232,6 +194,8 @@ public class HoodieDeltaStreamer implements Serializable {
|
|||||||
writeStatusRDD = client.insert(records, commitTime);
|
writeStatusRDD = client.insert(records, commitTime);
|
||||||
} else if (cfg.operation == Operation.UPSERT) {
|
} else if (cfg.operation == Operation.UPSERT) {
|
||||||
writeStatusRDD = client.upsert(records, commitTime);
|
writeStatusRDD = client.upsert(records, commitTime);
|
||||||
|
} else if (cfg.operation == Operation.BULK_INSERT) {
|
||||||
|
writeStatusRDD = client.bulkInsert(records, commitTime);
|
||||||
} else {
|
} else {
|
||||||
throw new HoodieDeltaStreamerException("Unknown operation :" + cfg.operation);
|
throw new HoodieDeltaStreamerException("Unknown operation :" + cfg.operation);
|
||||||
}
|
}
|
||||||
@@ -245,157 +209,84 @@ public class HoodieDeltaStreamer implements Serializable {
|
|||||||
if (success) {
|
if (success) {
|
||||||
log.info("Commit " + commitTime + " successful!");
|
log.info("Commit " + commitTime + " successful!");
|
||||||
// TODO(vc): Kick off hive sync from here.
|
// TODO(vc): Kick off hive sync from here.
|
||||||
|
|
||||||
} else {
|
} else {
|
||||||
log.info("Commit " + commitTime + " failed!");
|
log.info("Commit " + commitTime + " failed!");
|
||||||
}
|
}
|
||||||
client.close();
|
client.close();
|
||||||
}
|
}
|
||||||
|
|
||||||
private HoodieWriteConfig getHoodieClientConfig(String hoodieClientCfgPath) throws Exception {
|
private HoodieWriteConfig getHoodieClientConfig() throws Exception {
|
||||||
return HoodieWriteConfig.newBuilder().combineInput(true, true).withPath(cfg.targetBasePath)
|
return HoodieWriteConfig.newBuilder().combineInput(true, true).withPath(cfg.targetBasePath)
|
||||||
.withAutoCommit(false).withCompactionConfig(HoodieCompactionConfig.newBuilder()
|
.withAutoCommit(false)
|
||||||
.withPayloadClass(
|
|
||||||
OverwriteWithLatestAvroPayload
|
|
||||||
.class
|
|
||||||
.getName()).build())
|
|
||||||
.withSchema(schemaProvider.getTargetSchema().toString())
|
.withSchema(schemaProvider.getTargetSchema().toString())
|
||||||
.forTable(cfg.targetTableName).withIndexConfig(
|
.withCompactionConfig(HoodieCompactionConfig.newBuilder().withPayloadClass(cfg.payloadClassName).build())
|
||||||
HoodieIndexConfig.newBuilder().withIndexType(HoodieIndex.IndexType.BLOOM).build())
|
.forTable(cfg.targetTableName)
|
||||||
.fromInputStream(fs.open(new Path(hoodieClientCfgPath))).build();
|
.withIndexConfig(HoodieIndexConfig.newBuilder().withIndexType(HoodieIndex.IndexType.BLOOM).build())
|
||||||
|
.withProps(props).build();
|
||||||
}
|
}
|
||||||
|
|
||||||
private enum Operation {
|
public enum Operation {
|
||||||
UPSERT, INSERT
|
UPSERT, INSERT, BULK_INSERT
|
||||||
}
|
}
|
||||||
|
|
||||||
private class OperationConvertor implements IStringConverter<Operation> {
|
private class OperationConvertor implements IStringConverter<Operation> {
|
||||||
|
|
||||||
@Override
|
@Override
|
||||||
public Operation convert(String value) throws ParameterException {
|
public Operation convert(String value) throws ParameterException {
|
||||||
return Operation.valueOf(value);
|
return Operation.valueOf(value);
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
|
|
||||||
private class SourceFormatConvertor implements IStringConverter<SourceDataFormat> {
|
|
||||||
|
|
||||||
@Override
|
|
||||||
public SourceDataFormat convert(String value) throws ParameterException {
|
|
||||||
return SourceDataFormat.valueOf(value);
|
|
||||||
}
|
|
||||||
}
|
|
||||||
|
|
||||||
public static class Config implements Serializable {
|
public static class Config implements Serializable {
|
||||||
|
|
||||||
/**
|
@Parameter(names = {"--target-base-path"}, description = "base path for the target hoodie dataset. "
|
||||||
* TARGET CONFIGS
|
+ "(Will be created if did not exist first time around. If exists, expected to be a hoodie dataset)",
|
||||||
**/
|
required = true)
|
||||||
@Parameter(names = {
|
|
||||||
"--target-base-path"}, description = "base path for the target hoodie dataset", required = true)
|
|
||||||
public String targetBasePath;
|
public String targetBasePath;
|
||||||
|
|
||||||
// TODO: How to obtain hive configs to register?
|
// TODO: How to obtain hive configs to register?
|
||||||
@Parameter(names = {
|
@Parameter(names = {"--target-table"}, description = "name of the target table in Hive", required = true)
|
||||||
"--target-table"}, description = "name of the target table in Hive", required = true)
|
|
||||||
public String targetTableName;
|
public String targetTableName;
|
||||||
|
|
||||||
@Parameter(names = {"--hoodie-client-config"}, description =
|
@Parameter(names = {"--props"}, description = "path to properties file on localfs or dfs, with configurations for "
|
||||||
"path to properties file on localfs or "
|
+ "hoodie client, schema provider, key generator and data source. For hoodie client props, sane defaults are "
|
||||||
+ "dfs, with hoodie client config. "
|
+ "used, but recommend use to provide basic things like metrics endpoints, hive configs etc. For sources, refer"
|
||||||
+ "Sane defaults"
|
+ "to individual classes, for supported properties.")
|
||||||
+ "are used, but recommend use to "
|
public String propsFilePath =
|
||||||
+ "provide basic things like metrics "
|
"file://" + System.getProperty("user.dir") + "/src/test/resources/delta-streamer-config/dfs-source.properties";
|
||||||
+ "endpoints, hive configs etc")
|
|
||||||
public String hoodieClientProps = null;
|
|
||||||
|
|
||||||
/**
|
@Parameter(names = {"--source-class"}, description = "Subclass of com.uber.hoodie.utilities.sources to read data. "
|
||||||
* SOURCE CONFIGS
|
+ "Built-in options: com.uber.hoodie.utilities.sources.{JsonDFSSource (default), AvroDFSSource, "
|
||||||
**/
|
+ "JsonKafkaSource, AvroKafkaSource, HiveIncrPullSource}")
|
||||||
@Parameter(names = {"--source-class"}, description =
|
public String sourceClassName = JsonDFSSource.class.getName();
|
||||||
"subclass of com.uber.hoodie.utilities.sources"
|
|
||||||
+ ".Source to use to read data. "
|
|
||||||
+ "built-in options: com.uber.hoodie.utilities"
|
|
||||||
+ ".common.{DFSSource (default), KafkaSource, "
|
|
||||||
+ "HiveIncrPullSource}")
|
|
||||||
public String sourceClassName = DFSSource.class.getName();
|
|
||||||
|
|
||||||
@Parameter(names = {"--source-config"}, description =
|
@Parameter(names = {"--source-ordering-field"}, description = "Field within source record to decide how"
|
||||||
"path to properties file on localfs or dfs, with "
|
+ " to break ties between records with same key in input data. Default: 'ts' holding unix timestamp of record")
|
||||||
+ "source configs. "
|
|
||||||
+ "For list of acceptable properties, refer "
|
|
||||||
+ "the source class", required = true)
|
|
||||||
public String sourceConfigProps = null;
|
|
||||||
|
|
||||||
@Parameter(names = {"--source-format"}, description =
|
|
||||||
"Format of data in source, JSON (default), Avro. "
|
|
||||||
+ "All source data is "
|
|
||||||
+ "converted to Avro using the provided "
|
|
||||||
+ "schema in any case", converter = SourceFormatConvertor.class)
|
|
||||||
public SourceDataFormat sourceFormat = SourceDataFormat.JSON;
|
|
||||||
|
|
||||||
@Parameter(names = {"--source-ordering-field"}, description =
|
|
||||||
"Field within source record to decide how"
|
|
||||||
+ " to break ties between "
|
|
||||||
+ " records with same key in input "
|
|
||||||
+ "data. Default: 'ts' holding unix "
|
|
||||||
+ "timestamp of record")
|
|
||||||
public String sourceOrderingField = "ts";
|
public String sourceOrderingField = "ts";
|
||||||
|
|
||||||
@Parameter(names = {"--key-generator-class"}, description =
|
@Parameter(names = {"--key-generator-class"}, description = "Subclass of com.uber.hoodie.KeyGenerator "
|
||||||
"Subclass of com.uber.hoodie.utilities"
|
+ "to generate a HoodieKey from the given avro record. Built in: SimpleKeyGenerator (uses "
|
||||||
+ ".common.KeyExtractor to generate"
|
+ "provided field names as recordkey & partitionpath. Nested fields specified via dot notation, e.g: a.b.c)")
|
||||||
+ "a HoodieKey from the given avro "
|
|
||||||
+ "record. Built in: SimpleKeyGenerator"
|
|
||||||
+ " (Uses provided field names as "
|
|
||||||
+ "recordkey & partitionpath. "
|
|
||||||
+ "Nested fields specified via dot "
|
|
||||||
+ "notation, e.g: a.b.c)")
|
|
||||||
public String keyGeneratorClass = SimpleKeyGenerator.class.getName();
|
public String keyGeneratorClass = SimpleKeyGenerator.class.getName();
|
||||||
|
|
||||||
@Parameter(names = {"--key-generator-config"}, description =
|
@Parameter(names = {"--payload-class"}, description = "subclass of HoodieRecordPayload, that works off "
|
||||||
"Path to properties file on localfs or "
|
+ "a GenericRecord. Implement your own, if you want to do something other than overwriting existing value")
|
||||||
+ "dfs, with KeyGenerator configs. "
|
|
||||||
+ "For list of acceptable properites, "
|
|
||||||
+ "refer the KeyGenerator class",
|
|
||||||
required = true)
|
|
||||||
public String keyGeneratorProps = null;
|
|
||||||
|
|
||||||
@Parameter(names = {"--payload-class"}, description =
|
|
||||||
"subclass of HoodieRecordPayload, that works off "
|
|
||||||
+ "a GenericRecord. "
|
|
||||||
+ "Default: SourceWrapperPayload. Implement "
|
|
||||||
+ "your own, if you want to do something "
|
|
||||||
+ "other than overwriting existing value")
|
|
||||||
public String payloadClassName = OverwriteWithLatestAvroPayload.class.getName();
|
public String payloadClassName = OverwriteWithLatestAvroPayload.class.getName();
|
||||||
|
|
||||||
@Parameter(names = {"--schemaprovider-class"}, description =
|
@Parameter(names = {"--schemaprovider-class"}, description = "subclass of com.uber.hoodie.utilities.schema"
|
||||||
"subclass of com.uber.hoodie.utilities"
|
+ ".SchemaProvider to attach schemas to input & target table data, built in options: FilebasedSchemaProvider")
|
||||||
+ ".schema.SchemaProvider "
|
|
||||||
+ "to attach schemas to input & target"
|
|
||||||
+ " table data, built in options: "
|
|
||||||
+ "FilebasedSchemaProvider")
|
|
||||||
public String schemaProviderClassName = FilebasedSchemaProvider.class.getName();
|
public String schemaProviderClassName = FilebasedSchemaProvider.class.getName();
|
||||||
|
|
||||||
@Parameter(names = {"--schemaprovider-config"}, description =
|
@Parameter(names = {"--source-limit"}, description = "Maximum amount of data to read from source. "
|
||||||
"path to properties file on localfs or dfs, with schema "
|
+ "Default: No limit For e.g: DFSSource => max bytes to read, KafkaSource => max events to read")
|
||||||
+ "configs. For list of acceptable properties, refer "
|
public long sourceLimit = Long.MAX_VALUE;
|
||||||
+ "the schema provider class", required = true)
|
|
||||||
public String schemaProviderConfigProps = null;
|
|
||||||
|
|
||||||
|
@Parameter(names = {"--op"}, description = "Takes one of these values : UPSERT (default), INSERT (use when input "
|
||||||
/**
|
+ "is purely new data/inserts to gain speed)",
|
||||||
* Other configs
|
|
||||||
**/
|
|
||||||
@Parameter(names = {
|
|
||||||
"--max-input-bytes"}, description = "Maximum number of bytes to read from source. Default: 1TB")
|
|
||||||
public long maxInputBytes = 1L * 1024 * 1024 * 1024 * 1024;
|
|
||||||
|
|
||||||
@Parameter(names = {"--op"}, description =
|
|
||||||
"Takes one of these values : UPSERT (default), INSERT (use when input "
|
|
||||||
+ "is purely new data/inserts to gain speed)",
|
|
||||||
converter = OperationConvertor.class)
|
converter = OperationConvertor.class)
|
||||||
public Operation operation = Operation.UPSERT;
|
public Operation operation = Operation.UPSERT;
|
||||||
|
|
||||||
|
@Parameter(names = {"--spark-master"}, description = "spark master to use.")
|
||||||
|
public String sparkMaster = "local[2]";
|
||||||
|
|
||||||
@Parameter(names = {"--help", "-h"}, help = true)
|
@Parameter(names = {"--help", "-h"}, help = true)
|
||||||
public Boolean help = false;
|
public Boolean help = false;
|
||||||
@@ -408,6 +299,8 @@ public class HoodieDeltaStreamer implements Serializable {
|
|||||||
cmd.usage();
|
cmd.usage();
|
||||||
System.exit(1);
|
System.exit(1);
|
||||||
}
|
}
|
||||||
new HoodieDeltaStreamer(cfg).sync();
|
|
||||||
|
JavaSparkContext jssc = UtilHelpers.buildSparkContext("delta-streamer-" + cfg.targetTableName, cfg.sparkMaster);
|
||||||
|
new HoodieDeltaStreamer(cfg, jssc).sync();
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
|
|||||||
@@ -21,6 +21,7 @@ package com.uber.hoodie.utilities.keygen;
|
|||||||
import com.uber.hoodie.DataSourceUtils;
|
import com.uber.hoodie.DataSourceUtils;
|
||||||
import com.uber.hoodie.SimpleKeyGenerator;
|
import com.uber.hoodie.SimpleKeyGenerator;
|
||||||
import com.uber.hoodie.common.model.HoodieKey;
|
import com.uber.hoodie.common.model.HoodieKey;
|
||||||
|
import com.uber.hoodie.common.util.TypedProperties;
|
||||||
import com.uber.hoodie.exception.HoodieNotSupportedException;
|
import com.uber.hoodie.exception.HoodieNotSupportedException;
|
||||||
import com.uber.hoodie.utilities.exception.HoodieDeltaStreamerException;
|
import com.uber.hoodie.utilities.exception.HoodieDeltaStreamerException;
|
||||||
import java.io.Serializable;
|
import java.io.Serializable;
|
||||||
@@ -30,7 +31,6 @@ import java.util.Arrays;
|
|||||||
import java.util.Date;
|
import java.util.Date;
|
||||||
import java.util.TimeZone;
|
import java.util.TimeZone;
|
||||||
import org.apache.avro.generic.GenericRecord;
|
import org.apache.avro.generic.GenericRecord;
|
||||||
import org.apache.commons.configuration.PropertiesConfiguration;
|
|
||||||
|
|
||||||
/**
|
/**
|
||||||
* Key generator, that relies on timestamps for partitioning field. Still picks record key by name.
|
* Key generator, that relies on timestamps for partitioning field. Still picks record key by name.
|
||||||
@@ -64,7 +64,7 @@ public class TimestampBasedKeyGenerator extends SimpleKeyGenerator {
|
|||||||
+ ".dateformat";
|
+ ".dateformat";
|
||||||
}
|
}
|
||||||
|
|
||||||
public TimestampBasedKeyGenerator(PropertiesConfiguration config) {
|
public TimestampBasedKeyGenerator(TypedProperties config) {
|
||||||
super(config);
|
super(config);
|
||||||
DataSourceUtils.checkRequiredProperties(config,
|
DataSourceUtils.checkRequiredProperties(config,
|
||||||
Arrays.asList(Config.TIMESTAMP_TYPE_FIELD_PROP, Config.TIMESTAMP_OUTPUT_DATE_FORMAT_PROP));
|
Arrays.asList(Config.TIMESTAMP_TYPE_FIELD_PROP, Config.TIMESTAMP_OUTPUT_DATE_FORMAT_PROP));
|
||||||
|
|||||||
@@ -20,14 +20,14 @@ package com.uber.hoodie.utilities.schema;
|
|||||||
|
|
||||||
import com.uber.hoodie.DataSourceUtils;
|
import com.uber.hoodie.DataSourceUtils;
|
||||||
import com.uber.hoodie.common.util.FSUtils;
|
import com.uber.hoodie.common.util.FSUtils;
|
||||||
|
import com.uber.hoodie.common.util.TypedProperties;
|
||||||
import com.uber.hoodie.exception.HoodieIOException;
|
import com.uber.hoodie.exception.HoodieIOException;
|
||||||
import java.io.IOException;
|
import java.io.IOException;
|
||||||
import java.util.Arrays;
|
import java.util.Arrays;
|
||||||
import org.apache.avro.Schema;
|
import org.apache.avro.Schema;
|
||||||
import org.apache.commons.configuration.PropertiesConfiguration;
|
|
||||||
import org.apache.hadoop.conf.Configuration;
|
|
||||||
import org.apache.hadoop.fs.FileSystem;
|
import org.apache.hadoop.fs.FileSystem;
|
||||||
import org.apache.hadoop.fs.Path;
|
import org.apache.hadoop.fs.Path;
|
||||||
|
import org.apache.spark.api.java.JavaSparkContext;
|
||||||
|
|
||||||
/**
|
/**
|
||||||
* A simple schema provider, that reads off files on DFS
|
* A simple schema provider, that reads off files on DFS
|
||||||
@@ -37,33 +37,30 @@ public class FilebasedSchemaProvider extends SchemaProvider {
|
|||||||
/**
|
/**
|
||||||
* Configs supported
|
* Configs supported
|
||||||
*/
|
*/
|
||||||
static class Config {
|
public static class Config {
|
||||||
|
private static final String SOURCE_SCHEMA_FILE_PROP = "hoodie.deltastreamer.schemaprovider"
|
||||||
private static final String SOURCE_SCHEMA_FILE_PROP = "hoodie.deltastreamer.filebased"
|
+ ".source.schema.file";
|
||||||
+ ".schemaprovider.source.schema"
|
private static final String TARGET_SCHEMA_FILE_PROP = "hoodie.deltastreamer.schemaprovider"
|
||||||
+ ".file";
|
+ ".target.schema.file";
|
||||||
private static final String TARGET_SCHEMA_FILE_PROP = "hoodie.deltastreamer.filebased"
|
|
||||||
+ ".schemaprovider.target.schema"
|
|
||||||
+ ".file";
|
|
||||||
}
|
}
|
||||||
|
|
||||||
private final FileSystem fs;
|
private final FileSystem fs;
|
||||||
|
|
||||||
private final Schema sourceSchema;
|
private final Schema sourceSchema;
|
||||||
|
|
||||||
private final Schema targetSchema;
|
private Schema targetSchema;
|
||||||
|
|
||||||
public FilebasedSchemaProvider(PropertiesConfiguration config) {
|
public FilebasedSchemaProvider(TypedProperties props, JavaSparkContext jssc) {
|
||||||
super(config);
|
super(props, jssc);
|
||||||
this.fs = FSUtils.getFs(config.getBasePath(), new Configuration());
|
DataSourceUtils.checkRequiredProperties(props, Arrays.asList(Config.SOURCE_SCHEMA_FILE_PROP));
|
||||||
|
this.fs = FSUtils.getFs(props.getString(Config.SOURCE_SCHEMA_FILE_PROP), jssc.hadoopConfiguration());
|
||||||
DataSourceUtils.checkRequiredProperties(config,
|
|
||||||
Arrays.asList(Config.SOURCE_SCHEMA_FILE_PROP, Config.TARGET_SCHEMA_FILE_PROP));
|
|
||||||
try {
|
try {
|
||||||
this.sourceSchema = new Schema.Parser().parse(
|
this.sourceSchema = new Schema.Parser().parse(
|
||||||
fs.open(new Path(config.getString(Config.SOURCE_SCHEMA_FILE_PROP))));
|
fs.open(new Path(props.getString(Config.SOURCE_SCHEMA_FILE_PROP))));
|
||||||
this.targetSchema = new Schema.Parser().parse(
|
if (props.containsKey(Config.TARGET_SCHEMA_FILE_PROP)) {
|
||||||
fs.open(new Path(config.getString(Config.TARGET_SCHEMA_FILE_PROP))));
|
this.targetSchema = new Schema.Parser().parse(
|
||||||
|
fs.open(new Path(props.getString(Config.TARGET_SCHEMA_FILE_PROP))));
|
||||||
|
}
|
||||||
} catch (IOException ioe) {
|
} catch (IOException ioe) {
|
||||||
throw new HoodieIOException("Error reading schema", ioe);
|
throw new HoodieIOException("Error reading schema", ioe);
|
||||||
}
|
}
|
||||||
@@ -76,6 +73,10 @@ public class FilebasedSchemaProvider extends SchemaProvider {
|
|||||||
|
|
||||||
@Override
|
@Override
|
||||||
public Schema getTargetSchema() {
|
public Schema getTargetSchema() {
|
||||||
return targetSchema;
|
if (targetSchema != null) {
|
||||||
|
return targetSchema;
|
||||||
|
} else {
|
||||||
|
return super.getTargetSchema();
|
||||||
|
}
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
|
|||||||
@@ -18,22 +18,29 @@
|
|||||||
|
|
||||||
package com.uber.hoodie.utilities.schema;
|
package com.uber.hoodie.utilities.schema;
|
||||||
|
|
||||||
|
import com.uber.hoodie.common.util.TypedProperties;
|
||||||
import java.io.Serializable;
|
import java.io.Serializable;
|
||||||
import org.apache.avro.Schema;
|
import org.apache.avro.Schema;
|
||||||
import org.apache.commons.configuration.PropertiesConfiguration;
|
import org.apache.spark.api.java.JavaSparkContext;
|
||||||
|
|
||||||
/**
|
/**
|
||||||
* Class to provide schema for reading data and also writing into a Hoodie table
|
* Class to provide schema for reading data and also writing into a Hoodie table
|
||||||
*/
|
*/
|
||||||
public abstract class SchemaProvider implements Serializable {
|
public abstract class SchemaProvider implements Serializable {
|
||||||
|
|
||||||
protected PropertiesConfiguration config;
|
protected TypedProperties config;
|
||||||
|
|
||||||
protected SchemaProvider(PropertiesConfiguration config) {
|
protected JavaSparkContext jssc;
|
||||||
this.config = config;
|
|
||||||
|
protected SchemaProvider(TypedProperties props, JavaSparkContext jssc) {
|
||||||
|
this.config = props;
|
||||||
|
this.jssc = jssc;
|
||||||
}
|
}
|
||||||
|
|
||||||
public abstract Schema getSourceSchema();
|
public abstract Schema getSourceSchema();
|
||||||
|
|
||||||
public abstract Schema getTargetSchema();
|
public Schema getTargetSchema() {
|
||||||
|
// by default, use source schema as target for hoodie dataset as well
|
||||||
|
return getSourceSchema();
|
||||||
|
}
|
||||||
}
|
}
|
||||||
|
|||||||
@@ -0,0 +1,71 @@
|
|||||||
|
/*
|
||||||
|
* Copyright (c) 2017 Uber Technologies, Inc. (hoodie-dev-group@uber.com)
|
||||||
|
*
|
||||||
|
* Licensed under the Apache License, Version 2.0 (the "License");
|
||||||
|
* you may not use this file except in compliance with the License.
|
||||||
|
* You may obtain a copy of the License at
|
||||||
|
*
|
||||||
|
* http://www.apache.org/licenses/LICENSE-2.0
|
||||||
|
*
|
||||||
|
* Unless required by applicable law or agreed to in writing, software
|
||||||
|
* distributed under the License is distributed on an "AS IS" BASIS,
|
||||||
|
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||||
|
* See the License for the specific language governing permissions and
|
||||||
|
* limitations under the License.
|
||||||
|
*
|
||||||
|
*
|
||||||
|
*/
|
||||||
|
|
||||||
|
package com.uber.hoodie.utilities.schema;
|
||||||
|
|
||||||
|
import com.fasterxml.jackson.databind.JsonNode;
|
||||||
|
import com.fasterxml.jackson.databind.ObjectMapper;
|
||||||
|
import com.uber.hoodie.DataSourceUtils;
|
||||||
|
import com.uber.hoodie.common.util.TypedProperties;
|
||||||
|
import com.uber.hoodie.exception.HoodieIOException;
|
||||||
|
import java.io.IOException;
|
||||||
|
import java.net.URL;
|
||||||
|
import java.util.Arrays;
|
||||||
|
import org.apache.avro.Schema;
|
||||||
|
import org.apache.spark.api.java.JavaSparkContext;
|
||||||
|
|
||||||
|
/**
|
||||||
|
* Obtains latest schema from the Confluent/Kafka schema-registry
|
||||||
|
*
|
||||||
|
* https://github.com/confluentinc/schema-registry
|
||||||
|
*/
|
||||||
|
public class SchemaRegistryProvider extends SchemaProvider {
|
||||||
|
|
||||||
|
/**
|
||||||
|
* Configs supported
|
||||||
|
*/
|
||||||
|
public static class Config {
|
||||||
|
|
||||||
|
private static final String SCHEMA_REGISTRY_URL_PROP = "hoodie.deltastreamer.schemaprovider.registry.url";
|
||||||
|
}
|
||||||
|
|
||||||
|
private final Schema schema;
|
||||||
|
|
||||||
|
private String fetchSchemaFromRegistry(String registryUrl) throws IOException {
|
||||||
|
URL registry = new URL(registryUrl);
|
||||||
|
ObjectMapper mapper = new ObjectMapper();
|
||||||
|
JsonNode node = mapper.readTree(registry.openStream());
|
||||||
|
return node.get("schema").asText();
|
||||||
|
}
|
||||||
|
|
||||||
|
public SchemaRegistryProvider(TypedProperties props, JavaSparkContext jssc) {
|
||||||
|
super(props, jssc);
|
||||||
|
DataSourceUtils.checkRequiredProperties(props, Arrays.asList(Config.SCHEMA_REGISTRY_URL_PROP));
|
||||||
|
String registryUrl = props.getString(Config.SCHEMA_REGISTRY_URL_PROP);
|
||||||
|
try {
|
||||||
|
this.schema = new Schema.Parser().parse(fetchSchemaFromRegistry(registryUrl));
|
||||||
|
} catch (IOException ioe) {
|
||||||
|
throw new HoodieIOException("Error reading schema from registry :" + registryUrl, ioe);
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
@Override
|
||||||
|
public Schema getSourceSchema() {
|
||||||
|
return schema;
|
||||||
|
}
|
||||||
|
}
|
||||||
@@ -27,7 +27,7 @@ import org.apache.avro.Schema;
|
|||||||
import org.apache.avro.generic.GenericRecord;
|
import org.apache.avro.generic.GenericRecord;
|
||||||
|
|
||||||
/**
|
/**
|
||||||
* Convert a variety of {@link SourceDataFormat} into Avro GenericRecords. Has a bunch of lazy
|
* Convert a variety of datum into Avro GenericRecords. Has a bunch of lazy
|
||||||
* fields to circumvent issues around serializing these objects from driver to executors
|
* fields to circumvent issues around serializing these objects from driver to executors
|
||||||
*/
|
*/
|
||||||
public class AvroConvertor implements Serializable {
|
public class AvroConvertor implements Serializable {
|
||||||
@@ -82,6 +82,10 @@ public class AvroConvertor implements Serializable {
|
|||||||
return jsonConverter.convert(json);
|
return jsonConverter.convert(json);
|
||||||
}
|
}
|
||||||
|
|
||||||
|
public Schema getSchema() {
|
||||||
|
return new Schema.Parser().parse(schemaStr);
|
||||||
|
}
|
||||||
|
|
||||||
|
|
||||||
public GenericRecord fromAvroBinary(byte[] avroBinary) throws IOException {
|
public GenericRecord fromAvroBinary(byte[] avroBinary) throws IOException {
|
||||||
initSchema();
|
initSchema();
|
||||||
|
|||||||
@@ -0,0 +1,47 @@
|
|||||||
|
/*
|
||||||
|
* Copyright (c) 2017 Uber Technologies, Inc. (hoodie-dev-group@uber.com)
|
||||||
|
*
|
||||||
|
* Licensed under the Apache License, Version 2.0 (the "License");
|
||||||
|
* you may not use this file except in compliance with the License.
|
||||||
|
* You may obtain a copy of the License at
|
||||||
|
*
|
||||||
|
* http://www.apache.org/licenses/LICENSE-2.0
|
||||||
|
*
|
||||||
|
* Unless required by applicable law or agreed to in writing, software
|
||||||
|
* distributed under the License is distributed on an "AS IS" BASIS,
|
||||||
|
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||||
|
* See the License for the specific language governing permissions and
|
||||||
|
* limitations under the License.
|
||||||
|
*
|
||||||
|
*
|
||||||
|
*/
|
||||||
|
|
||||||
|
package com.uber.hoodie.utilities.sources;
|
||||||
|
|
||||||
|
import com.uber.hoodie.common.util.TypedProperties;
|
||||||
|
import com.uber.hoodie.utilities.schema.SchemaProvider;
|
||||||
|
import org.apache.avro.generic.GenericRecord;
|
||||||
|
import org.apache.avro.mapred.AvroKey;
|
||||||
|
import org.apache.avro.mapreduce.AvroKeyInputFormat;
|
||||||
|
import org.apache.hadoop.io.NullWritable;
|
||||||
|
import org.apache.spark.api.java.JavaPairRDD;
|
||||||
|
import org.apache.spark.api.java.JavaRDD;
|
||||||
|
import org.apache.spark.api.java.JavaSparkContext;
|
||||||
|
|
||||||
|
/**
|
||||||
|
* DFS Source that reads avro data
|
||||||
|
*/
|
||||||
|
public class AvroDFSSource extends DFSSource {
|
||||||
|
|
||||||
|
public AvroDFSSource(TypedProperties props, JavaSparkContext sparkContext, SchemaProvider schemaProvider) {
|
||||||
|
super(props, sparkContext, schemaProvider);
|
||||||
|
}
|
||||||
|
|
||||||
|
@Override
|
||||||
|
protected JavaRDD<GenericRecord> fromFiles(AvroConvertor convertor, String pathStr) {
|
||||||
|
JavaPairRDD<AvroKey, NullWritable> avroRDD = sparkContext.newAPIHadoopFile(pathStr,
|
||||||
|
AvroKeyInputFormat.class, AvroKey.class, NullWritable.class,
|
||||||
|
sparkContext.hadoopConfiguration());
|
||||||
|
return avroRDD.keys().map(r -> ((GenericRecord) r.datum()));
|
||||||
|
}
|
||||||
|
}
|
||||||
@@ -0,0 +1,47 @@
|
|||||||
|
/*
|
||||||
|
* Copyright (c) 2017 Uber Technologies, Inc. (hoodie-dev-group@uber.com)
|
||||||
|
*
|
||||||
|
* Licensed under the Apache License, Version 2.0 (the "License");
|
||||||
|
* you may not use this file except in compliance with the License.
|
||||||
|
* You may obtain a copy of the License at
|
||||||
|
*
|
||||||
|
* http://www.apache.org/licenses/LICENSE-2.0
|
||||||
|
*
|
||||||
|
* Unless required by applicable law or agreed to in writing, software
|
||||||
|
* distributed under the License is distributed on an "AS IS" BASIS,
|
||||||
|
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||||
|
* See the License for the specific language governing permissions and
|
||||||
|
* limitations under the License.
|
||||||
|
*
|
||||||
|
*
|
||||||
|
*/
|
||||||
|
|
||||||
|
package com.uber.hoodie.utilities.sources;
|
||||||
|
|
||||||
|
import com.uber.hoodie.common.util.TypedProperties;
|
||||||
|
import com.uber.hoodie.utilities.schema.SchemaProvider;
|
||||||
|
import io.confluent.kafka.serializers.KafkaAvroDecoder;
|
||||||
|
import kafka.serializer.StringDecoder;
|
||||||
|
import org.apache.avro.generic.GenericRecord;
|
||||||
|
import org.apache.spark.api.java.JavaRDD;
|
||||||
|
import org.apache.spark.api.java.JavaSparkContext;
|
||||||
|
import org.apache.spark.streaming.kafka.KafkaUtils;
|
||||||
|
import org.apache.spark.streaming.kafka.OffsetRange;
|
||||||
|
|
||||||
|
/**
|
||||||
|
* Reads avro serialized Kafka data, based on the confluent schema-registry
|
||||||
|
*/
|
||||||
|
public class AvroKafkaSource extends KafkaSource {
|
||||||
|
|
||||||
|
public AvroKafkaSource(TypedProperties props, JavaSparkContext sparkContext, SchemaProvider schemaProvider) {
|
||||||
|
super(props, sparkContext, schemaProvider);
|
||||||
|
}
|
||||||
|
|
||||||
|
@Override
|
||||||
|
protected JavaRDD<GenericRecord> toAvroRDD(OffsetRange[] offsetRanges, AvroConvertor avroConvertor) {
|
||||||
|
JavaRDD<GenericRecord> recordRDD = KafkaUtils
|
||||||
|
.createRDD(sparkContext, String.class, Object.class, StringDecoder.class, KafkaAvroDecoder.class, kafkaParams,
|
||||||
|
offsetRanges).values().map(obj -> (GenericRecord) obj);
|
||||||
|
return recordRDD;
|
||||||
|
}
|
||||||
|
}
|
||||||
@@ -20,8 +20,10 @@ package com.uber.hoodie.utilities.sources;
|
|||||||
|
|
||||||
import com.uber.hoodie.DataSourceUtils;
|
import com.uber.hoodie.DataSourceUtils;
|
||||||
import com.uber.hoodie.common.util.FSUtils;
|
import com.uber.hoodie.common.util.FSUtils;
|
||||||
|
import com.uber.hoodie.common.util.TypedProperties;
|
||||||
|
import com.uber.hoodie.common.util.collection.ImmutablePair;
|
||||||
|
import com.uber.hoodie.common.util.collection.Pair;
|
||||||
import com.uber.hoodie.exception.HoodieIOException;
|
import com.uber.hoodie.exception.HoodieIOException;
|
||||||
import com.uber.hoodie.exception.HoodieNotSupportedException;
|
|
||||||
import com.uber.hoodie.utilities.schema.SchemaProvider;
|
import com.uber.hoodie.utilities.schema.SchemaProvider;
|
||||||
import java.io.IOException;
|
import java.io.IOException;
|
||||||
import java.util.ArrayList;
|
import java.util.ArrayList;
|
||||||
@@ -30,31 +32,23 @@ import java.util.List;
|
|||||||
import java.util.Optional;
|
import java.util.Optional;
|
||||||
import java.util.stream.Collectors;
|
import java.util.stream.Collectors;
|
||||||
import org.apache.avro.generic.GenericRecord;
|
import org.apache.avro.generic.GenericRecord;
|
||||||
import org.apache.avro.mapred.AvroKey;
|
|
||||||
import org.apache.avro.mapreduce.AvroKeyInputFormat;
|
|
||||||
import org.apache.commons.configuration.PropertiesConfiguration;
|
|
||||||
import org.apache.commons.lang3.tuple.ImmutablePair;
|
|
||||||
import org.apache.commons.lang3.tuple.Pair;
|
|
||||||
import org.apache.hadoop.fs.FileStatus;
|
import org.apache.hadoop.fs.FileStatus;
|
||||||
import org.apache.hadoop.fs.FileSystem;
|
import org.apache.hadoop.fs.FileSystem;
|
||||||
import org.apache.hadoop.fs.LocatedFileStatus;
|
import org.apache.hadoop.fs.LocatedFileStatus;
|
||||||
import org.apache.hadoop.fs.Path;
|
import org.apache.hadoop.fs.Path;
|
||||||
import org.apache.hadoop.fs.RemoteIterator;
|
import org.apache.hadoop.fs.RemoteIterator;
|
||||||
import org.apache.hadoop.io.NullWritable;
|
|
||||||
import org.apache.spark.api.java.JavaPairRDD;
|
|
||||||
import org.apache.spark.api.java.JavaRDD;
|
import org.apache.spark.api.java.JavaRDD;
|
||||||
import org.apache.spark.api.java.JavaSparkContext;
|
import org.apache.spark.api.java.JavaSparkContext;
|
||||||
|
|
||||||
/**
|
/**
|
||||||
* Source to read data from a given DFS directory structure, incrementally
|
* Source to read data from a given DFS directory structure, incrementally
|
||||||
*/
|
*/
|
||||||
public class DFSSource extends Source {
|
public abstract class DFSSource extends Source {
|
||||||
|
|
||||||
/**
|
/**
|
||||||
* Configs supported
|
* Configs supported
|
||||||
*/
|
*/
|
||||||
static class Config {
|
static class Config {
|
||||||
|
|
||||||
private static final String ROOT_INPUT_PATH_PROP = "hoodie.deltastreamer.source.dfs.root";
|
private static final String ROOT_INPUT_PATH_PROP = "hoodie.deltastreamer.source.dfs.root";
|
||||||
}
|
}
|
||||||
|
|
||||||
@@ -62,50 +56,23 @@ public class DFSSource extends Source {
|
|||||||
|
|
||||||
private final transient FileSystem fs;
|
private final transient FileSystem fs;
|
||||||
|
|
||||||
public DFSSource(PropertiesConfiguration config, JavaSparkContext sparkContext,
|
public DFSSource(TypedProperties props, JavaSparkContext sparkContext, SchemaProvider schemaProvider) {
|
||||||
SourceDataFormat dataFormat, SchemaProvider schemaProvider) {
|
super(props, sparkContext, schemaProvider);
|
||||||
super(config, sparkContext, dataFormat, schemaProvider);
|
DataSourceUtils.checkRequiredProperties(props, Arrays.asList(Config.ROOT_INPUT_PATH_PROP));
|
||||||
this.fs = FSUtils.getFs(config.getBasePath(), sparkContext.hadoopConfiguration());
|
this.fs = FSUtils.getFs(props.getString(Config.ROOT_INPUT_PATH_PROP), sparkContext.hadoopConfiguration());
|
||||||
DataSourceUtils.checkRequiredProperties(config, Arrays.asList(Config.ROOT_INPUT_PATH_PROP));
|
|
||||||
}
|
|
||||||
|
|
||||||
|
|
||||||
public static JavaRDD<GenericRecord> fromAvroFiles(final AvroConvertor convertor, String pathStr,
|
|
||||||
JavaSparkContext sparkContext) {
|
|
||||||
JavaPairRDD<AvroKey, NullWritable> avroRDD = sparkContext.newAPIHadoopFile(pathStr,
|
|
||||||
AvroKeyInputFormat.class, AvroKey.class, NullWritable.class,
|
|
||||||
sparkContext.hadoopConfiguration());
|
|
||||||
return avroRDD.keys().map(r -> ((GenericRecord) r.datum()));
|
|
||||||
}
|
|
||||||
|
|
||||||
public static JavaRDD<GenericRecord> fromJsonFiles(final AvroConvertor convertor, String pathStr,
|
|
||||||
JavaSparkContext sparkContext) {
|
|
||||||
return sparkContext.textFile(pathStr).map((String j) -> {
|
|
||||||
return convertor.fromJson(j);
|
|
||||||
});
|
|
||||||
}
|
|
||||||
|
|
||||||
public static JavaRDD<GenericRecord> fromFiles(SourceDataFormat dataFormat,
|
|
||||||
final AvroConvertor convertor, String pathStr, JavaSparkContext sparkContext) {
|
|
||||||
if (dataFormat == SourceDataFormat.AVRO) {
|
|
||||||
return DFSSource.fromAvroFiles(convertor, pathStr, sparkContext);
|
|
||||||
} else if (dataFormat == SourceDataFormat.JSON) {
|
|
||||||
return DFSSource.fromJsonFiles(convertor, pathStr, sparkContext);
|
|
||||||
} else {
|
|
||||||
throw new HoodieNotSupportedException("Unsupported data format :" + dataFormat);
|
|
||||||
}
|
|
||||||
}
|
}
|
||||||
|
|
||||||
|
protected abstract JavaRDD<GenericRecord> fromFiles(final AvroConvertor convertor, String pathStr);
|
||||||
|
|
||||||
@Override
|
@Override
|
||||||
public Pair<Optional<JavaRDD<GenericRecord>>, String> fetchNewData(
|
public Pair<Optional<JavaRDD<GenericRecord>>, String> fetchNewData(
|
||||||
Optional<String> lastCheckpointStr, long maxInputBytes) {
|
Optional<String> lastCheckpointStr, long sourceLimit) {
|
||||||
|
|
||||||
try {
|
try {
|
||||||
// obtain all eligible files under root folder.
|
// obtain all eligible files under root folder.
|
||||||
List<FileStatus> eligibleFiles = new ArrayList<>();
|
List<FileStatus> eligibleFiles = new ArrayList<>();
|
||||||
RemoteIterator<LocatedFileStatus> fitr = fs.listFiles(
|
RemoteIterator<LocatedFileStatus> fitr = fs.listFiles(
|
||||||
new Path(config.getString(Config.ROOT_INPUT_PATH_PROP)), true);
|
new Path(props.getString(Config.ROOT_INPUT_PATH_PROP)), true);
|
||||||
while (fitr.hasNext()) {
|
while (fitr.hasNext()) {
|
||||||
LocatedFileStatus fileStatus = fitr.next();
|
LocatedFileStatus fileStatus = fitr.next();
|
||||||
if (fileStatus.isDirectory() || IGNORE_FILEPREFIX_LIST.stream().filter(
|
if (fileStatus.isDirectory() || IGNORE_FILEPREFIX_LIST.stream().filter(
|
||||||
@@ -130,13 +97,14 @@ public class DFSSource extends Source {
|
|||||||
continue;
|
continue;
|
||||||
}
|
}
|
||||||
|
|
||||||
maxModificationTime = f.getModificationTime();
|
if (currentBytes + f.getLen() >= sourceLimit) {
|
||||||
currentBytes += f.getLen();
|
|
||||||
filteredFiles.add(f);
|
|
||||||
if (currentBytes >= maxInputBytes) {
|
|
||||||
// we have enough data, we are done
|
// we have enough data, we are done
|
||||||
break;
|
break;
|
||||||
}
|
}
|
||||||
|
|
||||||
|
maxModificationTime = f.getModificationTime();
|
||||||
|
currentBytes += f.getLen();
|
||||||
|
filteredFiles.add(f);
|
||||||
}
|
}
|
||||||
|
|
||||||
// no data to read
|
// no data to read
|
||||||
@@ -153,7 +121,7 @@ public class DFSSource extends Source {
|
|||||||
final AvroConvertor avroConvertor = new AvroConvertor(schemaStr);
|
final AvroConvertor avroConvertor = new AvroConvertor(schemaStr);
|
||||||
|
|
||||||
return new ImmutablePair<>(
|
return new ImmutablePair<>(
|
||||||
Optional.of(DFSSource.fromFiles(dataFormat, avroConvertor, pathStr, sparkContext)),
|
Optional.of(fromFiles(avroConvertor, pathStr)),
|
||||||
String.valueOf(maxModificationTime));
|
String.valueOf(maxModificationTime));
|
||||||
} catch (IOException ioe) {
|
} catch (IOException ioe) {
|
||||||
throw new HoodieIOException(
|
throw new HoodieIOException(
|
||||||
|
|||||||
@@ -20,6 +20,9 @@ package com.uber.hoodie.utilities.sources;
|
|||||||
|
|
||||||
import com.uber.hoodie.DataSourceUtils;
|
import com.uber.hoodie.DataSourceUtils;
|
||||||
import com.uber.hoodie.common.util.FSUtils;
|
import com.uber.hoodie.common.util.FSUtils;
|
||||||
|
import com.uber.hoodie.common.util.TypedProperties;
|
||||||
|
import com.uber.hoodie.common.util.collection.ImmutablePair;
|
||||||
|
import com.uber.hoodie.common.util.collection.Pair;
|
||||||
import com.uber.hoodie.exception.HoodieIOException;
|
import com.uber.hoodie.exception.HoodieIOException;
|
||||||
import com.uber.hoodie.utilities.schema.SchemaProvider;
|
import com.uber.hoodie.utilities.schema.SchemaProvider;
|
||||||
import java.io.IOException;
|
import java.io.IOException;
|
||||||
@@ -30,14 +33,15 @@ import java.util.List;
|
|||||||
import java.util.Optional;
|
import java.util.Optional;
|
||||||
import java.util.stream.Collectors;
|
import java.util.stream.Collectors;
|
||||||
import org.apache.avro.generic.GenericRecord;
|
import org.apache.avro.generic.GenericRecord;
|
||||||
import org.apache.commons.configuration.PropertiesConfiguration;
|
import org.apache.avro.mapred.AvroKey;
|
||||||
import org.apache.commons.lang3.tuple.ImmutablePair;
|
import org.apache.avro.mapreduce.AvroKeyInputFormat;
|
||||||
import org.apache.commons.lang3.tuple.Pair;
|
|
||||||
import org.apache.hadoop.fs.FileStatus;
|
import org.apache.hadoop.fs.FileStatus;
|
||||||
import org.apache.hadoop.fs.FileSystem;
|
import org.apache.hadoop.fs.FileSystem;
|
||||||
import org.apache.hadoop.fs.Path;
|
import org.apache.hadoop.fs.Path;
|
||||||
|
import org.apache.hadoop.io.NullWritable;
|
||||||
import org.apache.log4j.LogManager;
|
import org.apache.log4j.LogManager;
|
||||||
import org.apache.log4j.Logger;
|
import org.apache.log4j.Logger;
|
||||||
|
import org.apache.spark.api.java.JavaPairRDD;
|
||||||
import org.apache.spark.api.java.JavaRDD;
|
import org.apache.spark.api.java.JavaRDD;
|
||||||
import org.apache.spark.api.java.JavaSparkContext;
|
import org.apache.spark.api.java.JavaSparkContext;
|
||||||
|
|
||||||
@@ -69,12 +73,12 @@ public class HiveIncrPullSource extends Source {
|
|||||||
private static final String ROOT_INPUT_PATH_PROP = "hoodie.deltastreamer.source.incrpull.root";
|
private static final String ROOT_INPUT_PATH_PROP = "hoodie.deltastreamer.source.incrpull.root";
|
||||||
}
|
}
|
||||||
|
|
||||||
public HiveIncrPullSource(PropertiesConfiguration config, JavaSparkContext sparkContext,
|
public HiveIncrPullSource(TypedProperties props, JavaSparkContext sparkContext,
|
||||||
SourceDataFormat dataFormat, SchemaProvider schemaProvider) {
|
SchemaProvider schemaProvider) {
|
||||||
super(config, sparkContext, dataFormat, schemaProvider);
|
super(props, sparkContext, schemaProvider);
|
||||||
this.fs = FSUtils.getFs(config.getBasePath(), sparkContext.hadoopConfiguration());
|
DataSourceUtils.checkRequiredProperties(props, Arrays.asList(Config.ROOT_INPUT_PATH_PROP));
|
||||||
DataSourceUtils.checkRequiredProperties(config, Arrays.asList(Config.ROOT_INPUT_PATH_PROP));
|
this.incrPullRootPath = props.getString(Config.ROOT_INPUT_PATH_PROP);
|
||||||
this.incrPullRootPath = config.getString(Config.ROOT_INPUT_PATH_PROP);
|
this.fs = FSUtils.getFs(incrPullRootPath, sparkContext.hadoopConfiguration());
|
||||||
}
|
}
|
||||||
|
|
||||||
/**
|
/**
|
||||||
@@ -110,7 +114,7 @@ public class HiveIncrPullSource extends Source {
|
|||||||
|
|
||||||
@Override
|
@Override
|
||||||
public Pair<Optional<JavaRDD<GenericRecord>>, String> fetchNewData(
|
public Pair<Optional<JavaRDD<GenericRecord>>, String> fetchNewData(
|
||||||
Optional<String> lastCheckpointStr, long maxInputBytes) {
|
Optional<String> lastCheckpointStr, long sourceLimit) {
|
||||||
try {
|
try {
|
||||||
// find the source commit to pull
|
// find the source commit to pull
|
||||||
Optional<String> commitToPull = findCommitToPull(lastCheckpointStr);
|
Optional<String> commitToPull = findCommitToPull(lastCheckpointStr);
|
||||||
@@ -125,10 +129,10 @@ public class HiveIncrPullSource extends Source {
|
|||||||
fs.listStatus(new Path(incrPullRootPath, commitToPull.get())));
|
fs.listStatus(new Path(incrPullRootPath, commitToPull.get())));
|
||||||
String pathStr = commitDeltaFiles.stream().map(f -> f.getPath().toString())
|
String pathStr = commitDeltaFiles.stream().map(f -> f.getPath().toString())
|
||||||
.collect(Collectors.joining(","));
|
.collect(Collectors.joining(","));
|
||||||
String schemaStr = schemaProvider.getSourceSchema().toString();
|
JavaPairRDD<AvroKey, NullWritable> avroRDD = sparkContext.newAPIHadoopFile(pathStr,
|
||||||
final AvroConvertor avroConvertor = new AvroConvertor(schemaStr);
|
AvroKeyInputFormat.class, AvroKey.class, NullWritable.class,
|
||||||
return new ImmutablePair<>(
|
sparkContext.hadoopConfiguration());
|
||||||
Optional.of(DFSSource.fromFiles(dataFormat, avroConvertor, pathStr, sparkContext)),
|
return new ImmutablePair<>(Optional.of(avroRDD.keys().map(r -> ((GenericRecord) r.datum()))),
|
||||||
String.valueOf(commitToPull.get()));
|
String.valueOf(commitToPull.get()));
|
||||||
} catch (IOException ioe) {
|
} catch (IOException ioe) {
|
||||||
throw new HoodieIOException(
|
throw new HoodieIOException(
|
||||||
|
|||||||
@@ -18,12 +18,23 @@
|
|||||||
|
|
||||||
package com.uber.hoodie.utilities.sources;
|
package com.uber.hoodie.utilities.sources;
|
||||||
|
|
||||||
|
import com.uber.hoodie.common.util.TypedProperties;
|
||||||
|
import com.uber.hoodie.utilities.schema.SchemaProvider;
|
||||||
|
import org.apache.avro.generic.GenericRecord;
|
||||||
|
import org.apache.spark.api.java.JavaRDD;
|
||||||
|
import org.apache.spark.api.java.JavaSparkContext;
|
||||||
|
|
||||||
/**
|
/**
|
||||||
* Format of the data within source.
|
* DFS Source that reads json data
|
||||||
*/
|
*/
|
||||||
public enum SourceDataFormat {
|
public class JsonDFSSource extends DFSSource {
|
||||||
AVRO, // No conversion needed explicitly to avro
|
|
||||||
JSON, // we will try to convert to avro
|
public JsonDFSSource(TypedProperties props, JavaSparkContext sparkContext, SchemaProvider schemaProvider) {
|
||||||
ROW, // Will be added later, so we can plug/play with spark sources.
|
super(props, sparkContext, schemaProvider);
|
||||||
CUSTOM // the source is responsible for conversion to avro.
|
}
|
||||||
|
|
||||||
|
@Override
|
||||||
|
protected JavaRDD<GenericRecord> fromFiles(AvroConvertor convertor, String pathStr) {
|
||||||
|
return sparkContext.textFile(pathStr).map((String j) -> convertor.fromJson(j));
|
||||||
|
}
|
||||||
}
|
}
|
||||||
@@ -0,0 +1,45 @@
|
|||||||
|
/*
|
||||||
|
* Copyright (c) 2017 Uber Technologies, Inc. (hoodie-dev-group@uber.com)
|
||||||
|
*
|
||||||
|
* Licensed under the Apache License, Version 2.0 (the "License");
|
||||||
|
* you may not use this file except in compliance with the License.
|
||||||
|
* You may obtain a copy of the License at
|
||||||
|
*
|
||||||
|
* http://www.apache.org/licenses/LICENSE-2.0
|
||||||
|
*
|
||||||
|
* Unless required by applicable law or agreed to in writing, software
|
||||||
|
* distributed under the License is distributed on an "AS IS" BASIS,
|
||||||
|
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||||
|
* See the License for the specific language governing permissions and
|
||||||
|
* limitations under the License.
|
||||||
|
*
|
||||||
|
*
|
||||||
|
*/
|
||||||
|
|
||||||
|
package com.uber.hoodie.utilities.sources;
|
||||||
|
|
||||||
|
import com.uber.hoodie.common.util.TypedProperties;
|
||||||
|
import com.uber.hoodie.utilities.schema.SchemaProvider;
|
||||||
|
import kafka.serializer.StringDecoder;
|
||||||
|
import org.apache.avro.generic.GenericRecord;
|
||||||
|
import org.apache.spark.api.java.JavaRDD;
|
||||||
|
import org.apache.spark.api.java.JavaSparkContext;
|
||||||
|
import org.apache.spark.streaming.kafka.KafkaUtils;
|
||||||
|
import org.apache.spark.streaming.kafka.OffsetRange;
|
||||||
|
|
||||||
|
/**
|
||||||
|
* Read json kafka data
|
||||||
|
*/
|
||||||
|
public class JsonKafkaSource extends KafkaSource {
|
||||||
|
|
||||||
|
public JsonKafkaSource(TypedProperties properties, JavaSparkContext sparkContext, SchemaProvider schemaProvider) {
|
||||||
|
super(properties, sparkContext, schemaProvider);
|
||||||
|
}
|
||||||
|
|
||||||
|
@Override
|
||||||
|
protected JavaRDD<GenericRecord> toAvroRDD(OffsetRange[] offsetRanges, AvroConvertor avroConvertor) {
|
||||||
|
return KafkaUtils.createRDD(sparkContext, String.class, String.class, StringDecoder.class, StringDecoder.class,
|
||||||
|
kafkaParams, offsetRanges)
|
||||||
|
.values().map(jsonStr -> avroConvertor.fromJson(jsonStr));
|
||||||
|
}
|
||||||
|
}
|
||||||
@@ -19,33 +19,26 @@
|
|||||||
package com.uber.hoodie.utilities.sources;
|
package com.uber.hoodie.utilities.sources;
|
||||||
|
|
||||||
import com.uber.hoodie.DataSourceUtils;
|
import com.uber.hoodie.DataSourceUtils;
|
||||||
|
import com.uber.hoodie.common.util.TypedProperties;
|
||||||
|
import com.uber.hoodie.common.util.collection.ImmutablePair;
|
||||||
|
import com.uber.hoodie.common.util.collection.Pair;
|
||||||
import com.uber.hoodie.exception.HoodieNotSupportedException;
|
import com.uber.hoodie.exception.HoodieNotSupportedException;
|
||||||
import com.uber.hoodie.utilities.exception.HoodieDeltaStreamerException;
|
import com.uber.hoodie.utilities.exception.HoodieDeltaStreamerException;
|
||||||
import com.uber.hoodie.utilities.schema.SchemaProvider;
|
import com.uber.hoodie.utilities.schema.SchemaProvider;
|
||||||
import java.nio.charset.Charset;
|
|
||||||
import java.util.Arrays;
|
import java.util.Arrays;
|
||||||
import java.util.Comparator;
|
import java.util.Comparator;
|
||||||
import java.util.HashMap;
|
import java.util.HashMap;
|
||||||
import java.util.HashSet;
|
import java.util.HashSet;
|
||||||
import java.util.List;
|
|
||||||
import java.util.Optional;
|
import java.util.Optional;
|
||||||
import java.util.Spliterator;
|
|
||||||
import java.util.Spliterators;
|
|
||||||
import java.util.stream.Collectors;
|
import java.util.stream.Collectors;
|
||||||
import java.util.stream.Stream;
|
|
||||||
import java.util.stream.StreamSupport;
|
|
||||||
import kafka.common.TopicAndPartition;
|
import kafka.common.TopicAndPartition;
|
||||||
import kafka.serializer.DefaultDecoder;
|
|
||||||
import org.apache.avro.generic.GenericRecord;
|
import org.apache.avro.generic.GenericRecord;
|
||||||
import org.apache.commons.configuration.PropertiesConfiguration;
|
|
||||||
import org.apache.commons.lang3.tuple.ImmutablePair;
|
|
||||||
import org.apache.commons.lang3.tuple.Pair;
|
|
||||||
import org.apache.log4j.LogManager;
|
import org.apache.log4j.LogManager;
|
||||||
import org.apache.log4j.Logger;
|
import org.apache.log4j.Logger;
|
||||||
import org.apache.spark.api.java.JavaRDD;
|
import org.apache.spark.api.java.JavaRDD;
|
||||||
import org.apache.spark.api.java.JavaSparkContext;
|
import org.apache.spark.api.java.JavaSparkContext;
|
||||||
import org.apache.spark.streaming.kafka.KafkaCluster;
|
import org.apache.spark.streaming.kafka.KafkaCluster;
|
||||||
import org.apache.spark.streaming.kafka.KafkaUtils;
|
import org.apache.spark.streaming.kafka.KafkaCluster.LeaderOffset;
|
||||||
import org.apache.spark.streaming.kafka.OffsetRange;
|
import org.apache.spark.streaming.kafka.OffsetRange;
|
||||||
import scala.Predef;
|
import scala.Predef;
|
||||||
import scala.collection.JavaConverters;
|
import scala.collection.JavaConverters;
|
||||||
@@ -59,10 +52,12 @@ import scala.util.Either;
|
|||||||
/**
|
/**
|
||||||
* Source to read data from Kafka, incrementally
|
* Source to read data from Kafka, incrementally
|
||||||
*/
|
*/
|
||||||
public class KafkaSource extends Source {
|
public abstract class KafkaSource extends Source {
|
||||||
|
|
||||||
private static volatile Logger log = LogManager.getLogger(KafkaSource.class);
|
private static volatile Logger log = LogManager.getLogger(KafkaSource.class);
|
||||||
|
|
||||||
|
private static long DEFAULT_MAX_EVENTS_TO_READ = 1000000; // 1M events max
|
||||||
|
|
||||||
|
|
||||||
static class CheckpointUtils {
|
static class CheckpointUtils {
|
||||||
|
|
||||||
@@ -72,6 +67,9 @@ public class KafkaSource extends Source {
|
|||||||
public static HashMap<TopicAndPartition, KafkaCluster.LeaderOffset> strToOffsets(
|
public static HashMap<TopicAndPartition, KafkaCluster.LeaderOffset> strToOffsets(
|
||||||
String checkpointStr) {
|
String checkpointStr) {
|
||||||
HashMap<TopicAndPartition, KafkaCluster.LeaderOffset> offsetMap = new HashMap<>();
|
HashMap<TopicAndPartition, KafkaCluster.LeaderOffset> offsetMap = new HashMap<>();
|
||||||
|
if (checkpointStr.length() == 0) {
|
||||||
|
return offsetMap;
|
||||||
|
}
|
||||||
String[] splits = checkpointStr.split(",");
|
String[] splits = checkpointStr.split(",");
|
||||||
String topic = splits[0];
|
String topic = splits[0];
|
||||||
for (int i = 1; i < splits.length; i++) {
|
for (int i = 1; i < splits.length; i++) {
|
||||||
@@ -83,46 +81,70 @@ public class KafkaSource extends Source {
|
|||||||
}
|
}
|
||||||
|
|
||||||
/**
|
/**
|
||||||
* String representation of checkpoint
|
* String representation of checkpoint <p> Format: topic1,0:offset0,1:offset1,2:offset2, .....
|
||||||
* <p>
|
|
||||||
* Format: topic1,0:offset0,1:offset1,2:offset2, .....
|
|
||||||
*/
|
*/
|
||||||
public static String offsetsToStr(
|
public static String offsetsToStr(OffsetRange[] ranges) {
|
||||||
HashMap<TopicAndPartition, KafkaCluster.LeaderOffset> offsetMap) {
|
|
||||||
StringBuilder sb = new StringBuilder();
|
StringBuilder sb = new StringBuilder();
|
||||||
// atleast 1 partition will be present.
|
// atleast 1 partition will be present.
|
||||||
sb.append(offsetMap.entrySet().stream().findFirst().get().getKey().topic() + ",");
|
sb.append(ranges[0].topic() + ",");
|
||||||
sb.append(offsetMap.entrySet().stream()
|
sb.append(Arrays.stream(ranges)
|
||||||
.map(e -> String.format("%s:%d", e.getKey().partition(), e.getValue().offset()))
|
.map(r -> String.format("%s:%d", r.partition(), r.untilOffset()))
|
||||||
.collect(Collectors.joining(",")));
|
.collect(Collectors.joining(",")));
|
||||||
return sb.toString();
|
return sb.toString();
|
||||||
}
|
}
|
||||||
|
|
||||||
public static OffsetRange[] computeOffsetRanges(
|
|
||||||
HashMap<TopicAndPartition, KafkaCluster.LeaderOffset> fromOffsetMap,
|
|
||||||
HashMap<TopicAndPartition, KafkaCluster.LeaderOffset> toOffsetMap) {
|
|
||||||
Comparator<OffsetRange> byPartition = (OffsetRange o1, OffsetRange o2) -> {
|
|
||||||
return Integer.valueOf(o1.partition()).compareTo(Integer.valueOf(o2.partition()));
|
|
||||||
};
|
|
||||||
List<OffsetRange> offsetRanges = toOffsetMap.entrySet().stream().map(e -> {
|
|
||||||
TopicAndPartition tp = e.getKey();
|
|
||||||
long fromOffset = -1;
|
|
||||||
if (fromOffsetMap.containsKey(tp)) {
|
|
||||||
fromOffset = fromOffsetMap.get(tp).offset();
|
|
||||||
}
|
|
||||||
return OffsetRange.create(tp, fromOffset, e.getValue().offset());
|
|
||||||
}).sorted(byPartition).collect(Collectors.toList());
|
|
||||||
|
|
||||||
OffsetRange[] ranges = new OffsetRange[offsetRanges.size()];
|
/**
|
||||||
return offsetRanges.toArray(ranges);
|
* Compute the offset ranges to read from Kafka, while handling newly added partitions, skews, event limits.
|
||||||
|
*
|
||||||
|
* @param fromOffsetMap offsets where we left off last time
|
||||||
|
* @param toOffsetMap offsets of where each partitions is currently at
|
||||||
|
* @param numEvents maximum number of events to read.
|
||||||
|
*/
|
||||||
|
public static OffsetRange[] computeOffsetRanges(
|
||||||
|
HashMap<TopicAndPartition, LeaderOffset> fromOffsetMap,
|
||||||
|
HashMap<TopicAndPartition, LeaderOffset> toOffsetMap,
|
||||||
|
long numEvents) {
|
||||||
|
|
||||||
|
Comparator<OffsetRange> byPartition = (OffsetRange o1, OffsetRange o2) ->
|
||||||
|
Integer.valueOf(o1.partition()).compareTo(Integer.valueOf(o2.partition()));
|
||||||
|
|
||||||
|
// Create initial offset ranges for each 'to' partition, with from = to offsets.
|
||||||
|
OffsetRange[] ranges = new OffsetRange[toOffsetMap.size()];
|
||||||
|
toOffsetMap.entrySet().stream().map(e -> {
|
||||||
|
TopicAndPartition tp = e.getKey();
|
||||||
|
long fromOffset = fromOffsetMap.getOrDefault(tp, new LeaderOffset("", -1, 0)).offset();
|
||||||
|
return OffsetRange.create(tp, fromOffset, fromOffset);
|
||||||
|
}).sorted(byPartition).collect(Collectors.toList()).toArray(ranges);
|
||||||
|
|
||||||
|
long allocedEvents = 0;
|
||||||
|
java.util.Set<Integer> exhaustedPartitions = new HashSet<>();
|
||||||
|
// keep going until we have events to allocate and partitions still not exhausted.
|
||||||
|
while (allocedEvents < numEvents && exhaustedPartitions.size() < toOffsetMap.size()) {
|
||||||
|
long remainingEvents = numEvents - allocedEvents;
|
||||||
|
long eventsPerPartition = (long) Math
|
||||||
|
.ceil((1.0 * remainingEvents) / (toOffsetMap.size() - exhaustedPartitions.size()));
|
||||||
|
|
||||||
|
// Allocate the remaining events to non-exhausted partitions, in round robin fashion
|
||||||
|
for (int i = 0; i < ranges.length; i++) {
|
||||||
|
OffsetRange range = ranges[i];
|
||||||
|
if (!exhaustedPartitions.contains(range.partition())) {
|
||||||
|
long toOffsetMax = toOffsetMap.get(range.topicAndPartition()).offset();
|
||||||
|
long toOffset = Math.min(toOffsetMax, range.untilOffset() + eventsPerPartition);
|
||||||
|
if (toOffset == toOffsetMax) {
|
||||||
|
exhaustedPartitions.add(range.partition());
|
||||||
|
}
|
||||||
|
allocedEvents += toOffset - range.untilOffset();
|
||||||
|
ranges[i] = OffsetRange.create(range.topicAndPartition(), range.fromOffset(), toOffset);
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
return ranges;
|
||||||
}
|
}
|
||||||
|
|
||||||
public static long totalNewMessages(OffsetRange[] ranges) {
|
public static long totalNewMessages(OffsetRange[] ranges) {
|
||||||
long totalMsgs = 0;
|
return Arrays.asList(ranges).stream().mapToLong(r -> r.count()).sum();
|
||||||
for (OffsetRange range : ranges) {
|
|
||||||
totalMsgs += Math.max(range.untilOffset() - range.fromOffset(), 0);
|
|
||||||
}
|
|
||||||
return totalMsgs;
|
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
|
|
||||||
@@ -149,32 +171,31 @@ public class KafkaSource extends Source {
|
|||||||
* Configs to be passed for this source. All standard Kafka consumer configs are also respected
|
* Configs to be passed for this source. All standard Kafka consumer configs are also respected
|
||||||
*/
|
*/
|
||||||
static class Config {
|
static class Config {
|
||||||
|
|
||||||
private static final String KAFKA_TOPIC_NAME = "hoodie.deltastreamer.source.kafka.topic";
|
private static final String KAFKA_TOPIC_NAME = "hoodie.deltastreamer.source.kafka.topic";
|
||||||
private static final String DEFAULT_AUTO_RESET_OFFSET = "largest";
|
private static final String DEFAULT_AUTO_RESET_OFFSET = "largest";
|
||||||
}
|
}
|
||||||
|
|
||||||
|
|
||||||
private HashMap<String, String> kafkaParams;
|
protected HashMap<String, String> kafkaParams;
|
||||||
|
|
||||||
private final String topicName;
|
protected final String topicName;
|
||||||
|
|
||||||
public KafkaSource(PropertiesConfiguration config, JavaSparkContext sparkContext,
|
public KafkaSource(TypedProperties props, JavaSparkContext sparkContext, SchemaProvider schemaProvider) {
|
||||||
SourceDataFormat dataFormat, SchemaProvider schemaProvider) {
|
super(props, sparkContext, schemaProvider);
|
||||||
super(config, sparkContext, dataFormat, schemaProvider);
|
|
||||||
|
|
||||||
kafkaParams = new HashMap<>();
|
kafkaParams = new HashMap<>();
|
||||||
Stream<String> keys = StreamSupport.stream(
|
for (Object prop : props.keySet()) {
|
||||||
Spliterators.spliteratorUnknownSize(config.getKeys(), Spliterator.NONNULL), false);
|
kafkaParams.put(prop.toString(), props.getString(prop.toString()));
|
||||||
keys.forEach(k -> kafkaParams.put(k, config.getString(k)));
|
}
|
||||||
|
DataSourceUtils.checkRequiredProperties(props, Arrays.asList(Config.KAFKA_TOPIC_NAME));
|
||||||
DataSourceUtils.checkRequiredProperties(config, Arrays.asList(Config.KAFKA_TOPIC_NAME));
|
topicName = props.getString(Config.KAFKA_TOPIC_NAME);
|
||||||
topicName = config.getString(Config.KAFKA_TOPIC_NAME);
|
|
||||||
}
|
}
|
||||||
|
|
||||||
|
protected abstract JavaRDD<GenericRecord> toAvroRDD(OffsetRange[] offsetRanges, AvroConvertor avroConvertor);
|
||||||
|
|
||||||
@Override
|
@Override
|
||||||
public Pair<Optional<JavaRDD<GenericRecord>>, String> fetchNewData(
|
public Pair<Optional<JavaRDD<GenericRecord>>, String> fetchNewData(
|
||||||
Optional<String> lastCheckpointStr, long maxInputBytes) {
|
Optional<String> lastCheckpointStr, long sourceLimit) {
|
||||||
|
|
||||||
// Obtain current metadata for the topic
|
// Obtain current metadata for the topic
|
||||||
KafkaCluster cluster = new KafkaCluster(ScalaHelpers.toScalaMap(kafkaParams));
|
KafkaCluster cluster = new KafkaCluster(ScalaHelpers.toScalaMap(kafkaParams));
|
||||||
@@ -192,7 +213,7 @@ public class KafkaSource extends Source {
|
|||||||
if (lastCheckpointStr.isPresent()) {
|
if (lastCheckpointStr.isPresent()) {
|
||||||
fromOffsets = CheckpointUtils.strToOffsets(lastCheckpointStr.get());
|
fromOffsets = CheckpointUtils.strToOffsets(lastCheckpointStr.get());
|
||||||
} else {
|
} else {
|
||||||
String autoResetValue = config
|
String autoResetValue = props
|
||||||
.getString("auto.offset.reset", Config.DEFAULT_AUTO_RESET_OFFSET);
|
.getString("auto.offset.reset", Config.DEFAULT_AUTO_RESET_OFFSET);
|
||||||
if (autoResetValue.equals("smallest")) {
|
if (autoResetValue.equals("smallest")) {
|
||||||
fromOffsets = new HashMap(ScalaHelpers.toJavaMap(
|
fromOffsets = new HashMap(ScalaHelpers.toJavaMap(
|
||||||
@@ -206,40 +227,23 @@ public class KafkaSource extends Source {
|
|||||||
}
|
}
|
||||||
}
|
}
|
||||||
|
|
||||||
// Always read until the latest offset
|
// Obtain the latest offsets.
|
||||||
HashMap<TopicAndPartition, KafkaCluster.LeaderOffset> toOffsets = new HashMap(
|
HashMap<TopicAndPartition, KafkaCluster.LeaderOffset> toOffsets = new HashMap(
|
||||||
ScalaHelpers.toJavaMap(cluster.getLatestLeaderOffsets(topicPartitions).right().get()));
|
ScalaHelpers.toJavaMap(cluster.getLatestLeaderOffsets(topicPartitions).right().get()));
|
||||||
|
|
||||||
// Come up with final set of OffsetRanges to read (account for new partitions)
|
// Come up with final set of OffsetRanges to read (account for new partitions, limit number of events)
|
||||||
// TODO(vc): Respect maxInputBytes, by estimating number of messages to read each batch from
|
long numEvents = Math.min(DEFAULT_MAX_EVENTS_TO_READ, sourceLimit);
|
||||||
// partition size
|
OffsetRange[] offsetRanges = CheckpointUtils.computeOffsetRanges(fromOffsets, toOffsets, numEvents);
|
||||||
OffsetRange[] offsetRanges = CheckpointUtils.computeOffsetRanges(fromOffsets, toOffsets);
|
|
||||||
long totalNewMsgs = CheckpointUtils.totalNewMessages(offsetRanges);
|
long totalNewMsgs = CheckpointUtils.totalNewMessages(offsetRanges);
|
||||||
if (totalNewMsgs <= 0) {
|
if (totalNewMsgs <= 0) {
|
||||||
return new ImmutablePair<>(Optional.empty(),
|
return new ImmutablePair<>(Optional.empty(), lastCheckpointStr.isPresent() ? lastCheckpointStr.get() : "");
|
||||||
lastCheckpointStr.isPresent() ? lastCheckpointStr.get()
|
|
||||||
: CheckpointUtils.offsetsToStr(toOffsets));
|
|
||||||
} else {
|
} else {
|
||||||
log.info("About to read " + totalNewMsgs + " from Kafka for topic :" + topicName);
|
log.info("About to read " + totalNewMsgs + " from Kafka for topic :" + topicName);
|
||||||
}
|
}
|
||||||
|
|
||||||
// Perform the actual read from Kafka
|
|
||||||
JavaRDD<byte[]> kafkaRDD = KafkaUtils.createRDD(sparkContext, byte[].class, byte[].class,
|
|
||||||
DefaultDecoder.class, DefaultDecoder.class, kafkaParams, offsetRanges).values();
|
|
||||||
|
|
||||||
// Produce a RDD[GenericRecord]
|
// Produce a RDD[GenericRecord]
|
||||||
final AvroConvertor avroConvertor = new AvroConvertor(
|
final AvroConvertor avroConvertor = new AvroConvertor(schemaProvider.getSourceSchema().toString());
|
||||||
schemaProvider.getSourceSchema().toString());
|
JavaRDD<GenericRecord> newDataRDD = toAvroRDD(offsetRanges, avroConvertor);
|
||||||
JavaRDD<GenericRecord> newDataRDD;
|
return new ImmutablePair<>(Optional.of(newDataRDD), CheckpointUtils.offsetsToStr(offsetRanges));
|
||||||
if (dataFormat == SourceDataFormat.AVRO) {
|
|
||||||
newDataRDD = kafkaRDD.map(bytes -> avroConvertor.fromAvroBinary(bytes));
|
|
||||||
} else if (dataFormat == SourceDataFormat.JSON) {
|
|
||||||
newDataRDD = kafkaRDD.map(
|
|
||||||
bytes -> avroConvertor.fromJson(new String(bytes, Charset.forName("utf-8"))));
|
|
||||||
} else {
|
|
||||||
throw new HoodieNotSupportedException("Unsupport data format :" + dataFormat);
|
|
||||||
}
|
|
||||||
|
|
||||||
return new ImmutablePair<>(Optional.of(newDataRDD), CheckpointUtils.offsetsToStr(toOffsets));
|
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
|
|||||||
@@ -18,12 +18,12 @@
|
|||||||
|
|
||||||
package com.uber.hoodie.utilities.sources;
|
package com.uber.hoodie.utilities.sources;
|
||||||
|
|
||||||
|
import com.uber.hoodie.common.util.TypedProperties;
|
||||||
|
import com.uber.hoodie.common.util.collection.Pair;
|
||||||
import com.uber.hoodie.utilities.schema.SchemaProvider;
|
import com.uber.hoodie.utilities.schema.SchemaProvider;
|
||||||
import java.io.Serializable;
|
import java.io.Serializable;
|
||||||
import java.util.Optional;
|
import java.util.Optional;
|
||||||
import org.apache.avro.generic.GenericRecord;
|
import org.apache.avro.generic.GenericRecord;
|
||||||
import org.apache.commons.configuration.PropertiesConfiguration;
|
|
||||||
import org.apache.commons.lang3.tuple.Pair;
|
|
||||||
import org.apache.spark.api.java.JavaRDD;
|
import org.apache.spark.api.java.JavaRDD;
|
||||||
import org.apache.spark.api.java.JavaSparkContext;
|
import org.apache.spark.api.java.JavaSparkContext;
|
||||||
|
|
||||||
@@ -32,32 +32,23 @@ import org.apache.spark.api.java.JavaSparkContext;
|
|||||||
*/
|
*/
|
||||||
public abstract class Source implements Serializable {
|
public abstract class Source implements Serializable {
|
||||||
|
|
||||||
protected transient PropertiesConfiguration config;
|
protected transient TypedProperties props;
|
||||||
|
|
||||||
protected transient JavaSparkContext sparkContext;
|
protected transient JavaSparkContext sparkContext;
|
||||||
|
|
||||||
protected transient SourceDataFormat dataFormat;
|
|
||||||
|
|
||||||
protected transient SchemaProvider schemaProvider;
|
protected transient SchemaProvider schemaProvider;
|
||||||
|
|
||||||
|
|
||||||
protected Source(PropertiesConfiguration config, JavaSparkContext sparkContext,
|
protected Source(TypedProperties props, JavaSparkContext sparkContext, SchemaProvider schemaProvider) {
|
||||||
SourceDataFormat dataFormat, SchemaProvider schemaProvider) {
|
this.props = props;
|
||||||
this.config = config;
|
|
||||||
this.sparkContext = sparkContext;
|
this.sparkContext = sparkContext;
|
||||||
this.dataFormat = dataFormat;
|
|
||||||
this.schemaProvider = schemaProvider;
|
this.schemaProvider = schemaProvider;
|
||||||
}
|
}
|
||||||
|
|
||||||
/**
|
/**
|
||||||
* Fetches new data upto maxInputBytes, from the provided checkpoint and returns an RDD of the
|
* Fetches new data upto sourceLimit, from the provided checkpoint and returns an RDD of the
|
||||||
* data, as well as the checkpoint to be written as a result of that.
|
* data, as well as the checkpoint to be written as a result of that.
|
||||||
*/
|
*/
|
||||||
public abstract Pair<Optional<JavaRDD<GenericRecord>>, String> fetchNewData(
|
public abstract Pair<Optional<JavaRDD<GenericRecord>>, String> fetchNewData(
|
||||||
Optional<String> lastCheckpointStr, long maxInputBytes);
|
Optional<String> lastCheckpointStr, long sourceLimit);
|
||||||
|
|
||||||
|
|
||||||
public PropertiesConfiguration getConfig() {
|
|
||||||
return config;
|
|
||||||
}
|
|
||||||
}
|
}
|
||||||
|
|||||||
@@ -1,19 +0,0 @@
|
|||||||
#
|
|
||||||
# Copyright (c) 2017 Uber Technologies, Inc. (hoodie-dev-group@uber.com)
|
|
||||||
#
|
|
||||||
# Licensed under the Apache License, Version 2.0 (the "License");
|
|
||||||
# you may not use this file except in compliance with the License.
|
|
||||||
# You may obtain a copy of the License at
|
|
||||||
#
|
|
||||||
# http://www.apache.org/licenses/LICENSE-2.0
|
|
||||||
#
|
|
||||||
# Unless required by applicable law or agreed to in writing, software
|
|
||||||
# distributed under the License is distributed on an "AS IS" BASIS,
|
|
||||||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
|
||||||
# See the License for the specific language governing permissions and
|
|
||||||
# limitations under the License.
|
|
||||||
#
|
|
||||||
#
|
|
||||||
#
|
|
||||||
hoodie.datasource.write.recordkey.field=_row_key
|
|
||||||
hoodie.datasource.write.partitionpath.field=driver
|
|
||||||
@@ -1,29 +0,0 @@
|
|||||||
{
|
|
||||||
"type" : "record",
|
|
||||||
"name" : "triprec",
|
|
||||||
"fields" : [ {
|
|
||||||
"name" : "_row_key",
|
|
||||||
"type" : "string"
|
|
||||||
}, {
|
|
||||||
"name" : "rider",
|
|
||||||
"type" : "string"
|
|
||||||
}, {
|
|
||||||
"name" : "driver",
|
|
||||||
"type" : "string"
|
|
||||||
}, {
|
|
||||||
"name" : "begin_lat",
|
|
||||||
"type" : "double"
|
|
||||||
}, {
|
|
||||||
"name" : "begin_lon",
|
|
||||||
"type" : "double"
|
|
||||||
}, {
|
|
||||||
"name" : "end_lat",
|
|
||||||
"type" : "double"
|
|
||||||
}, {
|
|
||||||
"name" : "end_lon",
|
|
||||||
"type" : "double"
|
|
||||||
}, {
|
|
||||||
"name" : "fare",
|
|
||||||
"type" : "double"
|
|
||||||
} ]
|
|
||||||
}
|
|
||||||
@@ -0,0 +1,162 @@
|
|||||||
|
/*
|
||||||
|
* Copyright (c) 2017 Uber Technologies, Inc. (hoodie-dev-group@uber.com)
|
||||||
|
*
|
||||||
|
* Licensed under the Apache License, Version 2.0 (the "License");
|
||||||
|
* you may not use this file except in compliance with the License.
|
||||||
|
* You may obtain a copy of the License at
|
||||||
|
*
|
||||||
|
* http://www.apache.org/licenses/LICENSE-2.0
|
||||||
|
*
|
||||||
|
* Unless required by applicable law or agreed to in writing, software
|
||||||
|
* distributed under the License is distributed on an "AS IS" BASIS,
|
||||||
|
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||||
|
* See the License for the specific language governing permissions and
|
||||||
|
* limitations under the License.
|
||||||
|
*
|
||||||
|
*
|
||||||
|
*/
|
||||||
|
|
||||||
|
package com.uber.hoodie.utilities;
|
||||||
|
|
||||||
|
import static org.junit.Assert.assertEquals;
|
||||||
|
import static org.junit.Assert.fail;
|
||||||
|
|
||||||
|
import com.uber.hoodie.common.model.HoodieCommitMetadata;
|
||||||
|
import com.uber.hoodie.common.table.HoodieTableMetaClient;
|
||||||
|
import com.uber.hoodie.common.table.HoodieTimeline;
|
||||||
|
import com.uber.hoodie.common.table.timeline.HoodieInstant;
|
||||||
|
import com.uber.hoodie.common.util.DFSPropertiesConfiguration;
|
||||||
|
import com.uber.hoodie.common.util.TypedProperties;
|
||||||
|
import com.uber.hoodie.exception.DatasetNotFoundException;
|
||||||
|
import com.uber.hoodie.utilities.deltastreamer.HoodieDeltaStreamer;
|
||||||
|
import com.uber.hoodie.utilities.deltastreamer.HoodieDeltaStreamer.Operation;
|
||||||
|
import com.uber.hoodie.utilities.sources.TestDataSource;
|
||||||
|
import java.io.IOException;
|
||||||
|
import org.apache.hadoop.fs.FileSystem;
|
||||||
|
import org.apache.hadoop.fs.Path;
|
||||||
|
import org.apache.log4j.LogManager;
|
||||||
|
import org.apache.log4j.Logger;
|
||||||
|
import org.apache.spark.sql.SQLContext;
|
||||||
|
import org.junit.After;
|
||||||
|
import org.junit.AfterClass;
|
||||||
|
import org.junit.Before;
|
||||||
|
import org.junit.BeforeClass;
|
||||||
|
import org.junit.Test;
|
||||||
|
|
||||||
|
/**
|
||||||
|
* Basic tests against {@link com.uber.hoodie.utilities.deltastreamer.HoodieDeltaStreamer}, by issuing bulk_inserts,
|
||||||
|
* upserts, inserts. Check counts at the end.
|
||||||
|
*/
|
||||||
|
public class TestHoodieDeltaStreamer extends UtilitiesTestBase {
|
||||||
|
|
||||||
|
private static volatile Logger log = LogManager.getLogger(TestHoodieDeltaStreamer.class);
|
||||||
|
|
||||||
|
@BeforeClass
|
||||||
|
public static void initClass() throws Exception {
|
||||||
|
UtilitiesTestBase.initClass();
|
||||||
|
|
||||||
|
// prepare the configs.
|
||||||
|
UtilitiesTestBase.Helpers.copyToDFS("delta-streamer-config/base.properties", dfs, dfsBasePath + "/base.properties");
|
||||||
|
UtilitiesTestBase.Helpers.copyToDFS("delta-streamer-config/source.avsc", dfs, dfsBasePath + "/source.avsc");
|
||||||
|
TypedProperties props = new TypedProperties();
|
||||||
|
props.setProperty("include", "base.properties");
|
||||||
|
props.setProperty("hoodie.datasource.write.recordkey.field", "_row_key");
|
||||||
|
props.setProperty("hoodie.datasource.write.partitionpath.field", "not_there");
|
||||||
|
props.setProperty("hoodie.deltastreamer.schemaprovider.source.schema.file", dfsBasePath + "/source.avsc");
|
||||||
|
UtilitiesTestBase.Helpers.savePropsToDFS(props, dfs, dfsBasePath + "/test-source.properties");
|
||||||
|
}
|
||||||
|
|
||||||
|
@AfterClass
|
||||||
|
public static void cleanupClass() throws Exception {
|
||||||
|
UtilitiesTestBase.cleanupClass();
|
||||||
|
}
|
||||||
|
|
||||||
|
@Before
|
||||||
|
public void setup() throws Exception {
|
||||||
|
super.setup();
|
||||||
|
TestDataSource.initDataGen();
|
||||||
|
}
|
||||||
|
|
||||||
|
@After
|
||||||
|
public void teardown() throws Exception {
|
||||||
|
super.teardown();
|
||||||
|
TestDataSource.resetDataGen();
|
||||||
|
}
|
||||||
|
|
||||||
|
static class TestHelpers {
|
||||||
|
|
||||||
|
static HoodieDeltaStreamer.Config makeConfig(String basePath, Operation op) {
|
||||||
|
HoodieDeltaStreamer.Config cfg = new HoodieDeltaStreamer.Config();
|
||||||
|
cfg.targetBasePath = basePath;
|
||||||
|
cfg.targetTableName = "hoodie_trips";
|
||||||
|
cfg.sourceClassName = TestDataSource.class.getName();
|
||||||
|
cfg.operation = op;
|
||||||
|
cfg.sourceOrderingField = "timestamp";
|
||||||
|
cfg.propsFilePath = dfsBasePath + "/test-source.properties";
|
||||||
|
cfg.sourceLimit = 1000;
|
||||||
|
return cfg;
|
||||||
|
}
|
||||||
|
|
||||||
|
static void assertRecordCount(long expected, String datasetPath, SQLContext sqlContext) {
|
||||||
|
long recordCount = sqlContext.read().format("com.uber.hoodie").load(datasetPath).count();
|
||||||
|
assertEquals(expected, recordCount);
|
||||||
|
}
|
||||||
|
|
||||||
|
static void assertCommitMetadata(String expected, String datasetPath, FileSystem fs, int totalCommits)
|
||||||
|
throws IOException {
|
||||||
|
HoodieTableMetaClient meta = new HoodieTableMetaClient(fs.getConf(), datasetPath);
|
||||||
|
HoodieTimeline timeline = meta.getActiveTimeline().getCommitsTimeline().filterCompletedInstants();
|
||||||
|
HoodieInstant lastCommit = timeline.lastInstant().get();
|
||||||
|
HoodieCommitMetadata commitMetadata = HoodieCommitMetadata.fromBytes(
|
||||||
|
timeline.getInstantDetails(lastCommit).get(), HoodieCommitMetadata.class);
|
||||||
|
assertEquals(totalCommits, timeline.countInstants());
|
||||||
|
assertEquals(expected, commitMetadata.getMetadata(HoodieDeltaStreamer.CHECKPOINT_KEY));
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
@Test
|
||||||
|
public void testProps() throws IOException {
|
||||||
|
TypedProperties props = new DFSPropertiesConfiguration(dfs, new Path(dfsBasePath + "/test-source.properties"))
|
||||||
|
.getConfig();
|
||||||
|
assertEquals(2, props.getInteger("hoodie.upsert.shuffle.parallelism"));
|
||||||
|
assertEquals("_row_key", props.getString("hoodie.datasource.write.recordkey.field"));
|
||||||
|
}
|
||||||
|
|
||||||
|
@Test
|
||||||
|
public void testDatasetCreation() throws Exception {
|
||||||
|
try {
|
||||||
|
dfs.mkdirs(new Path(dfsBasePath + "/not_a_dataset"));
|
||||||
|
HoodieDeltaStreamer deltaStreamer = new HoodieDeltaStreamer(
|
||||||
|
TestHelpers.makeConfig(dfsBasePath + "/not_a_dataset", Operation.BULK_INSERT), jsc);
|
||||||
|
deltaStreamer.sync();
|
||||||
|
fail("Should error out when pointed out at a dir thats not a dataset");
|
||||||
|
} catch (DatasetNotFoundException e) {
|
||||||
|
//expected
|
||||||
|
log.error("Expected error during dataset creation", e);
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
@Test
|
||||||
|
public void testBulkInsertsAndUpserts() throws Exception {
|
||||||
|
String datasetBasePath = dfsBasePath + "/test_dataset";
|
||||||
|
|
||||||
|
// Initial bulk insert
|
||||||
|
HoodieDeltaStreamer.Config cfg = TestHelpers.makeConfig(datasetBasePath, Operation.BULK_INSERT);
|
||||||
|
new HoodieDeltaStreamer(cfg, jsc).sync();
|
||||||
|
TestHelpers.assertRecordCount(1000, datasetBasePath + "/*/*.parquet", sqlContext);
|
||||||
|
TestHelpers.assertCommitMetadata("00000", datasetBasePath, dfs, 1);
|
||||||
|
|
||||||
|
// No new data => no commits.
|
||||||
|
cfg.sourceLimit = 0;
|
||||||
|
new HoodieDeltaStreamer(cfg, jsc).sync();
|
||||||
|
TestHelpers.assertRecordCount(1000, datasetBasePath + "/*/*.parquet", sqlContext);
|
||||||
|
TestHelpers.assertCommitMetadata("00000", datasetBasePath, dfs, 1);
|
||||||
|
|
||||||
|
// upsert() #1
|
||||||
|
cfg.sourceLimit = 2000;
|
||||||
|
cfg.operation = Operation.UPSERT;
|
||||||
|
new HoodieDeltaStreamer(cfg, jsc).sync();
|
||||||
|
TestHelpers.assertRecordCount(2000, datasetBasePath + "/*/*.parquet", sqlContext);
|
||||||
|
TestHelpers.assertCommitMetadata("00001", datasetBasePath, dfs, 2);
|
||||||
|
}
|
||||||
|
}
|
||||||
@@ -0,0 +1,137 @@
|
|||||||
|
/*
|
||||||
|
* Copyright (c) 2017 Uber Technologies, Inc. (hoodie-dev-group@uber.com)
|
||||||
|
*
|
||||||
|
* Licensed under the Apache License, Version 2.0 (the "License");
|
||||||
|
* you may not use this file except in compliance with the License.
|
||||||
|
* You may obtain a copy of the License at
|
||||||
|
*
|
||||||
|
* http://www.apache.org/licenses/LICENSE-2.0
|
||||||
|
*
|
||||||
|
* Unless required by applicable law or agreed to in writing, software
|
||||||
|
* distributed under the License is distributed on an "AS IS" BASIS,
|
||||||
|
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||||
|
* See the License for the specific language governing permissions and
|
||||||
|
* limitations under the License.
|
||||||
|
*
|
||||||
|
*
|
||||||
|
*/
|
||||||
|
|
||||||
|
package com.uber.hoodie.utilities;
|
||||||
|
|
||||||
|
import com.uber.hoodie.common.TestRawTripPayload;
|
||||||
|
import com.uber.hoodie.common.minicluster.HdfsTestService;
|
||||||
|
import com.uber.hoodie.common.model.HoodieRecord;
|
||||||
|
import com.uber.hoodie.common.util.TypedProperties;
|
||||||
|
import com.uber.hoodie.utilities.sources.TestDataSource;
|
||||||
|
import java.io.BufferedReader;
|
||||||
|
import java.io.IOException;
|
||||||
|
import java.io.InputStreamReader;
|
||||||
|
import java.io.PrintStream;
|
||||||
|
import java.util.List;
|
||||||
|
import org.apache.hadoop.fs.FileSystem;
|
||||||
|
import org.apache.hadoop.fs.Path;
|
||||||
|
import org.apache.hadoop.hdfs.DistributedFileSystem;
|
||||||
|
import org.apache.hadoop.hdfs.MiniDFSCluster;
|
||||||
|
import org.apache.spark.api.java.JavaSparkContext;
|
||||||
|
import org.apache.spark.sql.SQLContext;
|
||||||
|
import org.junit.After;
|
||||||
|
import org.junit.AfterClass;
|
||||||
|
import org.junit.Before;
|
||||||
|
import org.junit.BeforeClass;
|
||||||
|
|
||||||
|
/**
|
||||||
|
* Abstract test that provides a dfs & spark contexts.
|
||||||
|
*
|
||||||
|
* TODO(vc): this needs to be done across the board.
|
||||||
|
*/
|
||||||
|
public class UtilitiesTestBase {
|
||||||
|
|
||||||
|
protected static String dfsBasePath;
|
||||||
|
protected static HdfsTestService hdfsTestService;
|
||||||
|
protected static MiniDFSCluster dfsCluster;
|
||||||
|
protected static DistributedFileSystem dfs;
|
||||||
|
protected transient JavaSparkContext jsc = null;
|
||||||
|
protected transient SQLContext sqlContext;
|
||||||
|
|
||||||
|
@BeforeClass
|
||||||
|
public static void initClass() throws Exception {
|
||||||
|
hdfsTestService = new HdfsTestService();
|
||||||
|
dfsCluster = hdfsTestService.start(true);
|
||||||
|
dfs = dfsCluster.getFileSystem();
|
||||||
|
dfsBasePath = dfs.getWorkingDirectory().toString();
|
||||||
|
dfs.mkdirs(new Path(dfsBasePath));
|
||||||
|
}
|
||||||
|
|
||||||
|
@AfterClass
|
||||||
|
public static void cleanupClass() throws Exception {
|
||||||
|
if (hdfsTestService != null) {
|
||||||
|
hdfsTestService.stop();
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
@Before
|
||||||
|
public void setup() throws Exception {
|
||||||
|
TestDataSource.initDataGen();
|
||||||
|
jsc = UtilHelpers.buildSparkContext(this.getClass().getName() + "-hoodie", "local[2]");
|
||||||
|
sqlContext = new SQLContext(jsc);
|
||||||
|
}
|
||||||
|
|
||||||
|
@After
|
||||||
|
public void teardown() throws Exception {
|
||||||
|
TestDataSource.resetDataGen();
|
||||||
|
if (jsc != null) {
|
||||||
|
jsc.stop();
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
public static class Helpers {
|
||||||
|
|
||||||
|
// to get hold of resources bundled with jar
|
||||||
|
private static ClassLoader classLoader = Helpers.class.getClassLoader();
|
||||||
|
|
||||||
|
public static void copyToDFS(String testResourcePath, FileSystem fs, String targetPath) throws IOException {
|
||||||
|
BufferedReader reader = new BufferedReader(
|
||||||
|
new InputStreamReader(classLoader.getResourceAsStream(testResourcePath)));
|
||||||
|
PrintStream os = new PrintStream(fs.create(new Path(targetPath), true));
|
||||||
|
String line;
|
||||||
|
while ((line = reader.readLine()) != null) {
|
||||||
|
os.println(line);
|
||||||
|
}
|
||||||
|
os.flush();
|
||||||
|
os.close();
|
||||||
|
}
|
||||||
|
|
||||||
|
public static void savePropsToDFS(TypedProperties props, FileSystem fs, String targetPath) throws IOException {
|
||||||
|
String[] lines = props.keySet().stream().map(k -> String.format("%s=%s", k, props.get(k))).toArray(String[]::new);
|
||||||
|
saveStringsToDFS(lines, fs, targetPath);
|
||||||
|
}
|
||||||
|
|
||||||
|
public static void saveStringsToDFS(String[] lines, FileSystem fs, String targetPath) throws IOException {
|
||||||
|
PrintStream os = new PrintStream(fs.create(new Path(targetPath), true));
|
||||||
|
for (String l : lines) {
|
||||||
|
os.println(l);
|
||||||
|
}
|
||||||
|
os.flush();
|
||||||
|
os.close();
|
||||||
|
}
|
||||||
|
|
||||||
|
public static TypedProperties setupSchemaOnDFS() throws IOException {
|
||||||
|
UtilitiesTestBase.Helpers.copyToDFS("delta-streamer-config/source.avsc", dfs, dfsBasePath + "/source.avsc");
|
||||||
|
TypedProperties props = new TypedProperties();
|
||||||
|
props.setProperty("hoodie.deltastreamer.schemaprovider.source.schema.file", dfsBasePath + "/source.avsc");
|
||||||
|
return props;
|
||||||
|
}
|
||||||
|
|
||||||
|
public static String toJsonString(HoodieRecord hr) {
|
||||||
|
try {
|
||||||
|
return ((TestRawTripPayload) hr.getData()).getJsonData();
|
||||||
|
} catch (IOException ioe) {
|
||||||
|
return null;
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
public static String[] jsonifyRecords(List<HoodieRecord> records) throws IOException {
|
||||||
|
return records.stream().map(Helpers::toJsonString).toArray(String[]::new);
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}
|
||||||
@@ -0,0 +1,104 @@
|
|||||||
|
/*
|
||||||
|
* Copyright (c) 2017 Uber Technologies, Inc. (hoodie-dev-group@uber.com)
|
||||||
|
*
|
||||||
|
* Licensed under the Apache License, Version 2.0 (the "License");
|
||||||
|
* you may not use this file except in compliance with the License.
|
||||||
|
* You may obtain a copy of the License at
|
||||||
|
*
|
||||||
|
* http://www.apache.org/licenses/LICENSE-2.0
|
||||||
|
*
|
||||||
|
* Unless required by applicable law or agreed to in writing, software
|
||||||
|
* distributed under the License is distributed on an "AS IS" BASIS,
|
||||||
|
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||||
|
* See the License for the specific language governing permissions and
|
||||||
|
* limitations under the License.
|
||||||
|
*
|
||||||
|
*
|
||||||
|
*/
|
||||||
|
|
||||||
|
package com.uber.hoodie.utilities.sources;
|
||||||
|
|
||||||
|
import static org.junit.Assert.assertEquals;
|
||||||
|
|
||||||
|
import com.uber.hoodie.common.HoodieTestDataGenerator;
|
||||||
|
import com.uber.hoodie.common.util.TypedProperties;
|
||||||
|
import com.uber.hoodie.common.util.collection.Pair;
|
||||||
|
import com.uber.hoodie.utilities.UtilitiesTestBase;
|
||||||
|
import com.uber.hoodie.utilities.schema.FilebasedSchemaProvider;
|
||||||
|
import java.io.IOException;
|
||||||
|
import java.util.Optional;
|
||||||
|
import org.apache.avro.generic.GenericRecord;
|
||||||
|
import org.apache.hadoop.fs.Path;
|
||||||
|
import org.apache.spark.api.java.JavaRDD;
|
||||||
|
import org.junit.After;
|
||||||
|
import org.junit.AfterClass;
|
||||||
|
import org.junit.Before;
|
||||||
|
import org.junit.BeforeClass;
|
||||||
|
import org.junit.Test;
|
||||||
|
|
||||||
|
/**
|
||||||
|
* Basic tests against all subclasses of {@link DFSSource}
|
||||||
|
*/
|
||||||
|
public class TestDFSSource extends UtilitiesTestBase {
|
||||||
|
|
||||||
|
private FilebasedSchemaProvider schemaProvider;
|
||||||
|
|
||||||
|
@BeforeClass
|
||||||
|
public static void initClass() throws Exception {
|
||||||
|
UtilitiesTestBase.initClass();
|
||||||
|
}
|
||||||
|
|
||||||
|
@AfterClass
|
||||||
|
public static void cleanupClass() throws Exception {
|
||||||
|
UtilitiesTestBase.cleanupClass();
|
||||||
|
}
|
||||||
|
|
||||||
|
@Before
|
||||||
|
public void setup() throws Exception {
|
||||||
|
super.setup();
|
||||||
|
schemaProvider = new FilebasedSchemaProvider(Helpers.setupSchemaOnDFS(), jsc);
|
||||||
|
}
|
||||||
|
|
||||||
|
@After
|
||||||
|
public void teardown() throws Exception {
|
||||||
|
super.teardown();
|
||||||
|
}
|
||||||
|
|
||||||
|
@Test
|
||||||
|
public void testJsonDFSSource() throws IOException {
|
||||||
|
dfs.mkdirs(new Path(dfsBasePath + "/jsonFiles"));
|
||||||
|
HoodieTestDataGenerator dataGenerator = new HoodieTestDataGenerator();
|
||||||
|
|
||||||
|
TypedProperties props = new TypedProperties();
|
||||||
|
props.setProperty("hoodie.deltastreamer.source.dfs.root", dfsBasePath + "/jsonFiles");
|
||||||
|
JsonDFSSource jsonSource = new JsonDFSSource(props, jsc, schemaProvider);
|
||||||
|
|
||||||
|
// 1. Extract without any checkpoint => get all the data, respecting sourceLimit
|
||||||
|
assertEquals(Optional.empty(), jsonSource.fetchNewData(Optional.empty(), Long.MAX_VALUE).getKey());
|
||||||
|
UtilitiesTestBase.Helpers.saveStringsToDFS(
|
||||||
|
Helpers.jsonifyRecords(dataGenerator.generateInserts("000", 100)), dfs,
|
||||||
|
dfsBasePath + "/jsonFiles/1.json");
|
||||||
|
assertEquals(Optional.empty(), jsonSource.fetchNewData(Optional.empty(), 10).getKey());
|
||||||
|
Pair<Optional<JavaRDD<GenericRecord>>, String> fetch1 = jsonSource.fetchNewData(Optional.empty(), 1000000);
|
||||||
|
assertEquals(100, fetch1.getKey().get().count());
|
||||||
|
|
||||||
|
// 2. Produce new data, extract new data
|
||||||
|
UtilitiesTestBase.Helpers.saveStringsToDFS(
|
||||||
|
Helpers.jsonifyRecords(dataGenerator.generateInserts("001", 10000)),
|
||||||
|
dfs, dfsBasePath + "/jsonFiles/2.json");
|
||||||
|
Pair<Optional<JavaRDD<GenericRecord>>, String> fetch2 = jsonSource.fetchNewData(
|
||||||
|
Optional.of(fetch1.getValue()), Long.MAX_VALUE);
|
||||||
|
assertEquals(10000, fetch2.getKey().get().count());
|
||||||
|
|
||||||
|
// 3. Extract with previous checkpoint => gives same data back (idempotent)
|
||||||
|
Pair<Optional<JavaRDD<GenericRecord>>, String> fetch3 = jsonSource.fetchNewData(
|
||||||
|
Optional.of(fetch1.getValue()), Long.MAX_VALUE);
|
||||||
|
assertEquals(10000, fetch3.getKey().get().count());
|
||||||
|
assertEquals(fetch2.getValue(), fetch3.getValue());
|
||||||
|
|
||||||
|
// 4. Extract with latest checkpoint => no new data returned
|
||||||
|
Pair<Optional<JavaRDD<GenericRecord>>, String> fetch4 = jsonSource.fetchNewData(
|
||||||
|
Optional.of(fetch2.getValue()), Long.MAX_VALUE);
|
||||||
|
assertEquals(Optional.empty(), fetch4.getKey());
|
||||||
|
}
|
||||||
|
}
|
||||||
@@ -0,0 +1,99 @@
|
|||||||
|
/*
|
||||||
|
* Copyright (c) 2017 Uber Technologies, Inc. (hoodie-dev-group@uber.com)
|
||||||
|
*
|
||||||
|
* Licensed under the Apache License, Version 2.0 (the "License");
|
||||||
|
* you may not use this file except in compliance with the License.
|
||||||
|
* You may obtain a copy of the License at
|
||||||
|
*
|
||||||
|
* http://www.apache.org/licenses/LICENSE-2.0
|
||||||
|
*
|
||||||
|
* Unless required by applicable law or agreed to in writing, software
|
||||||
|
* distributed under the License is distributed on an "AS IS" BASIS,
|
||||||
|
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||||
|
* See the License for the specific language governing permissions and
|
||||||
|
* limitations under the License.
|
||||||
|
*
|
||||||
|
*
|
||||||
|
*/
|
||||||
|
|
||||||
|
package com.uber.hoodie.utilities.sources;
|
||||||
|
|
||||||
|
import com.uber.hoodie.common.HoodieTestDataGenerator;
|
||||||
|
import com.uber.hoodie.common.model.HoodieRecord;
|
||||||
|
import com.uber.hoodie.common.util.TypedProperties;
|
||||||
|
import com.uber.hoodie.common.util.collection.ImmutablePair;
|
||||||
|
import com.uber.hoodie.common.util.collection.Pair;
|
||||||
|
import com.uber.hoodie.utilities.schema.SchemaProvider;
|
||||||
|
import java.io.IOException;
|
||||||
|
import java.util.ArrayList;
|
||||||
|
import java.util.List;
|
||||||
|
import java.util.Optional;
|
||||||
|
import java.util.stream.Collectors;
|
||||||
|
import org.apache.avro.generic.GenericRecord;
|
||||||
|
import org.apache.avro.generic.IndexedRecord;
|
||||||
|
import org.apache.log4j.LogManager;
|
||||||
|
import org.apache.log4j.Logger;
|
||||||
|
import org.apache.spark.api.java.JavaRDD;
|
||||||
|
import org.apache.spark.api.java.JavaSparkContext;
|
||||||
|
|
||||||
|
/**
|
||||||
|
* An implementation of {@link Source}, that emits test upserts.
|
||||||
|
*/
|
||||||
|
public class TestDataSource extends Source {
|
||||||
|
|
||||||
|
private static volatile Logger log = LogManager.getLogger(TestDataSource.class);
|
||||||
|
|
||||||
|
// Static instance, helps with reuse across a test.
|
||||||
|
private static HoodieTestDataGenerator dataGenerator;
|
||||||
|
|
||||||
|
public static void initDataGen() {
|
||||||
|
dataGenerator = new HoodieTestDataGenerator();
|
||||||
|
}
|
||||||
|
|
||||||
|
public static void resetDataGen() {
|
||||||
|
dataGenerator = null;
|
||||||
|
}
|
||||||
|
|
||||||
|
public TestDataSource(TypedProperties props, JavaSparkContext sparkContext, SchemaProvider schemaProvider) {
|
||||||
|
super(props, sparkContext, schemaProvider);
|
||||||
|
}
|
||||||
|
|
||||||
|
private GenericRecord toGenericRecord(HoodieRecord hoodieRecord) {
|
||||||
|
try {
|
||||||
|
Optional<IndexedRecord> recordOpt = hoodieRecord.getData().getInsertValue(dataGenerator.avroSchema);
|
||||||
|
return (GenericRecord) recordOpt.get();
|
||||||
|
} catch (IOException e) {
|
||||||
|
return null;
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
@Override
|
||||||
|
public Pair<Optional<JavaRDD<GenericRecord>>, String> fetchNewData(Optional<String> lastCheckpointStr,
|
||||||
|
long sourceLimit) {
|
||||||
|
|
||||||
|
int nextCommitNum = lastCheckpointStr.isPresent() ? Integer.parseInt(lastCheckpointStr.get()) + 1 : 0;
|
||||||
|
String commitTime = String.format("%05d", nextCommitNum);
|
||||||
|
// No new data.
|
||||||
|
if (sourceLimit <= 0) {
|
||||||
|
return new ImmutablePair<>(Optional.empty(), commitTime);
|
||||||
|
}
|
||||||
|
|
||||||
|
// generate `sourceLimit` number of upserts each time.
|
||||||
|
int numExistingKeys = dataGenerator.getExistingKeysList().size();
|
||||||
|
int numUpdates = Math.min(numExistingKeys, (int) sourceLimit / 2);
|
||||||
|
int numInserts = (int) sourceLimit - numUpdates;
|
||||||
|
|
||||||
|
List<GenericRecord> records = new ArrayList<>();
|
||||||
|
try {
|
||||||
|
records.addAll(dataGenerator.generateUniqueUpdates(commitTime, numUpdates).stream()
|
||||||
|
.map(this::toGenericRecord).collect(Collectors.toList()));
|
||||||
|
records.addAll(dataGenerator.generateInserts(commitTime, numInserts).stream()
|
||||||
|
.map(this::toGenericRecord).collect(Collectors.toList()));
|
||||||
|
} catch (IOException e) {
|
||||||
|
log.error("Error generating test data.", e);
|
||||||
|
}
|
||||||
|
|
||||||
|
JavaRDD<GenericRecord> avroRDD = sparkContext.<GenericRecord>parallelize(records, 4);
|
||||||
|
return new ImmutablePair<>(Optional.of(avroRDD), commitTime);
|
||||||
|
}
|
||||||
|
}
|
||||||
@@ -0,0 +1,186 @@
|
|||||||
|
/*
|
||||||
|
* Copyright (c) 2017 Uber Technologies, Inc. (hoodie-dev-group@uber.com)
|
||||||
|
*
|
||||||
|
* Licensed under the Apache License, Version 2.0 (the "License");
|
||||||
|
* you may not use this file except in compliance with the License.
|
||||||
|
* You may obtain a copy of the License at
|
||||||
|
*
|
||||||
|
* http://www.apache.org/licenses/LICENSE-2.0
|
||||||
|
*
|
||||||
|
* Unless required by applicable law or agreed to in writing, software
|
||||||
|
* distributed under the License is distributed on an "AS IS" BASIS,
|
||||||
|
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||||
|
* See the License for the specific language governing permissions and
|
||||||
|
* limitations under the License.
|
||||||
|
*
|
||||||
|
*
|
||||||
|
*/
|
||||||
|
|
||||||
|
package com.uber.hoodie.utilities.sources;
|
||||||
|
|
||||||
|
import static com.uber.hoodie.utilities.sources.KafkaSource.CheckpointUtils;
|
||||||
|
import static org.junit.Assert.assertEquals;
|
||||||
|
|
||||||
|
import com.uber.hoodie.common.HoodieTestDataGenerator;
|
||||||
|
import com.uber.hoodie.common.util.TypedProperties;
|
||||||
|
import com.uber.hoodie.common.util.collection.Pair;
|
||||||
|
import com.uber.hoodie.utilities.UtilitiesTestBase;
|
||||||
|
import com.uber.hoodie.utilities.schema.FilebasedSchemaProvider;
|
||||||
|
import java.io.IOException;
|
||||||
|
import java.util.HashMap;
|
||||||
|
import java.util.Optional;
|
||||||
|
import kafka.common.TopicAndPartition;
|
||||||
|
import org.apache.avro.generic.GenericRecord;
|
||||||
|
import org.apache.spark.api.java.JavaRDD;
|
||||||
|
import org.apache.spark.streaming.kafka.KafkaCluster.LeaderOffset;
|
||||||
|
import org.apache.spark.streaming.kafka.KafkaTestUtils;
|
||||||
|
import org.apache.spark.streaming.kafka.OffsetRange;
|
||||||
|
import org.junit.After;
|
||||||
|
import org.junit.AfterClass;
|
||||||
|
import org.junit.Before;
|
||||||
|
import org.junit.BeforeClass;
|
||||||
|
import org.junit.Test;
|
||||||
|
|
||||||
|
/**
|
||||||
|
* Tests against {@link KafkaSource}
|
||||||
|
*/
|
||||||
|
public class TestKafkaSource extends UtilitiesTestBase {
|
||||||
|
|
||||||
|
private static String TEST_TOPIC_NAME = "hoodie_test";
|
||||||
|
|
||||||
|
private FilebasedSchemaProvider schemaProvider;
|
||||||
|
private KafkaTestUtils testUtils;
|
||||||
|
|
||||||
|
@BeforeClass
|
||||||
|
public static void initClass() throws Exception {
|
||||||
|
UtilitiesTestBase.initClass();
|
||||||
|
}
|
||||||
|
|
||||||
|
@AfterClass
|
||||||
|
public static void cleanupClass() throws Exception {
|
||||||
|
UtilitiesTestBase.cleanupClass();
|
||||||
|
}
|
||||||
|
|
||||||
|
@Before
|
||||||
|
public void setup() throws Exception {
|
||||||
|
super.setup();
|
||||||
|
schemaProvider = new FilebasedSchemaProvider(Helpers.setupSchemaOnDFS(), jsc);
|
||||||
|
testUtils = new KafkaTestUtils();
|
||||||
|
testUtils.setup();
|
||||||
|
}
|
||||||
|
|
||||||
|
@After
|
||||||
|
public void teardown() throws Exception {
|
||||||
|
super.teardown();
|
||||||
|
testUtils.teardown();
|
||||||
|
}
|
||||||
|
|
||||||
|
|
||||||
|
@Test
|
||||||
|
public void testJsonKafkaSource() throws IOException {
|
||||||
|
|
||||||
|
// topic setup.
|
||||||
|
testUtils.createTopic(TEST_TOPIC_NAME, 2);
|
||||||
|
HoodieTestDataGenerator dataGenerator = new HoodieTestDataGenerator();
|
||||||
|
TypedProperties props = new TypedProperties();
|
||||||
|
props.setProperty("hoodie.deltastreamer.source.kafka.topic", TEST_TOPIC_NAME);
|
||||||
|
props.setProperty("metadata.broker.list", testUtils.brokerAddress());
|
||||||
|
props.setProperty("auto.offset.reset", "smallest");
|
||||||
|
props.setProperty("key.serializer", "org.apache.kafka.common.serialization.StringSerializer");
|
||||||
|
props.setProperty("value.serializer", "org.apache.kafka.common.serialization.StringSerializer");
|
||||||
|
|
||||||
|
Source kafkaSource = new JsonKafkaSource(props, jsc, schemaProvider);
|
||||||
|
|
||||||
|
// 1. Extract without any checkpoint => get all the data, respecting sourceLimit
|
||||||
|
assertEquals(Optional.empty(), kafkaSource.fetchNewData(Optional.empty(), Long.MAX_VALUE).getKey());
|
||||||
|
testUtils.sendMessages(TEST_TOPIC_NAME, Helpers.jsonifyRecords(dataGenerator.generateInserts("000", 1000)));
|
||||||
|
Pair<Optional<JavaRDD<GenericRecord>>, String> fetch1 = kafkaSource.fetchNewData(Optional.empty(), 900);
|
||||||
|
assertEquals(900, fetch1.getKey().get().count());
|
||||||
|
|
||||||
|
// 2. Produce new data, extract new data
|
||||||
|
testUtils.sendMessages(TEST_TOPIC_NAME, Helpers.jsonifyRecords(dataGenerator.generateInserts("001", 1000)));
|
||||||
|
Pair<Optional<JavaRDD<GenericRecord>>, String> fetch2 = kafkaSource.fetchNewData(
|
||||||
|
Optional.of(fetch1.getValue()), Long.MAX_VALUE);
|
||||||
|
assertEquals(1100, fetch2.getKey().get().count());
|
||||||
|
|
||||||
|
// 3. Extract with previous checkpoint => gives same data back (idempotent)
|
||||||
|
Pair<Optional<JavaRDD<GenericRecord>>, String> fetch3 = kafkaSource.fetchNewData(
|
||||||
|
Optional.of(fetch1.getValue()), Long.MAX_VALUE);
|
||||||
|
assertEquals(fetch2.getKey().get().count(), fetch3.getKey().get().count());
|
||||||
|
assertEquals(fetch2.getValue(), fetch3.getValue());
|
||||||
|
|
||||||
|
// 4. Extract with latest checkpoint => no new data returned
|
||||||
|
Pair<Optional<JavaRDD<GenericRecord>>, String> fetch4 = kafkaSource.fetchNewData(
|
||||||
|
Optional.of(fetch2.getValue()), Long.MAX_VALUE);
|
||||||
|
assertEquals(Optional.empty(), fetch4.getKey());
|
||||||
|
}
|
||||||
|
|
||||||
|
|
||||||
|
private static HashMap<TopicAndPartition, LeaderOffset> makeOffsetMap(int[] partitions, long[] offsets) {
|
||||||
|
HashMap<TopicAndPartition, LeaderOffset> map = new HashMap<>();
|
||||||
|
for (int i = 0; i < partitions.length; i++) {
|
||||||
|
map.put(new TopicAndPartition(TEST_TOPIC_NAME, partitions[i]), new LeaderOffset("", -1, offsets[i]));
|
||||||
|
}
|
||||||
|
return map;
|
||||||
|
}
|
||||||
|
|
||||||
|
@Test
|
||||||
|
public void testComputeOffsetRanges() {
|
||||||
|
// test totalNewMessages()
|
||||||
|
long totalMsgs = CheckpointUtils.totalNewMessages(new OffsetRange[]{
|
||||||
|
OffsetRange.apply(TEST_TOPIC_NAME, 0, 0, 100),
|
||||||
|
OffsetRange.apply(TEST_TOPIC_NAME, 0, 100, 200)
|
||||||
|
});
|
||||||
|
assertEquals(200, totalMsgs);
|
||||||
|
|
||||||
|
// should consume all the full data
|
||||||
|
OffsetRange[] ranges = CheckpointUtils.computeOffsetRanges(
|
||||||
|
makeOffsetMap(new int[]{0, 1}, new long[]{200000, 250000}),
|
||||||
|
makeOffsetMap(new int[]{0, 1}, new long[]{300000, 350000}),
|
||||||
|
1000000L
|
||||||
|
);
|
||||||
|
assertEquals(200000, CheckpointUtils.totalNewMessages(ranges));
|
||||||
|
|
||||||
|
// should only consume upto limit
|
||||||
|
ranges = CheckpointUtils.computeOffsetRanges(
|
||||||
|
makeOffsetMap(new int[]{0, 1}, new long[]{200000, 250000}),
|
||||||
|
makeOffsetMap(new int[]{0, 1}, new long[]{300000, 350000}),
|
||||||
|
10000
|
||||||
|
);
|
||||||
|
assertEquals(10000, CheckpointUtils.totalNewMessages(ranges));
|
||||||
|
assertEquals(200000, ranges[0].fromOffset());
|
||||||
|
assertEquals(205000, ranges[0].untilOffset());
|
||||||
|
assertEquals(250000, ranges[1].fromOffset());
|
||||||
|
assertEquals(255000, ranges[1].untilOffset());
|
||||||
|
|
||||||
|
// should also consume from new partitions.
|
||||||
|
ranges = CheckpointUtils.computeOffsetRanges(
|
||||||
|
makeOffsetMap(new int[]{0, 1}, new long[]{200000, 250000}),
|
||||||
|
makeOffsetMap(new int[]{0, 1, 2}, new long[]{300000, 350000, 100000}),
|
||||||
|
1000000L
|
||||||
|
);
|
||||||
|
assertEquals(300000, CheckpointUtils.totalNewMessages(ranges));
|
||||||
|
assertEquals(3, ranges.length);
|
||||||
|
|
||||||
|
// for skewed offsets, does not starve any partition & can catch up
|
||||||
|
ranges = CheckpointUtils.computeOffsetRanges(
|
||||||
|
makeOffsetMap(new int[]{0, 1}, new long[]{200000, 250000}),
|
||||||
|
makeOffsetMap(new int[]{0, 1, 2}, new long[]{200010, 350000, 10000}),
|
||||||
|
100000
|
||||||
|
);
|
||||||
|
assertEquals(100000, CheckpointUtils.totalNewMessages(ranges));
|
||||||
|
assertEquals(10, ranges[0].count());
|
||||||
|
assertEquals(89990, ranges[1].count());
|
||||||
|
assertEquals(10000, ranges[2].count());
|
||||||
|
|
||||||
|
ranges = CheckpointUtils.computeOffsetRanges(
|
||||||
|
makeOffsetMap(new int[]{0, 1}, new long[]{200000, 250000}),
|
||||||
|
makeOffsetMap(new int[]{0, 1, 2}, new long[]{200010, 350000, 10000}),
|
||||||
|
1000000
|
||||||
|
);
|
||||||
|
assertEquals(110010, CheckpointUtils.totalNewMessages(ranges));
|
||||||
|
assertEquals(10, ranges[0].count());
|
||||||
|
assertEquals(100000, ranges[1].count());
|
||||||
|
assertEquals(10000, ranges[2].count());
|
||||||
|
}
|
||||||
|
}
|
||||||
@@ -15,4 +15,7 @@
|
|||||||
#
|
#
|
||||||
#
|
#
|
||||||
#
|
#
|
||||||
|
# Common hoodie client configs
|
||||||
hoodie.upsert.shuffle.parallelism=2
|
hoodie.upsert.shuffle.parallelism=2
|
||||||
|
hoodie.insert.shuffle.parallelism=2
|
||||||
|
hoodie.bulkinsert.shuffle.parallelism=2
|
||||||
@@ -14,5 +14,14 @@
|
|||||||
# limitations under the License.
|
# limitations under the License.
|
||||||
#
|
#
|
||||||
#
|
#
|
||||||
hoodie.deltastreamer.filebased.schemaprovider.source.schema.file=file:///Users/vinoth/bin/hoodie/hoodie-utilities/src/main/resources/delta-streamer-config/source.avsc
|
#
|
||||||
hoodie.deltastreamer.filebased.schemaprovider.target.schema.file=file:///Users/vinoth/bin/hoodie/hoodie-utilities/src/main/resources/delta-streamer-config/target.avsc
|
include=base.properties
|
||||||
|
# Key generator props
|
||||||
|
hoodie.datasource.write.recordkey.field=_row_key
|
||||||
|
hoodie.datasource.write.partitionpath.field=driver
|
||||||
|
# Schema provider props (change to absolute path based on your installation)
|
||||||
|
hoodie.deltastreamer.filebased.schemaprovider.source.schema.file=file:///path/to/hoodie/hoodie-utilities/src/main/resources/delta-streamer-props/source.avsc
|
||||||
|
hoodie.deltastreamer.filebased.schemaprovider.target.schema.file=file:///path/to/hoodie/hoodie-utilities/src/main/resources/delta-streamer-props/target.avsc
|
||||||
|
# DFS Source
|
||||||
|
hoodie.deltastreamer.source.dfs.root=file:///tmp/hoodie-dfs-input
|
||||||
|
|
||||||
@@ -15,9 +15,16 @@
|
|||||||
#
|
#
|
||||||
#
|
#
|
||||||
#
|
#
|
||||||
# DFS Source
|
include=base.properties
|
||||||
hoodie.deltastreamer.source.dfs.root=file:///tmp/hoodie-dfs-input
|
# Key fields, for kafka example
|
||||||
|
hoodie.datasource.write.recordkey.field=impressionid
|
||||||
|
hoodie.datasource.write.partitionpath.field=userid
|
||||||
|
# schema provider configs
|
||||||
|
hoodie.deltastreamer.schemaprovider.registry.url=http://localhost:8081/subjects/impressions-value/versions/latest
|
||||||
# Kafka Source
|
# Kafka Source
|
||||||
hoodie.deltastreamer.source.kafka.topic=uber_trips
|
#hoodie.deltastreamer.source.kafka.topic=uber_trips
|
||||||
|
hoodie.deltastreamer.source.kafka.topic=impressions
|
||||||
|
#Kafka props
|
||||||
metadata.broker.list=localhost:9092
|
metadata.broker.list=localhost:9092
|
||||||
auto.offset.reset=smallest
|
auto.offset.reset=smallest
|
||||||
|
schema.registry.url=http://localhost:8081
|
||||||
@@ -1,7 +1,11 @@
|
|||||||
{
|
{
|
||||||
"type" : "record",
|
"type" : "record",
|
||||||
"name" : "triprec",
|
"name" : "triprec",
|
||||||
"fields" : [ {
|
"fields" : [
|
||||||
|
{
|
||||||
|
"name" : "timestamp",
|
||||||
|
"type" : "double"
|
||||||
|
}, {
|
||||||
"name" : "_row_key",
|
"name" : "_row_key",
|
||||||
"type" : "string"
|
"type" : "string"
|
||||||
}, {
|
}, {
|
||||||
86
pom.xml
86
pom.xml
@@ -116,6 +116,7 @@
|
|||||||
<maven-dependency-plugin.version>2.10</maven-dependency-plugin.version>
|
<maven-dependency-plugin.version>2.10</maven-dependency-plugin.version>
|
||||||
<maven-jar-plugin.version>2.6</maven-jar-plugin.version>
|
<maven-jar-plugin.version>2.6</maven-jar-plugin.version>
|
||||||
<maven-surefire-plugin.version>2.19.1</maven-surefire-plugin.version>
|
<maven-surefire-plugin.version>2.19.1</maven-surefire-plugin.version>
|
||||||
|
<fasterxml.version>2.6.5</fasterxml.version>
|
||||||
<parquet.version>1.8.1</parquet.version>
|
<parquet.version>1.8.1</parquet.version>
|
||||||
<junit.version>4.11</junit.version>
|
<junit.version>4.11</junit.version>
|
||||||
<mockito.version>1.9.5</mockito.version>
|
<mockito.version>1.9.5</mockito.version>
|
||||||
@@ -376,6 +377,16 @@
|
|||||||
<artifactId>hadoop-client</artifactId>
|
<artifactId>hadoop-client</artifactId>
|
||||||
<version>${hadoop.version}-cdh${cdh.version}</version>
|
<version>${hadoop.version}-cdh${cdh.version}</version>
|
||||||
<scope>provided</scope>
|
<scope>provided</scope>
|
||||||
|
<exclusions>
|
||||||
|
<exclusion>
|
||||||
|
<groupId>com.fasterxml.jackson.*</groupId>
|
||||||
|
<artifactId>*</artifactId>
|
||||||
|
</exclusion>
|
||||||
|
<exclusion>
|
||||||
|
<groupId>javax.servlet</groupId>
|
||||||
|
<artifactId>servlet-api</artifactId>
|
||||||
|
</exclusion>
|
||||||
|
</exclusions>
|
||||||
</dependency>
|
</dependency>
|
||||||
|
|
||||||
<dependency>
|
<dependency>
|
||||||
@@ -427,6 +438,12 @@
|
|||||||
<artifactId>hive-common</artifactId>
|
<artifactId>hive-common</artifactId>
|
||||||
<version>${hive.version}-cdh${cdh.version}</version>
|
<version>${hive.version}-cdh${cdh.version}</version>
|
||||||
<scope>provided</scope>
|
<scope>provided</scope>
|
||||||
|
<exclusions>
|
||||||
|
<exclusion>
|
||||||
|
<groupId>com.fasterxml.jackson.*</groupId>
|
||||||
|
<artifactId>*</artifactId>
|
||||||
|
</exclusion>
|
||||||
|
</exclusions>
|
||||||
</dependency>
|
</dependency>
|
||||||
<dependency>
|
<dependency>
|
||||||
<groupId>org.apache.hadoop</groupId>
|
<groupId>org.apache.hadoop</groupId>
|
||||||
@@ -482,12 +499,28 @@
|
|||||||
<artifactId>spark-core_2.11</artifactId>
|
<artifactId>spark-core_2.11</artifactId>
|
||||||
<version>${spark.version}</version>
|
<version>${spark.version}</version>
|
||||||
<scope>provided</scope>
|
<scope>provided</scope>
|
||||||
|
<exclusions>
|
||||||
|
<exclusion>
|
||||||
|
<groupId>com.fasterxml.jackson.**</groupId>
|
||||||
|
<artifactId>*</artifactId>
|
||||||
|
</exclusion>
|
||||||
|
<exclusion>
|
||||||
|
<groupId>javax.servlet</groupId>
|
||||||
|
<artifactId>servlet-api</artifactId>
|
||||||
|
</exclusion>
|
||||||
|
</exclusions>
|
||||||
</dependency>
|
</dependency>
|
||||||
<dependency>
|
<dependency>
|
||||||
<groupId>org.apache.spark</groupId>
|
<groupId>org.apache.spark</groupId>
|
||||||
<artifactId>spark-sql_2.11</artifactId>
|
<artifactId>spark-sql_2.11</artifactId>
|
||||||
<version>${spark.version}</version>
|
<version>${spark.version}</version>
|
||||||
<scope>provided</scope>
|
<scope>provided</scope>
|
||||||
|
<exclusions>
|
||||||
|
<exclusion>
|
||||||
|
<groupId>com.fasterxml.jackson.**</groupId>
|
||||||
|
<artifactId>*</artifactId>
|
||||||
|
</exclusion>
|
||||||
|
</exclusions>
|
||||||
</dependency>
|
</dependency>
|
||||||
|
|
||||||
<dependency>
|
<dependency>
|
||||||
@@ -561,8 +594,33 @@
|
|||||||
<dependency>
|
<dependency>
|
||||||
<groupId>com.fasterxml.jackson.core</groupId>
|
<groupId>com.fasterxml.jackson.core</groupId>
|
||||||
<artifactId>jackson-annotations</artifactId>
|
<artifactId>jackson-annotations</artifactId>
|
||||||
<version>2.6.0</version>
|
<version>${fasterxml.version}</version>
|
||||||
</dependency>
|
</dependency>
|
||||||
|
|
||||||
|
<dependency>
|
||||||
|
<groupId>com.fasterxml.jackson.core</groupId>
|
||||||
|
<artifactId>jackson-core</artifactId>
|
||||||
|
<version>${fasterxml.version}</version>
|
||||||
|
</dependency>
|
||||||
|
|
||||||
|
<dependency>
|
||||||
|
<groupId>com.fasterxml.jackson.core</groupId>
|
||||||
|
<artifactId>jackson-databind</artifactId>
|
||||||
|
<version>${fasterxml.version}</version>
|
||||||
|
</dependency>
|
||||||
|
|
||||||
|
<dependency>
|
||||||
|
<groupId>com.fasterxml.jackson.module</groupId>
|
||||||
|
<artifactId>jackson-module-scala_2.11</artifactId>
|
||||||
|
<version>${fasterxml.version}</version>
|
||||||
|
</dependency>
|
||||||
|
|
||||||
|
<dependency>
|
||||||
|
<groupId>org.codehaus.jackson</groupId>
|
||||||
|
<artifactId>jackson-core-asl</artifactId>
|
||||||
|
<version>1.9.13</version>
|
||||||
|
</dependency>
|
||||||
|
|
||||||
<dependency>
|
<dependency>
|
||||||
<groupId>org.codehaus.jackson</groupId>
|
<groupId>org.codehaus.jackson</groupId>
|
||||||
<artifactId>jackson-mapper-asl</artifactId>
|
<artifactId>jackson-mapper-asl</artifactId>
|
||||||
@@ -573,6 +631,12 @@
|
|||||||
<groupId>org.apache.hive</groupId>
|
<groupId>org.apache.hive</groupId>
|
||||||
<artifactId>hive-jdbc</artifactId>
|
<artifactId>hive-jdbc</artifactId>
|
||||||
<version>${hive.version}-cdh${cdh.version}</version>
|
<version>${hive.version}-cdh${cdh.version}</version>
|
||||||
|
<exclusions>
|
||||||
|
<exclusion>
|
||||||
|
<groupId>com.fasterxml.jackson.*</groupId>
|
||||||
|
<artifactId>*</artifactId>
|
||||||
|
</exclusion>
|
||||||
|
</exclusions>
|
||||||
</dependency>
|
</dependency>
|
||||||
|
|
||||||
<dependency>
|
<dependency>
|
||||||
@@ -596,6 +660,26 @@
|
|||||||
<artifactId>hadoop-hdfs</artifactId>
|
<artifactId>hadoop-hdfs</artifactId>
|
||||||
<classifier>tests</classifier>
|
<classifier>tests</classifier>
|
||||||
<version>${hadoop.version}-cdh${cdh.version}</version>
|
<version>${hadoop.version}-cdh${cdh.version}</version>
|
||||||
|
<exclusions>
|
||||||
|
<exclusion>
|
||||||
|
<groupId>org.codehaus</groupId>
|
||||||
|
<artifactId>*</artifactId>
|
||||||
|
</exclusion>
|
||||||
|
<!-- Need these exclusions to make sure JavaSparkContext can be setup. https://issues.apache.org/jira/browse/SPARK-1693 -->
|
||||||
|
<exclusion>
|
||||||
|
<groupId>org.mortbay.jetty</groupId>
|
||||||
|
<artifactId>*</artifactId>
|
||||||
|
</exclusion>
|
||||||
|
<exclusion>
|
||||||
|
<groupId>javax.servlet.jsp</groupId>
|
||||||
|
<artifactId>*</artifactId>
|
||||||
|
</exclusion>
|
||||||
|
<exclusion>
|
||||||
|
<groupId>javax.servlet</groupId>
|
||||||
|
<artifactId>*</artifactId>
|
||||||
|
</exclusion>
|
||||||
|
</exclusions>
|
||||||
|
<scope>test</scope>
|
||||||
</dependency>
|
</dependency>
|
||||||
<dependency>
|
<dependency>
|
||||||
<groupId>org.apache.hadoop</groupId>
|
<groupId>org.apache.hadoop</groupId>
|
||||||
|
|||||||
Reference in New Issue
Block a user