Reworking the deltastreamer tool

- Standardize version of jackson - DFSPropertiesConfiguration replaces usage of commons PropertiesConfiguration - Remove dependency on ConstructorUtils - Throw error if ordering value is not present, during key generation - Switch to shade plugin for hoodie-utilities - Added support for consumption for Confluent avro kafka serdes - Support for Confluent schema registry - KafkaSource now deals with skews nicely, by doing round robin allocation of source limit across partitions - Added support for BULK_INSERT operations as well - Pass in the payload class config properly into HoodieWriteClient - Fix documentation based on new usage - Adding tests on deltastreamer, sources and all new util classes.
2018-08-04 03:35:30 -07:00
parent fb95dbdedb
commit d58ddbd999
49 changed files with 1919 additions and 754 deletions
--- a/docs/comparison.md
+++ b/docs/comparison.md
@@ -47,7 +47,7 @@ just for analytics. Finally, HBase does not support incremental processing primi
 A popular question, we get is : "How does Hoodie relate to stream processing systems?", which we will try to answer here. Simply put, Hoodie can integrate with
 batch (`copy-on-write storage`) and streaming (`merge-on-read storage`) jobs of today, to store the computed results in Hadoop. For Spark apps, this can happen via direct
 integration of Hoodie library with Spark/Spark streaming DAGs. In case of Non-Spark processing systems (eg: Flink, Hive), the processing can be done in the respective systems
-and later sent into a Hoodie table via a Kafka topic/HDFS intermediate file. (TODO: Need link to SQLStreamer/DeltaStreamer after reworking). In more conceptual level, data processing
+and later sent into a Hoodie table via a Kafka topic/HDFS intermediate file. In more conceptual level, data processing
 pipelines just consist of three components : `source`, `processing`, `sink`, with users ultimately running queries against the sink to use the results of the pipeline.
 Hoodie can act as either a source or sink, that stores data on HDFS. Applicability of Hoodie to a given stream processing pipeline ultimately boils down to suitability
 of Presto/SparkSQL/Hive for your queries.