hudi/docs/comparison.md at 8adaca34548ffdb89615dfbb4a2d44d8f71bff55

Files

Vinoth Chandar d58ddbd999 Reworking the deltastreamer tool

- Standardize version of jackson
 - DFSPropertiesConfiguration replaces usage of commons PropertiesConfiguration
 - Remove dependency on ConstructorUtils
 - Throw error if ordering value is not present, during key generation
 - Switch to shade plugin for hoodie-utilities
 - Added support for consumption for Confluent avro kafka serdes
 - Support for Confluent schema registry
 - KafkaSource now deals with skews nicely, by doing round robin allocation of source limit across partitions
 - Added support for BULK_INSERT operations as well
 - Pass in the payload class config properly into HoodieWriteClient
 - Fix documentation based on new usage
 - Adding tests on deltastreamer, sources and all new util classes.

2018-09-08 10:24:32 +08:00

5.3 KiB

Raw Blame History

title, keywords, sidebar, permalink, toc

title	keywords	sidebar	permalink	toc
Comparison	usecases	mydoc_sidebar	comparison.html	false

Hoodie fills a big void for processing data on top of HDFS, and thus mostly co-exists nicely with these technologies. However, it would be useful to understand how Hoodie fits into the current big data ecosystem, contrasting it with a few related systems and bring out the different tradeoffs these systems have accepted in their design.

Kudu

Apache Kudu is a storage system that has similar goals as Hoodie, which is to bring real-time analytics on petabytes of data via first class support for upserts. A key differentiator is that Kudu also attempts to serve as a datastore for OLTP workloads, something that Hoodie does not aspire to be. Consequently, Kudu does not support incremental pulling (as of early 2017), something Hoodie does to enable incremental processing use cases.

Kudu diverges from a distributed file system abstraction and HDFS altogether, with its own set of storage servers talking to each other via RAFT. Hoodie, on the other hand, is designed to work with an underlying Hadoop compatible filesystem (HDFS,S3 or Ceph) and does not have its own fleet of storage servers, instead relying on Apache Spark to do the heavy-lifting. Thu, Hoodie can be scaled easily, just like other Spark jobs, while Kudu would require hardware & operational support, typical to datastores like HBase or Vertica. We have not at this point, done any head to head benchmarks against Kudu (given RTTable is WIP). But, if we were to go with results shared by CERN , we expect Hoodie to positioned at something that ingests parquet with superior performance.

Hive Transactions

Hive Transactions/ACID is another similar effort, which tries to implement storage like merge-on-read, on top of ORC file format. Understandably, this feature is heavily tied to Hive and other efforts like LLAP. Hive transactions does not offer the read-optimized storage option or the incremental pulling, that Hoodie does. In terms of implementation choices, Hoodie leverages the full power of a processing framework like Spark, while Hive transactions feature is implemented underneath by Hive tasks/queries kicked off by user or the Hive metastore. Based on our production experience, embedding Hoodie as a library into existing Spark pipelines was much easier and less operationally heavy, compared with the other approach. Hoodie is also designed to work with non-hive enginers like Presto/Spark and will incorporate file formats other than parquet over time.

HBase

Even though HBase is ultimately a key-value store for OLTP workloads, users often tend to associate HBase with analytics given the proximity to Hadoop. Given HBase is heavily write-optimized, it supports sub-second upserts out-of-box and Hive-on-HBase lets users query that data. However, in terms of actual performance for analytical workloads, hybrid columnar storage formats like Parquet/ORC handily beat HBase, since these workloads are predominantly read-heavy. Hoodie bridges this gap between faster data and having analytical storage formats. From an operational perspective, arming users with a library that provides faster data, is more scalable, than managing a big farm of HBase region servers, just for analytics. Finally, HBase does not support incremental processing primitives like commit times, incremental pull as first class citizens like Hoodie.

Stream Processing

A popular question, we get is : "How does Hoodie relate to stream processing systems?", which we will try to answer here. Simply put, Hoodie can integrate with batch (copy-on-write storage) and streaming (merge-on-read storage) jobs of today, to store the computed results in Hadoop. For Spark apps, this can happen via direct integration of Hoodie library with Spark/Spark streaming DAGs. In case of Non-Spark processing systems (eg: Flink, Hive), the processing can be done in the respective systems and later sent into a Hoodie table via a Kafka topic/HDFS intermediate file. In more conceptual level, data processing pipelines just consist of three components : source, processing, sink, with users ultimately running queries against the sink to use the results of the pipeline. Hoodie can act as either a source or sink, that stores data on HDFS. Applicability of Hoodie to a given stream processing pipeline ultimately boils down to suitability of Presto/SparkSQL/Hive for your queries.

More advanced use cases revolve around the concepts of incremental processing, which effectively uses Hoodie even inside the processing engine to speed up typical batch pipelines. For e.g: Hoodie can be used as a state store inside a processing DAG (similar to how rocksDB is used by Flink). This is an item on the roadmap and will eventually happen as a Beam Runner

5.3 KiB Raw Blame History

Kudu

Hive Transactions

HBase

Stream Processing

5.3 KiB

Raw Blame History