In the early days of data processing, batch-oriented data infrastructure worked as a great way to process and output data, but now as networks move to mobile, where real-time analytics are required to keep up with network demands and functionality, stream processing has become vital.
While bat processing requires different programs for analyzing input and output dating, meaning it stores the data and processes it at a later time, stream processing using a continual input, outputting data near real-time.
The keys to stream processing revolve around the same basic principles
- Processes data streams as they occur
- Stores streaming data in a fault-tolerant way.
- Scalable across large clusters of machines
- Publishes stream records with reliability, ensuring
- Ease of deployment
With these traits in mind, our researchers have looked into four different open source streaming processors, including Flink, Spark, Storm and Kafka. Below we’ll give an overview of our findings to help you decide which real time processor best suits your network.
Apache Storm is a free and open source distributed realtime computation system. Storm makes it easy to reliably process unbounded streams of data, doing for realtime processing what Hadoop did for batch processing. Storm works by using your existing queuing and database technologies to process complex streams of data, separating and processing streams at different stages in the computation in order to meet your needs.
Storm’s main use cases include:
- real time analytics,
- online machine learning
- continuous computation
- distributed RPC, ETL, and more.
Tests have shown Storm to be reliably fast, with benchmark speeds clocked in at “over a million tuples processed per second per node.” Another big draw of Storm is the scalability, with parallel calculations running across multiple clusters of machines. Given the complexity of the system, it also is fault-tolerant, automatically restarting nodes and repositioning the workload across nodes. Storm also boasts of its ease to use, with “standard configurations suitable for production on day one”. Their site contains many forums and tutorials to help walk any user through setup and get the system running.
Flink’s is an open-source framework for distributed stream processing, providing a
- Is stateful and fault-tolerant and can seamlessly recover from failures while maintaining exactly-once application state
- Performs at large scale, running on thousands of nodes with very good throughput and latency characteristics
- Provides fault-tolerant state management
- Accuracy, even with late or out of order data
- Flexible windowing for computing accurate results on unbounded datasets
- Stateful, providing a summary of data that has been processed over time,
- Checkpointing mechanism in event of a failure.
Flink streaming processes data streams as true streams, i.e., data elements are immediately “pipelined” through a streaming program as soon as they arrive. This allows to perform flexible window operations on streams. It is even capable of handling late data in streams by the use of watermarks. Furthermore Flink provides a very strong compatibility mode which makes it possible to use your existing storm, MapReduce, … code on the flink execution engine
Flink is capable of high throughput and low latency, with side by side comparison showing the robust speeds compared to Storm.
Spark’s is mainly used for in-memory processing of batch data, but it does contain stream processing ability by wrapping data streams into smaller batches, collecting all data that arrives within a certain period of time and running a regular batch program on the collected data. Spark is often used for machine learning due to the fact that these algorithms tend to be iterative, which is what Spark was designed for. Spark can cashe datasets in the memory at much greater speeds, making it ideal for:
- Machine learning
- SQL workloads that require fast iterative access to datasets.
- Works in conjunction with Apache Hadoop to “exploit Spark’s power, derive insights, and enrich their data science workloads within a single, shared dataset in Hadoop.”
According to their support handbook, Spark also includes “MLlib, a library that provides a growing set of machine algorithms for common data science techniques: Classification, Regression, Collaborative Filtering, Clustering and Dimensionality Reduction.” So if your system requres a lot of data science workflows, Sparks and its abstraction layer could make it an ideal fit. Also, a recent Syncsort survey states that Spark has even managed to displaced Hadoop in terms of visibility and popularity on the market.
For more complex transformations Kafka provides a fully integrated Streams API. This allows building applications that do non-trivial processing that compute “aggregations off of streams or join streams together.”
Kafka helps to provide support for many stream processing issues:
- Handles out-of-order data,
- Processes input as code changes,
- Performs stateful computations, etc.
- Producer and consumer APIs for input,
- Group mechanism for fault tolerance among the stream processor instances
Kafka combines both distributed and tradition messaging systems, pairing it with a combination of store and stream processing in a way that isn’t widely seen, but essential to Kafka’s infrastructure.
A distributed file system like HDFS allows storing static files for batch processing. Effectively a system like this allows storing and processing historical data from the past. A traditional enterprise messaging system allows processing future messages that will arrive after you subscribe. Applications built in this way process future data as it arrives. Kafka uses aa combination of the two to create a more measured streaming data pipeline, with lower latency, better storage reliability, and guaranteed integration with offline systems in the event they go down.
In order to keep up with the changing nature of networking, data needs to be available and processed in a way that serves your business in real-time. So figuring out what kind of stream processor works for you is imperative now more than ever.