Apache Kafka, the open source distributed streaming platform, has become increasingly popular in the DevOps community, experiencing a meteoric rise in popularity, particularly within the developer and engineer community, over the last seven years. It was initially released in 2011.
Stephen O’Grady, co-founder and principal analyst at RedMonk commenting on its growing popularity said the main reason “it’s becoming more visible [is] because it’s a high-quality open-source project, but also because its ability to handle high-velocity streams of information is increasingly in demand for usage in servicing workloads like IoT, among others”.
What is Apache Kafka and Stream Processing?
Before big data can be analyzed, it must first be ingested and made available to enterprise users. This is the raison d’etre of Apache Kafka. The streaming platform acts like the “central nervous system” of an enterprise or organization, gathering high-volume data about many aspects of its operations, including logs, application metrics, user activity, stock tickers and device usage. This information is made available as a real-time stream for enterprise users allowing its users to react as events occur.
Compared to batch processing (originally used in data processing), stream processing uses a continual input, allowing it to output data in near real-time. Stream processing has become essential in the rapid world of mobile in which real-time analytics are necessary to keep up with network demands and functionality.
A streaming platform has various key capabilities, the main three being:
- Publish and subscribe to streams of records, similar to a message queue or enterprise messaging system
- Process data streams as they happen
- Store streaming data in a fault-tolerant durable manner
Kafka revolves around several basic principles, including that it is:
- Scalable over large clusters
- Reliably process stream records
- Easy setup and deployment
Kafka and the Competition
Kafka has various competitors, including Storm, Flink and Spark. What sets Kafka apart is the fact that it combines distributed and tradition messaging systems, pairing this with a unique and highly effective combination of store and stream processing. This combination enables the creation of a measured streaming data pipeline with greater reliability of storage and lower latency than most of its rivals, and assured integration with offline systems if networks go down.
The Kafka storage layer is basically a “massively scalable pub/sub message queue designed as a distributed transaction log”, making it a valuable tool for enterprise infrastructures to process streaming data. Kafka also easily connects to external systems (for data import and/or export) via Kafka Connect and provides Kafka Streams, a Java stream processing library.
Check out our detailed comparison post here.
The Rise of DevOps
A significant part of the rise in popularity of stream processing has been the growth of DevOps as a field. DevOps has become an increasingly important movement over the last few years as infrastructure has become more dispersed, and accordingly more complicated to manage. Configuration management has become an essential part of the infrastructure ecosystem and DevOps has accordingly emerged; not as a profession, but – in AWS’ words – a “combination of cultural philosophies, practices, and tools that increase an organization’s ability to deliver applications and services at high velocity: evolving and improving products at a faster pace than organizations using traditional software development and infrastructure management processes”.
DevOps also means that development and operation teams now work together as part of a single team allowing engineers to work across the complete lifecycle of an application, from development and testing to deployment and operations, allowing them to hone a wider range of skills.
Automation tends to be a key part of DevOps practices, automating and thus speeding up processes that have been historically slow through a new type of technology stack and new kinds of tooling, such as Kafka.
The Beginnings and Meteoric Growth of Kafka
Kafka was originally developed at LinkedIn by the Apache Software Foundation in order to manage real-time streams of data from websites, applications and sensors.
Kafka was released as open source software in 2011. It quickly gained the support of multiple high-profile companies, including Uber, Cisco, Netflix, Goldman Sachs and IBM. The list of companies it supports today is even more impressive, having added the likes of Airbnb, Spotify and Pinterest to its roster. The platform is used in production by over a third of the Fortune 500.
In 2014, three of the creators behind Kafka (engineers Jun Rao, Jay Kreps, and Neha Narkhede) launched startup Confluent with the goal of helping enterprises use Kafka in production at scale.
“During our explosive growth phase at LinkedIn, we could not keep up with the growing user base and the data that could be used to help us improve the user experience,” said Narkhede. “What Kafka allows you to do is move data across the company and make it available as a continuously free-flowing stream within seconds to people who need to make use of it,” she explains. “And it does that at scale.”
She says the impact at LinkedIn was “transformational”. Indeed in 2016, LinkedIn (plus Microsoft and Netflix) passed the threshold of processing over 1 trillion messages per day.
Confluent offers advanced management software through a subscription service to large enterprise companies who want to run Kafka for production systems. In 2016, the company grew subscriptions by over 700%. This growth was spread over multiple industries, including finance, healthcare, tech, retail and automotive.
Kreps, co-founder and CEO, attributed its success to the rise in streaming platforms within businesses. “We are seeing incredible demand for streaming platforms and real-time data from customers around the world”. He added, “The dramatic shift from batch processing to real-time streaming is changing how enterprises do business, and companies that realize the potential of streams can immediately respond to their customers, their supply chain or competitive threats.”
Apparently, Krebs named the software after Czech author Franz Kafka because the Kafka system is one “optimized for writing”, and he likes Kafka’s work.
Apache Kafka is based on the commit log, which lets users subscribe to it and publish the data gleaned within Kafka to real-time applications or any number of systems.
It is mainly used for two broad types of application:
- The build of real-time streaming data pipelines that pass data between systems or applications reliably
- The build of real-time streaming applications that transform or react to the data streams.
Use case scenarios include matching passenger and drivers at Uber, offering real-time analytics and predictive maintenance for smart homes at British Gas, and performing various real-time services across the LinkedIn platform.
Kafka runs as a cluster on one or multiple servers (called brokers) that can span multiple datacenters. It stores key-value messages that stem from many processes called producers. Its cluster then stores streams of records in partitions within different topics. Inside a partition, messages are ordered by their offsets (their position within a partition), and indexed and stored together with a timestamp. Other processes known as consumers read messages from partitions. Each record essentially comprises of a key, a value and a timestamp.
The Kafka architecture allows it to deliver huge streams of messages in a fault-tolerant way, leading to it replacing many of the conventional messaging systems such as JMS and AMQP.
The platform supports two kinds of topic: regular and compacted. The regular topics can be configured with a retention time or space bound. Kafka is allowed to delete old data to free up space if records are older than a predetermined retention time or if the space bound exceeds that for a partition. By contrast, compacted topics don’t expire based on time or space bounds. Instead, users can choose to delete them by writing a tombstone message with null-value for a particular key.
Kafka’s Four Core APIs
- The Streams API lets an application act as a stream processor, consuming an input stream from one or more topics and converting it to an output stream to one or more output topics, essentially transforming the input streams to output streams. It is written in Java. Its library permits the development of stateful stream-processing applications, which are elastic, fault-tolerant and highly scalable.
- The Connector API enables the building and running of reusable producers or consumers that connect topics to existing applications or data systems. The Connect framework executes “connectors”, which implement the actual logic to read/write data from other systems. The Connect API determines the necessary programming interface for the generation of a custom connector.
- The Producer API permits an application to publish a stream of records to one or more Kafka topics.
- The Consumer API permits an application to subscribe to one or more topics and process the stream of records produced to them.
The Producer and Consumer APIs are built on top of Kafka’s messaging protocol, thereby providing a reference implementation for Kafka consumers and producers clients in Java. This underlying protocol lets developers write their own consumer or producer clients in the programming language of their choice, and means Kafka is not bound to the Java Virtual Machine (JVM) based ecosystem.
Monitoring Kafka Performance
Monitoring Kafka’s performance at scale has become very important now it has achieved widespread integration into so many enterprise-level infrastructures. This kind of monitoring is highly intensive, involving end-to-end tracking of metrics from brokers, consumers and producers, as well as the monitoring of Zookeeper, which Kafka uses to coordinate among consumers.
There are both open-source monitoring platforms for Kafka such as LinkedIn’s Burrow or paid, like Datadog. The collection of Kafka data can also take place using tools bundled with Java such as JConsole.
Confluent and the Commercial Version of Apache Kafka
Confluent adds numerous enterprise and production-scale features and support to the regular Kafka environment, including administration, data management, operations tools and thorough testing capabilities. It makes all your company’s data readily available as realtime streams in a single platform: the goal being to “let you focus on how to derive business value from your data rather than worrying about the underlying mechanics of how data is shuttled, shuffled, switched, and sorted between various systems”.
It also enables the hosting of customers’ Kafka environments in the cloud, in addition to providing consultation services on how to maximize the impact of Kafka deployments across a wide range of use cases.
Uses can range from enabling batch Big Data analysis with Hadoop and feeding monitoring systems in real-time, to large volume data integration tasks, which involve a high-throughput, industrial-strength extraction, transformation, and load (ETL) backbone.
Confluent Open Source is available as a free download. Confluent Enterprise is available via a subscription service.
Confluent Partner Program
Confluent also started a partner program to “enable a rapidly growing ecosystem around Kafka and Confluent”. Its goal is to offer partners the resources, technical expertise and guidance to “integrate, implement and innovative with Apache Kafka and Confluent”. Existing partners include the big players like AWS, Google Cloud, Microsoft Azure and around 100 other companies, including Infosys, Accenture, Arcadia Data and Cognizant.