Apache Pulsar Stream Processing System Becomes Top-Level Project

October 18, 2018

Categories

The Apache Software Foundation has announced that Pulsar has graduated to become its latest Top Level Project. Apache explains Pulsar as a “next-generation, Open Source distributed publish-and-subscribe messaging system designed for scalability, flexibility, and no data loss.” By graduating Pulsar to Top Level status, Apache hopes to reach a wider community of users and contributors, and build a stronger ecosystem.

What is Pulsar?

Pulsar is a scalable, low latency messaging platform. It runs on commodity hardware. Originally, it was developed at Yahoo, and the initial goal for Pulsar was to create a multi-tenant, scalable messaging system, one that could fulfill a wide side of use-cases. It was originally created as a solution to Yahoo’s challenges with multiple messaging systems, and the problems Yahoo experienced around the multiple teams deploying them. Released in 2016, Pulsar has been used in many Yahoo applications, including Mail, Finance, Sports, Gemini Ads, and Sherpa – Yahoo’s distributed key-value service. As a result, Pulsar has run in production at the scale of Yahoo for over three years. It provides simple pub-sub and queue semantics over topics and has a lightweight compute framework. It provides automatic cursor management for subscribers, as well as cross-datacenter replication.

The two traditional messaging models are queuing and publish-subscribe. Queuing is point-to-point, and allows you to divide up data processing over multiple consumer instances, so that your processing can be scaled. Publish-subscribe, meanwhile, is a broadcast model, where, instead of a message being delivered to one consumer, it is broadcast to all consumers.

Pulsar generalizes queuing and publish-subscribe in a unified messaging API. Producers publish messages to topics, and messages are broadcast to different subscriptions. Consumers can subscribe to those subscriptions to consume messages. Consumers who have subscribed to the same subscription have flexibility in how they consume their messages – exclusively, failover, and shared. Shared subscription (with round robin delivery) allows applications to divide up processing across consumers in the same subscription, just like with a queue. One of the differences to Pulsar versus other messaging systems, is that it allows you to scale the number of active consumers even beyond the number of partitions within a topic.

Uniquely, at the time that it was launched, Pulsar was designed for deployment as a hosted service for public and private cloud – this was not offered by any available open source system. Part of the design rationale behind Pulsar was to make it more cost effective to use a single deployment of Pulsar instead of requiring different teams to operate their own messaging solutions. The other advantage of using one system is that it does not require in-depth knowledge to configure, monitor, and troubleshoot different solutions effectively. By using a single system, cluster servers are better utilized, a dev-ops approach can be taken by multiple teams, and there can be more effective capacity planning using expected peak usage, as well as projected growth.

Some of Pulsar’s features include:

Deployment

You can deploy lightweight compute logic via an API, and you do not need to run your own stream processing engine
Pulsar can be deployed across different cloud providers with data replicated across them. This can be done without being tied to proprietary cloud APIs
You can deploy on bare metal or Kubernetes cluster on premises and/or in the cloud, the Google Kubernetes Engine, and AWS. It has REST Admin API for provisioning, administration, tools, and monitoring
You can download and use Pulsar on a single node for development and testing purposes

Low Latency

It is designed for low publish latency (less than 5ms)

Geo-replication

It has geo-replication – it was designed for configurable data replication between data centers across multiple regions. Users can configure the geographical regions where topics need to be replicated to, and can enable replication of data between Pulsar clusters automatically. Data is then continuously replicated to remote clusters. If there is a network failure across data centers, data is stored. It is then continuously retried until replication is successful. Geo-replication to multiple data centers in different geographical regions can be deployed in several configurations, including: between public and private clouds, between public clouds, and from edge data centers to either public or private clouds. This geo-replication is available out-of-the-box

Multi-tenant

It was built as a multi-tenant system, and supports Isolation, Authentication, Authorization and Quotas

Zero Data Loss

There is zero data loss regardless of errors, power failures, etc., which is achieved through the use of bookies (BookKeeper servers) running in storage nodes. Messages received by a Pulsar broker are sent to a set of bookie nodes. A bookie then saves a copy in memory as well as writing the data to a WAL. Before an acknowledgement is sent to the broker, the log is forced to stable storage. The WAL ensures the data is not lost, and it does so even if the machine fails, and then comes back again. Pulsar’s messages can also survive multiple node failures by replicating each message to multiple bookies. Configured number of replicas write the message to bookie. After they have done so successfully, Pulsar sends an acknowledgement to the producer
Pulsar’s layered architecture isolates the storage mechanism from the broker, which allows you to scale brokers and bookies independently and containerize ZooKeeper, broker, and bookies. The configuration and state of the cluster is provided by ZooKeeper

I/O Isolation

Pulsar has persistent message storage that is based on Apache BookKeeper. Between read and write levels, it provides IO-level isolation. It does this by using different paths of execution for reads and writes. When reads and writes share a single path of execution, consumer lag can produce general performance degradation. I/O isolation can achieve lower and more predictable push latency. This is true even when disks are saturated with heavy read activity

Multi-language API

Applications can interact with Pulsar with official client libraries for C++, Java and Python. Users can write their application in their language of choice and run them in production. Across languages, the APIs are intuitive and consistent. To accommodate different application styles, Pulsar clients provide support for both synchronous and asynchronous read and write operations. Whether synchronous or asynchronous, the semantics are the same. Either the API method blocks until the operation completes, or it returns a future object, and that future object can be used to track completion

Zero Rebalancing

There is zero rebalancing time. Any new broker added to the cluster is immediately available for writes and reads and does not spend any time rebalancing across the cluster. Thanks to the underlying Distributed Log Architecture, new bookies added to the cluster are immediately ready for writes

Scalability

Pulsar has both high-throughput scalability, and scalability in terms of the number of topics
By driving disks close to their maximum I/O bandwidth, Pulsar achieves high throughput. Pulsar varies throughputs based on message size, for example, with a 1 KB message, Pulsar saturates the disk at 120 MB/sec, but for a 10 byte message, throughput has been measured at 1.8 M/sec. Both cases have a publish latency below 5ms at the 99pct, and in both cases, a single publisher writes to a single topic with one partition
Pulsar can scale from hundreds to millions of topics while providing consistent performance. Storing data for a topic in a dedicated file or directory will have issues scaling because I/O will be scattered across the disk. Periodically, these files will have to be flushed. Pulsar data is stored in bookies, so messages from different topics are aggregated, sorted and stored in large files, then indexes

Security

Clients can authenticate themselves, since Pulsar supports a pluggable authentication mechanism. There are two authentication providers supported right out-of-the-box – TLS Authentication and Athenz, but Pulsar can be configured to support multiple authentication providers. The authentication providers establish the identity of a client by assigning the client a role token, which is used to determine which operations the client is authorized to perform

Inbuilt Load Balancers and Service Discovery

Pulsar’s inbuilt load balancers distribute loads across all brokers internally
Pulsar has inbuilt service discovery. This identifies, using a single endpoint, where and how to connect to brokers

Achieving Top Level Project Status

A Top Level Project (TLP) is a project that has received the highest status and can be considered part of the Apache Software Foundation (ASF). To do so, they have to go through an incubator period. For any project or database that wishes to become part of the Foundation’s effort, this incubator period guarantees that all donations are in accordance with ASF legal standards, and it ensures that new communities, which are developed always adhere to the Apache Foundation’s guiding principles. After completing the incubation period, Pulsar achieved its status as a Top Level Project.

Matteo Merli, Vice President of Apache Pulsar, said, “We are very proud of reaching this important milestone. This is the testament to all work done over the years by all the contributors, before and after starting our journey within The Apache Software Foundation. During the incubation process, it has been amazing to see the community grow and the project mature at such a high pace. The last year has seen the evolution of Pulsar from its original messaging core into an integrated platform for data in motion. We are thrilled to continue to drive the innovation in this exciting and fast moving space.”

Apache vs. Kafka

Comparisons are being made between Pulsar and another ASF project, Kafka. Kafka was developed at LinkedIn. Messaging and data pipelines are the two top uses for Kafka. Merli had this to say about Apache and Kafka, “There is a big overlap in the use cases for the two systems, but the original designs were very different.”

Pulsar’s messaging model unifies queuing and streaming into a single API, without having to set up one thing for queuing and one for streaming. While Pulsar was designed for shared data consumption, Kafka was not. Pulsar has a two-layer design – a stateless layer of brokers that receive and deliver messages, and a stateful persistence layer with bookies that provide low-latency, durable storage. This provides strong data guarantees and enables users to configure a retention period for messages. This retention period remains even after all subscriptions have consumed them.

Sijie Guo, Co-Founder of Streamlio, summed up the difference between the two, saying, “Apache Pulsar combines high-performance streaming (which Apache Kafka pursues) and flexible traditional queuing (which RabbitMQ pursues) into a unified messaging model and API. Pulsar gives you one system for both streaming and queuing, with the same high performance, using a unified API.”

Here is a breakdown that lists similarities and differences:

Concepts

Pulsar is producer-topic-consumer group-consumer
Kafka is producer-topic-subscription consumer

Consumption

Kafka has no shared consumption. It is more focused on streaming, exclusive messaging on partitions
Pulsar has streaming via exclusive, failover subscription and queuing via shared subscription

Acking

For Kafka, it depends on the version. Prior to 0.8, offsets are stored in ZooKeeper, but after 0.8 offsets are stored on offset topics
Pulsar has streaming via exclusive, failover subscription and queuing via shared subscription

Retention

In Kafka, messages are deleted based on retention. If a message is unread by the consumer before the retention period, it will lose data
Pulsar only deletes messages after all subscriptions consume them. There is no data loss even if the consumers of a subscription are down for a long time. Then, messages are kept for a retention period even after all subscriptions consume them

TTL

Kafka has no TTL support
Pulsar supports message TTL

In published results of a stream processing benchmark, it was found that Pulsar has up to a 150% performance improvement over Kafka, while maintaining up to a 60% lower latency. Pulsar also has other advantages over Kafka:

For multi-data-center geographic mirroring, Pulsar offers native support.
While Kafka’s MirrorMaker had a number of issues for even two data centers within Yahoo, Pulsar is running full-mesh active active in 10
Pulsar has zero data loss. When Kafka was developed, some data loss was acceptable
Scaling Kafka can be difficult, which is not true of Pulsar
Pulsar allows you to add nodes on the fly without impacting production