Site icon Bizety: Research & Consulting

Apache Pulsar Stream Processing System Becomes Top-Level Project

The Apache Software Foundation has announced that Pulsar has graduated to become its latest Top Level Project. Apache explains Pulsar as a “next-generation, Open Source distributed publish-and-subscribe messaging system designed for scalability, flexibility, and no data loss.” By graduating Pulsar to Top Level status, Apache hopes to reach a wider community of users and contributors, and build a stronger ecosystem.

What is Pulsar?

Pulsar is a scalable, low latency messaging platform. It runs on commodity hardware. Originally, it was developed at Yahoo, and the initial goal for Pulsar was to create a multi-tenant, scalable messaging system, one that could fulfill a wide side of use-cases. It was originally created as a solution to Yahoo’s challenges with multiple messaging systems, and the problems Yahoo experienced around the multiple teams deploying them. Released in 2016, Pulsar has been used in many Yahoo applications, including Mail, Finance, Sports, Gemini Ads, and Sherpa – Yahoo’s distributed key-value service. As a result, Pulsar has run in production at the scale of Yahoo for over three years. It provides simple pub-sub and queue semantics over topics and has a lightweight compute framework. It provides automatic cursor management for subscribers, as well as cross-datacenter replication.

The two traditional messaging models are queuing and publish-subscribe. Queuing is point-to-point, and allows you to divide up data processing over multiple consumer instances, so that your processing can be scaled. Publish-subscribe, meanwhile, is a broadcast model, where, instead of a message being delivered to one consumer, it is broadcast to all consumers.

Pulsar generalizes queuing and publish-subscribe in a unified messaging API. Producers publish messages to topics, and messages are broadcast to different subscriptions. Consumers can subscribe to those subscriptions to consume messages. Consumers who have subscribed to the same subscription have flexibility in how they consume their messages – exclusively, failover, and shared. Shared subscription (with round robin delivery) allows applications to divide up processing across consumers in the same subscription, just like with a queue. One of the differences to Pulsar versus other messaging systems, is that it allows you to scale the number of active consumers even beyond the number of partitions within a topic.

Uniquely, at the time that it was launched, Pulsar was designed for deployment as a hosted service for public and private cloud – this was not offered by any available open source system. Part of the design rationale behind Pulsar was to make it more cost effective to use a single deployment of Pulsar instead of requiring different teams to operate their own messaging solutions. The other advantage of using one system is that it does not require in-depth knowledge to configure, monitor, and troubleshoot different solutions effectively. By using a single system, cluster servers are better utilized, a dev-ops approach can be taken by multiple teams, and there can be more effective capacity planning using expected peak usage, as well as projected growth.

Some of Pulsar’s features include:

Deployment

Low Latency

Geo-replication

Multi-tenant

Zero Data Loss

I/O Isolation

Multi-language API

Zero Rebalancing

Scalability

Security

Inbuilt Load Balancers and Service Discovery

Achieving Top Level Project Status

A Top Level Project (TLP) is a project that has received the highest status and can be considered part of the Apache Software Foundation (ASF). To do so, they have to go through an incubator period. For any project or database that wishes to become part of the Foundation’s effort, this incubator period guarantees that all donations are in accordance with ASF legal standards, and it ensures that new communities, which are developed always adhere to the Apache Foundation’s guiding principles. After completing the incubation period, Pulsar achieved its status as a Top Level Project.

Matteo Merli, Vice President of Apache Pulsar, said, “We are very proud of reaching this important milestone. This is the testament to all work done over the years by all the contributors, before and after starting our journey within The Apache Software Foundation. During the incubation process, it has been amazing to see the community grow and the project mature at such a high pace. The last year has seen the evolution of Pulsar from its original messaging core into an integrated platform for data in motion. We are thrilled to continue to drive the innovation in this exciting and fast moving space.”

Apache vs. Kafka

Comparisons are being made between Pulsar and another ASF project, Kafka. Kafka was developed at LinkedIn. Messaging and data pipelines are the two top uses for Kafka. Merli had this to say about Apache and Kafka, “There is a big overlap in the use cases for the two systems, but the original designs were very different.”

Pulsar’s messaging model unifies queuing and streaming into a single API, without having to set up one thing for queuing and one for streaming. While Pulsar was designed for shared data consumption, Kafka was not. Pulsar has a two-layer design – a stateless layer of brokers that receive and deliver messages, and a stateful persistence layer with bookies that provide low-latency, durable storage. This provides strong data guarantees and enables users to configure a retention period for messages. This retention period remains even after all subscriptions have consumed them.

Sijie Guo, Co-Founder of Streamlio, summed up the difference between the two, saying, “Apache Pulsar combines high-performance streaming (which Apache Kafka pursues) and flexible traditional queuing (which RabbitMQ pursues) into a unified messaging model and API. Pulsar gives you one system for both streaming and queuing, with the same high performance, using a unified API.”

Here is a breakdown that lists similarities and differences:

Concepts

Consumption

Acking

Retention

TTL

In published results of a stream processing benchmark, it was found that Pulsar has up to a 150% performance improvement over Kafka, while maintaining up to a 60% lower latency. Pulsar also has other advantages over Kafka:

Copyright secured by Digiprove © 2018
Exit mobile version