Open Source Stream Processing: Flink vs Spark vs Storm vs Kafka

In the early days of data processing, batch-oriented data infrastructure worked as a great way to process and output data, but now as networks move to mobile, where real-time analytics are required to keep up with network demands and functionality, stream processing has become vital.

Breaking Down the ELK Stack

The ELK Stack is a new system that offers a way to take data from any source on your network, analyze it and visualize it for your convenience, all in realtime. Recently, it’s been gaining traction as a new leader in the open source market for logging analytics and visualization.  Over the years, the ELK Stack has been growing in terms usage, as monthly downloads have exceeded 500,00, with companies like Google, Netflix and Linkedin relying on it for their  analytics.

The stack itself consists of three parts: Elasticsearch, Logstash, and Kibana. Together these pieces add up to create a full system for analytics, that we’ll break apart below.

Elasticsearch

The Elasticsearch data structure helps to change mappings and gain performance benefits by indexing information in a searchable data storage. Some features include:

  • Interactive Search Analytics: search your data with real time insights and advanced analytics to help optimize your network performance
  • Scalability: As in the name, Elasticsearch offers incredible elasticity with a massively distributed structure that allows you to start small and scale horizontally by adding more nodes as your network expands
  • Reliability: Elasticsearch clusters help to detect failed nodes in order to organize and distribute data automatically so your data is always accessible and secure. Any changes are recorded in transaction logs on multiple nodes to minimize data losses
  • Multitenancy: clusters can contain more than one index that can be grouped together or searched individually. Index aliases allow administrators to present filtered views of an index that can be updated transparently to your application
  • Extensive Search Parameters: Elasticsearch offers full-text searching capabilities with a query API that supports multilingual search, geolocation, contextual suggestions, autocomplete and result snippets
  • Automation: simply index a JSON document and it will automatically structure the data, making it searchable, with customization options
  • RESTful API: all API driven with a simple RESTful API using JSON over HTTP
  • Free: Built on Apache 2, using Apache Lucene, it is all open source, allowing you to download, configure and modify all based on your needs

Logstash

Logstash is an open source data collection tool that organizes data across multiple sources and ships log data to Elasticsearch. Features include:

  • Logs and Metrics: handles various types of logging data such as syslog, Windows event logs, networking and firewall logs, Apache and log4j for Java
  • Metrics offered from Ganglia, collectd, NetFlow JMX and many other infrastructure application platforms over TCP and UDP
  • Build events from HTTP requests  that can then be pooling HTTP endpoints on demand
  • Unifies all your data streams
  • Wide uses for logging data from any source, such as IoT use cases, collecting data from any sort of connected sensor on any device, from cars, to homes to phones
  • Many tuning options available to help modify your Logstash with pattern matching, geo mapping, dynamic lookup capabilities, also deciphers geo coordinates from IP addresses, etc.
  • Integration with Grok to help structure data with their unique filters

Kibana:

Kibana takes the information from the datastore and presents in in graphical format for log analysis. This includes:

  • Integration with Elasticsearch and Logstash: helps to visualize both structured and unstructured data that is indexed into elasticsearch
  • Easily create charts, plots, histograms, maps, etc.
  • Visualize all of the analytics provided by Elasticsearch to provide practical uses for you data and make it digestible for you team
  • User friendly: makes it easily to share and set up on its own web server

How It Compares

Since it’s inception in 2010, ELK Stack has been disrupting the log analytics industry. The leading tool beforehand was Splunk, which was founded in 2003 and quickly grew into a global product, but since ELK Stack, Splunk has finally found some competition. The biggest benefit that Splunk had against open source logging tools was organization and reliability. Most of the time with open source projects, they fail since they are not able to create the same type of enterprise as privately funded companies. But ELK Stack seems to have finally broken that mold.

And while ELK Stack may be getting a lot more users, it’s not the only open source logging tool that is making waves. Recently, Fastly issued a post about how they are using Greylog, an open source competitor that has some similarities to ELK.

How Fastly Scales The Network

When most people hear the term “networking,” they think of cables, routers and switches. Every CDN out there does networking, building out their infrastructure to build out their network. But with Fastly, they do a different kind of networking, and it all involves software.

The Fastly Architecture

At its start, the Fastly founders began by investing their money in commodity hardware, but that made it so they couldn’t afford the boxes. Because of this, they were stuck and needed a way to perform all the basic network functions, which is how they developed their software. So they accidentally ended up in a nice place.

For traditional CDNs, networking is infrastructure: purely, something you buy as a sunk cost with a fixed feature set that depreciates over time. But for Fastly, the network is the application: a set of functions that you can expose to services and eventually serve to the customer.

Cutting Through Marketing Fluff

To market to customers, a lot of CDNs will try to sell you “features” that aren’t really features at all. First off, anytime CDNs discuss networking, it is always in the form of a map. To them, networking means having a global presence, which is often because they have fixed networks that cannot be changed or adapted like Fastly’s software. So for you, the customer, they prefer to sell you on their network based on reach.

Another feature that CDNs often try to use for marketing purposes is the number of PoPs. Yes, as more PoPs are added the RTT gets lower, which is a feature, but after a certain number it will not affect you or your platform. There is a limit to how close to end users you can get, where adding more data centers does nothing for your specific needs. Once you get to the optimal level, it’s better to focus on other features that will directly help or hinder your performance.

Bandwidth is another buzzword that is often flaunted as a feature, but again it is just a number. Given all the other factors such as traffic flow, capacity, economies of scale, and so on, you get to a point where it doesn’t matter how much more bandwidth you can add onto your network, it will perform at a similar state.

The same goes for latency; after a certain point, a change in the perceived latency isn’t going to have any notable effects on the system, even though CDNs market it to you as a significant indicator of network performance. A lot of other optimization factors and features carry more importance when it comes to performance, and below we’ll address three key aspects of a CDN, and how the Fastly network manages them.

Load Balancing

One main issue that all CDNs face, is the fact that the internet is just constantly throwing you packets at a PoP, which is all just routers with servers underneath. In order to manage this traffic, you need a load balancer. What load balancing does is efficiently and evenly spread traffic. You also have the option for DNS load balancing, where you simply put the IPs in and it will round robin naturally. But with traditional CDNs, doing PoP level and host level balancing in same infrastructure gets really complicated.

In order to perform, load balancers require state, and increased state means more cost. But without a load balancer to protect your network, any increased traffic peaks will cause the system to fail. So there’s a definite balance that needs to be achieved to protect your network, while saving on cost.

In order to help combat some load balancing issues some CDNs apply Equal Cost Multi Path (ECMP) routing. With ECMP the routing table has multiple entries to destination networks, and each entry sends you to a different PoP, meaning it’s stateless and, therefore, cheaper. If you take out a server, though, the routing table will change, causing a connection reset.

Netflix and Mantis

The Mantis shrimp is an oxymoron in its own right. It’s a small, yet immensely strong and powerful shrimp, able to support weight far greater than their body size might make you think, but above all else it has incredible vision. Compared to humans who have only 3 photoreceptors in their eyes, the mantis shrimp has 16. And it’s this shrimp that helped inspire Netflix’s their stream-processing platform, Mantis.

Over the past 8 years, Netflix has exploded with over 75 million members in over 190 different countries, watching 125 million hours of content everyday. To support this customer base and maintain their prestige in the evolving world of video streaming, identifying issues in their system is critical. For the Netflix team, spotting issues at the application service level and service-level monitoring has always been relatively easy, but where they struggle has been with addressing issues of individual devices.

What Mantis does is help to allow Netflix teams gain access to real-time events and combat any issues by building low-latency, high throughput stream-processing apps on top of them. This helps them to detect and mitigate specific issues across various regions and devices.

The issue that Netflix was having is that they produce billions of events and metrics on a daily basis, yet all that data wasn’t processed in a way that made it useful. They were still constantly faced with the issue of not having the right data to address the problems at hand. What Mantis does is help to build highly granular, real-time insights applications that give the Netflix team more visibility into their interactions with their devices and their AWS services. There’s a long trail throughout the system, where any number of mistakes could occur, and Mantis helps Netflix see that more clearly.

Mantis Architecture

Mantis was built 100% with the cloud in mind, helping to reduce operational overhead and developer hours. Mantis manages a cluster of servers that run stream-processing jobs, with Apache Mesos used to created a shared pool, and an open-sourced library called Fenzo, which helps to allocate resources amongst different jobs.

Netflix and Mantis

Inside the Mantis architecture there are two main clusters that manage the job processes.

  • Master Cluster: The master cluster consists of the managing parts that help to disseminate the flow of all the work
    • Resource Manager: assigns resources to a job worker using Fenzo
    • Job Manager: manages the operations of a job, dealing with metadata, SLAs, artifact locations, job topology and life cycle
  • Agent Cluster: When a user submits a stream, the job runs as one or more workers on the agent cluster
    • Instances: the agent cluster consists of multiple instances in pools running the jobs

Who Makes Pan Dulce, Integrates Open Source Ingredients and Delivers On Time?

There’s no doubt that Netflix is at the top of their game, serving over 75 million global members. But the remarkable thing for the streaming industry is that they built the foundation of their site using open source code. This post will summarize the tools and techniques Netflix used to turn source code into a movie/TV streaming juggernaut and may shed light on the future of network infrastructure.

First of all, it’s important to look at Netflix’s process when creating their site from the source.

  1.  They built the code using Nebula: a collection of Gradle plugins that they created
  2.  Changes are applied to a central git repository
  3.  A Jenkins job executes Necula–building, testing and packaging the application for deployment
  4.  Builds are “baked” into Amazon Machine Images
  5.  Spinnaker pipelines are used to deploy and promote the code changes

The Build

To begin the build, they created Nebula, a set of plugins for the Gradle build system that supports building, testing and packaging Java applications. They chose Gradle because it enabled them to easily  write their own testable plugins with a small file size. In their build.gradle file, they declare a few Java dependencies and applied three Gradle plugins that are part of Nebula.

  1. “nebula:” internal Gradle plugin that provides configurations for integration
  2. “nebula.dependency-lock:” allows the project to create a .lock file
  3. “netflix.ospackage-tomcat:” this is the operating system package which is used if the repository is an application

Integration

Once the code had been built and tested using Nebula, the code was then sent to a git repository for integration. At this step, they have dedicated teams who find git workflows and apply them to each code to see what works best.

After applying all changes, Netflix issued a Jenkins job. Jenkins is an open source automation server built with Java that helps to provide plugins for building, testing, deploying and automating projects. Currently, Netflix runs 25 Jenkins servers  in their AWS data centers.

Each Jenkins job uses Nebula to build, test and package the application code all in order to create their repositories. If they’re building a library, Nebula will publish the .jar to their artifact repository, while if the repository being built is an application, the netflix.ospackage-tomcat plugin is deployed. The application will then be bundled into a Debian or RPM package and then packaged into the repository for the next step

Pan Dulce

Netflix created something that they call “The Bakery,” which manages an API that helps to create Amazon Machine Images or AMI’s from the source code. AMI’s provide the information required to launch individual virtual servers in the cloud that they call “instances.” When you launch an instance, you can specify an AMI or launch as many AMI’s as you need It consists of the following pieces:

Fastly Taking Security To A Whole New Level

On February 25, Fastly shared the first part of their speaker series on security with Alex Pinto, Chief Data Scientist of Niddel and leader of the MLSec Project. The topic of his presentation was data science machine learning and security. The biggest issue that Pinto wanted to address is the marketing fluff that surrounds A.I./machine learning and security.

A lot of companies use “machine learning” as a buzzword to seem current with their trends, but Pinto stresses that this just hurts buyers and investors who know don’t know if they are buying the correct products or funding the right developments. He stressed awareness over certain heavy-handed marketing terms such as:

  • Big data security analytics
  • Hyper-dimensional security analytics
  • 3rd generation artificial intelligence
  • Secure because of math
  • Math>Malware

All of these tend to be used as marketing terms without much backup on the details. Big data security analytics is usually sold as a big mythical creature that promises you nothing will go wrong and all the problems that others can’t solve, they can. The big issue with this and the reason it needs to be address is because without the right security features, your network could suffer lasting damages and extensive costs.

Difficulty of Evaluation

One of the main problems that comes with security is questioning whether your system is fully protected or not, and a lot of this has to do with the fact that many security people don’t share information. Public data sets are incredibly scarce in this industry, which makes having a benchmark of your protection a lot more difficult.

Anomaly Detection

Pinto then began discussing the issues that arise when using anomaly detection with machine learning. To begin, anomaly detection works best for well-defined processes on simpler systems. Using anomaly detection compares known means to the system, which makes it good at looking at single, consistently measurable variables. It is often the basis for most fraud prevention or used by devops to measure against different stats on their dashboards.

The problem with anomaly detection though, is that the data tends to be too diverse. Anomaly detection is like trying to measure a distance, so if a piece of data is way outside of the normal range, you know based on the distance that it is problematic. But the more dimensions you add, the more ways you have to look at the data, which means that these distances begin to look the same.

API Management: Feature Set Comparison

Previously, we talked with founder of Varnish, Per Buer about the advancements Varnish has made over the past couple years and how they continue to grow in the market of open source caching. But one topic that came up was how Varnish’s API Engine compares to the likes of legacy API Engines such as Tibco Mashery, Apigee, Kong and Tyk. Buer related how Varnish manages to stack up to these when it comes to performance, but that they may contain richer feature sets. Given the assertion, we decided to follow up with a side by side comparison of the four API Engines to shed light on the competition in the API market.

The World of APIs

Tibco Mashery

Last year in August TIBCO Software, which specialized in analytics, integration and event processing, announced that they acquired Mashery, an API management company. The goal of the acquisition, as told by the company’s press release, was to expand TIBCO’s API management solutions “to help enterprises transform into digital businesses, integrate more channels for new services and experiences, and empower developers to accelerate innovation.”

Mashery already had a strong track record in the world of API Management, being named the leader in Application Services Governance Magic Quadrant by Gartner. Now with TIBCO, they continue to expand on their API engine with an extensive feature set.

  • API Portal: Allows you to create a customizable portal for enrolling and supporting partners
    • Provisioning tactics that lets you issue live keys or require activation by a moderator for security
    • Mashery I/O Docs lets developers execute calls directly from API documents. It also shows them how many calls they’re making, the methods they’re using, and so on
  • API Security: PCI-Compliant for data security, HITRUST CSF Certified to handle health information, SSAE 16 SOC 2 Type 1 compliant
    • Also have OAuth 2.0 accelerator for secure access to sensitive user data and SSL support
  • API Analytics: Helps to identify key trends and behaviors that impact business, while gaining a deeper understanding of API traffic, performance and growth.
    • Helps make reports more user friendly, transporting raw data into more visible insights
    • Provides you with the information to scale your API Management infrastructure, uncover new business outlets and so on.
  • API Traffic: Offers three types of infrastructure
    • Cloud: global PoPs, dynamic scaling and monitoring features, fast time to market, focus on core competency
    • Local: Fully integrated central dashboard for API policies, management and reporting. Removes network latency, internal traffic situation, with on-premise security and control.
    • Hybrid: Contains a single integrated Web dashboard for access policy configuration, partner administration and reporting. Ultimate flexibility, “single pane of glass” control, right tools for various use cases
  • API Packager: Helps give management teams the power to negotiate custom API access, reduce work for IT, and provides fine-tune resources.
    • With the packager no coding is required, allowing for business-side product creation with lifecycle management and response filters.

Do It Yourself CDN: Netflix, Comcast and Facebook

Over the past couple years, video streaming, gaming, photo sharing and other forms of high-data usage media consumption have exploded across network platforms. More and more people are online, requesting more and more content, and at the crux of this all is the CDN. CDNs are there to keep content moving as the stable backbone of the network, but an emerging trend in the market has implications that some believe might render traditional, commercial CDNs an expendable middleman in their own industry. Is there any truth to this?

The Trend: DIY CDNs

The trend has been that more and more companies are building their own CDNs. Large companies like Netflix, Comcast and Facebook, which all operate in very different forms, have all begun to shift towards this unifying trend of building their own CDNs. All three are using their own CDNs in varying ways, which has a definite impact on how we look at the future CDN industry. But does Netflix, Comcast and Facebook represent the general marketplace?

Netflix Open Connect

Back in June of 2012, Netflix announced the rollout of their CDN called Open Connect. With Open Connect, Netflix no longer relies on third party CDNs like Akamai, Limelight and Level 3 as much, which were the only companies with the capacity to support the level of bandwidth Netflix required at the time. Several factors came into play when Netflix decided to build their own CDN.

First of all, from a logistics standpoint it made more sense. As the number of their subscribers grew, Netflix was receiving more and more complaints about the quality of their streaming services diminishing because their vendors could not support the traffic. Control of their user experience and content was at the hands of these CDNs who could not match the growth that Netflix required, so the logical next step was to build a CDN tailored to their own specific needs.

Is ESI Making a Comeback

Akamai, the CDN industry juggernaut, has been the dominant CDN for years now, with almost two decades of experience under their belt. And in that time they developed a markup language Edge Side Includes (ESI) to help with the issue of dynamic web page assembly. Since dynamic content can’t be cached, if a web page contains dynamic elements the entire page must be retrieved from the origin source and regenerated for the user at the edge. In order to address this problem, Akamai developed ESI to help assemble dynamic content at the edge to speed up the load time for sites that require dynamic content retrieval.

Back in 2001, Akamai released their ESI Language Specification 1.0—a markup language that creates XML-based ESI tags that contain the dynamic elements. The tags indicate to the edge-side processing agents what other dynamic content needs to be added to the page, so the action can be completed on the page’s assembly. Because the dynamic elements are added after the fact at the edge, the page it belongs to can now be cached, which speeds up the overall assembly time.

While this seems like a great fix to the question that has been plaguing caching servers for years, ESI was never widely used outside of the Akamai servers. Due to its specificity and complexity, ESI was never implemented at other CDN’s and over the years lost its relevance. But now, almost a decade and a half after Akamai’s initial release, Fastly announced that they have begun using ESI alongside Varnish and their VCL programming language  So what does this mean for ESI and the world of dynamic content assembly and scaling infrastructure? Let’s break it down and find out.

Akamai EdgeSuite 5.0

ESI was developed as an open specification co-authored by Akamai and 14 other industry leaders to provide a uniform programming model with the ability to build dynamic pages at the edge of the Internet. The most current version, EdgeSuite 5.0 has built off of that premise offers new extensions, including

  • esi:break
  • user-defined functions
  • a new range operator
  • bitwise behavior options
  • new ESI functions

Features

ESI’s main feature is the ability to improve site performance by caching objects that contain dynamically generated elements at the edge of the Internet. It also has reliability, fault tolerance, and scalability. From their user manual, here’s a list of inclusive features that EdgeSuite provides.

  • Inclusion: central to the ESI framework is the ability to fetch and build pages, with each page subject to its own configuration and control, which means it has its own specified time-to-live in cache, revalidation instructions, and so on.
  • Integration with Edge Transformation: include fragments processed by edge transformation services
  • Environmental Variables: supports the use of CGI environment variables such as cookie information and POST responses
  • User-defined Variables: supports a range of user-defined variables
  • Functions: ESI offers functions to perform evaluations, set HTTP headers, set redirects, create time stamps, and so on
  • Conditional Logic: supports conditional logic based on Boolean expressions and environmental variable comparisons
  • Iteration: ESI provides logic to iterate through lists dictionaries
  • Secure Processing: supports a logic for SSL processing, automatically using secure process for fragment if template is secure
  • Exception and Error Handling: allows you to specify alternative objects and default behaviors such as serving default HTML in the event that the origin site is down

Open Source UDP File Transfer Tool Comparison

When it comes to Internet protocols, TCP has been the dominant protocol used across the web to form connections. TCP helps computers communicate by breaking large data sets into individual packets, transmitting them, and then reforming the packet in the original order once the data set has been received. But as file sizes grew and latency became an issue, User Datagram Protocol (UDP), gained more popularity. UDP picks up the slack by offering faster speeds with the ability to transmit much larger files, something that TCP isn’t capable of.

When comparing the architecture of the two protocol tools, the main difference is that UDP sends the packets without waiting for each connection to go through, which means lower bandwidth overhead and latency. TCP, on the other hand, sends the packets one at a time, in order, waiting to make sure each connection goes through before starting the next. To better understand the pros and cons of each protocol, below is a basic comparison of the two:

Feature
TCP UDP
Connection Connection-Oriented: messages navigate the internet one connection at a time Connectionless: a single program sends out a load of packets all at once.
Usage Used for application needing high reliability, where time is less relevant Used when applications need fast transmission
Used by other protocols HTTP, HTTPS, FTP, SMTP, Telnet DNS, DHCP, TFTP, SNMP, RIP, VOIP
Reliability All transferred data is guaranteed to arrive in order specified. No guarantee of arrival and ordering has to be managed by the application layer
Header Size 20 bytes 8 bytes
Data Flow Control Does flow control, requiring three packets to set up a socket connection before user data can be sent Does not have an option for flow control
Error Checking Yes Yes, but no recovery options
Acknowledgments Yes, required before next transfer will take place No, allows for quicker speed

There’s a clear tradeoff between the two when it comes to speed versus reliability, but as developments have been made with UDP, it’s become more and more trustworthy as a leading protocol tool. Recently, Google announced that they use their open source Google QUIC UDP-based protocol to run 50% of their Chrome traffic, with that percentage expecting to increase in the coming years. So as the market shifts and advancements continue to be made with the source code, UDP is quickly becoming the file transfer tool of the future. So what UDP should you use?

Choosing a UDP

When it comes to choosing a UDP there are two main options—buy it from a commercial service or install it for open source software. Given the immaturity of most open source software, it may not be as user friendly, but there are many options out there, so paying developers to manage and configure your UDP is not necessary.

Currently, there are six main UDP file transfer tools available as open source.

  • Tsunami UDP Protocol: Uses TCP control and UDP data for transfer over high speed long distance networks. It was designed specifically to offer more throughput than possible with TCP over the same networks.
  • UDT UDP-based Data Transfer: Designed to support global data transfer of terabyte sized data sets, using UDP to transfer the bulk of its data with reliability control mechanisms.
  • Enet: Main goal is to provide a thin, simple and robust network communication layer on top of UDP with reliable and ordered delivery of packets, which they accomplish by stripping it of higher level networking features.
  • UFTP: Specializes in distributing large files to a large number of receivers, especially when data distribution takes place over a satellite link, making TCP communication inefficient due to the delay. They have been used widely by The Wall Street Journal to send WSJ pages over satellite to their remote printing plants.
  • GridFTP: Unlike all others, GridFTP is based on FTP and is not UDP, but it can be used to solve the same problems experienced when using TCP.
  • Google QUIC: An experimental UDP-based network protocol designed at Google to support multiplexed connections between two endpoints, provide security protection equivalent to TSL/SSL, and reduce latency and bandwidth usage.

Below is a features comparison chart to better help you understand the side by side differences each system supports.

Tsunami

UDT

ENet

UFTP

GridFTP

Google QUIC

Multi-Threaded No No Yes No Yes Yes
Protocol Overhead 20% 10% NA ~10% 6-8%, same as TCP NA
Encryption No No No Yes Yes Yes
C++ source code Yes Yes Yes Yes Yes Yes
Java Source Code No Partial No No No No
Command Line No No No Yes Yes Yes
Distribution Packets Source code only Source code only Source Only Yes Yes Yes
UDP based point-to-point Yes Yes Yes Yes No Yes
Firewall Friendly No Partial, no auto-detection No Partial, no auto-detection No No, has had issues with stateful firewalls
Congestion Control Yes Yes NA Yes Yes, using TCP Yes
Automatic retry and resume No No Yes No, manual resume yes Yes Yes
Jumbo Packets No Yes No Yes, up to 8800 bytes Yes NA
Support for Packet Loss No No Yes No Yes Yes