Site icon Bizety: Research & Consulting

Data Serialization – Protocol Buffers vs Thrift vs Avro

What is Data Serialization?

Data serialization refers to the process of translating data structures or object state into a different format capable of being stored (such as a memory buffer or file), or transmitted and reconstructed at a different point. When the serialization format is used to reread the resulting set of bits, it can be used to generate a semantically identical clone of the original object. This process is not straightforward for most complex objects, and serialization of object-oriented objects doesn’t include the methods with which they were previously linked. The phrase marshaling an object is also used in reference to serializing an object.

Deserialization refers to the reverse operation i.e. extracting a data structure from a series of bytes. This happens once the serialized data has been transmitted from the source to the destination machine.

What are its Uses?

Serialization enables the saving of an object state and its later recreation. Typical uses include:

Data Serialization Formats

There is a wide variety of data serialization formats, including XML, JSON, BSON, YAML, MessagePack, Protocol Buffers, Thrift and Avro. Choice of format for an application is subject to a variety of factors, including data complexity, necessity for humans to read it, latency and storage space concerns.

XML is the reference benchmark for the other formats as it was the original implementation. JSON is often described as faster and more light-weight.

We will look at three newer frameworks: Thrift, Protocol Buffers and Avro here, all of which offer efficient, cross-language serialization of data using a scheme, and code generation for Java. Each has a different set of strengths. They all also offer support for schema evolution. The schema can be changed, producers and consumers can hold different versions of the scheme simultaneously, and it all continues to work. When working with a large production system, this is a highly valuable feature as it allows you to independently update different elements of the system at different times without concerns about compatibility.

They share certain features, including:

Protocol Buffers

There has been a lot of hype around Google’s protocol buffers (also known as protobufs and PB). They were designed in 2001 as an alternative to XML to be an optimized server request/response protocol. They were a proprietary solution at Google until 2008 when they were open-sourced. Igor Anischenko, Java Developer at Lohika describes them as “the glue to all Google services”, and “battle-tested, very stable and well trusted”. Indeed, Google uses Protocol Buffers as the foundation for a custom procedure call (RPC) system that underpins virtually all its intermachine communication.

There is excitement around the potential for Protobufs to speed things up due to the fact that binary formats are usually quicker than text formats; in addition to the fact that the data model for messaging can be generated in a wide range of languages from the protobuf IDL file. Protocol buffers are language-neutral and platform-neutral. Google describes protobufs as “smaller, faster and simpler” than XML.

Protocol buffers at present support generated code in C++, Java, Objective-C and Python. You can also work with C#, Dart, Go and Ruby through the use of the new proto3 language version. When they were first developed, Google’s preferred language was C++. This helps explain why Protobufs is strongly typed and has an independent schema file. PB was constructed to be layered over an existing RPC mechanism.

Examples of projects using Protobufs include Google, ActiveMQ (which uses the protobufs for Message store) and Netty.

Apache Thrift

Apache Thrift was developed at Facebook in 2007 by an ex-Google employee and used extensively there. It is now an open Apache project, hosted in Apache’s Inkubator. Its goal was to be the next-generation PB with more extensive features and more languages. The Apache Thrift software framework combines a software stack with a code generation engine to build services, which work between a wide range of languages, including C++, C#, Cocoa, Erlang, Java, JavaScript, Node.js, OCami, Perl, PHP, Smalltalk and others. Its IDL syntax is supposed to be cleaner than PB, and it offers a full client/server stack for RPC calls (which Protobufs does not). It also offers additional data structures, such as Map and Set.

Here is a summary of an example for a simple service, which communicates using Thrift, developed by Software Developer Kevin Sookocheff. It is a Task Manager deployed as a service using Thrift. One of the initial steps is to define your API. The API for the Task Manager intends to expose a single endpoint for adding tasks, which essentially transfers a task from the manager’s list to her employee. As a REST end-point, the API would look like this:

POST /api/v1/tasks/create
with payload:
{
   “taskId”: “TK-2190809”,
   “description”: “Close Jira ticket OPS-12345”
}
When the employee receives a task, an acknowledgement will be sent that contains the task ID received, which would look something like this:
200 SUCCESS
with payload:
{
   “taskId”: “TK-2190809”
}

Thrift divides a service API into four different elements:

  1. Types – types define a common type system, which is translated into different language implementations;
  2. Transport – transports detail how data is migrated from one service to another (this can involve TCP sockets, HTTP requests, or files on disk as a transport mechanism);
  3. Protocol – protocols detail the encoding/decoding of data types e.g. the same service is able to communicate using a binary protocol, JSON, or XML;
  4. Processors – processing incoming requests can be separated from transport functions, data types and protocol, letting the client and server to concentrate exclusively on business logic.

The Task Manager service can be defined through the creation of a Thrift definition file in a new Go project directory.

After Thrift is installed, client and server code can be generated from your Thrift IDL file for the language of your choice. This enables the decoupling of the serialization protocol, transport method and programming language from the application’s business logic.

Further details of the implementation can be found here.

Some examples of applications, projects and organizations that use Thrift include Facebook, the Cassandra project, HBase (for a cross-language API), Hadoop (which supports access to its HDFS API through Thrift bindings), LastFM, ThriftDB, Scribe and Evernote (which uses Thrift for its public API).

Apache Avro

Apache Avro is a very recent serialization system, which relies on a schema-based system. The schema used when writing it is continually present when Avro data is read. The data for Avro is serialized with its schema. When Avro data is stored, its schema is stored along with it, meaning its files can be processed later by any program. It offers official support for four languages: C, C++, Java, Python and Ruby. It is an RPC framework and operates under an Apache License 2.0.

Apache’s functionality is similar to that of Thrift and Protocol Buffers. It provides rich data structures, a compact binary data format, a container file used to store persistent data, a remote procedure call (RPC) and simple integration with dynamic languages. However, there are a few advantages unique to Avro:

Copyright secured by Digiprove © 2020
Exit mobile version