Data Serialization – Protocol Buffers vs Thrift vs Avro

5 years ago

What is Data Serialization?

Data serialization refers to the process of translating data structures or object state into a different format capable of being stored (such as a memory buffer or file), or transmitted and reconstructed at a different point. When the serialization format is used to reread the resulting set of bits, it can be used to generate a semantically identical clone of the original object. This process is not straightforward for most complex objects, and serialization of object-oriented objects doesn’t include the methods with which they were previously linked. The phrase marshaling an object is also used in reference to serializing an object.

Deserialization refers to the reverse operation i.e. extracting a data structure from a series of bytes. This happens once the serialized data has been transmitted from the source to the destination machine.

What are its Uses?

Serialization enables the saving of an object state and its later recreation. Typical uses include:

Storing Data into Databases or on Hard Drives – a method which involves converting program objects into byte streams and then storing them into DBs, such as in Java JDBC.
Transferring Data through the Wires – for instance, web applications and mobile apps passing on objects from client to server and the reverse.
Detecting Changes in Time-Varying Data – abrupt variations in time series data can represent transitions that occur between states, which is useful for modelling and predicting time series and is found in a variety of application areas.
Remote Procedure Call (RPC) – a protocol, which one program can use to request a service from a program found on another computer on a network without needing to know that network’s details. An RPC is also referred to as a function call or a subroutine call.
Persisting Data onto Files – this occurs mainly in language-neutral formats like XML or CSV. Most languages, however, allow the direct serialization of objects into binary using APIs such as the Serializable interface in Java or fstreamclass in C++.
Remote Method Invocation (RMI) – serialized objects are passed as parameters to functions on a remote machine as if they have been invoked on a local one. This can also take place across domains through firewalls.
Distributing Objects in a Distributed Object Model – this method is used for instances when programs running on diverse platforms written in different languages have to share object data over a distributed network using a framework such as CORBA or COM. Newer web services such as SOAP and REST have replaced these applications now.

Data Serialization Formats

There is a wide variety of data serialization formats, including XML, JSON, BSON, YAML, MessagePack, Protocol Buffers, Thrift and Avro. Choice of format for an application is subject to a variety of factors, including data complexity, necessity for humans to read it, latency and storage space concerns.

XML is the reference benchmark for the other formats as it was the original implementation. JSON is often described as faster and more light-weight.

We will look at three newer frameworks: Thrift, Protocol Buffers and Avro here, all of which offer efficient, cross-language serialization of data using a scheme, and code generation for Java. Each has a different set of strengths. They all also offer support for schema evolution. The schema can be changed, producers and consumers can hold different versions of the scheme simultaneously, and it all continues to work. When working with a large production system, this is a highly valuable feature as it allows you to independently update different elements of the system at different times without concerns about compatibility.

They share certain features, including:

Interface Definition Language (IDL) – web services interfaces use the Web Service Definition Language. Similarly to SOAP, WSDL is a XML-based language. The new frameworks each use their own languages, which aren’t based on XML, but are simialr to the IDL, known from CORBA.
Performance
Versioning
Binary Format

Protocol Buffers

There has been a lot of hype around Google’s protocol buffers (also known as protobufs and PB). They were designed in 2001 as an alternative to XML to be an optimized server request/response protocol. They were a proprietary solution at Google until 2008 when they were open-sourced. Igor Anischenko, Java Developer at Lohika describes them as “the glue to all Google services”, and “battle-tested, very stable and well trusted”. Indeed, Google uses Protocol Buffers as the foundation for a custom procedure call (RPC) system that underpins virtually all its intermachine communication.

There is excitement around the potential for Protobufs to speed things up due to the fact that binary formats are usually quicker than text formats; in addition to the fact that the data model for messaging can be generated in a wide range of languages from the protobuf IDL file. Protocol buffers are language-neutral and platform-neutral. Google describes protobufs as “smaller, faster and simpler” than XML.

Protocol buffers at present support generated code in C++, Java, Objective-C and Python. You can also work with C#, Dart, Go and Ruby through the use of the new proto3 language version. When they were first developed, Google’s preferred language was C++. This helps explain why Protobufs is strongly typed and has an independent schema file. PB was constructed to be layered over an existing RPC mechanism.

Examples of projects using Protobufs include Google, ActiveMQ (which uses the protobufs for Message store) and Netty.

Apache Thrift

Apache Thrift was developed at Facebook in 2007 by an ex-Google employee and used extensively there. It is now an open Apache project, hosted in Apache’s Inkubator. Its goal was to be the next-generation PB with more extensive features and more languages. The Apache Thrift software framework combines a software stack with a code generation engine to build services, which work between a wide range of languages, including C++, C#, Cocoa, Erlang, Java, JavaScript, Node.js, OCami, Perl, PHP, Smalltalk and others. Its IDL syntax is supposed to be cleaner than PB, and it offers a full client/server stack for RPC calls (which Protobufs does not). It also offers additional data structures, such as Map and Set.

Here is a summary of an example for a simple service, which communicates using Thrift, developed by Software Developer Kevin Sookocheff. It is a Task Manager deployed as a service using Thrift. One of the initial steps is to define your API. The API for the Task Manager intends to expose a single endpoint for adding tasks, which essentially transfers a task from the manager’s list to her employee. As a REST end-point, the API would look like this:

POST /api/v1/tasks/create
with payload:
{
   “taskId”: “TK-2190809”,
   “description”: “Close Jira ticket OPS-12345”
}
When the employee receives a task, an acknowledgement will be sent that contains the task ID received, which would look something like this:
200 SUCCESS
with payload:
{
   “taskId”: “TK-2190809”
}

Thrift divides a service API into four different elements:

Types – types define a common type system, which is translated into different language implementations;
Transport – transports detail how data is migrated from one service to another (this can involve TCP sockets, HTTP requests, or files on disk as a transport mechanism);
Protocol – protocols detail the encoding/decoding of data types e.g. the same service is able to communicate using a binary protocol, JSON, or XML;
Processors – processing incoming requests can be separated from transport functions, data types and protocol, letting the client and server to concentrate exclusively on business logic.

The Task Manager service can be defined through the creation of a Thrift definition file in a new Go project directory.

After Thrift is installed, client and server code can be generated from your Thrift IDL file for the language of your choice. This enables the decoupling of the serialization protocol, transport method and programming language from the application’s business logic.

Further details of the implementation can be found here.

Some examples of applications, projects and organizations that use Thrift include Facebook, the Cassandra project, HBase (for a cross-language API), Hadoop (which supports access to its HDFS API through Thrift bindings), LastFM, ThriftDB, Scribe and Evernote (which uses Thrift for its public API).

Apache Avro

Apache Avro is a very recent serialization system, which relies on a schema-based system. The schema used when writing it is continually present when Avro data is read. The data for Avro is serialized with its schema. When Avro data is stored, its schema is stored along with it, meaning its files can be processed later by any program. It offers official support for four languages: C, C++, Java, Python and Ruby. It is an RPC framework and operates under an Apache License 2.0.

Apache’s functionality is similar to that of Thrift and Protocol Buffers. It provides rich data structures, a compact binary data format, a container file used to store persistent data, a remote procedure call (RPC) and simple integration with dynamic languages. However, there are a few advantages unique to Avro:

Schema evolution – Avro requires the use of schemas when data is either written or read; schemas can be used for serialization and deserialization and Avro will take care of the missing, extra or modified fields. It can be used for building less decoupled and more robust systems. It is a particularly interesting feature for quickly evolving protocols such as OpenRTB. Both the old and new schema are always present when a schema changes, so any differences can be resolved symbolically through the use of field names.
Untagged data – This is a schema with binary data, which allows each datum to be written without the overhead. This leads to more compact data encoding and quicker data processing. Data can be encoded in two different ways: binary or JSON. You can also switch between binary and JSON encoding with only a one-line code change.
Dynamic typing – This relates to serialization and deserialization without the involvement of code generation. Code generation is nonetheless still available in Avro for statically typed languages as an alternative optimization. The critical abstraction is GenericData.Record. The extra flexibility this provides has performance implications, however, the downside is minor and the benefit is a simplified code base.