Getting started with Protobuffer and Python

In this post, I am going to talk about Proto Buffers and how you can use them in Python for passing messages across networks.  Protocol Buffers or Porobuf in short, are used for data serialization and deserialization. Before I discuss Protobuf, I would like to discuss data serialization and serialization first.

Data Serialization and De-serialization

According to Wikipedia

Serialization is the process of translating a data structure or object state into a format that can be stored (for example, in a file or memory data buffer) or transmitted (for example, over a computer network) and reconstructed later (possibly in a different computer environment)

In simple words, you convert simplex and complex data structures and objects into byte streams so that they could be transferred to other machines or store in some storage.

Data deserialization is the reverse process of it to restore the original format for further usage.

Data serialization and vice versa is not something Google invented. Most of the programming languages support it. Python provides pickle,  PHP provides serialize function. Similarly, many other languages provide similar features. The issue is that these serialization/deserialization mechanisms are not standardized and only useful if both source and destination are using the same language. There was a need for some standard format that could be used across the system irrespective of the underlying programming language.

You might wonder, hey, why not just use XML or JSON?  the issue is that these formats are in text form and they could get huge if the data structure is big and could make things slow if sent across the network. Protobuf on other hand is in binary format and could be smaller in size if a large amount of data is sent.

On another hand, Protobuf is not the first data serialization format that is available in binary format. BSON(Binary JSON) developed by MongoDB and MessagePack also supports such facilities.

Google’s Protobuf is not all about data exchange. It also provides a set of rules to define and exchange these messages. On top of it, it is heavily used in gRPC, Google’s way of Remote Procedure Calls. Hence, if you are planning to work on gRPC in any programming language you must have an idea of writing .proto files.

Installation

In order to use .proto based messages you must have a proto compiler installed on your machine which you can download from here, based on your operating system. Since I am on macOS, I simply used Homebrew to install it. Once installed correctly you can run the command protoc --version to test the version. As of now, the latest version libprotoc 3.9.1 is installed on my machine.

Now let’s write our first proto message.

 

The very first line dictates which version of protobuf message should you use. If you do not specify one it’d be version 2. The second line is about packaging your message to avoid any conflict. For this particular message, you may ignore it. The package line behaves differently in different programming languages after the code generation. In Python, it does not make any difference while in Go and Java languages it is used as a package name while in C++ it is used as a namespace.

The next line starts with message followed by message type. In our case it is Employee. It then consists of fields and their types. For instance, it is string and int32 type fields. If you have worked in languages like C and Go you would find it is similar to struct.

There are two fields mention here: Name of type string and age of type int32.  You can make a field mandatory by putting required before the data type name. By default all are optional. There are three types of rules:

  • required:- It means this field is mandatory.
  • optional:- You may leave this field.
  • repeated: This field could be repeated 0 or more times.

You would have noticed that each field is being assigned a number. This is not a typical variable assignment but rather are used for tagging for identification and encoding purpose. Just to clarify this numbering is not for ordering purposes as many suggest, you can’t guarantee in which order these fields will be encoded.

Alright, the message is ready and now we have to serialize it. For that purpose, we will be using protoc a compiler that can generate code in multiple languages. Since I am working in Python so I will be using the following command:

protoc -I=. --python_out=. ./test.proto

  • -I is used for the path for finding all kinds of dependencies and proto files. I mentioned a dot(.) for the current directory.
  • –python_out is used for the path where the generated python file will be stored. Again, it is being generated in the current folder.
  • The last unnamed argument is to pass the proto file path.

When it runs it generates a file in the format <PROTOFILE NAME>_pb2.py. In our case, it is test.proto so the generated file name is test_pb2.py.

You do not need to worry about the generated file at this moment. For now, all you have to be concerned about how to use the file. I am creating a new file, test.py that contains the following code:

import test_pb2 as Test

obj = Test.Employees()
obj.Name = 'Adnan'
obj.age = 20
print(obj.SerializeToString())

It is pretty simple. I imported the generated file and from there I instantiate the Employees object. This Employees comes from the message type we had defined in .proto file. You then simply assign values to fields. The last line is printing the object values. When I run the program it prints:

b'\n\x05Adnan\x10\x14'

What if I set age as a string? It gives the error:

Traceback (most recent call last):
  File "test.py", line 5, in <module>
    obj.age = "20"
TypeError: '20' has type str, but expected one of: int, long

Awesome, so it is also validating whether the correct data is being passed or not. Now let’s save this output in a binary file.

What if we save something similar in JSON format? The code now looks like below:

import test_pb2 as Test

obj = Test.Employees()
obj.Name = 'Adnan'
obj.age = 20
print(obj.SerializeToString())

with open('output.bin','wb') as f:
    f.write(obj.SerializeToString())


json_data = '{"Name": "Adnan","Age": 20}'

with open('output.json','w',encoding='utf8') as f:
    f.write(json_data)

Now if I do an ls - la it results the following:

➜  LearningProtoBuffer ls -l output*
-rw-r--r--  1 AdnanAhmad  staff   9 May 13 18:05 output.bin
-rw-r--r--  1 AdnanAhmad  staff  27 May 13 18:05 output.json

The JSON one has a size of 27 bytes while the protobuf one took 9 bytes only. You can easily figure out who is the winner here. Imagine sending a stream of JSONified data, it’d definitely increase the latency. By sending the data in a binary format you can squeeze many bytes and increase the efficiency of your system.

Conclusion

In this post, you learned how you can efficiently send and store huge data by using Protocol Buffers which generate code in many languages thus make it easier to exchange data without worrying about the underlying programming language.

 

If you like this post then you should subscribe to my blog for future updates.

* indicates required