Getting started with Apache Avro and Python Learn how to create and consume Apache Avro based data for better and efficient transfer.

In this post, I am going to talk about Apache Avro, an open-source data serialization system that is being used by tools like Spark, Kafka, and others for big data processing.

What is Apache Avro

According to Wikipedia:

Avro is a row-oriented remote procedure call and data serialization framework developed within Apache’s Hadoop project. It uses JSON for defining data types and protocols, and serializes data in a compact binary format. Its primary use is in Apache Hadoop, where it can provide both a serialization format for persistent data, and a wire format for communication between Hadoop nodes, and from client programs to the Hadoop services. Avro uses a schema to structure the data that is being encoded. It has two different types of schema languages; one for human editing (Avro IDL) and another which is more machine-readable based on JSON.[3]

Basically, Avro is a language-independent data serialization system developed by the father of Hadoop Doug Cutting. Before I discuss further Avro, allow me to discuss in brief about data serialization and advantages of it.

What is Data Serialization and DeSerialization

Data Serialization is the process of converting complex objects(array, dicts, lists, class objects, JSON, etc) into byte streams so that they can be stored or transferred to other machines. The reason for data serialization of transferring data among computers who have different architecture, hardware, or operating systems. Once the data is received at the other end, it can be revert back to the original form. The process is called Deserialization.

Now you know what is it all about, let’s dig in and play with some code.

Development and Installation

There are two libraries that are currently being used in Python applications. One is simply called avro which you can access here. And the other is FastAvro which claims to be faster than the previous one. The working of both is the same. Since we are working on a toy example therefore the previous lib is sufficient for us. So, as always use the typical pip tool to install it:

pip install avro

Avro Schema

Apache Avro format is actually a JSON structure. You can say that Avro format is actually a combination of a JSON data structure and a schema for validation purposes. So before we create our Avro file which has an extension .avro, we will be creating its schema.

{"namespace": "me.adnansiddiqi",
 "type": "record",
 "name": "User",
 "fields": [
     {"name": "name", "type": "string"},
     {"name": "age",  "type": ["int", "null"]},
     {"name": "gender", "type": ["string", "null"]}
 ]
}

OK, so I have come up with a schema above which you can see is a JSON structure. We will be mentioning a namespace first. It is nothing but a string. Usually, it follows the same format which is used by the Java packaging naming convention that is the reverse of your domain name but not necessary. Here I mentioned the reverse of my blog URL. After that you mentioned the type of your schema, here it is of record type. There are other types too like enum,arrays etc. After that, we mention the name of the schema which is User here. The next is field item which could be one or more. It has mandatory fields like name and type and optional fields like doc and alias. The doc field is used to document your field while alias is used to give the field a name other than the one mentioned in name. So far so good. The schema we created, we will be saving in a file called users.avsc.

Now we will be writing the code that will read the schema from the schema file and then will be adding a few records in the Avro file. Later, we will be retrieving the records and display them. Let’s write the code!

import avro.schema
from avro.datafile import DataFileReader, DataFileWriter
from avro.io import DatumReader, DatumWriter

schema = avro.schema.parse(open("user.avsc").read())

writer = DataFileWriter(open("users.avro", "wb"), DatumWriter(), schema)
writer.append({"name": "Alyssa", "age": 25,"gender":"female"})
writer.append({"name": "Ahmad", "age": 35,"gender":"male"})
writer.close()

reader = DataFileReader(open("users.avro", "rb"), DatumReader())
for user in reader:
    print(user)
    print('===================')
reader.close()

After importing the necessary modules, the very first thing I am doing is reading the schema file. DatumWriter is responsible for data translation into Avro format w.r.t the input schema. DataFileWriter is responsible for writing the data in the file. After inserting a couple of records by we close the writer. The DataFileReader is working similarly, the only difference is that here DatumReader() is responsible for deserialization of data which is stored in users.avro and then make it available for display. When you run the code it will display like the below:

python play.py
{'name': 'Alyssa', 'age': 25, 'gender': 'female'}
===================
{'name': 'Ahmad', 'age': 35, 'gender': 'male'}

So far so good? Now let’s make a bit of change in the schema. I changed age field to dob and when I run it gives the following error:

python play.py
Traceback (most recent call last):
  File "play.py", line 8, in <module>
    writer.append({"name": "Alyssa", "age": 25,"gender":"female"})
  File "/Users/AdnanAhmad/Data/anaconda3/lib/python3.7/site-packages/avro/datafile.py", line 303, in append
    self.datum_writer.write(datum, self.buffer_encoder)
  File "/Users/AdnanAhmad/Data/anaconda3/lib/python3.7/site-packages/avro/io.py", line 771, in write
    raise AvroTypeException(self.writer_schema, datum)
avro.io.AvroTypeException: The datum {'name': 'Alyssa', 'age': 25, 'gender': 'female'} is not an example of the schema {
  "type": "record",
  "name": "User",
  "namespace": "me.adnansiddiqi",
  "fields": [
    {
      "type": "string",
      "name": "name"
    },
    {
      "type": [
        "int",
        "null"
      ],
      "name": "dob"
    },
    {
      "type": [
        "string",
        "null"
      ],
      "name": "gender"
    }
  ]
}

Cool, No? Similarly, if you make changes in the data it will scream and tell you to fix things up.

Conclusion

I hope you will have learned a bit about Apache Avro and how Python lets you use it to transfer data across devices and systems. I have just touched the surface of it and there is more about it. Avro is also heavily used by Apache Spark and Kafka for data transfer.

What is Apache Avro

What is Data Serialization and DeSerialization

Development and Installation

Avro Schema

Conclusion

If you like this post then you should subscribe to my blog for future updates.

Related Posts

Things every developer should know to improve site performance

5 strategies to write unblockable web scrapers in Python

Create your first ETL Pipeline in Apache Spark and Python