Create your first ETL Pipeline in Apache Beam and Python
Learn how to use Apache Beam to create efficient Pipelines for your applications.

This post is part of Data Engineering and ETL Series.

In this post, I am going to introduce another ETL tool for your Python applications, called Apache Beam.

What is Apache Beam?

According to Wikipedia:

Apache Beam is an open source unified programming model to define and execute data processing pipelines, including ETL, batch and stream (continuous) processing..

Unlike Airflow and Luigi, Apache Beam is not a server. It is rather a programming model that contains a set of APIs. Currently, they are available for Java, Python and Go programming languages. A typical Apache Beam based pipeline looks like below:

A linear pipeline starts with one input collection, sequentially applies three transforms, and ends with one output collection.

(Image Source: https://beam.apache.org/images/design-your-pipeline-linear.svg)

From the left, the data is being acquired(extract) from a database then it goes thru the multiple steps of transformation and finally it is stored(load) into the database.

Let’s discuss a few of terminologies used in Apache Beam:

  • Pipeline:- A pipeline encapsulates the entire data processing experience; from data, acquisition to load it up into a datastore.
  • PCollection:- It is a collection of data. Data could be bounded, that is coming from a fixed source or unbounded that is coming from single or multiple streams.
  • PTransform:- It is a process that works on each element within a PCollection.
  • Runners:- A portable API Layer that helps to create pipelines executed on different engines or runners. Currently, it supports Direct Runner(for local development or testing purpose), Apache Apex, Apache Flink, Gearpump, Apache Spark and Google DataFlow.

Development

Use pip to install the Python SDK:

pip install apache-beam

The Apache Beam SDK is now installed and now we will create a simple pipeline that will read lines from text fields and convert the case and then reverse it.

Below is the complete program for the pipeline:

After importing the necessary libraries, we will call PipelineOptions for configuring purposes. For instance, if you are using Google Cloud DataFlow then you’d pass necessary options like project name or cloud name etc. You can also pass command line arguments in the flag parameter of it, for instance, input and output file names.

The very first step is data ingestion or acquisition, we will call ReadFromText to read the data from a text file. The pipe sign (|) is an overloading operator which applies PTransform to PCollection. If you have used Linux pipes than it should not be difficult for you to understand. In our case, the collection was produced via ReadFromText and then passed to the ToLower() via ParDo a function. Pardo applies a function to each element of PCollection. In our case, it runs on each line of the file. If you add a print() command in ToLower() function, you will see that it will iterate over the collection and print each line that is stored in element variable. We construct a JSON object and store the content of each element in it. After that, it goes thru another PTransform , this time via ToReverse to reverse the content. In the end, the data is stored in a text file by calling WriteToText function. When you use file_name_suffix parameter it creates output files with a proper extension, for instance for us it created as processed-00000-of-00001.txt. By default Apache Beam create multiple output files as it is a practice while working on distributed systems. The original content of the files are:

Coronavirus cases in Pakistan doubled in one day with total tally at 106 on Monday

Dubai: Coronavirus cases in Pakistan has risen to 106 on Monday after the Sindh Provincial authorities confirmed 53 new cases.

Sindh is now the worst-hit province in Pakistan with 88 out of 106 total coronavirus cases reported until Monday afternoon. This is the single largest increase in novel coronavirus cases in the country as of today.

Murtaza Wahab, Advisor to Sindh Chief Minister on Law, Anti-Corruption Establishment, and Information, tweeted on Monday that 50 people who returned from the quarantine at Taftan border with Iran had tested positive for COVID-19

 

Upon the first transformation, it will convert all characters to lowercase. For instance like below:

coronavirus cases in pakistan doubled in one day with total tally at 106 on monday

dubai: coronavirus cases in pakistan has risen to 106 on monday after the sindh provincial authorities confirmed 53 new cases.

As you can see I have returned a list of dictionaries. The reason for doing it that it if I do not do this it returns one character at a line so coronavirus will become:

c
o
r
o
n
a
v
i
r
u
s

which is undesirable. I don’t know yet why this is happening and dividing a line of words in individual characters but for now, the way out I found was to return a list of dictionaries. In the last transformation, I am returning without any dictionary. Upon running, there will be a file created with the name processed-00000-of-00001.txt which will contain the following content:

As you can see the file now contains the same content but in the reverse order.

Conclusion

Apache Beam is a nice tool for writing your ETL based applications. I just discussed the basics of it. You can integrate it with cloud solutions as well as store the data into a typical database. I hope you will find it useful and will try it in your next application.

As always, the code is available on Github.

 

If you like this post then you should subscribe to my blog for future updates.

* indicates required