This post is part of Data Engineering and ETL Series.
In this post, I am going to introduce another ETL tool for your Python applications, called Apache Beam.
What is Apache Beam?
According to Wikipedia:
Apache Beam is an open source unified programming model to define and execute data processing pipelines, including ETL, batch and stream (continuous) processing..
Unlike Airflow and Luigi, Apache Beam is not a server. It is rather a programming model that contains a set of APIs. Currently, they are available for Java, Python and Go programming languages. A typical Apache Beam based pipeline looks like below:
(Image Source: https://beam.apache.org/images/design-your-pipeline-linear.svg)
From the left, the data is being acquired(extract) from a database then it goes thru the multiple steps of transformation and finally it is stored(load) into the database.
Let’s discuss a few of terminologies used in Apache Beam:
- Pipeline:- A pipeline encapsulates the entire data processing experience; from data, acquisition to load it up into a datastore.
- PCollection:- It is a collection of data. Data could be bounded, that is coming from a fixed source or unbounded that is coming from single or multiple streams.
- PTransform:- It is a process that works on each element within a PCollection.
- Runners:- A portable API Layer that helps to create pipelines executed on different engines or runners. Currently, it supports Direct Runner(for local development or testing purpose), Apache Apex, Apache Flink, Gearpump, Apache Spark and Google DataFlow.
Development
Use pip
to install the Python SDK:
pip install apache-beam
The Apache Beam SDK is now installed and now we will create a simple pipeline that will read lines from text fields and convert the case and then reverse it.
Below is the complete program for the pipeline:
import apache_beam as beam from apache_beam.options.pipeline_options import PipelineOptions class ToLower(beam.DoFn): def process(self, element): return[{'Data': element.lower()}] class ToReverse(beam.DoFn): def process(self, el): d = el['Data'] return [d[::-1]] if __name__ == '__main__': in_file = 'news.txt' out_file = 'processed' options = PipelineOptions() with beam.Pipeline(options=PipelineOptions()) as p: r = ( p | beam.io.ReadFromText(in_file) | beam.ParDo(ToLower()) | beam.ParDo(ToReverse()) | beam.io.WriteToText(out_file, file_name_suffix='.txt') ) result = p.run()
After importing the necessary libraries, we will call PipelineOptions
for configuring purposes. For instance, if you are using Google Cloud DataFlow then you’d pass necessary options like project name or cloud name etc. You can also pass command line arguments in the flag
parameter of it, for instance, input and output file names.
The very first step is data ingestion or acquisition, we will call ReadFromText
to read the data from a text file. The pipe sign (|) is an overloading operator which applies PTransform
to PCollection
. If you have used Linux pipes than it should not be difficult for you to understand. In our case, the collection was produced via ReadFromText
and then passed to the ToLower()
via ParDo
a function. Pardo
applies a function to each element of PCollection
. In our case, it runs on each line of the file. If you add a print()
command in ToLower()
function, you will see that it will iterate over the collection and print each line that is stored in element
variable. We construct a JSON object and store the content of each element in it. After that, it goes thru another PTransform
, this time via ToReverse
to reverse the content. In the end, the data is stored in a text file by calling WriteToText
function. When you use file_name_suffix
parameter it creates output files with a proper extension, for instance for us it created as processed-00000-of-00001.txt
. By default Apache Beam create multiple output files as it is a practice while working on distributed systems. The original content of the files are:
Coronavirus cases in Pakistan doubled in one day with total tally at 106 on Monday
Dubai: Coronavirus cases in Pakistan has risen to 106 on Monday after the Sindh Provincial authorities confirmed 53 new cases.
Sindh is now the worst-hit province in Pakistan with 88 out of 106 total coronavirus cases reported until Monday afternoon. This is the single largest increase in novel coronavirus cases in the country as of today.
Murtaza Wahab, Advisor to Sindh Chief Minister on Law, Anti-Corruption Establishment, and Information, tweeted on Monday that 50 people who returned from the quarantine at Taftan border with Iran had tested positive for COVID-19
Upon the first transformation, it will convert all characters to lowercase. For instance like below:
coronavirus cases in pakistan doubled in one day with total tally at 106 on monday
dubai: coronavirus cases in pakistan has risen to 106 on monday after the sindh provincial authorities confirmed 53 new cases.
As you can see I have returned a list of dictionaries. The reason for doing it that it if I do not do this it returns one character at a line so coronavirus will become:
c
o
r
o
n
a
v
i
r
u
s
which is undesirable. I don’t know yet why this is happening and dividing a line of words in individual characters but for now, the way out I found was to return a list of dictionaries. In the last transformation, I am returning without any dictionary. Upon running, there will be a file created with the name processed-00000-of-00001.txt which will contain the following content:
yadnom no 601 ta yllat latot htiw yad eno ni delbuod natsikap ni sesac surivanoroc .sesac wen 35 demrifnoc seitirohtua laicnivorp hdnis eht retfa yadnom no 601 ot nesir sah natsikap ni sesac surivanoroc :iabud .yadot fo sa yrtnuoc eht ni sesac surivanoroc levon ni esaercni tsegral elgnis eht si siht .noonretfa yadnom litnu detroper sesac surivanoroc latot 601 fo tuo 88 htiw natsikap ni ecnivorp tih-tsrow eht won si hdnis 91-divoc rof evitisop detset dah nari htiw redrob natfat ta enitnarauq eht morf denruter ohw elpoep 05 taht yadnom no deteewt ,noitamrofni dna ,tnemhsilbatse noitpurroc-itna ,wal no retsinim feihc hdnis ot rosivda ,bahaw azatrum
As you can see the file now contains the same content but in the reverse order.
Conclusion
Apache Beam is a nice tool for writing your ETL based applications. I just discussed the basics of it. You can integrate it with cloud solutions as well as store the data into a typical database. I hope you will find it useful and will try it in your next application.
As always, the code is available on Github.