This post is the part of Data Engineering Series. In previous posts, I discussed writing ETLs in Bonobo, Spark, and Airflow. In this post, I am introducing another ETL tool which was developed by Spotify, called Luigi. Earlier I had discussed here, here and here about writing basic ETL pipelines. Bonobo is cool for write ETL pipelines but the world is not all about writing ETL pipelines to automate things. There are other use cases in which you have to perform tasks in a certain order once or periodically. For instance: Monitoring Cron jobs transferring data from one place to another. Automating your DevOps operations. Periodically fetching data from websites and…
-
Create your first ETL in Luigi
-
Schedule web scrapers with Apache Airflow
This post is the part of Data Engineering Series. In the previous post, I discussed Apache Airflow and it’s basic concepts, configuration, and usage. In this post, I am going to discuss how can you schedule your web scrapers with help of Apache Airflow. I will be using the same example I used in Apache Kafka and Elastic Search example that is scraping https://allrecipes.com because the purpose is to use Airflow. In case you want to learn about scraping you may check the entire series here. So, we will work on a workflow consist of tasks: parse_recipes: It will parse individual recipes. download_image: It downloads recipe image. store_data: Finally store image…