Schedule web scrapers with Apache Airflow

 This post is the part of Data Engineering Series. In the previous post, I discussed Apache Airflow and it’s basic concepts, configuration, and usage. In this post, I am going to discuss how can you schedule your web scrapers with help of Apache Airflow. I will be using the same example I used in Apache Kafka […]

Getting started with Apache Airflow

This post is the part of Data Engineering Series. In this post, I am going to discuss Apache Airflow, a workflow management system developed by Airbnb. Earlier I had discussed writing basic ETL pipelines in Bonobo. Bonobo is cool for write ETL pipelines but the world is not all about writing ETL pipelines to automate things. […]

Data Engineering Series – An Intro

So I just realized that I am here after a month or so. I was busy at work and traveling out of the country. I am starting a kind of new series, I say it Data Engineering Series in which I will be discussing different tools. Of course, I am not able to discuss the […]

Getting started with Elasticsearch in Python

In this post, I am going to discuss Elasticsearch and how you can integrate with different Python apps. What is ElasticSearch? ElasticSearch (ES) is a distributed and highly available open-source search engine that is built on top of Apache Lucene. It’s an open-source which is built in Java thus available for many platforms. You store […]

5 strategies to write unblock-able web scrapers in Python

People who read my posts in scraping series often contacted me to know how could they write scrapers that don’t get blocked. It is very difficult to write a scraper that NEVER gets blocked but yes, you can increase the life of your web scraper by implementing a few strategies. Today I am going to […]

How to setup PHP7.1, Apache 2.2 on Amazon Linux

OK I had no plan to make this post but recently I spent quite a few time to figure it out so thought to make it as a post for self and others who come across issue to deal with this simply thing otherwise. Alright, let’s proceed! Below is the details of my Amazon Distro: […]

Getting started with Python and IPFS

In this post I am going to discuss how you can use  decentralized IPFS in your Python apps for storing different kind of data. What is IPFS? From Wikipedia: InterPlanetary File System (IPFS) is a protocol and network designed to create a content-addressable, peer-to-peer method of storing and sharing hypermedia in a distributed file system. […]