Schedule web scrapers with Apache Airflow

 This post is the part of Data Engineering Series. In the previous post, I discussed Apache Airflow and it’s basic concepts, configuration, and usage. In this post, I am going to discuss how can you schedule your web scrapers with help of Apache Airflow. I will be using the same example I used in Apache Kafka […]

5 strategies to write unblock-able web scrapers in Python

People who read my posts in scraping series often contacted me to know how could they write scrapers that don’t get blocked. It is very difficult to write a scraper that NEVER gets blocked but yes, you can increase the life of your web scraper by implementing a few strategies. Today I am going to […]

Implementing beanstalk to create a scaleable web scraper

Image Credit (http://blog.hqc.sk.ca/wp-content/uploads/2012/12/Queue-2012-12-11.jpg) Queues are often used to make applications scaleable by offloading the data and process them later. In this post I am going to use BeansTalk queue management system in Python. Before I get into real task, allow me to give a brief intro of Beanstalk. What is Beanstalk? From the official website: Beanstalk […]

Write your first web crawler in Python Scrapy

The scraping series will not get completed without discussing Scrapy. In this post I am going to write a web crawler that will scrape data from OLX’s Electronics & Appliances’ items. Before I get into the code, how about having a brief intro of Scrapy itself? What is Scrapy? From Wikipedia: Scrapy (/ˈskreɪpi/ skray-pee)[1] is a […]

Write a Gmail autoresponder by using Python Selenium

In earlier posts(here and here) I discuss how to use Python requests and beautifulsoup library to access and scrape a website. This time I am going to make a simple Gmail Autoresponder that responds to a certain mail. Before I discuss how to do it, a few words about Selenium and why is it going to make […]

Write your first web scraper in Python with Beautifulsoup

Ok so I am going to write the simplest web scraper in Python with the help of libraries like requests and BeautifulSoup. Before I move further, allow me to discuss what’s web/HTML scraping. What is Web scraping? According to Wikipedia: Web scraping (web harvesting or web data extraction) is a computer software technique of extracting […]

A subreddit for web scrappers

I recently realized that I am in love with Data Scraping. Thanks to Python and Beautifulsoup to get me into this. In past few months I have scraped data from sites like Amazon, Rakuten and NewEgg. I have started a subreddit for developers who love to scrap sites for fun( or for work). I named […]