scraping

ScrapeGen – Tool for generating Python scrapers
A simple python tool that generates a requests/bs4 based web scraper

OK, so I was kind of bored last week so thought of coming up something anyway, even if it is useless. So, I allocated a few hours and came up with ScrapeGen. What is ScrapeGen? It is a simple Python-based command-line tool that generates python web scrapers based on rules and details entered in a YAML file. When it runs, it generates a new file. Rules turned to separate functions which then are called to main parsing method. View the Demo: Why is needed? Such kind of tool could be good for companies and individuals who write many parsers and hardcode the rules within the .py files. Imagine the rule…

Read More
Create your first web scraper with Scraper API and Python

Recently I come across a tool that takes care of many of the issues you usually face while scraping websites. The tool is called Scraper API which provides an easy to use REST API to scrape a different kind of websites(Simple, JS enabled, Captcha, etc) with quite an ease. Before I proceed further, allow me to introduce Scraper API. What is Scraper API If you visit their website you’d find their mission statement: Scraper API handles proxies, browsers, and CAPTCHAs, so you can get the HTML from any web page with a simple API call! As it suggests, it is offering you all the things to deal with the issues…

Read More
Schedule web scrapers with Apache Airflow

This post is the part of Data Engineering Series. In the previous post, I discussed Apache Airflow and it’s basic concepts, configuration, and usage. In this post, I am going to discuss how can you schedule your web scrapers with help of Apache Airflow. I will be using the same example I used in Apache Kafka and Elastic Search example that is scraping https://allrecipes.com because the purpose is to use Airflow. In case you want to learn about scraping you may check the entire series here. So, we will work on a workflow consist of tasks: parse_recipes: It will parse individual recipes. download_image: It downloads recipe image. store_data: Finally store image…

Read More
5 strategies to write unblockable web scrapers in Python

Introduction People who read my posts in scraping series often contacted me to know how could they write scrapers that don’t get blocked. It is very difficult to write a scraper that NEVER gets blocked but yes, you can increase the life of your web scraper by implementing a few strategies. Today I am going to discuss them. User-Agent The very first thing you need to take care of is setting the user-agent. User Agent is a tool that works on behalf of the user and tells the server about which web browser the user is using for visiting the website. Many websites do not let you view the content…

Read More
Implementing beanstalk to create a scaleable web scraper

Image Credit (http://blog.hqc.sk.ca/wp-content/uploads/2012/12/Queue-2012-12-11.jpg) Queues are often used to make applications scaleable by offloading the data and process them later. In this post I am going to use BeansTalk queue management system in Python. Before I get into real task, allow me to give a brief intro of Beanstalk. What is Beanstalk? From the official website: Beanstalk is a simple, fast work queue. Its interface is generic, but was originally designed for reducing the latency of page views in high-volume web applications by running time-consuming tasks asynchronously. It’s demon called beanstalkd which you can run on *nix based machines. Since I am on OSX so I called brew install beanstalkd to install it. Once…

Read More
Write your first web crawler in Python Scrapy

The scraping series will not get completed without discussing Scrapy. In this post I am going to write a web crawler that will scrape data from OLX’s Electronics & Appliances’ items. Before I get into the code, how about having a brief intro of Scrapy itself? What is Scrapy? From Wikipedia: Scrapy (/ˈskreɪpi/ skray-pee)[1] is a free and open source web crawling framework, written in Python. Originally designed for web scraping, it can also be used to extract data using APIs or as a general purpose web crawler.[2] It is currently maintained by Scrapinghub Ltd., a web scraping development and services company. A web crawling framework which has done all…

Read More
Write a Gmail autoresponder by using Python Selenium

In earlier posts(here and here) I discuss how to use Python requests and beautifulsoup library to access and scrape a website. This time I am going to make a simple Gmail Autoresponder that responds to a certain mail. Before I discuss how to do it, a few words about Selenium and why is it going to make our life easier. Advantages of Selenium What one is going to achieve with Selenium by not opting for a lightweight solution based on Python requests and beautifulsoup? Selenium actually automates browser activities by simulating clicks and other events and makes easier to access information that is accessible after executing Javascript on page. Since it’s automating…

Read More
How to speed up your python web scraper by using multiprocessing

In earlier posts, here and here I discussed how to write a scraper and make it secure and foolproof. These things are good to implement but not good enough to make it fast and efficient. In this post, I am going to show how a change of a few lines of code can speed up your web scraper by X times. Keep reading! If you remember the post, I scraped the detail page of OLX. Now, usually, you end up to this page after going thru the listing of such entries. First, I will make a script without multiprocessing, we will see why is it not good and then a scraper…

Read More
How to develop an efficient web scraper in Python

Last week I was working on a web scraper for a client who needed to get around a million records from a real estate website. After a certain level, the scraper stopped working and the reason was I forgot to put certain checks as I was expecting the client would not go for that route but he DID! A few days back I shared a post about how to write a basic scraper in Python by using Beautifulsoup. In this post, I am going to discuss how to make your scraper more foolproof and user-friendly for non-technical people. 1- Check 200 status code It is always good to check the…

Read More
Write your first web scraper in Python with Beautifulsoup

Ok, so I am going to write the simplest web scraper in Python with the help of libraries like requests and BeautifulSoup. Before I move further, allow me to discuss what’s web/HTML scraping. What is Web scraping? According to Wikipedia: Web scraping (web harvesting or web data extraction) is a computer software technique of extracting information from websites. This is accomplished by either directly implementing the Hypertext Transfer Protocol (on which the Web is based), or embedding a web browser. So use scraping technique to access the data from web pages and make it useful for various purposes (e.g: Analysis, aggregation etc). Scraping is not the only way to…

Read More