6 things to develop an efficient web scraper in Python

Last week I was working on a web scraper for a client who needed to get around a million of records from a real estate website. After a certain level the scraper stopped working and the reason was I forgot to put a certain checks as I was expecting client would not go for that route but he DID!

A few days back I shared a post about how to write basic scraper in Python by using Beautifulsoup. In this post I am going to discuss how to make your scraper more fool proof and user friendly for non-technical people.

1- Check 200 status code

It is always good to check the HTTP status code earlier and proceed accordingly.

This is good:

This is better:

2- Never Trust HTML

Yep, specially if you can’t control it. Web scraping depends on HTML DOM, a simple change in element or class name could break your entire script. The best way to deal with it to check whether it returns None or not.

Here I am checking whether the CSS selector returned something legitimate, if yes then proceed further.

3 – Set headers

Python Requests does not force you to use request headers while sending requests but there are few smart websites that does not let you to get read anything important unless certain headers are not set in it. Once I faced the situation that the HTML I was seeing in browser was different than what I was getting via my script. So, it is always good to make your requests as legitimate as you can. The least you should do is to set a User-Agent.

 

4- Set timeout

One of the issue with Python Requests is that, if you don’t mention timeout, it will keep trying till it’s last breathe. This might be good for some certain conditions but not in majority cases. Therefore, it’s always good to set a timeout value for each request. Here I am setting timeout to 5 seconds.

5- Exception handling

It is always good to implement exception handling. It does not only help to avoid unexpected exit of script but can also help to log errors and info notification. When using Python requests I prefer to catch exceptions like this:

Check the very last one. This one tells the program that if someone wants to terminate program by using Ctrl+C then it wrap things up first and then exist. This situation is good if you are storing information in file and wants to dump all at the time of exit.

6- Efficient File Handling

One of the functions of web scrapers is to store data either in Db or flat files like CSV/Text. If you are scraping a large amount of data, it is not to do I/O operation within a loop. Let me show you how I do it:

Here I am calling a function(though you are not bound to do it like that) that is appending records in a list. Once it’s done or program gets terminated, before termination it will just dump entire list in file in a single go.  Much better than multiple I/Os.

I hope you will find it useful. Please share your experience how else one could make a scraper super efficient.

If you like this post then you should subscribe to my blog for future updates.

* indicates required



6 responses to “6 things to develop an efficient web scraper in Python”

  1. Tejas says:

    Hi Adnan, Can you please elaborate more on your last point with a better example. It would be really appreciated. Thanks in advance.

    • admin says:

      I try my best.

      There are many ways you can store data in a file or DB. One of them usually people do is to store data in each iteration. This is not a good practise as you are doing I/O in each iteration. Imagine you are downloading 100K of records, if you do an I/O for each iteration it means 100K times you opened a file, store your data and closed it. You would like to do it because you want to store data as soon as it’s available.

      What I proposed is doing same but it is more efficient since I am storing all the records in a list and at the time it ends(either due to error or normal termination), finally block will store all of those 100K records as a single string after joining list entries thus better in performance. 1:100K ratio.

      HIH

  2. ZeeD26 says:

    Your last example suffers from one small problem in these lines:

    try:
    # file to store state based URLs
    record_file = open(‘records_file.txt’, ‘a+’)
    record_file.write(“\n”.join(property_urls))
    record_file.close()
    except Exception as ex:
    print(“Unable to store records in CSV file. Techncical details below.\n”)
    print(str(e))

    If something goes south while writing to record_file it is not closed properly. Better to use the context manager:

    try:
    # file to store state based URLs
    with open(‘records_file.txt’, ‘a+’) as record_file:
    record_file.write(“\n”.join(property_urls))
    except Exception as ex:
    print(“Unable to store records in CSV file. Techncical details below.\n”)
    print(str(e))

  3. Ilya says:

    Or use Scrapy!