HTML

How to develop an efficient web scraper in Python

Last week I was working on a web scraper for a client who needed to get around a million records from a real estate website. After a certain level, the scraper stopped working and the reason was I forgot to put certain checks as I was expecting the client would not go for that route but he DID!

A few days back I shared a post about how to write a basic scraper in Python by using Beautifulsoup. In this post, I am going to discuss how to make your scraper more foolproof and user-friendly for non-technical people.

1- Check 200 status code

It is always good to check the HTTP status code earlier and proceed accordingly.

This is good:

if r.status_code == 200:
   #Proceed further

This is better:

if r.status_code != 200:
  return False

2- Never Trust HTML

Yep, especially if you can’t control it. Web scraping depends on HTML DOM, a simple change in element or class name could break your entire script. The best way to deal with it to check whether it returns None or not.

page_count = soup.select('.pager-pages > li > a')
if page_count:
 #do your stuff
else:
 # ALERT!! Send notification to Admin

Here I am checking whether the CSS selector returned something legitimate, if yes then proceed further.

3 – Set headers

Python Requests does not force you to use request headers while sending requests but there are few smart websites that does not let you to get read anything important unless certain headers are not set in it. Once I faced the situation that the HTML I was seeing in browser was different than what I was getting via my script. So, it is always good to make your requests as legitimate as you can. The least you should do is to set a User-Agent.

headers = {
    'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/53.0.2785.143 Safari/537.36'}

r = requests.get(url, headers=headers, timeout=5)

 

4- Set timeout

One of the issue with Python Requests is that, if you don’t mention timeout, it will keep trying till it’s last breathe. This might be good for some certain conditions but not in majority cases. Therefore, it’s always good to set a timeout value for each request. Here I am setting timeout to 5 seconds.

r = requests.get(url, headers=headers, timeout=5)

5- Exception handling

It is always good to implement exception handling. It does not only help to avoid unexpected exit of script but can also help to log errors and info notification. When using Python requests I prefer to catch exceptions like this:

try:
    # your logic is here

except requests.ConnectionError as e:
    print("OOPS!! Connection Error. Make sure you are connected to Internet. Technical Details given below.\n")
    print(str(e))
except requests.Timeout as e:
    print("OOPS!! Timeout Error")
    print(str(e))
except requests.RequestException as e:
    print("OOPS!! General Error")
    print(str(e))
except KeyboardInterrupt:
    print("Someone closed the program")

Check the very last one. This one tells the program that if someone wants to terminate program by using Ctrl+C then it wrap things up first and then exist. This situation is good if you are storing information in file and wants to dump all at the time of exit.

6- Efficient File Handling

One of the functions of web scrapers is to store data either in Db or flat files like CSV/Text. If you are scraping a large amount of data, it is not to do I/O operation within a loop. Let me show you how I do it:

try:
 a_list_variable = []
 a_list_variable.extend(a_func_return_record())
except requests.ConnectionError as e:
    print("OOPS!! Connection Error. Make sure you are connected to Internet. Technical Details given below.\n")
    print(str(e))
except requests.Timeout as e:
    print("OOPS!! Timeout Error")
    print(str(e))
except requests.RequestException as e:
    print("OOPS!! General Error")
    print(str(e))
except KeyboardInterrupt:
    print("Someone closed the program")
finally:
    print("Total Records  = " + str(len(property_urls)))
    try:
        # file to store state based URLs
        record_file = open('records_file.txt', 'a+')
        record_file.write("\n".join(property_urls))
        record_file.close()
    except Exception as ex:
        print("Unable to store records in CSV file. Techncical details below.\n")
        print(str(e))

Here I am calling a function(though you are not bound to do it like that) that is appending records in a list. Once it’s done or the program gets terminated, before termination it will just dump the entire list in a file in a single go.  Much better than multiple I/Os.

7- Using Sitemap

Whenever you write a scraper for fetching multiple records from a site, you tend to look for the page where all such links are present. You then write a scraper to fetch all links and then parse them individually. It is fine but what if you can do it smartly? You can check sitemap.xml file. Sitemaps are used by search engines to index records. Many sites just dump individual listings in a sitemap file. What you could do is grab all relevant records from the XML file and then parse them. In this way, you not only get your desired records but chances of getting blocked by the website also get reduced since you got all records at once. Sitemap files also included sub-sections in .gz format. You could download those zip files, unzip them and parse them individually.

I hope you will find it useful. Please share your experience about how else one could make a scraper super-efficient.

Planning to write a book about Web Scraping in Python. Click here to give your feedback

If you like this post then you should subscribe to my blog for future updates.

* indicates required



6 Comments

  • Tejas

    Hi Adnan, Can you please elaborate more on your last point with a better example. It would be really appreciated. Thanks in advance.

    • admin

      I try my best.

      There are many ways you can store data in a file or DB. One of them usually people do is to store data in each iteration. This is not a good practise as you are doing I/O in each iteration. Imagine you are downloading 100K of records, if you do an I/O for each iteration it means 100K times you opened a file, store your data and closed it. You would like to do it because you want to store data as soon as it’s available.

      What I proposed is doing same but it is more efficient since I am storing all the records in a list and at the time it ends(either due to error or normal termination), finally block will store all of those 100K records as a single string after joining list entries thus better in performance. 1:100K ratio.

      HIH

  • ZeeD26

    Your last example suffers from one small problem in these lines:

    try:
    # file to store state based URLs
    record_file = open(‘records_file.txt’, ‘a+’)
    record_file.write(“\n”.join(property_urls))
    record_file.close()
    except Exception as ex:
    print(“Unable to store records in CSV file. Techncical details below.\n”)
    print(str(e))

    If something goes south while writing to record_file it is not closed properly. Better to use the context manager:

    try:
    # file to store state based URLs
    with open(‘records_file.txt’, ‘a+’) as record_file:
    record_file.write(“\n”.join(property_urls))
    except Exception as ex:
    print(“Unable to store records in CSV file. Techncical details below.\n”)
    print(str(e))