How to speed up your python web scraper by using multiprocessing

 

In earlier posts, here and here I discussed how to write a scraper and make it secure and fool proof. These things are good to implement but not good enough to make it fast and efficient.

In this post I am going to show how a change of few lines of code can speed up your web scraper by X times. Keep reading!

If you remember the post, I scraped the detail page of OLX. Now, usually you end up to this page after going thru listing of such entries. First, I will make a script without multiprocessing, we will see why is it not good and then a scraper with multiprocessing.

OK, the goal is to access this page and fetch all URLs from the page. For sake of simplicity I am not covering pagination part. So, let’s get into code.

Here’s the gist of accessing listing from the URL and then parse and fetch information of each entry.

 

I have divided scripts in two functions: get_listing()  access the listing page, parses and saves the list in list and return it and parse(url) which takes an individual url parses the info and returns a comma delimited string.

I am then calling get_list() to get a list of links and then using mighty list comprehension to get a list of individual parsed entry and saving ALL info in a csv file.

Then I executed the script by using time command:

which calculates the time a process takes. On my computer it returns:

hmm.. around 6minutes for 50 records. There’s a 2 seconds delay in each iteration so minus 1 and half minute so makes it 4 and half minute.

Now.. I am going to make a a few lines of change and make it running into parallel.

Keep reading!

The first change is using a new Python module, Multiprocessing:

From the documentation:

multiprocessing is a package that supports spawning processes using an API similar to the threading module. The multiprocessing package offers both local and remote concurrency, effectively side-stepping the Global Interpreter Lock by using subprocesses instead of threads. Due to this, the multiprocessing module allows the programmer to fully leverage multiple processors on a given machine. It runs on both Unix and Windows.

 

So unlike thread it’s locks the thing w.r.t a process instead of a thread. It reminds me kind of MapReduce thing but obviously not the same.  Now note following lines:

Pool here playing an important role, it tells how many subprocesses should be spawn at a time. Here I mentioned 10 which means 10 URLs will be processed at a single time.

The second line, the first argument is the function which will be multi-processed and second argument is the number of links in list format. In our case there are 50 links so there will be 5 iterations, 10 URLs will be accessed and parsed in a go and will return data in form of list.

The third line is actually terminate a process, in *Nix it’s SIGTERM and In Windows it uses TerminateProcess()

The last line, in simple words join() makes sure to avoid zombie processes and end all process gracefully.

If you don’t use terminate() and join(), the ONLY issue you’d have that you’d have many zombie or defunct process occupying your machine without any reason. I’m sure you definitely want that.

Or, you can use Context Manager to make it further simple. Thanks KimPeek on Reddit Python for this tip!

Alright, I ran this script:

And the time it took:

Here, same 2 seconds delay but since all processed in parallel so it took around 22 seconds. I reduced the Pool Size to 5, still, quite good!

Hope you’d be implementing multiprocessing to speed up your next web scrapers. Give your feedback in comments and let everyone knows how could it be made much better than this one. Thanks.

As usual, the code is available on Github.

If you like this post then you should subscribe to my blog for future updates.

* indicates required




If you like my work then you can donate me in bitcoins

2 responses to “How to speed up your python web scraper by using multiprocessing”

  1. Goran says:

    Check this out…. it’s even faster 🙂

    import requests
    from bs4 import BeautifulSoup
    from time import sleep
    from multiprocessing import Pool
    from multiprocessing import cpu_count
    from datetime import datetime
    proxies = { ‘http’: ‘64.183.94.45:8080’ }
    t1 = datetime.now()
    def get_listing(url):
    headers = {
    ‘user-agent’: ‘Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/53.0.2785.143 Safari/537.36’}
    html = None
    links = None

    r = requests.get(url, headers=headers, proxies=proxies, timeout=10)

    if r.status_code == 200:
    html = r.text
    soup = BeautifulSoup(html, ‘lxml’)
    listing_section = soup.select(‘#offers_table table > tbody > tr > td > h3 > a’)
    links = [link[‘href’].strip() for link in listing_section]
    return links

    # parse a single item to get information
    def parse(url):
    headers = {
    ‘user-agent’: ‘Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/53.0.2785.143 Safari/537.36’}
    r = requests.get(url, headers=headers, timeout=10)
    sleep(2)

    info = []
    title_text = ‘-‘
    location_text = ‘-‘
    price_text = ‘-‘
    title_text = ‘-‘
    images = ‘-‘
    description_text = ‘-‘

    if r.status_code == 200:
    print(‘Processing..’ + url)
    html = r.text
    soup = BeautifulSoup(html, ‘lxml’)
    title = soup.find(‘h1’)
    if title is not None:
    title_text = title.text.strip()

    location = soup.find(‘strong’, {‘class’: ‘c2b small’})
    if location is not None:
    location_text = location.text.strip()

    price = soup.select(‘div > .xxxx-large’)
    if price is not None:
    price_text = price[0].text.strip(‘Rs’).replace(‘,’, ”)

    images = soup.select(‘#bigGallery > li > a’)
    img = [image[‘href’].strip() for image in images]
    images = ‘^’.join(img)

    description = soup.select(‘#textContent > p’)
    if description is not None:
    description_text = description[0].text.strip()

    info.append(url)
    info.append(title_text)
    info.append(location_text)
    info.append(price_text)
    info.append(images)

    return ‘,’.join(info)

    car_links = None
    cars_info = []
    cars_links = get_listing(‘https://www.olx.com.pk/cars/’)

    if __name__ == ‘__main__’:
    pool = Pool(cpu_count() * 2)
    with pool as p:
    records = p.map(parse, cars_links)

    if len(records) > 0:
    with open(‘data_parallel.csv’, ‘a+’) as f:
    f.write(‘\n’.join(records))
    t2 = datetime.now()
    total = t2-t1
    print(“Scraping finished in: %s”%(total))