In earlier posts, here and here I discussed how to write a scraper and make it secure and foolproof. These things are good to implement but not good enough to make it fast and efficient.
In this post, I am going to show how a change of a few lines of code can speed up your web scraper by X times. Keep reading!
If you remember the post, I scraped the detail page of OLX. Now, usually, you end up to this page after going thru the listing of such entries. First, I will make a script without multiprocessing, we will see why is it not good and then a scraper with multiprocessing.
OK, the goal is to access this page and fetch all URLs from the page. For the sake of simplicity, I am not covering the pagination part. So, let’s get into code.
Here’s the gist of accessing listing from the URL and then parse and fetch information of each entry.
https://gist.github.com/kadnan/07fa4c3fd2f46f8a079ace39be90e211
I have divided scripts into two functions: get_listing() access the listing page, parses and saves the list in the list and return it and parse(url) which takes an individual url parses the info and returns a comma-delimited string.
I am then calling get_list() to get a list of links and then using mighty list comprehension to get a list of individual parsed entry and saving ALL info in a CSV file.
Then I executed the script by using time command:
Adnans-MBP:~ AdnanAhmad$ time python listing_seq.py
which calculates the time a process takes. On my computer it returns:
real 5m49.168s user 0m2.876s sys 0m0.198s
hmm.. around 6minutes for 50 records. There’s a 2 seconds delay in each iteration so minus 1 and a half minute so makes it 4 and a half minute.
Now.. I am going to make a few lines of change and make it running into parallel.
Keep reading!
The first change is using a new Python module, Multiprocessing:
from multiprocessing import Pool
From the documentation:
multiprocessing is a package that supports spawning processes using an API similar to the threading module. The multiprocessing package offers both local and remote concurrency, effectively side-stepping the Global Interpreter Lock by using subprocesses instead of threads. Due to this, the multiprocessing module allows the programmer to fully leverage multiple processors on a given machine. It runs on both Unix and Windows.
So unlike thread, it locks the thing w.r.t a process instead of a thread. It reminds me kind of MapReduce thing but obviously not the same. Now note the following lines:
p = Pool(10) # Pool tells how many at a time records = p.map(parse, cars_links) p.terminate() p.join()
The Pool here playing an important role, it tells how many subprocesses should be spawn at a time. Here I mentioned 10 which means 10 URLs will be processed at a single time.
The second line, the first argument is the function which will be multi-processed and second argument is the number of links in list format. In our case there are 50 links so there will be 5 iterations, 10 URLs will be accessed and parsed in a go and will return data in form of list.
The third line is actually terminate a process, in *Nix it’s SIGTERM and In Windows it uses TerminateProcess()
The last line, in simple words join() makes sure to avoid zombie processes and end all process gracefully.
If you don’t use terminate() and join(), the ONLY issue you’d have that you’d have many zombie or defunct process occupying your machine without any reason. I’m sure you definitely want that.
Or, you can use Context Manager to make it further simple. Thanks KimPeek on Reddit Python for this tip!
with Pool(10) as p: records = p.map(parse, cars_links)
Alright, I ran this script:
Adnans-MBP:~ AdnanAhmad$ time python list_parallel.py
And the time it took:
real 0m22.884s user 0m2.748s sys 0m0.363s
Here, same 2 seconds delay but since all processed in parallel so it took around 22 seconds. I reduced the Pool Size to 5, still, quite good!
P(5) real 0m43.695s user 0m2.829s sys 0m0.336s
Hope you’d be implementing multiprocessing to speed up your next web scrapers. Give your feedback in comments and let everyone knows how could it be made much better than this one. Thanks.
Writing scrapers is an interesting journey but you can hit the wall if the site blocks your IP. As an individual, you can’t afford expensive proxies either. Scraper API provides you an affordable and easy to use API that will let you scrape websites without any hassle. You do not need to worry about getting blocked because Scraper API by default uses proxies to access websites. On top of it, you do not need to worry about Selenium either since Scraper API provides the facility of a headless browser too. I also have written a post about how to use it.
Click here to signup with my referral link or enter promo code adnan10, you will get a 10% discount on it. In case you do not get the discount then just let me know via email on my site and I’d sure help you out
As usual, the code is available on Github.
Planning to write a book about Web Scraping in Python. Click here to give your feedback
2 Comments
Goran
Check this out…. it’s even faster 🙂
import requests
from bs4 import BeautifulSoup
from time import sleep
from multiprocessing import Pool
from multiprocessing import cpu_count
from datetime import datetime
proxies = { ‘http’: ‘64.183.94.45:8080’ }
t1 = datetime.now()
def get_listing(url):
headers = {
‘user-agent’: ‘Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/53.0.2785.143 Safari/537.36’}
html = None
links = None
r = requests.get(url, headers=headers, proxies=proxies, timeout=10)
if r.status_code == 200:
html = r.text
soup = BeautifulSoup(html, ‘lxml’)
listing_section = soup.select(‘#offers_table table > tbody > tr > td > h3 > a’)
links = [link[‘href’].strip() for link in listing_section]
return links
# parse a single item to get information
def parse(url):
headers = {
‘user-agent’: ‘Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/53.0.2785.143 Safari/537.36’}
r = requests.get(url, headers=headers, timeout=10)
sleep(2)
info = []
title_text = ‘-‘
location_text = ‘-‘
price_text = ‘-‘
title_text = ‘-‘
images = ‘-‘
description_text = ‘-‘
if r.status_code == 200:
print(‘Processing..’ + url)
html = r.text
soup = BeautifulSoup(html, ‘lxml’)
title = soup.find(‘h1’)
if title is not None:
title_text = title.text.strip()
location = soup.find(‘strong’, {‘class’: ‘c2b small’})
if location is not None:
location_text = location.text.strip()
price = soup.select(‘div > .xxxx-large’)
if price is not None:
price_text = price[0].text.strip(‘Rs’).replace(‘,’, ”)
images = soup.select(‘#bigGallery > li > a’)
img = [image[‘href’].strip() for image in images]
images = ‘^’.join(img)
description = soup.select(‘#textContent > p’)
if description is not None:
description_text = description[0].text.strip()
info.append(url)
info.append(title_text)
info.append(location_text)
info.append(price_text)
info.append(images)
return ‘,’.join(info)
car_links = None
cars_info = []
cars_links = get_listing(‘https://www.olx.com.pk/cars/’)
if __name__ == ‘__main__’:
pool = Pool(cpu_count() * 2)
with pool as p:
records = p.map(parse, cars_links)
if len(records) > 0:
with open(‘data_parallel.csv’, ‘a+’) as f:
f.write(‘\n’.join(records))
t2 = datetime.now()
total = t2-t1
print(“Scraping finished in: %s”%(total))
Adnan
Better you put it as a GIST and explain why is it so faster? 🙂