How to speed up your python web scraper by using multiprocessing
In this post I am going to show how a change of few lines of code can speed up your web scraper by X times. Keep reading!
If you remember the post, I scraped the detail page of OLX. Now, usually you end up to this page after going thru listing of such entries. First, I will make a script without multiprocessing, we will see why is it not good and then a scraper with multiprocessing.
OK, the goal is to access this page and fetch all URLs from the page. For sake of simplicity I am not covering pagination part. So, let’s get into code.
Here’s the gist of accessing listing from the URL and then parse and fetch information of each entry.
I have divided scripts in two functions: get_listing() access the listing page, parses and saves the list in list and return it and parse(url) which takes an individual url parses the info and returns a comma delimited string.
I am then calling get_list() to get a list of links and then using mighty list comprehension to get a list of individual parsed entry and saving ALL info in a csv file.
Then I executed the script by using time command:
Adnans-MBP:~ AdnanAhmad$ time python listing_seq.py
which calculates the time a process takes. On my computer it returns:
hmm.. around 6minutes for 50 records. There’s a 2 seconds delay in each iteration so minus 1 and half minute so makes it 4 and half minute.
Now.. I am going to make a a few lines of change and make it running into parallel.
The first change is using a new Python module, Multiprocessing:
from multiprocessing import Pool
From the documentation:
multiprocessing is a package that supports spawning processes using an API similar to the threading module. The multiprocessing package offers both local and remote concurrency, effectively side-stepping the Global Interpreter Lock by using subprocesses instead of threads. Due to this, the multiprocessing module allows the programmer to fully leverage multiple processors on a given machine. It runs on both Unix and Windows.
So unlike thread it’s locks the thing w.r.t a process instead of a thread. It reminds me kind of MapReduce thing but obviously not the same. Now note following lines:
p = Pool(10) # Pool tells how many at a time
records = p.map(parse, cars_links)
Pool here playing an important role, it tells how many subprocesses should be spawn at a time. Here I mentioned 10 which means 10 URLs will be processed at a single time.
The second line, the first argument is the function which will be multi-processed and second argument is the number of links in list format. In our case there are 50 links so there will be 5 iterations, 10 URLs will be accessed and parsed in a go and will return data in form of list.
The third line is actually terminate a process, in *Nix it’s SIGTERM and In Windows it uses TerminateProcess()
The last line, in simple words join() makes sure to avoid zombie processes and end all process gracefully.
If you don’t use terminate() and join(), the ONLY issue you’d have that you’d have many zombie or defunct process occupying your machine without any reason. I’m sure you definitely want that.
with Pool(10) as p:
records = p.map(parse, cars_links)
Alright, I ran this script:
Adnans-MBP:~ AdnanAhmad$ time python list_parallel.py
And the time it took:
Here, same 2 seconds delay but since all processed in parallel so it took around 22 seconds. I reduced the Pool Size to 5, still, quite good!
Hope you’d be implementing multiprocessing to speed up your next web scrapers. Give your feedback in comments and let everyone knows how could it be made much better than this one. Thanks.
As usual, the code is available on Github.