In earlier posts, here and here I discussed how to write a scraper and make it secure and foolproof. These things are good to implement but not good enough to make it fast and efficient. In this post, I am going to show how a change of a few lines of code can speed up your web scraper by X times. Keep reading! If you remember the post, I scraped the detail page of OLX. Now, usually, you end up to this page after going thru the listing of such entries. First, I will make a script without multiprocessing, we will see why is it not good and then a scraper…
-
-
How to develop an efficient web scraper in Python
Last week I was working on a web scraper for a client who needed to get around a million records from a real estate website. After a certain level, the scraper stopped working and the reason was I forgot to put certain checks as I was expecting the client would not go for that route but he DID! A few days back I shared a post about how to write a basic scraper in Python by using Beautifulsoup. In this post, I am going to discuss how to make your scraper more foolproof and user-friendly for non-technical people. 1- Check 200 status code It is always good to check the…
-
Write your first web scraper in Python with Beautifulsoup
Ok, so I am going to write the simplest web scraper in Python with the help of libraries like requests and BeautifulSoup. Before I move further, allow me to discuss what’s web/HTML scraping. What is Web scraping? According to Wikipedia: Web scraping (web harvesting or web data extraction) is a computer software technique of extracting information from websites. This is accomplished by either directly implementing the Hypertext Transfer Protocol (on which the Web is based), or embedding a web browser. So use scraping technique to access the data from web pages and make it useful for various purposes (e.g: Analysis, aggregation etc). Scraping is not the only way to…
-
A subreddit for web scrappers
I recently realized that I am in love with Data Scraping. Thanks to Python and Beautifulsoup to get me into this. In past few months I have scraped data from sites like Amazon, Rakuten and NewEgg. I have started a subreddit for developers who love to scrap sites for fun( or for work). I named it Scraping the Web. You can join here and submit your entries. Happy Scraping!