• HTML

    Create your first Web scraper in Go with goQuery
    A beginners tutorial for writing web scrapers in Go language for Yelp.

    Planning to write a book about Web Scraping in Python. Click here to give your feedback I have been covering web scraping for a long time on this blog for a long time but they were mostly in Python; be it requests, Selenium or Scrapy framework, all were based on Python language but scraping is not limited to a specific language. Any language that provides APIs or libraries for an Http client and HTML parser is able to provide you web scraping facility. Go also provides you the ability to write web scrapers. Go is a compiled and static type language and could be very beneficial to write efficient and…

  • Create your first web scraper with ScrapingBee API and Python
    Learn how to use cloud based Scraping API to scrape web pages without getting blocked.

    Planning to write a book about Web Scraping in Python. Click here to give your feedback In this post, I am going to discuss another cloud-based scraping tool that takes care of many of the issues you usually face while scraping websites. This platform has been introduced by ScrapingBee, a cloud-based Scraping tool. What is ScrapingBee If you visit their website, you will find something like below: ScrapingBee API handles headless browsers and rotates proxies for you. As it suggests, it is offering you all the things to deal with the issues you usually come across while writing your scrapers, especially the availability of proxies and headless scraping. No installation…

  • HTML

    Develop AirBnb Parser in Python

    Planning to write a book about Web Scraping in Python. Click here to give your feedback So I am starting a new scraping series, called, ScrapeTheFamous, in which I will be parsing some famous websites and will discuss my development process. The posts will be using Scraper API for parsing purposes which makes me free from all worries blocking and rendering dynamic sites since Scraper API takes care of everything. Anyways, the first post is about Airbnb. We will be scraping some important data points from it. We will be scraping a list of rental URL and fetch and store data in JSON format. So let’s start! The URL we…

  • HTML

    Advanced Proxy Use for Web Scraping

    Guest Post by Vytautas Kirjazovas from Oxylabs.io In the eyes of many, web scraping is an art. It is safe to state that the majority of web scraping enthusiasts have faced bans from websites more than once during their careers. Web scraping is a challenging task, and it’s more common than you think to see your crawlers getting banned by websites. In this article, we’ll talk about more advanced ways to use proxies for web scraping. There are some key components that you should take into account with web scraping to avoid getting banned too quickly: Set browser-like headers User-Agent that can be found in real life. Referer header. Other…

  • ScrapeGen – Tool for generating Python scrapers
    A simple python tool that generates a requests/bs4 based web scraper

    OK, so I was kind of bored last week so thought of coming up something anyway, even if it is useless. So, I allocated a few hours and came up with ScrapeGen. What is ScrapeGen? It is a simple Python-based command-line tool that generates python web scrapers based on rules and details entered in a YAML file. When it runs, it generates a new file. Rules turned to separate functions which then are called to main parsing method. View the Demo: Why is needed? Such kind of tool could be good for companies and individuals who write many parsers and hardcode the rules within the .py files. Imagine the rule…

  • Scraping dynamic websites using Scraper API and Python
    Learn how to efficiently and easily scrape modern Javascript enabled websites or Single Page Applications without installing a headless browser and Selenium

    In the last post of scraping series, I showed you how you can use Scraper API to scrape websites that use proxies hence your chance of getting blocked is reduced. Today I am going to show how you can use Scraper API to scrape websites that are using AJAX to render data with the help of JavaScript, Single Page Applications(SPAs) or scraping websites using frameworks like ReactJS, AngularJS or VueJS. I will be working on the same code I had written in the introductory post. Let’s work on a simple example. There is a website that tells your IP, called HttpBin. If you load via browser it will tell your…

  • Create your first web scraper with Scraper API and Python

    Recently I come across a tool that takes care of many of the issues you usually face while scraping websites. The tool is called Scraper API which provides an easy to use REST API to scrape a different kind of websites(Simple, JS enabled, Captcha, etc) with quite an ease. Before I proceed further, allow me to introduce Scraper API. What is Scraper API If you visit their website you’d find their mission statement: Scraper API handles proxies, browsers, and CAPTCHAs, so you can get the HTML from any web page with a simple API call! As it suggests, it is offering you all the things to deal with the issues…

  • Schedule web scrapers with Apache Airflow

     This post is the part of Data Engineering Series. In the previous post, I discussed Apache Airflow and it’s basic concepts, configuration, and usage. In this post, I am going to discuss how can you schedule your web scrapers with help of Apache Airflow. I will be using the same example I used in Apache Kafka and Elastic Search example that is scraping https://allrecipes.com  because the purpose is to use Airflow. In case you want to learn about scraping you may check the entire series here. So, we will work on a workflow consist of tasks: parse_recipes: It will parse individual recipes. download_image: It downloads recipe image. store_data: Finally store image…

  • 5 strategies to write unblock-able web scrapers in Python

    People who read my posts in scraping series often contacted me to know how could they write scrapers that don’t get blocked. It is very difficult to write a scraper that NEVER gets blocked but yes, you can increase the life of your web scraper by implementing a few strategies. Today I am going to discuss them. User-Agent The very first thing you need to take care of is setting the user-agent. User Agent is a tool that works on behalf of the user and tells the server about which web browser the user is using for visiting the website. Many websites do not let you view the content if…

  • Implementing beanstalk to create a scaleable web scraper

    Image Credit (http://blog.hqc.sk.ca/wp-content/uploads/2012/12/Queue-2012-12-11.jpg) Queues are often used to make applications scaleable by offloading the data and process them later. In this post I am going to use BeansTalk queue management system in Python. Before I get into real task, allow me to give a brief intro of Beanstalk. What is Beanstalk? From the official website: Beanstalk is a simple, fast work queue. Its interface is generic, but was originally designed for reducing the latency of page views in high-volume web applications by running time-consuming tasks asynchronously. It’s demon called beanstalkd which you can run on *nix based machines. Since I am on OSX so I called brew install beanstalkd to install it. Once…