HTML

Advanced Proxy Use for Web Scraping

Guest Post by Vytautas Kirjazovas from Oxylabs.io

In the eyes of many, web scraping is an art. It is safe to state that the majority of web scraping enthusiasts have faced bans from websites more than once during their careers. Web scraping is a challenging task, and it’s more common than you think to see your crawlers getting banned by websites. In this article, we’ll talk about more advanced ways to use proxies for web scraping.

There are some key components that you should take into account with web scraping to avoid getting banned too quickly:

  1. Set browser-like headers
    • User-Agent that can be found in real life.
    • Referer header.
    • Other default request headers.
  2. Induce delays between scraping jobs.
  3. Use appropriate proxies for each scraping scenario.

Proxies

Proxies have proved to be an essential component of web scraping. A proxy is a third-party service used to re-route your requests between source and destination. When you use a proxy, the website you visit no longer can identify your IP address, but instead, will see the IP address of the proxy. This only applies if the re-routing proxy is configured correctly and does not leak information.

How are proxies made?

There is no rocket science behind it, at least when it comes to data center proxies. You can easily make a proxy yourself. All you need to do is get a VPS or dedicated server and setup proxy software. The most common choice for Unix systems is Squid. Bear in mind that default Squid configuration will not perform as well as expected, as it’s likely to leak information about itself. Some tweaking is definitely required.

Is getting one proxy IP enough?

No, many websites track how many requests they receive from a single IP address, and if that number exceeds the humanly-possible threshold, they simply ban that IP, or worse – start delivering false content. For example, e-commerce websites and airlines, instead of returning 503 or similar HTTP errors to your queries, quite often return inaccurate prices for their products and services. At first, you might not even notice, since it can be rather difficult to trace. 

What if you want to gather thousands of pages of the site without getting banned?

The answer to this case is to create multiple identities. In other words, it means you should use various IP addresses. Using multiple IP addresses randomly one at a time or simultaneously, and with a delay induced request for each IP, you can easily scrape a good amount of data without raising any suspicions to the data source. 

How to get a large pool of IP addresses and how to choose a service provider?

Well, you can google ‘free proxy list,’ and you’ll find some links to GitHub repositories. Then choose an actively managed IP pool and find the list of free proxy IP addresses that you can use. In fact, multiple websites provide free proxy lists, but be cautious, as proxies are usually untested, and many have been dead and unused for months. 

That said, if you’re serious about web scraping, you should choose a leading web scraping service provider. Reputable proxy service providers are the way to go for the reliability and scalability of web scraping tasks. To add, it saves a lot of time in setting up the infrastructure for web scraping in the first place. These proxies almost flawlessly simulate organic users’ behavior, detect and manage bans, and thus assist in a complete web scraping solution. Some providers also have additional tools and features that make it easier to integrate, such as in-house proxy rotation.

Keep in mind that a more prominent website won’t bypass multiple IPs making repeated requests to the server for a long time. Nowadays, sites have become much more advanced, and a lot of planning and preparations undergo in developing an infrastructure for successful web scraping. We suggest using ready-to-deploy web scraping infrastructures so you’ll never have to worry about your IPs getting banned again, especially if you are not comfortable in maintaining extensive scraping infrastructure. For instance, the leading proxy service provider Oxylabs can effortlessly manage IP rotation on your behalf, leaving you to work with the collected data, instead of focusing on data gathering procedures. 

Furthermore, Oxylabs have an unmatched scraping-as-a-service solution within the market – Real-Time Crawler – which excels in effortlessly capturing web data in a hassle-free manner. All you need to do is indicate which pages you want to scrape, be it online marketplaces, e-commerce sites, search engines, or any URL in general. In fact, it could be a more cost-effective solution than building an in-house data gathering solution supported by proxies. And, it’s a given that it will save you time, allowing you to place focus on data analysis to capture actionable insights.

However, if you plan to create your custom web scraper supported with proxies, I recommend to keep in mind the following vital components for effective proxy management:

Proxy rotation

By using single proxy servers, we make several requests to the webserver with a new IP address to hide our identity. However, even after having a large pool of single-use proxies for web scraping, the website’s server might track you by monitoring the repeated requests from the same pool of IPs. Hence, it’s important to rotate these proxies after a predefined interval of time.

By using the Python requests library, you can configure proxies by setting the argument in proxies. 

You can also use a Scrapy middleware called scrapy-rotating-proxies for your proxy rotation. Here’s a sample code to use scrapy-rotating-proxies:

pip install scrapy-rotating-proxies

Required code:

After this, all requests will be proxied using one of the proxies from the Rotating_proxies list.

4 pillars for successful web scraping with proxies

  1. Automate the free proxies: You can use free proxy lists, but remember that free proxies are temporary, and they frequently expire. So, you must automate the process of updating your expired proxies and fetch another list of fresh proxies. Or, as mentioned earlier, if you are serious about your web scraping tasks, you should use one of the leading proxy providers in the market.
  2. Randomize the proxies: Don’t use proxies ordered sequentially. Modern anti-scraping plugins will be quick to understand the pattern of proxies. Instead, consider an example below:
    • An example of serial proxies that are too easy for anomaly detection:

    192.1.1.1

    192.1.1.2

    192.1.1.3

    192.1.1.4

    192.2.1.1

    • Instead, use your proxies in the following order:

    192.1.1.4

    192.1.1.1

    192.2.1.2

    192.1.1.3

    192.1.1.2

  3.  User-agent: User-agent is a packet header for the requests that you send to a server. It’s a characteristic string that gives information about the operating system, browser, and device type. (More details)Every request that you make with a browser contains user-agent information which is represented in the format shown below:User-Agent: Mozilla/5.0 (<system-information>) <platform> (<platform-details>) <extensions>If too many requests are processed with the same user-agent header, it will trigger a suspicious activity. Before you know it, your IPs will get banned. One way to work-around is to send random user-agents by rotating them along with each request. There are several resources available online to get user-agents for our use:
    1. https://developers.whatismybrowser.com/useragents/explore/software_name/chrome/
    2. https://github.com/tamimibrahim17/List-of-user-agents

    Once you’ve gathered a list of user-agents, you can use Python’s requests library or Scrapy framework, which has a middleware called Scrapy-User Agents for effective rotation. It is highly recommended to refer to the official documentation to understand the correct usage.

    Keep in mind that User-Agent should be up-to-date. Using heavily outdated User-Agent will, without a doubt, raise some eyebrows.

  4. Category of proxies: Depending upon how they’re hosted, we broadly classify proxy into three categories: Mobile, Residential, and Datacenter IPs.
    1. Mobile IPs: As the name suggests, they’re private mobile devices IPs which are very incognito and immensely difficult to be detected by anti-scraping agents. However, the benefit of scraping most securely comes with a high cost, and mobile IPs are rather challenging to acquire. Also, in most cases, using mobile proxies is an overkill. Websites you will be trying to scrape are unlikely to be so advanced on anti-bot measures that mobile or even residential IPs are required. Finally, it’s important to stress out that mobile IPs are less reliable due to nature where they come from, and they are also considerably slower than regular proxies.
    2. Residential IPs: Residential IPs are IP addresses allocated to regular internet users by Internet Service Providers. Residential proxies reflect the typical user behavior and are an excellent option for almost any web scraping job. They’re cheaper than mobile IPs and are readily available. In our opinion, this is the best type of proxy – fast, reliable, and extremely difficult to identify.
    3. Datacenter IPs: Datacenter IPs are IPs allocated to servers in a data center. These IPs are very stable and hardly change, which makes them less flexible. Another problem is that it’s challenging to get a very diverse set of IPs, since most proxies are likely to come from one large IP range subset, thus making them easier to identify when deployed in scraping jobs. However, they’re very cheap, and with some right tools, data center proxies can prove to be effective for most web scraping jobs.

Conclusion

if you are an individual who relies on web scraping in your day to day tasks, choosing  Oxylabs that can provide on-going support on your projects should be a no brainer. Nobody likes it when proxies get blocked during a more intense web scraping session and having experts on hand will help to avoid headaches. Mention the promo code Adnan100 to get a $100 discount on any pricing plan.

If you like this post then you should subscribe to my blog for future updates.

* indicates required