Create Amazon Scraper in Python using Scraper API Learn how to create an Amazon scraper in python to scrape product details like price, ASIN etc

In this post of ScrapingTheFamous, I am going o write a scraper that will scrape data from Amazon. I do not need to tell you what is Amazon. You are here because you already know about it 🙂

So, we are going to write two different scripts: one would be fetch.py that would be fetching URLs of individual listings and save in a text file. Later another script, parse.py that will have a function taking an individual listing URL, scrape data, and save in JSON format.

I will be using Scraper API service for parsing purposes which makes me free from all worries blocking and rendering dynamic sites since it takes care of everything.

The first script is to fetching listings of a category. So let’s do it!

import requests
from bs4 import BeautifulSoup

if __name__ == '__main__':
    headers = {
        'authority': 'www.amazon.com',
        'pragma': 'no-cache',
        'cache-control': 'no-cache',
        'rtt': '600',
        'downlink': '1.5',
        'ect': '3g',
        'upgrade-insecure-requests': '1',
        'dnt': '1',
        'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.183 Safari/537.36',
        'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
        'sec-fetch-site': 'same-origin',
        'sec-fetch-mode': 'navigate',
        'sec-fetch-user': '?1',
        'sec-fetch-dest': 'document',
        'referer': 'https://google.com',
        'accept-language': 'en-US,en;q=0.9,ur;q=0.8,zh-CN;q=0.7,zh;q=0.6',
    }

    API_KEY = None
    links_file = 'links.txt'
    links = []

    with open('API_KEY.txt', encoding='utf8') as f:
        API_KEY = f.read()

    URL_TO_SCRAPE = 'https://www.amazon.com/s?i=electronics&rh=n%3A172541%2Cp_n_feature_four_browse-bin%3A12097501011&lo=image'

    payload = {'api_key': API_KEY, 'url': URL_TO_SCRAPE, 'render': 'false'}

    r = requests.get('http://api.scraperapi.com', params=payload, timeout=60)

    if r.status_code == 200:
        text = r.text.strip()
        soup = BeautifulSoup(text, 'lxml')
        links_section = soup.select('h2 > .a-link-normal')
        for link in links_section:
            url = 'https://amazon.com' + link['href']
            links.append(url)

    if len(links) > 0:
        with open(links_file, 'a+', encoding='utf8') as f:
            f.write('\n'.join(links))

        print('Links stored successfully.')

So here is the script. I picked the electronics category, you may choose anyone you want. I arranged relevant headers. You can either pick manually via Chrome Inspector or use https://curl.trillworks.com/ to do this for you by copying the cURL request.

I am using the ScraperAPI API endpoint by setting both the payload and api_key. I am using h2 > .a-link-normal selector because there are many .a-link-normal links which are not required so I am using h2 > to make sure to pick the required links.

Once we have links we will be saving in a text file.

Now the next part of the post is about parsing the info:

import requests
from bs4 import BeautifulSoup


def parse(url):
    record = {}
    headers = {
        'authority': 'www.amazon.com',
        'pragma': 'no-cache',
        'cache-control': 'no-cache',
        'rtt': '1000',
        'downlink': '1.5',
        'ect': '3g',
        'dnt': '1',
        'upgrade-insecure-requests': '1',
        'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.183 Safari/537.36',
        'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
        'sec-fetch-site': 'none',
        'sec-fetch-mode': 'navigate',
        'sec-fetch-user': '?1',
        'sec-fetch-dest': 'document',
        'accept-language': 'en-US,en;q=0.9,ur;q=0.8,zh-CN;q=0.7,zh;q=0.6',
    }

    payload = {'api_key': API_KEY, 'url': url, 'render': 'false'}

    r = requests.get('http://api.scraperapi.com', params=payload, timeout=60)

    if r.status_code == 200:
        data = r.text.strip()
        soup = BeautifulSoup(data, 'lxml')
        title_section = soup.select('#productTitle')
        price_section = soup.select('#priceblock_ourprice')
        availability_section = soup.select('#availability')
        features_section = soup.select('#feature-bullets')
        asin_section = soup.find('link', {'rel': 'canonical'})
        print(asin_section)

        if title_section:
            title = title_section[0].text.strip()

        if price_section:
            price = price_section[0].text.strip()

        if availability_section:
            availability = availability_section[0].text.strip()

        if features_section:
            features = features_section[0].text.strip()
        if asin_section:
            asin_url = asin_section['href']
            asin_url_parts = asin_url.split('/')
            asin = asin_url_parts[len(asin_url_parts) - 1]

        record = {'title': title, 'price': price, 'availability': availability, 'asin': asin, 'features': features}

    return record


if __name__ == '__main__':
    API_KEY = None

    with open('API_KEY.txt', encoding='utf8') as f:
        API_KEY = f.read()

    result = parse('https://www.amazon.com/Bambino-Seconds-Stainless-Japanese-Automatic-Leather/dp/B07B49QG1H/')
    print(result)

Pretty straightforward. I fetched title, scraped Amazon ASIN, and other stuff. You can scrape many other things like Amazon reviews as well. The data is stored in JSON format.

Conclusion

In this post, you learned how you can scrape Amazon data easily by using Scraper API in Python. You can enhance this script as per your need like writing a price monitoring script or as an ASIN scraper.

Writing scrapers is an interesting journey but you can hit the wall if the site blocks your IP. As an individual, you can’t afford expensive proxies either. Scraper API provides you an affordable and easy to use API that will let you scrape websites without any hassle. You do not need to worry about getting blocked because Scraper API by default uses proxies to access websites. On top of it, you do not need to worry about Selenium either since Scraper API provides the facility of a headless browser too. I also have written a post about how to use it.

Click here to signup with my referral link or enter promo code adnan10, you will get a 10% discount on it. In case you do not get the discount then just let me know via email on my site and I’d sure help you out.

Conclusion

If you like this post then you should subscribe to my blog for future updates.

Related Posts

Create your first PHP/MySQL application in docker

Design Custom Candlestick Patterns for Signal Generation Using Python

How to implement Stellar blockchain on existing ecommerce site