Develop Ali Express Scraper in Python with Scraper API

This is another post in ScrapeTheFamous, in which I will be parsing some famous websites and will discuss my development process. The posts will be using Scraper API for parsing purposes which makes me free from all worries about blocking and rendering dynamic sites since Scraper API takes care of everything.

In this post, we are going to scrape AliExpress. AliExpress is a Chinese B2C portal to buy stuff.

The script I am going to make consists of two parts, or I say, two functions: fetch and parse. The fetch will accept a category and return all links of individual items and parse will parse an individual entry and returns a few data points in JSON format. So, let’s begin!

def fetch(url):
    links = []
    headers = {
        'authority': 'www.aliexpress.com',
        'pragma': 'no-cache',
        'cache-control': 'no-cache',
        'sec-ch-ua': '" Not A;Brand";v="99", "Chromium";v="96", "Google Chrome";v="96"',
        'sec-ch-ua-mobile': '?0',
        'sec-ch-ua-platform': '"macOS"',
        'dnt': '1',
        'upgrade-insecure-requests': '1',
        'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.55 Safari/537.36',
        'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
        'sec-fetch-site': 'none',
        'sec-fetch-mode': 'navigate',
        'sec-fetch-user': '?1',
        'sec-fetch-dest': 'document',
        'accept-language': 'en-US,en;q=0.9,ur;q=0.8,zh-CN;q=0.7,zh;q=0.6',
    }

    with open('API_KEY.txt', encoding='utf8') as f:
        API_KEY = f.read()

    URL_TO_SCRAPE = url
    BASE_URL = 'https://www.aliexpress.com'

    payload = {'api_key': API_KEY, 'url': URL_TO_SCRAPE, 'render': 'true'}

    r = requests.get('http://api.scraperapi.com', params=payload, timeout=60, headers=headers)

    if r.status_code == 200:
        html = r.text.strip()
        soup = BeautifulSoup(html, 'lxml')
        links_section = soup.select('._3KNwG')
        for link in links_section:
            links.append(BASE_URL + link['href'])

    return links

The API_KEY.txt file contains the Scraper API key. After that, I fetch all elements that have the class ._3KNwG in them and then store them in a list(). You may store them in a file or database, up to you.

The second function is parse() and it looks like below:

def parse(url):
    title = ''
    price = ''
    image = ''
    store_name = ''
    record = {}
    headers = {
        'authority': 'www.aliexpress.com',
        'pragma': 'no-cache',
        'cache-control': 'no-cache',
        'sec-ch-ua': '" Not A;Brand";v="99", "Chromium";v="96", "Google Chrome";v="96"',
        'sec-ch-ua-mobile': '?0',
        'sec-ch-ua-platform': '"macOS"',
        'dnt': '1',
        'upgrade-insecure-requests': '1',
        'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.55 Safari/537.36',
        'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
        'sec-fetch-site': 'none',
        'sec-fetch-mode': 'navigate',
        'sec-fetch-user': '?1',
        'sec-fetch-dest': 'document',
        'accept-language': 'en-US,en;q=0.9,ur;q=0.8,zh-CN;q=0.7,zh;q=0.6',
    }

    with open('API_KEY.txt', encoding='utf8') as f:
        API_KEY = f.read()

    payload = {'api_key': API_KEY, 'url': url, 'render': 'true', 'country_code': 'pk', 'keep_headers': 'true'}

    r = requests.get('http://api.scraperapi.com', params=payload, timeout=60, headers=headers)

    if r.status_code == 200:
        html = r.text.strip()
        soup = BeautifulSoup(html, 'lxml')
        title_section = soup.select('.product-title-text')
        if title_section:
            title = title_section[0].text.strip()

        price_section = soup.select('.uniform-banner-box-price')
        if price_section:
            price = price_section[0].text.strip()

        image_section = soup.select('.image-viewer img')
        if image_section:
            image = image_section[0]['src']

        store_section = soup.select('.shop-name a')
        if store_section:
            store_name = store_section[0].text.strip()
    record = {'title': title, 'price': price, 'image': image, 'store': store_name}
    return record

You need to pay attention to the payload variable where I introduced two new parameters: country_code and keep_headers. The reason I set country_code here because AliExpress is available in multiple countries. Since ScraperAPI uses proxies by default, it was picking the proxy IP of a random country and was not displaying price in local Pakistani currency. By setting the country I am forcing ScraperAPI to use Pakistan-based proxies. It is quite a useful feature because many websites allow local traffic only and don’t allow global visitors to visit them. The rest of the code is self-explanatory. The reason I am using render as True because AliExpress is generating content by parsing Javascript.

Conclusion

So in this post, you learned how to scrape e-commerce sites like AliExpress to get the desired data by using Scraper API. You do not have to worry about Proxy IPs either nor do you have to pay hundreds of dollars, especially when you are an individual or working in a startup. Companies spend 100s of dollars on a monthly basis just for the proxy IPs.

Oh if you sign up here with my referral link or enter promo code adnan10, you’ll get a 10% discount on it. In case you do not get the discount then just let me know via email on my site and I’d sure help you out.

Conclusion

If you like this post then you should subscribe to my blog for future updates.

Related Posts

Getting started with Apache Cassandra and Python

Create your first ReactJS app

ScrapeGen – Tool for generating Python scrapers A simple python tool that generates a requests/bs4 based web scraper

ScrapeGen – Tool for generating Python scrapers
A simple python tool that generates a requests/bs4 based web scraper