This is another post in ScrapeTheFamous, in which I will be parsing some famous websites and will discuss my development process. The posts will be using Scraper API for parsing purposes which makes me free from all worries about blocking and rendering dynamic sites since Scraper API takes care of everything.
In this post, we are going to scrape AliExpress. AliExpress is a Chinese B2C portal to buy stuff.
The script I am going to make consists of two parts, or I say, two functions: fetch and parse. The fetch will accept a category and return all links of individual items and parse will parse an individual entry and returns a few data points in JSON format. So, let’s begin!
def fetch(url): links = [] headers = { 'authority': 'www.aliexpress.com', 'pragma': 'no-cache', 'cache-control': 'no-cache', 'sec-ch-ua': '" Not A;Brand";v="99", "Chromium";v="96", "Google Chrome";v="96"', 'sec-ch-ua-mobile': '?0', 'sec-ch-ua-platform': '"macOS"', 'dnt': '1', 'upgrade-insecure-requests': '1', 'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.55 Safari/537.36', 'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9', 'sec-fetch-site': 'none', 'sec-fetch-mode': 'navigate', 'sec-fetch-user': '?1', 'sec-fetch-dest': 'document', 'accept-language': 'en-US,en;q=0.9,ur;q=0.8,zh-CN;q=0.7,zh;q=0.6', } with open('API_KEY.txt', encoding='utf8') as f: API_KEY = f.read() URL_TO_SCRAPE = url BASE_URL = 'https://www.aliexpress.com' payload = {'api_key': API_KEY, 'url': URL_TO_SCRAPE, 'render': 'true'} r = requests.get('http://api.scraperapi.com', params=payload, timeout=60, headers=headers) if r.status_code == 200: html = r.text.strip() soup = BeautifulSoup(html, 'lxml') links_section = soup.select('._3KNwG') for link in links_section: links.append(BASE_URL + link['href']) return links
The API_KEY.txt
file contains the Scraper API key. After that, I fetch all elements that have the class ._3KNwG
in them and then store them in a list()
. You may store them in a file or database, up to you.
The second function is parse()
and it looks like below:
def parse(url): title = '' price = '' image = '' store_name = '' record = {} headers = { 'authority': 'www.aliexpress.com', 'pragma': 'no-cache', 'cache-control': 'no-cache', 'sec-ch-ua': '" Not A;Brand";v="99", "Chromium";v="96", "Google Chrome";v="96"', 'sec-ch-ua-mobile': '?0', 'sec-ch-ua-platform': '"macOS"', 'dnt': '1', 'upgrade-insecure-requests': '1', 'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.55 Safari/537.36', 'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9', 'sec-fetch-site': 'none', 'sec-fetch-mode': 'navigate', 'sec-fetch-user': '?1', 'sec-fetch-dest': 'document', 'accept-language': 'en-US,en;q=0.9,ur;q=0.8,zh-CN;q=0.7,zh;q=0.6', } with open('API_KEY.txt', encoding='utf8') as f: API_KEY = f.read() payload = {'api_key': API_KEY, 'url': url, 'render': 'true', 'country_code': 'pk', 'keep_headers': 'true'} r = requests.get('http://api.scraperapi.com', params=payload, timeout=60, headers=headers) if r.status_code == 200: html = r.text.strip() soup = BeautifulSoup(html, 'lxml') title_section = soup.select('.product-title-text') if title_section: title = title_section[0].text.strip() price_section = soup.select('.uniform-banner-box-price') if price_section: price = price_section[0].text.strip() image_section = soup.select('.image-viewer img') if image_section: image = image_section[0]['src'] store_section = soup.select('.shop-name a') if store_section: store_name = store_section[0].text.strip() record = {'title': title, 'price': price, 'image': image, 'store': store_name} return record
You need to pay attention to the payload
variable where I introduced two new parameters: country_code and keep_headers. The reason I set country_code
here because AliExpress is available in multiple countries. Since ScraperAPI uses proxies by default, it was picking the proxy IP of a random country and was not displaying price in local Pakistani currency. By setting the country I am forcing ScraperAPI to use Pakistan-based proxies. It is quite a useful feature because many websites allow local traffic only and don’t allow global visitors to visit them. The rest of the code is self-explanatory. The reason I am using render
as True
because AliExpress is generating content by parsing Javascript.
Conclusion
So in this post, you learned how to scrape e-commerce sites like AliExpress to get the desired data by using Scraper API. You do not have to worry about Proxy IPs either nor do you have to pay hundreds of dollars, especially when you are an individual or working in a startup. Companies spend 100s of dollars on a monthly basis just for the proxy IPs.
Oh if you sign up here with my referral link or enter promo code adnan10, you’ll get a 10% discount on it. In case you do not get the discount then just let me know via email on my site and I’d sure help you out.