Planning to write a book about Web Scraping in Python. Click here to give your feedback
So I am starting a new scraping series, called, ScrapeTheFamous, in which I will be parsing some famous websites and will discuss my development process. The posts will be using Scraper API for parsing purposes which makes me free from all worries blocking and rendering dynamic sites since Scraper API takes care of everything.
Anyways, the first post is about Airbnb. We will be scraping some important data points from it. We will be scraping a list of rental URL and fetch and store data in JSON format. So let’s start!
The URL we will be using is here: https://www.airbnb.com/s/Karachi–Sindh–Pakistan/homes?query=Karachi%2C%20Sindh%2C%20Pakistan
Above is the screenshot of the listing for the city of Karachi. Though you can pick as much data as you want for example purpose, I am only picking a price, number of guests, listing URL and, number of bedrooms.
if __name__ == '__main__': price = '-' bedroom = '-' guests = '-' url = '-' title = '-' records = [] with open('API_KEY.txt', encoding='utf8') as f: API_KEY = f.read() URL_TO_SCRAPE = 'https://www.airbnb.com/s/Karachi--Sindh--Pakistan/homes?query=Karachi%2C%20Sindh%2C%20Pakistan' payload = {'api_key': API_KEY, 'url': URL_TO_SCRAPE, 'render': 'false'} r = requests.get('http://api.scraperapi.com', params=payload, timeout=60) if r.status_code == 200: html = r.text.strip() soup = BeautifulSoup(html, 'lxml') listing_section = soup.select('._fhph4u ._8ssblpx') for item in listing_section: link_section = item.select('a') if link_section: url = link_section[0]['href'] title = link_section[0]['aria-label'] # Extracting HTML per item html_ = item.prettify() price = get_price(html_) guests = get_guests(html_) bedroom = get_bed(html_) records.append({'title': title, 'url': url, 'guests': guests, 'bedroom': bedroom}) if len(records) > 0: with open('airbnb.json', 'a+', encoding='utf8') as f: f.write(json.dumps(records)) print('Done')
You are seeing a few methods here: get_price()
, get_guests
and get_bed
. These functions will be returning our required data. Also I am passing html data of each element which will then be used to extract the data.
I am not showing all methods here, you can download the code from Github, I am just giving code of get_bed
here.
def get_bed(h): _bedroom = '-' regex = r"(\d+) bedroom" matches = re.finditer(regex, h, re.MULTILINE) for matchNum, match in enumerate(matches, start=1): for groupNum in range(0, len(match.groups())): groupNum = groupNum + 1 _bedroom = match.group(groupNum) break return _bedroom
As you can see I am using RegEx here. Parsing is not all about using Beautifulsoup
. You can any tool that helps to give you data. If you want you can use Beautifulsoup
. All up to you. The data is then saved in records
list which then convert to a JSON structure and save in a .json
file. If all goes well, it will generate a file like below:
[ { "title": "Exquisite apartment in Downtown Defence karachi!!!", "url": "/rooms/27884577?location=Karachi%2C%20Sindh%2C%20Pakistan&previous_page_section_name=1000&federated_search_id=9ad24778-f627-4cd5-93d5-e6027ca4699b", "guests": "6", "bedroom": "2" }, { "title": "1 bed studio with kitchen+Wifi+Netflix (2nd Floor)", "url": "/rooms/40790188?location=Karachi%2C%20Sindh%2C%20Pakistan&previous_page_section_name=1000&federated_search_id=9ad24778-f627-4cd5-93d5-e6027ca4699b", "guests": "2", "bedroom": "1" }, { "title": "Adda's Pent House", "url": "/rooms/15671232?location=Karachi%2C%20Sindh%2C%20Pakistan&previous_page_section_name=1000&federated_search_id=9ad24778-f627-4cd5-93d5-e6027ca4699b", "guests": "2", "bedroom": "1" }, { "title": "2 bed Studio, Opposite Beach+Netflix+kitchen", "url": "/rooms/41527230?location=Karachi%2C%20Sindh%2C%20Pakistan&previous_page_section_name=1000&federated_search_id=9ad24778-f627-4cd5-93d5-e6027ca4699b", "guests": "4", "bedroom": "2" }, { "title": "Stay a while", "url": "/rooms/36153212?location=Karachi%2C%20Sindh%2C%20Pakistan&previous_page_section_name=1000&federated_search_id=9ad24778-f627-4cd5-93d5-e6027ca4699b", "guests": "2", "bedroom": "1" }, { "title": "Comfortable Studio Apartment near the Beach", "url": "/rooms/33680248?location=Karachi%2C%20Sindh%2C%20Pakistan&previous_page_section_name=1000&federated_search_id=9ad24778-f627-4cd5-93d5-e6027ca4699b", "guests": "2", "bedroom": "1" }, { "title": "Western luxury apartment in the heart of DHA", "url": "/rooms/39086647?location=Karachi%2C%20Sindh%2C%20Pakistan&previous_page_section_name=1000&federated_search_id=9ad24778-f627-4cd5-93d5-e6027ca4699b", "guests": "5", "bedroom": "3" }, { "title": "Beachfront, budgeted room with shared kitchen", "url": "/rooms/38830813?location=Karachi%2C%20Sindh%2C%20Pakistan&previous_page_section_name=1000&federated_search_id=9ad24778-f627-4cd5-93d5-e6027ca4699b", "guests": "2", "bedroom": "1" }, { "title": "Contemporary home DHA bedroom 1 KARACHI", "url": "/rooms/15259352?location=Karachi%2C%20Sindh%2C%20Pakistan&previous_page_section_name=1000&federated_search_id=9ad24778-f627-4cd5-93d5-e6027ca4699b", "guests": "4", "bedroom": "1" }, { "title": "2 Bedroom Apartment in DHA- Phase 6", "url": "/rooms/36055227?location=Karachi%2C%20Sindh%2C%20Pakistan&previous_page_section_name=1000&federated_search_id=9ad24778-f627-4cd5-93d5-e6027ca4699b", "guests": "4", "bedroom": "2" }, { "title": "Penthouse", "url": "/rooms/21764173?location=Karachi%2C%20Sindh%2C%20Pakistan&previous_page_section_name=1000&federated_search_id=9ad24778-f627-4cd5-93d5-e6027ca4699b", "guests": "2", "bedroom": "1" }, { "title": "Private room for females only.Fully equipped house", "url": "/rooms/41933011?location=Karachi%2C%20Sindh%2C%20Pakistan&previous_page_section_name=1000&federated_search_id=9ad24778-f627-4cd5-93d5-e6027ca4699b", "guests": "2", "bedroom": "1" }, { "title": "Business Centre Karachi apartment", "url": "/rooms/29350827?location=Karachi%2C%20Sindh%2C%20Pakistan&previous_page_section_name=1000&federated_search_id=9ad24778-f627-4cd5-93d5-e6027ca4699b", "guests": "2", "bedroom": "1" }, { "title": "Fully furnished Entire apartment-DHA PH 6", "url": "/rooms/33603135?location=Karachi%2C%20Sindh%2C%20Pakistan&previous_page_section_name=1000&federated_search_id=9ad24778-f627-4cd5-93d5-e6027ca4699b", "guests": "6", "bedroom": "3" }, { "title": "Very Neat and Clean Bedroom Available", "url": "/rooms/33308076?location=Karachi%2C%20Sindh%2C%20Pakistan&previous_page_section_name=1000&federated_search_id=9ad24778-f627-4cd5-93d5-e6027ca4699b", "guests": "2", "bedroom": "1" }, { "title": "Comfee Room", "url": "/rooms/41255914?location=Karachi%2C%20Sindh%2C%20Pakistan&previous_page_section_name=1000&federated_search_id=9ad24778-f627-4cd5-93d5-e6027ca4699b", "guests": "2", "bedroom": "1" }, { "title": "Business DHA Room", "url": "/rooms/38717735?location=Karachi%2C%20Sindh%2C%20Pakistan&previous_page_section_name=1000&federated_search_id=9ad24778-f627-4cd5-93d5-e6027ca4699b", "guests": "3", "bedroom": "1" }, { "title": "Independent Apartment in Top Vicinity of Karachi", "url": "/rooms/38446232?location=Karachi%2C%20Sindh%2C%20Pakistan&previous_page_section_name=1000&federated_search_id=9ad24778-f627-4cd5-93d5-e6027ca4699b", "guests": "2", "bedroom": "1" } ]
Looks good, no?
Conclusion
In this post, you learned how you can create an Airbnb parser in Python using Scraper API. You do not have to worry about Proxy IPs either nor you have to pay hundreds of dollars, especially when you are an individual or working in a startup. The company I work with spend 100s of dollars on a monthly basis just for the proxy IPs.
Oh if you sign up here with my referral link or enter promo code adnan10, you will get a 10% discount on it. In case you do not get the discount then just let me know via email on my site and I’d sure help you out.
The code is available on Github.