Python

Create your first web scraper with ScrapingBee API and Python
Learn how to use cloud based Scraping API to scrape web pages without getting blocked.

In this post, I am going to discuss another cloud-based scraping tool that takes care of many of the issues you usually face while scraping websites. This platform has been introduced by ScrapingBee, a cloud-based Scraping tool. What is ScrapingBee If you visit their website, you will find something like below: ScrapingBee API handles headless browsers and rotates proxies for you. As it suggests, it is offering you all the things to deal with the issues you usually come across while writing your scrapers, especially the availability of proxies and headless scraping. No installation of web drivers for Selenium, yay! Development ScrapingBee is based on REST API hence it can…

Read More
Develop AirBnb Parser in Python

Planning to write a book about Web Scraping in Python. Click here to give your feedback So I am starting a new scraping series, called, ScrapeTheFamous, in which I will be parsing some famous websites and will discuss my development process. The posts will be using Scraper API for parsing purposes which makes me free from all worries blocking and rendering dynamic sites since Scraper API takes care of everything. Anyways, the first post is about Airbnb. We will be scraping some important data points from it. We will be scraping a list of rental URL and fetch and store data in JSON format. So let’s start! The URL we…

Read More
Advanced Proxy Use for Web Scraping

Guest Post by Vytautas Kirjazovas from Oxylabs.io In the eyes of many, web scraping is an art. It is safe to state that the majority of web scraping enthusiasts have faced bans from websites more than once during their careers. Web scraping is a challenging task, and it’s more common than you think to see your crawlers getting banned by websites. In this article, we’ll talk about more advanced ways to use proxies for web scraping. There are some key components that you should take into account with web scraping to avoid getting banned too quickly: Set browser-like headers User-Agent that can be found in real life. Referer header. Other…

Read More
Create a simple image search engine in OpenCV and Flask

I recently started playing with OpenCV, an open-source Computer Vision library for image processing. Luckily there are Python bindings available. More luck that the guys like Adrian has done a great service by releasing both book and blog on a similar topic. I have also made a demo which can see below. Conclusion So this was a basic image processing tutorial in OpenCV. It is not a mature product as it can’t tell you about unique colors in a picture. It is just telling you informaiton based on pixel colors rather than performing color segmentation and clustering based on a ML algorithm. So what is it all about? Well, it…

Read More
Create your first web scraper with Scraper API and Python

Recently I come across a tool that takes care of many of the issues you usually face while scraping websites. The tool is called Scraper API which provides an easy to use REST API to scrape a different kind of websites(Simple, JS enabled, Captcha, etc) with quite an ease. Before I proceed further, allow me to introduce Scraper API. What is Scraper API If you visit their website you’d find their mission statement: Scraper API handles proxies, browsers, and CAPTCHAs, so you can get the HTML from any web page with a simple API call! As it suggests, it is offering you all the things to deal with the issues…

Read More
Create your first ETL Pipeline in Apache Spark and Python

In this post, I am going to discuss Apache Spark and how you can create simple but robust ETL pipelines in it. You will learn how Spark provides APIs to transform different data format into Data frames and SQL for analysis purpose and how one data source could be transformed into another without any hassle. What is Apache Spark? According to Wikipedia: Apache Spark is an open-source distributed general-purpose cluster-computing framework. Spark provides an interface for programming entire clusters with implicit data parallelism and fault tolerance. From Official Website: Apache Spark™ is a unified analytics engine for large-scale data processing. In short, Apache Spark is a framework which is…

Read More
Getting started with Apache Cassandra and Python

In this post, I am going to talk about Apache Cassandra, its purpose, usage, configuration, and setting up a cluster and in the end, how can you access it in your Python applications. At the end of the post, you should have an idea of it and could start playing it for your next project. What is Apache Cassandra? According to Wikipedia: Apache Cassandra is a free and open-source, distributed, wide column store, NoSQL database management system designed to handle large amounts of data across many commodity servers, providing high availability with no single point of failure. Cassandra offers robust support for clusters spanning multiple datacenters,[1] with asynchronous masterless replication…

Read More
Getting started with Apache Airflow

This post is the part of Data Engineering Series. In this post, I am going to discuss Apache Airflow, a workflow management system developed by Airbnb. Earlier I had discussed writing basic ETL pipelines in Bonobo. Bonobo is cool for write ETL pipelines but the world is not all about writing ETL pipelines to automate things. There are other use cases in which you have to perform tasks in a certain order once or periodically. For instance: Monitoring Cron jobs transferring data from one place to other. Automating your DevOps operations. Periodically fetching data from websites and update the database for your awesome price comparison system. Data processing for recommendation based…

Read More
Getting started with Apache Kafka in Python

This post is the part of Data Engineering Series. In this post, I am going to discuss Apache Kafka and how Python programmers can use it for building distributed systems. What is Apache Kafka? Apache Kafka is an open-source streaming platform that was initially built by LinkedIn. It was later handed over to Apache foundation and open sourced it in 2011. According to Wikipedia: Apache Kafka is an open-source stream-processing software platform developed by the Apache Software Foundation, written in Scala and Java. The project aims to provide a unified, high-throughput, low-latency platform for handling real-time data feeds. Its storage layer is essentially a “massively scalable pub/sub message queue architected…

Read More
Getting started with Elasticsearch in Python

The updated version of this post for Elasticsearch 7.x is available here. In this post, I am going to discuss Elasticsearch and how you can integrate it with different Python apps. What is ElasticSearch? ElasticSearch (ES) is a distributed and highly available open-source search engine that is built on top of Apache Lucene. It’s an open-source which is built in Java thus available for many platforms. You store unstructured data in JSON format which also makes it a NoSQL database. So, unlike other NoSQL databases ES also provides search engine capabilities and other related features. ElasticSearch Use Cases You can use ES for multiple purposes, a couple of them given below: You…

Read More