• Getting started with Apache Avro and Python
    Learn how to create and consume Apache Avro based data for better and efficient transfer.

    In this post, I am going to talk about Apache Avro, an open-source data serialization system that is being used by tools like Spark, Kafka, and others for big data processing. What is Apache Avro According to Wikipedia: Avro is a row-oriented remote procedure call and data serialization framework developed within Apache’s Hadoop project. It uses JSON for defining data types and protocols, and serializes data in a compact binary format. Its primary use is in Apache Hadoop, where it can provide both a serialization format for persistent data, and a wire format for communication between Hadoop nodes, and from client programs to the Hadoop services. Avro uses a schema…

  • Create your first REST API in Django Rest Framework
    A step by step guide creating APIs in Django Rest Framework

    In this post, I am going to talk about Django Rest Framework or DRF. DRF is used to create RESTful APIs in Django which later could be consumed by various apps; mobile, web, desktop, etc. We will be discussing how to install DRF on your machine and then will be writing our APIs for a system. Before we discuss DRF, let’s talk a bit about REST itself. What is Rest From Wikipedia Representational state transfer (REST) is a software architectural style that defines a set of constraints to be used for creating Web services. Web services that conform to the REST architectural style, called RESTful Web services, provide interoperability between…

  • Create your first web scraper with ScrapingBee API and Python
    Learn how to use cloud based Scraping API to scrape web pages without getting blocked.

    In this post, I am going to discuss another cloud-based scraping tool that takes care of many of the issues you usually face while scraping websites. This platform has been introduced by ScrapingBee, a cloud-based Scraping tool. What is ScrapingBee If you visit their website, you will find something like below: ScrapingBee API handles headless browsers and rotates proxies for you. As it suggests, it is offering you all the things to deal with the issues you usually come across while writing your scrapers, especially the availability of proxies and headless scraping. No installation of web drivers for Selenium, yay! Development ScrapingBee is based on REST API hence it can…

  • HTML

    Develop AirBnb Parser in Python

    Planning to write a book about Web Scraping in Python. Click here to give your feedback So I am starting a new scraping series, called, ScrapeTheFamous, in which I will be parsing some famous websites and will discuss my development process. The posts will be using Scraper API for parsing purposes which makes me free from all worries blocking and rendering dynamic sites since Scraper API takes care of everything. Anyways, the first post is about Airbnb. We will be scraping some important data points from it. We will be scraping a list of rental URL and fetch and store data in JSON format. So let’s start! The URL we…

  • HTML

    Advanced Proxy Use for Web Scraping

    Guest Post by Vytautas Kirjazovas from Oxylabs.io In the eyes of many, web scraping is an art. It is safe to state that the majority of web scraping enthusiasts have faced bans from websites more than once during their careers. Web scraping is a challenging task, and it’s more common than you think to see your crawlers getting banned by websites. In this article, we’ll talk about more advanced ways to use proxies for web scraping. There are some key components that you should take into account with web scraping to avoid getting banned too quickly: Set browser-like headers User-Agent that can be found in real life. Referer header. Other…

  • Create a simple image search engine in OpenCV and Flask

    I recently started playing with OpenCV, an open-source Computer Vision library for image processing. Luckily there are Python bindings available. More luck that the guys like Adrian has done a great service by releasing both book and blog on a similar topic. I have also made a demo which can see below. Conclusion So this was a basic image processing tutorial in OpenCV. It is not a mature product as it can’t tell you about unique colors in a picture. It is just telling you informaiton based on pixel colors rather than performing color segmentation and clustering based on a ML algorithm. So what is it all about? Well, it…

  • Create your first web scraper with Scraper API and Python

    Recently I come across a tool that takes care of many of the issues you usually face while scraping websites. The tool is called Scraper API which provides an easy to use REST API to scrape a different kind of websites(Simple, JS enabled, Captcha, etc) with quite an ease. Before I proceed further, allow me to introduce Scraper API. What is Scraper API If you visit their website you’d find their mission statement: Scraper API handles proxies, browsers, and CAPTCHAs, so you can get the HTML from any web page with a simple API call! As it suggests, it is offering you all the things to deal with the issues…

  • Create your first ETL Pipeline in Apache Spark and Python

    In this post, I am going to discuss Apache Spark and how you can create simple but robust ETL pipelines in it. You will learn how Spark provides APIs to transform different data format into Data frames and SQL for analysis purpose and how one data source could be transformed into another without any hassle. What is Apache Spark? According to Wikipedia: Apache Spark is an open-source distributed general-purpose cluster-computing framework. Spark provides an interface for programming entire clusters with implicit data parallelism and fault tolerance.   From Official Website: Apache Spark™ is a unified analytics engine for large-scale data processing. In short, Apache Spark is a framework which is…

  • Getting started with Apache Cassandra and Python

    In this post, I am going to talk about Apache Cassandra, its purpose, usage, configuration, and setting up a cluster and in the end, how can you access it in your Python applications. At the end of the post, you should have an idea of it and could start playing it for your next project. What is Apache Cassandra? According to Wikipedia: Apache Cassandra is a free and open-source, distributed, wide column store, NoSQL database management system designed to handle large amounts of data across many commodity servers, providing high availability with no single point of failure. Cassandra offers robust support for clusters spanning multiple datacenters,[1] with asynchronous masterless replication…

  • Getting started with Apache Airflow

    This post is the part of Data Engineering Series. In this post, I am going to discuss Apache Airflow, a workflow management system developed by Airbnb. Earlier I had discussed writing basic ETL pipelines in Bonobo. Bonobo is cool for write ETL pipelines but the world is not all about writing ETL pipelines to automate things. There are other use cases in which you have to perform tasks in a certain order once or periodically. For instance: Monitoring Cron jobs transferring data from one place to other. Automating your DevOps operations. Periodically fetching data from websites and update the database for your awesome price comparison system. Data processing for recommendation based…