• Schedule web scrapers with Apache Airflow

     This post is the part of Data Engineering Series. In the previous post, I discussed Apache Airflow and it’s basic concepts, configuration, and usage. In this post, I am going to discuss how can you schedule your web scrapers with help of Apache Airflow. I will be using the same example I used in Apache Kafka and Elastic Search example that is scraping https://allrecipes.com  because the purpose is to use Airflow. In case you want to learn about scraping you may check the entire series here. So, we will work on a workflow consist of tasks: parse_recipes: It will parse individual recipes. download_image: It downloads recipe image. store_data: Finally store image…

  • Getting started with Apache Airflow

    This post is the part of Data Engineering Series. In this post, I am going to discuss Apache Airflow, a workflow management system developed by Airbnb. Earlier I had discussed writing basic ETL pipelines in Bonobo. Bonobo is cool for write ETL pipelines but the world is not all about writing ETL pipelines to automate things. There are other use cases in which you have to perform tasks in a certain order once or periodically. For instance: Monitoring Cron jobs transferring data from one place to other. Automating your DevOps operations. Periodically fetching data from websites and update the database for your awesome price comparison system. Data processing for recommendation based…

  • Data Engineering Series – An Intro

    So I just realized that I am here after a month or so. I was busy at work and traveling out of the country. I am starting a kind of new series, I say it Data Engineering Series in which I will be discussing different tools. Of course, I am not able to discuss the entire concept of Data Engineering neither I know it as I will be learning myself. What is Data Engineering? Data Engineering is all about developing, maintaining systems that are responsible for transferring data in large volumes and make it available for analysts and data scientists to use it for analyzing and data modeling. Data engineering…

  • Windows Desktop Applications Automation using AutoIt

    Guest Post by Ahmed Memon AutoIt is a language & framework to automate (mainly) Windows GUI application interactions. Though it has its own language but there are wrappers available for Python, C# etc too. This write-up target developer who want to automate Windows Desktop Applications but either do not know the right tool or want to enhance their skills using AutoIt. Just like with anything, there is either informed or amateur approach towards solving a problem. Developers who have never written Windows GUI applications will write AutoIt code that will depend too much on interactions with active GUI. While we can get maximum out of AutoIt when we rely less on…

  • Getting started with Apache Kafka in Python

    This post is the part of Data Engineering Series. In this post, I am going to discuss Apache Kafka and how Python programmers can use it for building distributed systems. What is Apache Kafka? Apache Kafka is an open-source streaming platform that was initially built by LinkedIn. It was later handed over to Apache foundation and open sourced it in 2011. According to Wikipedia: Apache Kafka is an open-source stream-processing software platform developed by the Apache Software Foundation, written in Scala and Java. The project aims to provide a unified, high-throughput, low-latency platform for handling real-time data feeds. Its storage layer is essentially a “massively scalable pub/sub message queue architected…

  • Getting started with Elasticsearch in Python

    In this post, I am going to discuss Elasticsearch and how you can integrate with different Python apps. What is ElasticSearch? ElasticSearch (ES) is a distributed and highly available open-source search engine that is built on top of Apache Lucene. It’s an open-source which is built in Java thus available for many platforms. You store unstructured data in JSON format which also makes it a NoSQL database. So, unlike other NoSQL databases ES also provides search engine capabilities and other related features. ElasticSearch Use Cases You can use ES for multiple purposes, a couple of them given below: You are running a website that provides lots of dynamic content; be it an…

  • How to create a custom token on Stellar network in Python

    A few months back I made a post about Stellar that how you can use it in your Python applications. In this post, I am going to discuss that how you can create your own custom token, a.k.a, a coin programmatically in Python. Before I get into the code, I’d like to discuss what are tokens and their background, how they are different from Alt-coins and some Stellar network concepts. This post is lengthy so read it when you have ample time to read. What are Tokens? The term token is not new and many of us would have experienced the application of it one way or other. Tokens are…

  • 5 strategies to write unblock-able web scrapers in Python

    People who read my posts in scraping series often contacted me to know how could they write scrapers that don’t get blocked. It is very difficult to write a scraper that NEVER gets blocked but yes, you can increase the life of your web scraper by implementing a few strategies. Today I am going to discuss them. User-Agent The very first thing you need to take care of is setting the user-agent. User Agent is a tool that works on behalf of the user and tells the server about which web browser the user is using for visiting the website. Many websites do not let you view the content if…

  • Develop your first ETL job in Python using bonobo

    In this post I am going to discuss how you can write ETL jobs in Python by using  Bonobo library. Before I get into the library itself, allow me to discuss about ETL itself and why is it needed? What is ETL? ETL is actually short form of Extract, Transform and Load, a process in which data is acquired, changed/processes and then finally get loaded into data warehouse/database(s). You can extract data from data sources like Files, Website or some Database, transform the acquired data and then load the final version into database for business usage. You may ask, Why ETL?, well, what ETL does, many of you might already been doing…

  • How to setup PHP7.1, Apache 2.2 on Amazon Linux

    OK I had no plan to make this post but recently I spent quite a few time to figure it out so thought to make it as a post for self and others who come across issue to deal with this simply thing otherwise. Alright, let’s proceed! Below is the details of my Amazon Distro: [crayon-5c15e8d507483294305355/] Repo setting Before we proceed, first make sure that you are downloading stuff from the right repository. Remove all irrelevant repos, specially remi-safe. [crayon-5c15e8d50748b174022662/] You can see list of available repos, if all goes well you should see something like below: [crayon-5c15e8d50748e059229943/] Alright, we now have correct repo setup, it’s time to install things…