• Getting started with Apache Airflow

    This post is the part of Data Engineering Series. In this post, I am going to discuss Apache Airflow, a workflow management system developed by Airbnb. Earlier I had discussed writing basic ETL pipelines in Bonobo. Bonobo is cool for write ETL pipelines but the world is not all about writing ETL pipelines to automate things. There are other use cases in which you have to perform tasks in a certain order once or periodically. For instance: Monitoring Cron jobs transferring data from one place to other. Automating your DevOps operations. Periodically fetching data from websites and update the database for your awesome price comparison system. Data processing for recommendation based…

  • Getting started with Apache Kafka in Python

    This post is the part of Data Engineering Series. In this post, I am going to discuss Apache Kafka and how Python programmers can use it for building distributed systems. What is Apache Kafka? Apache Kafka is an open-source streaming platform that was initially built by LinkedIn. It was later handed over to Apache foundation and open sourced it in 2011. According to Wikipedia: Apache Kafka is an open-source stream-processing software platform developed by the Apache Software Foundation, written in Scala and Java. The project aims to provide a unified, high-throughput, low-latency platform for handling real-time data feeds. Its storage layer is essentially a “massively scalable pub/sub message queue architected…

  • Getting started with Elasticsearch in Python

    In this post, I am going to discuss Elasticsearch and how you can integrate with different Python apps. What is ElasticSearch? ElasticSearch (ES) is a distributed and highly available open-source search engine that is built on top of Apache Lucene. It’s an open-source which is built in Java thus available for many platforms. You store unstructured data in JSON format which also makes it a NoSQL database. So, unlike other NoSQL databases ES also provides search engine capabilities and other related features. ElasticSearch Use Cases You can use ES for multiple purposes, a couple of them given below: You are running a website that provides lots of dynamic content; be it an…

  • How to create a custom token on Stellar network in Python

    A few months back I made a post about Stellar that how you can use it in your Python applications. In this post, I am going to discuss that how you can create your own custom token, a.k.a, a coin programmatically in Python. Before I get into the code, I’d like to discuss what are tokens and their background, how they are different from Alt-coins and some Stellar network concepts. This post is lengthy so read it when you have ample time to read. What are Tokens? The term token is not new and many of us would have experienced the application of it one way or other. Tokens are…

  • 5 strategies to write unblock-able web scrapers in Python

    People who read my posts in scraping series often contacted me to know how could they write scrapers that don’t get blocked. It is very difficult to write a scraper that NEVER gets blocked but yes, you can increase the life of your web scraper by implementing a few strategies. Today I am going to discuss them. User-Agent The very first thing you need to take care of is setting the user-agent. User Agent is a tool that works on behalf of the user and tells the server about which web browser the user is using for visiting the website. Many websites do not let you view the content if…

  • Develop your first ETL job in Python using bonobo

    In this post I am going to discuss how you can write ETL jobs in Python by using  Bonobo library. Before I get into the library itself, allow me to discuss about ETL itself and why is it needed? What is ETL? ETL is actually short form of Extract, Transform and Load, a process in which data is acquired, changed/processes and then finally get loaded into data warehouse/database(s). You can extract data from data sources like Files, Website or some Database, transform the acquired data and then load the final version into database for business usage. You may ask, Why ETL?, well, what ETL does, many of you might already been doing…

  • Getting started with Python and IPFS

    In this post I am going to discuss how you can use  decentralized IPFS in your Python apps for storing different kind of data. What is IPFS? From Wikipedia: InterPlanetary File System (IPFS) is a protocol and network designed to create a content-addressable, peer-to-peer method of storing and sharing hypermedia in a distributed file system. IPFS was initially designed by Juan Benet, and is now an open-source project developed with help from the community. In simple it’s Amazon S3 on Blockchain. All of your information is available on decentralized network across nodes thus not only make the system scaleable but reliable as well since data which is fed in it…

  • Introduction to Exploratory Data Analysis in Python

      Recently I finished up Python Graph series by using Matplotlib to represent data in different types of charts. In this post I am giving a brief intro of Exploratory data analysis(EDA) in Python with help of pandas and matplotlib. What is Exploratory data analysis? According to Wikipedia: In statistics, exploratory data analysis (EDA) is an approach to analyzing data sets to summarize their main characteristics, often with visual methods. A statistical model can be used or not, but primarily EDA is for seeing what the data can tell us beyond the formal modeling or hypothesis testing task. You can say that EDA is statisticians way of story telling where you explore…

  • Implementing beanstalk to create a scaleable web scraper

    Image Credit (http://blog.hqc.sk.ca/wp-content/uploads/2012/12/Queue-2012-12-11.jpg) Queues are often used to make applications scaleable by offloading the data and process them later. In this post I am going to use BeansTalk queue management system in Python. Before I get into real task, allow me to give a brief intro of Beanstalk. What is Beanstalk? From the official website: Beanstalk is a simple, fast work queue. Its interface is generic, but was originally designed for reducing the latency of page views in high-volume web applications by running time-consuming tasks asynchronously. It’s demon called beanstalkd which you can run on *nix based machines. Since I am on OSX so I called brew install beanstalkd to install it. Once…

  • Develop database driven applications in Python with Peewee

    It is not uncommon that most of the applications these days are interacting with database these days. Specially with RDBMS based engines( DB engines that support SQL). Like any other languages Python also provides native and 3rd party libraries to interact with database. Normally you have to write SQL queries for CRUD operations. That’s OK but at times it happens that things get messy: The Big Boss decided to move from MySQL to.. MSSQL and you have no choice but to nod at him and making changes in your queries compatible with other Db Engine. You gotta make multiple queries just to retrieve a single piece of info from the…