Write your first web scraper in Python with Beautifulsoup

Ok so I am going to write the simplest web scraper in Python with the help of libraries like requests and BeautifulSoup. Before I move further, allow me to discuss what’s web/HTML scraping.

What is Web scraping?

According to Wikipedia:

Web scraping (web harvesting or web data extraction) is a computer software technique of extracting information from websites. This is accomplished by either directly implementing the Hypertext Transfer Protocol (on which the Web is based), or embedding a web browser.


So use scraping technique to access the data from web pages and make it useful for various purposes (e.g: Analysis, aggregation etc). Scraping is not the only way to extract data from a website or web application, there are other ways you can achieve the goal, for instance, using API/SDK provided by the application, RSS/Atom feeds etc. If such facilities are already provided, scraping should be the last resort, specially when there are questions about the legality about this activity.

Ok, so I am going to write the scraper in Python. Python is not the only language that can be used for the purpose, almost all languages provide some way to access a webpage and parse HTML. The reason I use Python is my personal choice due to simplicity of the language itself and the available libraries.

Now, I have to pick some web page to fetch required information. I picked OLX for the purpose, the page I am going to scrape is one of the Ad’s detailed page, a Car Ad here.

Suppose I am going to build a price comparison system for the used cars available on OLX. For the purpose I need to have data available in my database. For the purpose I need to have data available periodically in my local database. In order to achieve this task I’d have to go on each car Ad’s page, parse the data and put in local Db. For the simplicity I am not covering crawling listing page and dump data of each item individually.

On the page there are some useful information for my price comparison system: Title, Price, Location, Images, owner name and the Description.

Let’s grab the Title first. One of the prerequisites of writing web scrapper is that you should good enough to use HTML Inspector provided in web browsers because if you don’t know which markup or tag need to be pick you can’t get your required data. This Chrome tutorial should be helpful if you have not used inspectors before. Ok, below is the HTML of the Ad Title:

Chrome HTML Inspector



As you can see, it’s pretty simple, the Ad’s title is written in H1 tag and using css classes brkword lheight28. An important thing: make sure you can narrow down your selection as much as you can, specially when you are picking a single entry because if you just pick a generic HTML tag, there are chances that such tag exist more than once on the page. It is not common for H1 tag but since I found they also specified class names, I picked that as well, in this particular scenario, there’s only one instance of H1 tag so even if I pick H1 tag without classes, I’d get my required info anyway.

Ok, now I am going to write code. The first task is to access the page, check whether it’s available and then access the html of it.

Here, I access the page by using Python’s requests library, you may use other like urllib as well but since I am fond of simplicity provided by requests, i don’t look for any other library at all.

The page was accessed, I also passed the UserAgent in header. It’s not mandatory but often it happens that site does not let you access the page if certain header are not sent with Http request. Beside that, it’s also good to come out clean and let the server know that you are trying to be nice by providing a userAgent.

After creating the BeautifulSoup object I access H1 tag. There are multiple ways to do it but mostly I rely on select() method since it let you use CSS selectors. This one was simple so I just used find() method. More details given on BS’ documentation website.

You might wonder, why am I verifying whether H1 tag while it’s obvious it’s right there. The reason is, you don’t have control on websites of others. There is probability that they change the markup or site structure for any reason and your code stop working. So the best practise is to check each element’s existence and if it exists then proceed further. It can also help you in logging errors and send notification to site administrators or logging management system so that the problem could be rectify as soon as possible.

Once title object found, I got the text of it and used strip() to remove all whitespace. Never trust the data retrieved from website. You should always clean and transform it based on your needs. Next I got the location which is similar to what I did for Title and then Price.

Here as you can see I used select() method, it is used to find elements by CSS selectors. I could achieve same result by using find() but just for sake of variety I picked this. The select returns a list element because multiple elements can fulfill the criteria in case of selectors. I picked the one and only element matched the criteria and extracted price. Next, I picked URLs of images and description

In last I created a dictionary object. It is not part of scraping but just to produce data in a standard format for further processing I created it. You may also convert into XML, CSV whatever you want.

The code is available on Github

If you like this post then you should subscribe to my blog for future updates.

* indicates required

How I wrote my first Machine Learning program in 3 days

A few weeks back I was intrigued by Per Harald Borgen’s post Machine Learning in a Week which oversimplified the entire learning and implementing a Machine Learning algorithm in a week on a real dataset. He laid down the framework that how as a programmer one can get into ML thing without getting worried about heavy maths and statistics. It was a good excuse to give this ML thing a chance which I had been trying to do for many years after completing Coursera course.

Alright. So on this Monday I started my mission. I had to find out some good dataset to achieve the task. Initially I wanted to use NASA’s meteorites landing data for the task but due to lack of knowledge of both data and ML algorithms I could not find the way to use it for prediction. I stopped here for a while and decided to make some learning, as Per suggested, I headed to Udacity’s Introduction to Machine Learning course with great hopes. I had attended the Coursera’s Machine Learning Intro course in 2010 and TBH I did not feel comfortable due to heavy theory, Maths and Stats. This course was a like a fresh breeze for me. Unlike Coursera’s course it was more practical. They also used Python as a language unlike Octave which was used in the course and not commonly used among developers. Python’s scikit-learn is an awesome library which makes you to smile by hiding all complexities related to algorithms.

After watching intro lectures related to Naive Base, a supervised learning algorithm I thought of using Cricket related stats for my work and figure out who could win in the future based on existing conditions. Pakistan recently became the No.1 Test Cricket team after a recent England tour so naturally I was inclined to find out something about Pakistan’s performance in test matches against England since beginning.

Step 1: Data Acquisition

The data was infront of me, all I need to get it in CSV format for further processing. I already have been doing data scraping for long time in Python so it was not a difficult task for me. Data acquisition is the most important part to find answers of your questions and as a programmer you should know how to acquire it. Below is the code that access ESPN Cricinfo; fetches, parse and store required data.

The script will create a CSV file with all the tabular data. Now move on to next step that is transforming and cleaning data.

Step 2: Data Cleaning and Transformation

The raw data is available in text format. In order to use in algorithms we need to convert it numerical format where possible. I am using another awesome Python library PANDAS which does all heavy lifting required for data analysis. For sake of simplicity I am taking only 3 parameters; Toss and Bat as features and Result as Label. Features means, what are the input parameters that will help model to learn and give required data where as Label is, what tag should be associated against that record.

 Final Step: Training and Predicting

Alright, we have both training and testing data available, it’s time to load scikit-learn library and load our training data and predict.

I picked Naive Bayes Algorithm, this is the efficient algorithm as compared to other supervised learning algorithm and easy to use as well

The program returned the accuracy of ~30% which is not good at all. Accuracy helps to figure out whether your algorithm works well. There are methods available like Cross Validation to fine tune and boost the model but that’s not the scope at this moment. Ok, I kept playing with the data. The total dataset consist of 81 records only. I divided it into training and test data in 67-33 ratio. On decreasing the test records accuracy shot upto 75% with 3 record sets only. I questioned in a few forums and they considered it normal due to repetition of similar data as well as small data set.

Anyways, so that’s how my first program was written. I wrote same program by using SVM which was not something different as such in terms of output due to reason told above.

It was a good exercise, surprisingly I achieved it a bit earlier due to some knowledge I already had as well as availability of awesome ML library. I finally learnt how to make training and testing data.

I am not stopping here, I am further exploring Udacity course as well as real word datasets available on Kaggle which should be good enough to excite you.

The code is also available on Github

If you like this post then you should subscribe to my blog for future updates.

* indicates required

Mbstring not found error while installing mailparse PHP extension

Today I had to install mailparse via PEAR. On installing I came across one of the errors: mbstring not found despite of it was right there. One of the simplest option to fix it is given below.

Though I am working on MAMP but it will work with anyone since it’s part of PHP Source.

Go to your PHP Source Library. In my case it is:

and add following lines in mbfilter.h

and then run PEAR command again

If all goes well it should give something like:

Manually add entry of the extension in your php.ini file and it should be available for work.

Currently Reading – Chrome extension about social book reading


I love reading books, in fact lots of books. I’am sure many of you read books as well. Books on different topics and subjects help you to expand the horizon of knowledge and creativity. Often it happens that you extract an idea from a book and you give your own touch in it and produce a product which people love to use.

In the age of social media when people love to let others know about their daily activities by updating a status, it ‘s not surprising that there are such social sites available which let your friends know about what you’re reading. Goodreads is one of them which is kind of Facebook for Books.

A month and half back I got an idea to come up with a tool that let your email recipients know about your book reading habit. Lo and behold, Currently Reading came it being.

Currently Reading is actually a Chrome extension for Gmail that adds an entry in your existing signature about the book you’re reading.

How it Works?

Well, see yourself!

Currently Reading Steps

You search the book you’re reading, select it and it automagically appears in your email signature.

Go to  Currently Reading website to know further how to use it.

Happy Reading!!

Online JSON Diff Viewer

Just came across this awesome tool to compare JSON. Yeah, at time it’s needed to compare them. Check this out(http://json-diff.com/)

The thing I like most was the interface.


Develop your first Facebook messenger bot in PHP

Facebook Messenger Bot - F8 Conference (Credit: Independent.co.uk)

Facebook recently announced a Bot platform for it’s Messenger which provides businesses and individuals another way to communicate with people.

What is a Chat bot?

A computer program designed to simulate conversation with human users, especially over the Internet.

Chat bot in PHP

When I heard of it, my very first thought was to a bot in PHP. I started to find some SDK in this regard released by Facebook but none was present. I headed over to documentation which provided good information for starters.

Ok! so without wasting further time, let’s build our first Bot


In order to create an Fb bot you will need two things to host your bot: A Facebook Page which will be like Home of Bot, People will visit the page, click on Message option to Interact with your bot. For example, suppose Pizza Hut introduce a bot for order related operations. What could they do that they integrate or host their bot on their official page, a fan can just click on Message button and send messages to order a Pizza, get new deals etc and they will get messages as if some human representatives is responding to them. It all depends how efficient a bot is. Facebook puts no limitation in this regard.

I am going to create a Time bot which will tell you current time Time API that provides different options to retrieve time. For our bot, we are just fetching latest(NOW) time. I will go step by step:


Step 1: Create Facebook Page:

I am going to create Bot’s Page first, this page will actually be the entry point for communication for the bot to interact with your page fans/users. Do note that it is not necessary to create a separate page only for Bot purpose. You may use existing Fan page for this purpose.  For sake of this tutorial I am assuming that you never created a page before. Visit https://www.facebook.com/pages/create/ and you will see something like this(as of April, 2016):

Create Facebook Page


I picked Entertainment option. In next steps it asks different options which you can always skip.

Alright! so my page is ready and something like should be visible for you as well:

Facebook Fan page




Step 2: Create Facebook App:

Alright, go to https://developers.facebook.com/apps and click on Add a New App button. Make sure you have a developer account otherwise you will not be able to access developer dashboard. 

Facebook Create App

When you click on it it shows a window and asks you what kind of app are you going to make. I picked Basic Setup given at bottom. I entered required information; Display Name & Contact Email and hit Create App ID button.

Create Facebook App ID

After Captcha you will be redirected to your App Page where you will be seeing details.

Facebook App page


On left side bar you will see an option of Messenger. When you click on it, it shows introduction of Messenger Platform and why and how these bots will be helpful.


Messenger Platform

Click on Get Started and it will show a New Dashboard page related to your newly created app that’s going to be hooked with Messenger platform.

Messenger App Dashboard

Now we need to do a few things for setting up the bot. As you can see, you are being asked a few things; Access Token/Page Token so that Facebook can know where do you want to host bot, Webhooks, your script URL that will be receiving messages from your users and responding them. It will also be hold the logic of your bot and Permissions that is, what this bot should be able to perform when communicating with users. Ok first, set the page which you just created. I am selecting TimBot. Since I, as a normal Facebook User going to use this page very first time, it will ask for Permissions as it normally asks.

Once all goes well you would get your Page Token like this, save it somewhere as this will be used as access_token while sending messages.

Facebook Page Access Token



Now we have to set our Webhook. Facebook asks you to setup an https:// URL which means, you simply can’t use localhost while developing. You can either upload your script somewhere which allows SSL based requests or.. you can use some tunneling tool that will pass on your localhost message to outer world. Luckily such tools are available and they are FREE as well. I’d recommend nGrok for this purpose. Once it’s unzipped, go to the folder and run the command:

As a free user you are not allowed to give your own subdomain. Once it starts working, it shows something like this:

nGrok in Action

As you can see, it gives you two forwarded URLs. Since we re interested about https one so we will focus on that. nGrok also provides you an interface to view requests on your newly created domain. It will help you to debug how your webhook page is being accessed by Messenger Platform. For that purpose, open a separate tab with the URL http://localhost:4040/inspect/http and here you can see all details related to that request.

nGrok Web Interface in action
Now I have the URL, all I have to do is to setup my Webhook for Time Bot. Click on Setup Webhooks option and you’d see something like that.

Facebook Messenger Webhook



Here I entered the nGrok based URL, Verify Token which could be ANY string and checked subscription fields. If you hit verify and save, you will get error:



What does it mean? Webhook when access the URL, it first verification token before doing anything further, if not present or incorrect, it gives the error you are seeing above.

It’s time to open IDE and write some code. Create a file, in my case it’s index.php, in your case it could be any other file. Write the code that verify your Webhook.


$verify_token holds the value you in Verify Token field.

Now all seems set, let’s try again! Hurray! if all goes well you would see something like this:

Now all seems set. How about we test our Webhook by sending message to our bot.

Step 3: Sending and Receiving Messages

Before you start sending/receiving messages, you need to subscribe your page. Go to Command prompt, assuming cURL is installed,run the following command.

Where access_token is the code I got in earlier step for the page Time Bot Page which I asked you to save somewhere. If works perfectly you should success message on console:

Now go to your bot page and send message.



If hooks and subscription works fine, on your nGrok Web Interface(http://localhost:4040/inspect/http), you should see a request hit:

Facebook Messenger Bot Message Structure



Now you will know the power of nGrok Web Interface, here we are receiving message from Bot in a structure. In next step we will convert this message structure by decoding JSON into array and fetch sender Id and message.

First line using to get message from Facebook and then convert the JSON into PHP Associative Array.

In next couple of lines I am getting senderID, the ID of the person messaging to bot and the message itself.

Wait.. while you are debugging requests on nGrok, you could find an error:

This issue arises if you are using PHP5.6. The best bet to create an .htaccess file with the content.

Now the error should be gone away. This particular will not go away by using ini_set().

I am setting some basic rules for my bot. It should only tell if it contains certain words.

Jus to check all going well, the message going to be printed should be able to watch on nGrok debugger.

Ok, now we have to message back to user. We have sender’s Id, we prepared the message, it’s time to send.

First, we will create a JSON structure of the message as given by Facebook platform and then will make  POST call via cURL. If all things coded correctly, you should see bot in action!

As you can see it covers bot valid and invalid messages here as per our rules.

Facebook Bot in Action


Alright! It’s your time to show your creativity and create your first awesome bot. if you bot goes public you can add a little code on your Website to invite others. Check the documentation for that.

Code is available on Github

Do you have an existing Web app or Facebook page and wants to integrate a bot with it? Let me know and I could help you in this regard. Contact me at kadnan(at)gmail.com

If you like this post then you should subscribe to my blog for future updates.

* indicates required

How to remove local git branches?

If you are working on a project having lots of feature branches, you’d like to remove all of once your deployment is done and all feature branches merge into master.  Following simple command can help you to get rid of them.

Here DB- is prefix of your feature branch. So, if you have branches like DB-1, DB-2; it will run grep branch and then grep all strings match DB- and then will remove them. xargs rule!


Laravel: How to avoid redirecting to home after login?

If you are working on a Laravel 5.1 based project and using Auth class for authentication purpose, chances are you would have faced issue of redirecting to home after login.

While docs say that you can override the behaviour by setting the $redirectedPath value. Thing is it will still not work. After wasting a reasonable time, I figured out the issue.

By default Auth constructor is defined as:


This middleware actually calls  RedirectIfAuthenticated trait. If you go to definition you will find:


Hardcoded right. So the best is to omit out the following line from constructor:

Add following protected method to override default redirection:


And it should work then.