A few weeks back I was intrigued by Per Harald Borgen’s post Machine Learning in a Week which oversimplified the entire learning and implementing a Machine Learning algorithm in a week on a real dataset. He laid down the framework that how as a programmer one can get into ML thing without getting worried about heavy maths and statistics. It was a good excuse to give this ML thing a chance which I had been trying to do for many years after completing Coursera course.
Alright. So on this Monday I started my mission. I had to find out some good dataset to achieve the task. Initially I wanted to use NASA’s meteorites landing data for the task but due to lack of knowledge of both data and ML algorithms I could not find the way to use it for prediction. I stopped here for a while and decided to make some learning, as Per suggested, I headed to Udacity’s Introduction to Machine Learning course with great hopes. I had attended the Coursera’s Machine Learning Intro course in 2010 and TBH I did not feel comfortable due to heavy theory, Maths and Stats. This course was a like a fresh breeze for me. Unlike Coursera’s course it was more practical. They also used Python as a language unlike Octave which was used in the course and not commonly used among developers. Python’s scikit-learn is an awesome library which makes you to smile by hiding all complexities related to algorithms.
After watching intro lectures related to Naive Base, a supervised learning algorithm I thought of using Cricket related stats for my work and figure out who could win in the future based on existing conditions. Pakistan recently became the No.1 Test Cricket team after a recent England tour so naturally I was inclined to find out something about Pakistan’s performance in test matches against England since beginning.
Step 1: Data Acquisition
The data was infront of me, all I need to get it in CSV format for further processing. I already have been doing data scraping for long time in Python so it was not a difficult task for me. Data acquisition is the most important part to find answers of your questions and as a programmer you should know how to acquire it. Below is the code that access ESPN Cricinfo; fetches, parse and store required data.
""" This will grab the data from CricInfo Site about TestMatch Played by Pakistan against England from 1954-till now """ import requests from bs4 import BeautifulSoup url = 'http://stats.espncricinfo.com/ci/engine/team/7.html?class=1;opposition=1;template=results;type=team;view=results' r = requests.get(url) html = r.text #create soup object soup = BeautifulSoup(html,'lxml') recs = soup.select('tbody > .data1') file = open('data_england_test.csv', "w") for i in range(2,len(recs)): single = recs[i].findAll('td') file.write(single.text + ','+single.text+ ','+single.text+ ','+single.text+ ','+ single.text+ ','+single.text+ ','+ single.text+'\n') file.close()
The script will create a CSV file with all the tabular data. Now move on to next step that is transforming and cleaning data.
Step 2: Data Cleaning and Transformation
The raw data is available in text format. In order to use in algorithms we need to convert it numerical format where possible. I am using another awesome Python library PANDAS which does all heavy lifting required for data analysis. For sake of simplicity I am taking only 3 parameters; Toss and Bat as features and Result as Label. Features means, what are the input parameters that will help model to learn and give required data where as Label is, what tag should be associated against that record.
import pandas as pd file_name = 'data_england_test.csv' fields = ['Result','Toss','Bat'] df = pd.read_csv(file_name,skipinitialspace=True,usecols=fields) df.to_csv('data_england_test_filter.csv',index=False) # Convert features and labels into digits df_replace = df.replace(['lost','draw','won','1st','2nd'],[-1,0,1,-1,1]) dataset_length = len(df_replace) # 67% of training data ratio = 0.67 train_data_df = df_replace[:round(dataset_length*ratio)] # first 67% of data test_data_df = df_replace[-(1-round(dataset_length*ratio)):] # rest for testing # Create Respected CSV train_data_df.to_csv('train.csv',index=False) test_data_df.to_csv('test.csv',index=False)
Final Step: Training and Predicting
Alright, we have both training and testing data available, it’s time to load scikit-learn library and load our training data and predict.
""" Labels : Lost, Draw, Won [-1,0,1] Features ========== Toss(Lost,Won) = [-1,1] Bat(First, Second) = [-1,1] """ # Import Library of Gaussian Naive Bayes model from sklearn.naive_bayes import GaussianNB from sklearn.metrics import accuracy_score import numpy as np from sklearn.metrics import precision_recall_fscore_support as score # Assigning Features features = np.genfromtxt('train.csv',delimiter=',',usecols=(1,2),dtype=int) labels = np.genfromtxt('train.csv',delimiter=',',usecols=(0),dtype=int) features_test = np.genfromtxt('test.csv',delimiter=',',usecols=(1,2),dtype=int) labels_test = np.genfromtxt('test.csv',delimiter=',',usecols=(0),dtype=int) # Create a Gaussian Classifier model = GaussianNB() # # # Train the model using the training sets model.fit(features, labels) # # # Predict Output predicted = model.predict(features_test) # print(predicted) acc = accuracy_score(labels_test,predicted) print(acc)
I picked Naive Bayes Algorithm, this is the efficient algorithm as compared to other supervised learning algorithm and easy to use as well
The program returned the accuracy of ~30% which is not good at all. Accuracy helps to figure out whether your algorithm works well. There are methods available like Cross Validation to fine tune and boost the model but that’s not the scope at this moment. Ok, I kept playing with the data. The total dataset consist of 81 records only. I divided it into training and test data in 67-33 ratio. On decreasing the test records accuracy shot upto 75% with 3 record sets only. I questioned in a few forums and they considered it normal due to repetition of similar data as well as small data set.
Anyways, so that’s how my first program was written. I wrote same program by using SVM which was not something different as such in terms of output due to reason told above.
It was a good exercise, surprisingly I achieved it a bit earlier due to some knowledge I already had as well as availability of awesome ML library. I finally learnt how to make training and testing data.
I am not stopping here, I am further exploring Udacity course as well as real word datasets available on Kaggle which should be good enough to excite you.
The code is also available on Github