Tuesday, November 4, 2014

Exercise to compare classifier performance


This is an exercise to compare performance of different machine learning algorithms (classifiers) both in terms of accuracy and speed. This is not a comprehensive exercise; its merely a trial to come up with a process to compare performance. The caret package of R helps here since it can execute different classifiers with just the change of a parameter.

The data used is an income data set that contains details about individuals - age, years of education, gender and work hours per week. It also has their salary range as a binary variable - whether they earn more than 50K or not. The goal of the classifier is to predict the income range based on the other attributes. R is used for the exercise.


Loading and exploring the data set


data <-read.csv("incomedata.csv")
str(data)

Let us look at the correlation between the predictors and the target.
library(psych)
pairs.panels(data)

Correlation Plot




























All 4 predictors are week, with correlation co-efficients against Salary ranging in the twenties. So it would be interesting to see how different algorithms would perform on this data.


Model Building and Prediction


Split training and testing sets

library(caret)
inTrain <- createDataPartition(y=data$Salary,p=0.7,list=FALSE)
training <- data[inTrain,]
testing <- data[-inTrain,]

Predict using different models

#list of all algorithms
predlist <- c("bagFDA","lda","LogitBoost",
              "nb","nnet","rf","rpart","svmRadialCost",
              "C5.0", "glm")

results <- data.frame( Algorithm=character(), Duration=numeric(), Accuracy=numeric(),
                       stringsAsFactors=FALSE)

#loop through algorithm list and perform model building and prediction

for (i in 1:length(predlist)) {
  pred <- predlist[i]
  print(paste("Algorithm = ",pred ))
  startTime <- as.integer(Sys.time())
  
  model <- train( Salary ~ ., data=training, method=pred)
  predicted <- predict(model, testing)
  matrix<- confusionMatrix(predicted, testing$Salary)
  endTime <- as.integer(Sys.time())
  
  thisresult <- c( as.character(pred), endTime-startTime, as.numeric(matrix$overall[1]))
  results[i,1] <- pred
  results[i,2] <- endTime-startTime
  results[i,3] <- round(as.numeric(matrix$overall[1]) * 100, 2)
  
}

results

Analyzing results

Classifier Accuracy (%) Execution Time (minutes)
Linear Discriminant Analysis 80.38 0.0
Classification and Regression Trees 80.86 0.2
General Linear Models 80.53 0.2
Boosted Logistic Regression 79.01 1.4
Decision Trees (C5.0) 80.9 4.5
Naïve Bayes 80.94 4.9
Neural Networks 80.8 6.4
Random Forest 80.84 15.1
Bagging (Flexible Discriminant Analysis) 81.15 23.2
Support Vector Machines 81.17 66.3

The table above shows the results of comparison. Some thoughts.

1. Different classifiers perform differently based on the data set. The correlation between predictors and variability of predictions between runs goes a long way in influencing the algorithms. Sometimes all algorithms predict the same way.

2. There are differences in accuracy of a few points between many of them. Are these differences statistically significant?

3. There are huge differences in execution times. This indicates that for any data science project, trying out multiple algorithms and comparing their results is a worthwhile exercise. For similar levels of accuracy, some of them have execution times way better. This will help in choosing the most optimized algorithm for the problem in question.

Here is more comparison of accuracy on different datasets.


Classifier
Income Prediction SPAM Filter Credit Approval
Linear Discriminant Analysis 80.38 91.95 72
Classification and Regression Trees 80.86 88.59 66
General Linear Models 80.53 92.62 71.67
Boosted Logistic Regression 79.01 90.6 69.33
Decision Trees (C5.0) 80.9 89.26 71
Naïve Bayes 80.94 87.92 68.33
Neural Networks 80.8 91.28 71.67
Random Forest 80.84 89.93 70.67
Bagging (Flexible Discriminant Analysis) 81.15 93.96 70.33
Support Vector Machines 81.17 89.93 72.33

9 comments:

  1. Hi,

    This is a great article.

    Does your experience suggest anything by visualizing the relationship between predictor and independent variables(less than 10) would suggest the best /top 3 classifiers ? If yes can you go in more detail.

    I also think knowing the inherent differences/similarities between the models would help determine the best pick.

    ReplyDelete

  2. Best SEO training in hyderabad,training by professional experts during experience more than 15 years,and Giving placement in their company.And giving knowledge of All digital marketing modules with professional practise.Train by expert and Earn huge money per month.Best SEO training in hyderabad

    ReplyDelete
  3. I am reading your post from the beginning, it was so interesting to read & I feel thanks to you for posting such a good blog, keep updates regularly.


    Data Science Online Training|
    Hadoop Online Training
    R Programming Online Training|

    ReplyDelete
  4. This comment has been removed by the author.

    ReplyDelete
  5. Good Post! Thank you so much for sharing this pretty post, it was so good to read and useful to improve my knowledge as updated one, keep blogging…

    Data science training in Marathahalli|
    Data science training in Bangalore|
    Hadoop Training in Marathahalli|
    Hadoop Training in Bangalore|

    ReplyDelete
  6. I really appreciate the information shared above. It’s of great help. If someone wants to learn Online (Virtual) instructor lead live training in DATA SCIENCE, kindly contact us http://www.maxmunus.com/contact
    MaxMunus Offer World Class Virtual Instructor-led training on DATA SCIENCE. We have industry expert trainer. We provide Training Material and Software Support. MaxMunus has successfully conducted 100000+ pieces of training in India, USA, UK, Australia, Switzerland, Qatar, Saudi Arabia, Bangladesh, Bahrain, and UAE etc.
    saurabh
    MaxMunus
    E-mail: saurabh@maxmunus.com
    Skype id: saurabhmaxmunus
    Ph:(0) 8553576305
    http://www.maxmunus.com/

    ReplyDelete