Tuesday, November 4, 2014

Exercise to compare classifier performance

This is an exercise to compare performance of different machine learning algorithms (classifiers) both in terms of accuracy and speed. This is not a comprehensive exercise; its merely a trial to come up with a process to compare performance. The caret package of R helps here since it can execute different classifiers with just the change of a parameter.

The data used is an income data set that contains details about individuals - age, years of education, gender and work hours per week. It also has their salary range as a binary variable - whether they earn more than 50K or not. The goal of the classifier is to predict the income range based on the other attributes. R is used for the exercise.

Loading and exploring the data set

data <-read.csv("incomedata.csv")

Let us look at the correlation between the predictors and the target.

Correlation Plot

All 4 predictors are week, with correlation co-efficients against Salary ranging in the twenties. So it would be interesting to see how different algorithms would perform on this data.

Model Building and Prediction

Split training and testing sets

inTrain <- createDataPartition(y=data$Salary,p=0.7,list=FALSE)
training <- data[inTrain,]
testing <- data[-inTrain,]

Predict using different models

#list of all algorithms
predlist <- c("bagFDA","lda","LogitBoost",
              "C5.0", "glm")

results <- data.frame( Algorithm=character(), Duration=numeric(), Accuracy=numeric(),

#loop through algorithm list and perform model building and prediction

for (i in 1:length(predlist)) {
  pred <- predlist[i]
  print(paste("Algorithm = ",pred ))
  startTime <- as.integer(Sys.time())
  model <- train( Salary ~ ., data=training, method=pred)
  predicted <- predict(model, testing)
  matrix<- confusionMatrix(predicted, testing$Salary)
  endTime <- as.integer(Sys.time())
  thisresult <- c( as.character(pred), endTime-startTime, as.numeric(matrix$overall[1]))
  results[i,1] <- pred
  results[i,2] <- endTime-startTime
  results[i,3] <- round(as.numeric(matrix$overall[1]) * 100, 2)


Analyzing results

Classifier Accuracy (%) Execution Time (minutes)
Linear Discriminant Analysis 80.38 0.0
Classification and Regression Trees 80.86 0.2
General Linear Models 80.53 0.2
Boosted Logistic Regression 79.01 1.4
Decision Trees (C5.0) 80.9 4.5
Naïve Bayes 80.94 4.9
Neural Networks 80.8 6.4
Random Forest 80.84 15.1
Bagging (Flexible Discriminant Analysis) 81.15 23.2
Support Vector Machines 81.17 66.3

The table above shows the results of comparison. Some thoughts.

1. Different classifiers perform differently based on the data set. The correlation between predictors and variability of predictions between runs goes a long way in influencing the algorithms. Sometimes all algorithms predict the same way.

2. There are differences in accuracy of a few points between many of them. Are these differences statistically significant?

3. There are huge differences in execution times. This indicates that for any data science project, trying out multiple algorithms and comparing their results is a worthwhile exercise. For similar levels of accuracy, some of them have execution times way better. This will help in choosing the most optimized algorithm for the problem in question.

Here is more comparison of accuracy on different datasets.

Income Prediction SPAM Filter Credit Approval
Linear Discriminant Analysis 80.38 91.95 72
Classification and Regression Trees 80.86 88.59 66
General Linear Models 80.53 92.62 71.67
Boosted Logistic Regression 79.01 90.6 69.33
Decision Trees (C5.0) 80.9 89.26 71
Naïve Bayes 80.94 87.92 68.33
Neural Networks 80.8 91.28 71.67
Random Forest 80.84 89.93 70.67
Bagging (Flexible Discriminant Analysis) 81.15 93.96 70.33
Support Vector Machines 81.17 89.93 72.33


  1. Hi,

    This is a great article.

    Does your experience suggest anything by visualizing the relationship between predictor and independent variables(less than 10) would suggest the best /top 3 classifiers ? If yes can you go in more detail.

    I also think knowing the inherent differences/similarities between the models would help determine the best pick.


  2. Best SEO training in hyderabad,training by professional experts during experience more than 15 years,and Giving placement in their company.And giving knowledge of All digital marketing modules with professional practise.Train by expert and Earn huge money per month.Best SEO training in hyderabad

  3. I am reading your post from the beginning, it was so interesting to read & I feel thanks to you for posting such a good blog, keep updates regularly.

    Data Science Online Training|
    Hadoop Online Training
    R Programming Online Training|

  4. This comment has been removed by the author.

  5. Good Post! Thank you so much for sharing this pretty post, it was so good to read and useful to improve my knowledge as updated one, keep blogging…

    Data science training in Marathahalli|
    Data science training in Bangalore|
    Hadoop Training in Marathahalli|
    Hadoop Training in Bangalore|

  6. I really appreciate the information shared above. It’s of great help. If someone wants to learn Online (Virtual) instructor lead live training in DATA SCIENCE, kindly contact us http://www.maxmunus.com/contact
    MaxMunus Offer World Class Virtual Instructor-led training on DATA SCIENCE. We have industry expert trainer. We provide Training Material and Software Support. MaxMunus has successfully conducted 100000+ pieces of training in India, USA, UK, Australia, Switzerland, Qatar, Saudi Arabia, Bangladesh, Bahrain, and UAE etc.
    E-mail: saurabh@maxmunus.com
    Skype id: saurabhmaxmunus
    Ph:(0) 8553576305

  7. Really useful information. we are providing best data science online training from industry experts.

  8. CIITN is located in Prime location in Noida having best connectivity via all modes of public transport. CIITN offer both

    weekend and weekdays courses to facilitate Hadoop aspirants. Among all Hadoop Training Institute in Noida , CIITN's Big Data and Hadoop Certification course is designed to prepare you to match

    all required knowledge for real time job assignment in the Big Data world with top level companies. CIITN puts more focus in project based training

    and facilitated with Hadoop 2.7 with Cloud Lab—a cloud-based Hadoop environment lab setup for hands-on experience.

    CIITNOIDA is the good choice for Big Data Hadoop

    Training in NOIDA
    in the final year. I have also completed my summer training from here. It provides high quality Hadoop training with Live

    projects. The best thing about CIITNOIDA is its experienced trainers and updated course content. They even provide you placement guidance and have

    their own development cell. You can attend their free demo class and then decide.

    Hadoop Training in Noida
    Big Data Hadoop Training in Noida

  9. The information which you have provided is very good. It is very useful who is looking for machine learning online training Hyderabad

  10. I'm glad to hear that, Data Science. Good luck to you. Blogging is a great thing, and you get better with practice. Data Science training in Hyderabad One of the best ways to grow is to read other people's blogs. See what they do, how they do things. It's always food for thought, and sometimes, it's downright inspiring.

  11. Best R Programming Training in Bangalore offered by myTectra. India's No.1 R Programming Training Institute. Classroom, Online and Corporate training in R Programming
    r programming training

  12. Gaining Python certifications will validate your skills and advance your career.
    python certification

  13. I wish to indicate because of you only to bail me out of this specific trouble. As a consequence of checking through the net and meeting systems that were not beneficial, I thought my life was finished.
    health and safrety courses in chennai

  14. Those guidelines additionally worked to become a good way to recognize that other people online have the identical fervor like mine to grasp great deal more around this condition.
    safety course in chennai