This is an exercise to compare performance of different machine learning algorithms (classifiers) both in terms of accuracy and speed. This is not a comprehensive exercise; its merely a trial to come up with a process to compare performance. The caret package of R helps here since it can execute different classifiers with just the change of a parameter.
The data used is an income data set that contains details about individuals - age, years of education, gender and work hours per week. It also has their salary range as a binary variable - whether they earn more than 50K or not. The goal of the classifier is to predict the income range based on the other attributes. R is used for the exercise.
Loading and exploring the data set
data <-read.csv("incomedata.csv")
str(data)
Let us look at the correlation between the predictors and the target.
library(psych)
pairs.panels(data)
Correlation Plot |
All 4 predictors are week, with correlation co-efficients against Salary ranging in the twenties. So it would be interesting to see how different algorithms would perform on this data.
Model Building and Prediction
Split training and testing sets
library(caret)
inTrain <- createDataPartition(y=data$Salary,p=0.7,list=FALSE)
training <- data[inTrain,]
testing <- data[-inTrain,]
Predict using different models
#list of all algorithms
predlist <- c("bagFDA","lda","LogitBoost",
"nb","nnet","rf","rpart","svmRadialCost",
"C5.0", "glm")
results <- data.frame( Algorithm=character(), Duration=numeric(), Accuracy=numeric(),
stringsAsFactors=FALSE)
#loop through algorithm list and perform model building and prediction
for (i in 1:length(predlist)) {
pred <- predlist[i]
print(paste("Algorithm = ",pred ))
startTime <- as.integer(Sys.time())
model <- train( Salary ~ ., data=training, method=pred)
predicted <- predict(model, testing)
matrix<- confusionMatrix(predicted, testing$Salary)
endTime <- as.integer(Sys.time())
thisresult <- c( as.character(pred), endTime-startTime, as.numeric(matrix$overall[1]))
results[i,1] <- pred
results[i,2] <- endTime-startTime
results[i,3] <- round(as.numeric(matrix$overall[1]) * 100, 2)
}
results
Analyzing results
Classifier | Accuracy (%) | Execution Time (minutes) |
Linear Discriminant Analysis | 80.38 | 0.0 |
Classification and Regression Trees | 80.86 | 0.2 |
General Linear Models | 80.53 | 0.2 |
Boosted Logistic Regression | 79.01 | 1.4 |
Decision Trees (C5.0) | 80.9 | 4.5 |
Naïve Bayes | 80.94 | 4.9 |
Neural Networks | 80.8 | 6.4 |
Random Forest | 80.84 | 15.1 |
Bagging (Flexible Discriminant Analysis) | 81.15 | 23.2 |
Support Vector Machines | 81.17 | 66.3 |
The table above shows the results of comparison. Some thoughts.
1. Different classifiers perform differently based on the data set. The correlation between predictors and variability of predictions between runs goes a long way in influencing the algorithms. Sometimes all algorithms predict the same way.
2. There are differences in accuracy of a few points between many of them. Are these differences statistically significant?
3. There are huge differences in execution times. This indicates that for any data science project, trying out multiple algorithms and comparing their results is a worthwhile exercise. For similar levels of accuracy, some of them have execution times way better. This will help in choosing the most optimized algorithm for the problem in question.
Here is more comparison of accuracy on different datasets.
Classifier
|
Income Prediction | SPAM Filter | Credit Approval |
Linear Discriminant Analysis | 80.38 | 91.95 | 72 |
Classification and Regression Trees | 80.86 | 88.59 | 66 |
General Linear Models | 80.53 | 92.62 | 71.67 |
Boosted Logistic Regression | 79.01 | 90.6 | 69.33 |
Decision Trees (C5.0) | 80.9 | 89.26 | 71 |
Naïve Bayes | 80.94 | 87.92 | 68.33 |
Neural Networks | 80.8 | 91.28 | 71.67 |
Random Forest | 80.84 | 89.93 | 70.67 |
Bagging (Flexible Discriminant Analysis) | 81.15 | 93.96 | 70.33 |
Support Vector Machines | 81.17 | 89.93 | 72.33 |
Hi,
ReplyDeleteThis is a great article.
Does your experience suggest anything by visualizing the relationship between predictor and independent variables(less than 10) would suggest the best /top 3 classifiers ? If yes can you go in more detail.
I also think knowing the inherent differences/similarities between the models would help determine the best pick.
ReplyDeleteBest SEO training in hyderabad,training by professional experts during experience more than 15 years,and Giving placement in their company.And giving knowledge of All digital marketing modules with professional practise.Train by expert and Earn huge money per month.Best SEO training in hyderabad
I am reading your post from the beginning, it was so interesting to read & I feel thanks to you for posting such a good blog, keep updates regularly.
ReplyDeleteData Science Online Training|
Hadoop Online Training
R Programming Online Training|
This comment has been removed by the author.
ReplyDeleteThanks for sharing amazing info Datascience online training in hyderabad
ReplyDeleteGood Post! Thank you so much for sharing this pretty post, it was so good to read and useful to improve my knowledge as updated one, keep blogging…
ReplyDeleteData science training in Marathahalli|
Data science training in Bangalore|
Hadoop Training in Marathahalli|
Hadoop Training in Bangalore|
nice,Data Science Online Training
ReplyDeleteThankq for sharing great information Datascience Online Training in hydderabad
ReplyDeletenice information about Datascience training in hyderabad
ReplyDeletevery good article about data science
ReplyDeleteData ScienceTraining in Hyderabad
Data Science Course Content
Data Science Interview Questions
Data Science Training in ameerpet
Data Science Online Training in Hyderabad
I wish to indicate because of you only to bail me out of this specific trouble. As a consequence of checking through the net and meeting systems that were not beneficial, I thought my life was finished.
ReplyDeletehealth and safrety courses in chennai
Those guidelines additionally worked to become a good way to recognize that other people online have the identical fervor like mine to grasp great deal more around this condition.
ReplyDeletesafety course in chennai
Nice post. Thanks for sharing information about your services. This is really useful.
ReplyDeleteWebsite Design Company in Bangalore | Best Web Design Company in Bangalore | Website Designing in Bangalore
great article !!
ReplyDeleteThanks for provide great informatic and looking beautiful blog .easy to understand . need more updates .Waiting for your next content .thanks for sharing.
ReplyDeleteData Science Training in Chennai
Data Science Training in Velachery
Data Science Training in Tambaram
Data Science Training in Porur
Data Science Training in Omr
Data Science Training in Annanagar