This is an exercise to compare performance of different machine learning algorithms (classifiers) both in terms of accuracy and speed. This is not a comprehensive exercise; its merely a trial to come up with a process to compare performance. The caret package of R helps here since it can execute different classifiers with just the change of a parameter.

The data used is an income data set that contains details about individuals - age, years of education, gender and work hours per week. It also has their salary range as a binary variable - whether they earn more than 50K or not. The goal of the classifier is to predict the income range based on the other attributes. R is used for the exercise.

### Loading and exploring the data set

data <-read.csv("incomedata.csv")

str(data)

Let us look at the correlation between the predictors and the target.

library(psych)

pairs.panels(data)

Correlation Plot |

### All 4 predictors are week, with correlation co-efficients against Salary ranging in the twenties. So it would be interesting to see how different algorithms would perform on this data.

### Model Building and Prediction

#### Split training and testing sets

library(caret)

inTrain <- createDataPartition(y=data$Salary,p=0.7,list=FALSE)

training <- data[inTrain,]

testing <- data[-inTrain,]

#### Predict using different models

**#list of all algorithms**

predlist <- c("bagFDA","lda","LogitBoost",

"nb","nnet","rf","rpart","svmRadialCost",

"C5.0", "glm")

results <- data.frame( Algorithm=character(), Duration=numeric(), Accuracy=numeric(),

stringsAsFactors=FALSE)

**#loop through algorithm list and perform model building and prediction**

for (i in 1:length(predlist)) {

pred <- predlist[i]

print(paste("Algorithm = ",pred ))

startTime <- as.integer(Sys.time())

model <- train( Salary ~ ., data=training, method=pred)

predicted <- predict(model, testing)

matrix<- confusionMatrix(predicted, testing$Salary)

endTime <- as.integer(Sys.time())

thisresult <- c( as.character(pred), endTime-startTime, as.numeric(matrix$overall[1]))

results[i,1] <- pred

results[i,2] <- endTime-startTime

results[i,3] <- round(as.numeric(matrix$overall[1]) * 100, 2)

}

results

### Analyzing results

Classifier |
Accuracy (%) |
Execution Time
(minutes) |

Linear Discriminant Analysis | 80.38 | 0.0 |

Classification and Regression Trees | 80.86 | 0.2 |

General Linear Models | 80.53 | 0.2 |

Boosted Logistic Regression | 79.01 | 1.4 |

Decision Trees (C5.0) | 80.9 | 4.5 |

Naïve Bayes | 80.94 | 4.9 |

Neural Networks | 80.8 | 6.4 |

Random Forest | 80.84 | 15.1 |

Bagging (Flexible Discriminant Analysis) | 81.15 | 23.2 |

Support Vector Machines | 81.17 | 66.3 |

The table above shows the results of comparison. Some thoughts.

1. Different classifiers perform differently based on the data set. The correlation between predictors and variability of predictions between runs goes a long way in influencing the algorithms. Sometimes all algorithms predict the same way.

2. There are differences in accuracy of a few points between many of them. Are these differences statistically significant?

3. There are huge differences in execution times. This indicates that for any data science project, trying out multiple algorithms and comparing their results is a worthwhile exercise. For similar levels of accuracy, some of them have execution times way better. This will help in choosing the most optimized algorithm for the problem in question.

Here is more comparison of accuracy on different datasets.

Classifier |
Income Prediction |
SPAM Filter |
Credit Approval |

Linear Discriminant Analysis | 80.38 | 91.95 | 72 |

Classification and Regression Trees | 80.86 | 88.59 | 66 |

General Linear Models | 80.53 | 92.62 | 71.67 |

Boosted Logistic Regression | 79.01 | 90.6 | 69.33 |

Decision Trees (C5.0) | 80.9 | 89.26 | 71 |

Naïve Bayes | 80.94 | 87.92 | 68.33 |

Neural Networks | 80.8 | 91.28 | 71.67 |

Random Forest | 80.84 | 89.93 | 70.67 |

Bagging (Flexible Discriminant Analysis) | 81.15 | 93.96 | 70.33 |

Support Vector Machines | 81.17 | 89.93 | 72.33 |

Hi,

ReplyDeleteThis is a great article.

Does your experience suggest anything by visualizing the relationship between predictor and independent variables(less than 10) would suggest the best /top 3 classifiers ? If yes can you go in more detail.

I also think knowing the inherent differences/similarities between the models would help determine the best pick.

ReplyDeleteBest SEO training in hyderabad,training by professional experts during experience more than 15 years,and Giving placement in their company.And giving knowledge of All digital marketing modules with professional practise.Train by expert and Earn huge money per month.Best SEO training in hyderabad

I am reading your post from the beginning, it was so interesting to read & I feel thanks to you for posting such a good blog, keep updates regularly.

ReplyDeleteData Science Online Training|

Hadoop Online Training

R Programming Online Training|

This comment has been removed by the author.

ReplyDeleteThanks for sharing amazing info Datascience online training in hyderabad

ReplyDelete