Big Data Science Practice: Exercise to compare classifier performance

Tuesday, November 4, 2014

Exercise to compare classifier performance

This is an exercise to compare performance of different machine learning algorithms (classifiers) both in terms of accuracy and speed. This is not a comprehensive exercise; its merely a trial to come up with a process to compare performance. The caret package of R helps here since it can execute different classifiers with just the change of a parameter.

The data used is an income data set that contains details about individuals - age, years of education, gender and work hours per week. It also has their salary range as a binary variable - whether they earn more than 50K or not. The goal of the classifier is to predict the income range based on the other attributes. R is used for the exercise.

Loading and exploring the data set

data <-read.csv("incomedata.csv")

str(data)

Let us look at the correlation between the predictors and the target.

library(psych)

pairs.panels(data)

Correlation Plot

All 4 predictors are week, with correlation co-efficients against Salary ranging in the twenties. So it would be interesting to see how different algorithms would perform on this data.

Model Building and Prediction

Split training and testing sets

library(caret)

inTrain <- createDataPartition(y=data$Salary,p=0.7,list=FALSE)

training <- data[inTrain,]

testing <- data[-inTrain,]

Predict using different models

#list of all algorithms

predlist <- c("bagFDA","lda","LogitBoost",

"nb","nnet","rf","rpart","svmRadialCost",

"C5.0", "glm")

results <- data.frame( Algorithm=character(), Duration=numeric(), Accuracy=numeric(),

stringsAsFactors=FALSE)

#loop through algorithm list and perform model building and prediction

for (i in 1:length(predlist)) {

pred <- predlist[i]

print(paste("Algorithm = ",pred ))

startTime <- as.integer(Sys.time())

model <- train( Salary ~ ., data=training, method=pred)

predicted <- predict(model, testing)

matrix<- confusionMatrix(predicted, testing$Salary)

endTime <- as.integer(Sys.time())

thisresult <- c( as.character(pred), endTime-startTime, as.numeric(matrix$overall[1]))

results[i,1] <- pred

results[i,2] <- endTime-startTime

results[i,3] <- round(as.numeric(matrix$overall[1]) * 100, 2)

}

results

Analyzing results

Classifier	Accuracy (%)	Execution Time (minutes)
Linear Discriminant Analysis	80.38	0.0
Classification and Regression Trees	80.86	0.2
General Linear Models	80.53	0.2
Boosted Logistic Regression	79.01	1.4
Decision Trees (C5.0)	80.9	4.5
Naïve Bayes	80.94	4.9
Neural Networks	80.8	6.4
Random Forest	80.84	15.1
Bagging (Flexible Discriminant Analysis)	81.15	23.2
Support Vector Machines	81.17	66.3

The table above shows the results of comparison. Some thoughts.

1. Different classifiers perform differently based on the data set. The correlation between predictors and variability of predictions between runs goes a long way in influencing the algorithms. Sometimes all algorithms predict the same way.

2. There are differences in accuracy of a few points between many of them. Are these differences statistically significant?

3. There are huge differences in execution times. This indicates that for any data science project, trying out multiple algorithms and comparing their results is a worthwhile exercise. For similar levels of accuracy, some of them have execution times way better. This will help in choosing the most optimized algorithm for the problem in question.

Here is more comparison of accuracy on different datasets.

Classifier	Income Prediction	SPAM Filter	Credit Approval
Linear Discriminant Analysis	80.38	91.95	72
Classification and Regression Trees	80.86	88.59	66
General Linear Models	80.53	92.62	71.67
Boosted Logistic Regression	79.01	90.6	69.33
Decision Trees (C5.0)	80.9	89.26	71
Naïve Bayes	80.94	87.92	68.33
Neural Networks	80.8	91.28	71.67
Random Forest	80.84	89.93	70.67
Bagging (Flexible Discriminant Analysis)	81.15	93.96	70.33
Support Vector Machines	81.17	89.93	72.33