Big Data Science Practice: Impact of target class proportions on accuracy of classification

Friday, March 20, 2015

Impact of target class proportions on accuracy of classification

When we try to build classification models from training data, the proportion of target classes do impact the accuracy levels of predictions. This is an experiment to measure the level of impact of these proportions.

Let us say you are trying to predict which visitors to your website would buy a product. You collect historical data about the visitor's characteristics and actions and also whether they brought something or not. This is the model building data set. The "Buy Decision" variable becomes the target variable we are trying to predict. It has two possible values - "yes" and "no". If 70% of the records in the training data set have "no" in them, then the proportion of classes is 70-30 between "no" and "yes".

If we build a model using this data set, what is the impact of this proportion on overall accuracy of predictions using this model? Will the accuracy be higher if the ratio is 50-50 than 90-10? To test this, we performed multiple iterations of classifications using this base data set. For each iteration, we choose a random data set from a base data set with different proportions between "no" and "yes". The total number of records remains the same for all iterations. Then we split the data set into training and testing sets. The training and testing sets will retain the same proportion of class values. We then built a classification model on the training data set and predicted the test data set. For each iteration, we measured the following

Overall accuracy
Accuracy of "No" predictions - how well we predict "No"
Accuracy of "Yes" predictions - how well we predict "Yes".

The results are shown in this chart. The X-axis shows the % of "Yes" in the data for that iteration. The 3 lines show the various accuracy levels being measured

The findings are as follows

1. When the proportion of a specific class is high, its prediction accuracy is also very high. On the contrary, if the proportion of that class is low, its accuracy is also very low. This goes to show that the larger class "biases" the model towards it, since it has more samples in the training data set.

2. The overall accuracy is higher when one of the classes has a higher proportion than the other. It is lower when the classes are of equal proportion. This is again because, the higher class skews the accuracy computation towards it since it has more representation in the numerator and denominator.

3. When the proportions are equal,all three accuracy levels are the same. While this is a lower level, this might be the desired equilibrium because we have a model that can predict all classes equally well.

It goes to show that we should be sensitive to the target class proportions in the data set. To build models, its recommended that we choose a data set that has equal proportions of all classes. This way the model equally "represents" the characteristics of each class.

7 comments:

AnonymousMarch 21, 2015 at 6:07 PM
Thanks for this.

Wondering if there is an optimal lower bound - a number which is the optimal training set, to use across all classes?
Sort of a minimal spanning tree, which is sufficient in describing the class without sharing characteristics blurring edges?

Does it come to complete enumeration of valid values to match potential class members,
then something like Bayes to produce a most likely weighted result - or support multiple membership/multimodal?

Surely crisply defined classes are easier to detect - dogs, cats vs. all mammals which bring in whales, etc.
ReplyDelete
Replies
AnonymousMarch 23, 2015 at 12:29 PM
This is a well known issue called the class imbalance problem. Just google it - there are tens of thousands papers on that with recommendations how to deal with it.
ReplyDelete
Replies
Data science May 11, 2017 at 11:04 PM
Excellent Article ...thank u for sharing, such a valuable content Learners to get good knowledge after read this article..
DATA SCIENCE ONLINE TRAINING
ReplyDelete
Replies
daizy mathewAugust 7, 2019 at 5:03 AM
very useful information
python training
ReplyDelete
Replies
Jackie Co KadDecember 4, 2019 at 2:36 AM
Great Article
Data Mining Projects

Python Training in Chennai

Project Centers in Chennai

Python Training in Chennai
ReplyDelete
Replies
hpu ba 3rd year result 2022 name wiseJuly 18, 2022 at 12:28 AM
You have lots of great content that is helpful to gain more knowledge. Best wishes.
ReplyDelete
Replies
AnonymousOctober 20, 2024 at 6:02 PM
Thank You and I have a tremendous offer you: What Does A Full House Renovation Cost home renovation youtube
ReplyDelete
Replies

Add comment

Labels

Friday, March 20, 2015

Impact of target class proportions on accuracy of classification

7 comments: