Friday, March 20, 2015

Impact of target class proportions on accuracy of classification


When we try to build classification models from training data, the proportion of target classes do impact the accuracy levels of predictions. This is an experiment to measure the level of impact of these proportions.

Let us say you are trying to predict which visitors to your website would buy a product. You collect historical data about the visitor's characteristics and actions and also whether they brought something or not. This is the model building data set. The "Buy Decision" variable becomes the target variable we are trying to predict. It has two possible values - "yes" and "no". If 70% of the records in the  training data set have "no" in them, then the proportion of classes is 70-30 between "no" and "yes".

If we build a model using this data set, what is the impact of this proportion on overall accuracy of predictions using this model? Will the accuracy be higher if the ratio is 50-50 than 90-10? To test this, we performed multiple iterations of classifications using this base data set. For each iteration, we choose a random data set from a base data set with  different proportions between "no" and "yes". The total number of records remains the same for all iterations. Then we split the data set into training and testing sets. The training and testing sets will retain the same proportion of class values. We then built a classification model on the training data set and predicted the test data set. For each iteration, we measured the following
  • Overall accuracy
  • Accuracy of "No" predictions - how well we predict "No"
  • Accuracy of "Yes" predictions - how well we predict "Yes".
The results are shown in this chart. The X-axis shows the % of "Yes" in the data for that iteration. The 3 lines show the various accuracy levels being measured



The findings are as follows

1. When the proportion of a specific class is high, its prediction accuracy is also very high. On the contrary, if the proportion of that class is low, its accuracy is also very low. This goes to show that the larger class "biases" the model towards it, since it has more samples in the training data set.

2. The overall accuracy is higher when one of the classes has a higher proportion than the other. It is lower when the classes are of equal proportion. This is again because, the higher class skews the accuracy computation towards it since it has more representation in the numerator and denominator.

3. When the proportions are equal,all three accuracy levels are the same. While this is a lower level, this might be the desired equilibrium because we have a model that can predict all classes equally well.

It goes to show that we should be sensitive to the target class proportions in the data set. To build models, its recommended that we choose a data set that has equal proportions of all classes. This way the model equally "represents" the characteristics of each class.




16 comments:

  1. Thanks for this.

    Wondering if there is an optimal lower bound - a number which is the optimal training set, to use across all classes?
    Sort of a minimal spanning tree, which is sufficient in describing the class without sharing characteristics blurring edges?

    Does it come to complete enumeration of valid values to match potential class members,
    then something like Bayes to produce a most likely weighted result - or support multiple membership/multimodal?

    Surely crisply defined classes are easier to detect - dogs, cats vs. all mammals which bring in whales, etc.

    ReplyDelete
  2. This is a well known issue called the class imbalance problem. Just google it - there are tens of thousands papers on that with recommendations how to deal with it.

    ReplyDelete
  3. As a new participant in your blog, I just want to say that all the information you have given here is awesome. Thank you

    big data classroom training

    ReplyDelete
  4. Excellent Article ...thank u for sharing, such a valuable content Learners to get good knowledge after read this article..
    DATA SCIENCE ONLINE TRAINING

    ReplyDelete
  5. CIITN is located in Prime location in Noida having best connectivity via all modes of public transport. CIITN offer both

    weekend and weekdays courses to facilitate Hadoop aspirants. Among all Hadoop Training Institute in Noida , CIITN's Big Data and Hadoop Certification course is designed to prepare you to match

    all required knowledge for real time job assignment in the Big Data world with top level companies. CIITN puts more focus in project based training

    and facilitated with Hadoop 2.7 with Cloud Lab—a cloud-based Hadoop environment lab setup for hands-on experience.

    CIITNOIDA is the good choice for Big Data Hadoop

    Training in NOIDA
    in the final year. I have also completed my summer training from here. It provides high quality Hadoop training with Live

    projects. The best thing about CIITNOIDA is its experienced trainers and updated course content. They even provide you placement guidance and have

    their own development cell. You can attend their free demo class and then decide.

    Hadoop Training in Noida
    Big Data Hadoop Training in Noida

    ReplyDelete
  6. I'm glad to hear that, Data Science. Good luck to you. Blogging is a great thing, and you get better with practice. Data Science training in Hyderabad One of the best ways to grow is to read other people's blogs. See what they do, how they do things. It's always food for thought, and sometimes, it's downright inspiring.

    ReplyDelete
  7. I was very interested in the article , it’s quite inspiring I should admit. I like visiting your site since I always come across interesting articles like this one. Keep sharing! Regards. Read more about Big Data services

    ReplyDelete
  8. Gaining Python certifications will validate your skills and advance your career.
    python certification

    ReplyDelete
  9. Enroll today to get free access to our live demo session which is a great opportunity to interact with the trainer directly which is a placement based Salesforce training India with job placement and certification . Get salesforce training in affordable cost from a best computer institute.

    ReplyDelete
  10. This article is really helpful for me. I am regular visitor to this blog. Share such kind of article more in future. Personally i like this article a lot and you can have a look at my services also: I was seriously search for a Salesforce training institutes in ameerpet which offer job assistance and Salesforce training institutes in Hyderabad who are providing certification material. It's worth to join Salesforce training institutes in India because of their real time projects material and 24x7 support from customer desk. You can easily find the best Salesforce training institutes in kukatpally kphb which are also a part of Pega training institutes in hyderabad. This is amazing to join Data science training institutes in ameerpet who are quire popular with Selenium training institutes in ameerpet and trending coureses like Java training institutes in ameerpet and data science related programming coures python training institutes in ameerpet If you want HCM course then this workday training institutes in ameerpet is best for you to get job on workday.

    ReplyDelete
  11. It's very useful article with informative and insightful content and i had good experience with this information. We, at the CRS info solutions ,help candidates in acquiring certificates, master interview questions, and prepare brilliant resumes.Go through some helpful and rich content Salesforce Admin syllabus from learn in real time team. This Salesforce Development syllabus is 100% practical and highly worth reading. Recently i have gone through Salesforce Development syllabus and Salesforce Admin syllabus which includes Salesforce training in USA so practically designed.

    ReplyDelete