Monday, October 27, 2014
Predictions - Effect of unique number of target classes on accuracy
When we perform machine learning of type classification, the target variable is a categorical (nominal) variable that has a set of unique values or classes . It could be a simple two class target variable like "approve application? " with classes (values) of "yes" or "no". Sometimes they might indicate ranges like "Excellent", "Good" etc. for a target variable like satisfaction score. We might also convert continuous variables like test scores (1 - 100) into classes like grades (A, B, C etc).
This experiment is to find the effect of the number of unique classes in the target variable on the accuracy of the prediction. The hypothesis is that accuracy will go down as the number of classes increases. This is because, with each additional class boundary, there is additional chance of a predicted sample to end up on the wrong side of the boundary.
For this experiment, I used a data set of blood pressure levels. Each observation contains the patient's demographics and the actual systolic blood pressure measured. The value of the blood pressure is the binned into multiple classes (blood pressure ranges). Prediction of the blood pressure range is then done for varying number of bins (classes). The results are then tabulated as follows.
The experiment confirms the hypothesis. Accuracy drops sharply as the number of classes in the target variable increases. It does taper out beyond as size of 8.