Monday, October 27, 2014

Predictions - Effect of unique number of target classes on accuracy




When we perform machine learning of type classification, the target variable is a categorical (nominal) variable that has a set of unique values or classes . It could be a simple two class target variable like "approve application? " with classes (values)  of "yes" or "no". Sometimes they might indicate ranges like "Excellent", "Good" etc. for a target variable like satisfaction score. We might also convert continuous variables like test scores (1 - 100)  into classes like grades (A, B, C etc).

This experiment is to find the effect of the number of unique classes in the target variable on the accuracy of the prediction. The hypothesis is that accuracy will go down as the number of classes increases. This is because, with each additional class boundary, there is additional chance of a predicted sample to end up on the wrong side of the boundary.

For this experiment, I used a data set of  blood pressure levels. Each observation contains the patient's demographics and the actual systolic blood pressure measured. The value of the blood pressure is the binned into multiple classes (blood pressure ranges). Prediction of the blood pressure range is then done for varying number of bins (classes). The results are then tabulated as follows.


The experiment confirms the hypothesis. Accuracy drops sharply as the number of classes in the target variable increases. It does taper out beyond as size of 8.






Sunday, October 19, 2014

K Means Clustering - Effect of random seed


When the k-means clustering algorithm runs, it uses a randomly generated seed to determine the starting centroids of the clusters. wiki article

If the feature variables exhibit patterns that automatically group them into visible clusters, then the starting seed will not have an impact on the final cluster memberships. However, if the data is evenly distributed, then we might end up with different cluster members based on the initial random variable. An example for such a behavior is shown.

R is used for the experiment. The code to load the data and the contents of the data are as follows. We try to group the samples based on two feature variables - age and bmi.

data <- read.csv("AgeBMIEven.csv")
str(data)
## 'data.frame':    1338 obs. of  2 variables:
##  $ age: int  19 18 28 33 32 31 46 37 37 60 ...
##  $ bmi: num  27.9 33.8 33 22.7 28.9 ...
summary(data)
##       age            bmi      
##  Min.   :18.0   Min.   :16.0  
##  1st Qu.:27.0   1st Qu.:26.3  
##  Median :39.0   Median :30.4  
##  Mean   :39.2   Mean   :30.7  
##  3rd Qu.:51.0   3rd Qu.:34.7  
##  Max.   :64.0   Max.   :53.1
plot(data$age, data$bmi)
plot of chunk unnamed-chunk-1

As we an see from the above plot, the data points are distributed almost evenly all over the scatter plot. The initial cluster center position would affect the final cluster shapes and memberships. We run the clustering 4 times to group this data as 4 clusters and plot the clusters outputs here.

par(mfrow=c(2,2))
for (i in 1:4 ) {
  clusters<- kmeans(data,4)
  plot(data$age, data$bmi, col=clusters$cluster)
}

plot of chunk unnamed-chunk-2


Each time the clustering algorithm runs, it is going to pick a random seed and that seem to impact the shapes and memberships of the clusters. The first two runs generate the same groups, but the next 2 give different groupings of the data. Setting the seed explicitly to a specific value is required to generate the same results every time.