Big Data Science Practice: K Means Clustering

Sunday, October 19, 2014

K Means Clustering - Effect of random seed

When the k-means clustering algorithm runs, it uses a randomly generated seed to determine the starting centroids of the clusters. wiki article

If the feature variables exhibit patterns that automatically group them into visible clusters, then the starting seed will not have an impact on the final cluster memberships. However, if the data is evenly distributed, then we might end up with different cluster members based on the initial random variable. An example for such a behavior is shown.

R is used for the experiment. The code to load the data and the contents of the data are as follows. We try to group the samples based on two feature variables - age and bmi.

data <- read.csv("AgeBMIEven.csv")

str(data)

## 'data.frame':    1338 obs. of  2 variables:

##  $ age: int  19 18 28 33 32 31 46 37 37 60 ...
##  $ bmi: num  27.9 33.8 33 22.7 28.9 ...

summary(data)

##       age            bmi

##  Min.   :18.0   Min.   :16.0  
##  1st Qu.:27.0   1st Qu.:26.3  
##  Median :39.0   Median :30.4  
##  Mean   :39.2   Mean   :30.7  
##  3rd Qu.:51.0   3rd Qu.:34.7  
##  Max.   :64.0   Max.   :53.1

plot(data$age, data$bmi)

As we an see from the above plot, the data points are distributed almost evenly all over the scatter plot. The initial cluster center position would affect the final cluster shapes and memberships. We run the clustering 4 times to group this data as 4 clusters and plot the clusters outputs here.

par(mfrow=c(2,2))
for (i in 1:4 ) {
  clusters<- kmeans(data,4)
  plot(data$age, data$bmi, col=clusters$cluster)
}

Each time the clustering algorithm runs, it is going to pick a random seed and that seem to impact the shapes and memberships of the clusters. The first two runs generate the same groups, but the next 2 give different groupings of the data. Setting the seed explicitly to a specific value is required to generate the same results every time.

10 comments:

AnonymousOctober 27, 2014 at 2:04 AM
This is why I use Fuzzy C-Means. By calculating the associativity of each node with each centroid and controlling the error function we can guarantee the same result and an even distribution each time.
ReplyDelete
Replies
LucianOctober 29, 2014 at 1:05 AM
Nice test. Could you please repeat the experiment with kmeans++ (http://en.wikipedia.org/wiki/K-means%2B%2B) for initial centroid positioning, to see whether the same variance of output occurs?
ReplyDelete
Replies
SieveSoftwareSeptember 12, 2018 at 12:45 AM
Thank you for providing useful information and its best blog for the students to make good career.learn Python training course.
Python Training in Hyderabad
ReplyDelete
Replies
UnknownOctober 5, 2018 at 12:08 AM
This has been very useful sir.R language has started to gain prominence in the recent times an this is a very good example.
Thank You.
Machine Learning Course in Chennai
ReplyDelete
Replies
umaDecember 8, 2018 at 1:47 AM
This has been very useful sir.R language has started to gain prominence in the recent times an this is a very good example.

data science training in bangalore

best data science courses in bangalore

data science institute in bangalore

data science certification bangalore

data analytics training in bangalore

data science training institute in bangalore
ReplyDelete
Replies
chandanaDecember 8, 2018 at 2:40 AM
Thank you for providing useful information ...

best training institute for hadoop in Marathahalli

best big data hadoop training in Marathahalli

hadoop training in Marathahalli

hadoop training institutes in Marathahalli

hadoop course in Marathahalli
ReplyDelete
Replies
CloudLearn ERPJune 23, 2020 at 11:57 PM
Your article is very informative. It's a welcome change from other supposed informational content. Your points are unique and original in my opinion. I agree with many of your points.
Best Data Science training in Mumbai

Data Science training in Mumbai
ReplyDelete
Replies
digital shynaApril 7, 2021 at 7:09 AM
The content in this matter is very nice and superpython class
ReplyDelete
Replies
OliverSeptember 30, 2021 at 3:05 AM
Automated Forex Trading : exness login Is An Automated Forex Investing Software. It Is An Algorithmic Trading Software That Provides Automated Forex Trading Signals.
ReplyDelete
Replies
AnnaJuly 24, 2023 at 5:16 AM
Blogs can discuss the benefits of eco-conscious technology, such as energy-saving devices and eco-friendly gadgets. Digital Website Delhi Articles can delve into the impact of artificial intelligence on the gaming industry, including AI-driven game design.
ReplyDelete
Replies

Labels

Sunday, October 19, 2014

K Means Clustering - Effect of random seed

10 comments: