How do I determine k when using k-means clustering

Question

I ve been studying about k-means clustering  and one thing that s not clear is how you choose the value of k   Is it just a matter of trial and error  or is there more to it

User · Answer

I m surprised nobody has mentioned this excellent article  http   www ee columbia edu  dpwe papers PhamDN05-kmeans pdf  After following several other suggestions I finally came across this article while reading this blog  https   datasciencelab wordpress com 2014 01 21 selection-of-k-in-k-means-clustering-reloaded   After that I implemented it in Scala  an implementation which for my use cases provide really good results  Here s code   import breeze linalg DenseVector import Kmeans  Features     import nak cluster  Kmeans   gt  NakKmeans   import scala collection immutable IndexedSeq import scala collection mutable ListBuffer     https   datasciencelab wordpress com 2014 01 21 selection-of-k-in-k-means-clustering-reloaded      class Kmeans features  Features      def fkAlphaDispersionCentroids k  Int  dispersionOfKMinus1  Double   0d  alphaOfKMinus1  Double   1d    Double  Double  Double  Features          if  1    k    0d    dispersionOfKMinus1   1d  1d  1d  Vector empty      else         val featureDimensions   features headOption map   size  getOrElse 1        val  dispersion  centroids  Features    new NakKmeans DenseVector Double   features  run k        val alpha           if  2    k  1d - 3d    4d   featureDimensions          else alphaOfKMinus1    1d - alphaOfKMinus1    6d       val fk   dispersion    alpha   dispersionOfKMinus1         fk  alpha  dispersion  centroids               def fks maxK  Int   maxK   List  Double  Double  Double  Features           val fadcs   ListBuffer  Double  Double  Double  Features   fkAlphaDispersionCentroids 1       var k   2     while  k  lt   maxK          val  fk  alpha  dispersion  features    fadcs k - 2        fadcs    fkAlphaDispersionCentroids k  dispersion  alpha        k    1           fadcs toList        def detK   Double  Features          val vals   fks   minBy    1       vals  3  vals  4         object Kmeans     val maxK   10   type Features   IndexedSeq DenseVector Double

User · Answer

km    for i in range num data shape 1        kmeans   KMeans n clusters ncluster i   we take number of cluster bandwidth theory     ndata num data  i   dropna       ndata  labels   kmeans fit predict ndata values      cluster ndata     co cluster groupby   labels    cluster columns 0   count   count for frequency     me cluster groupby   labels    cluster columns 0   median   median     ma cluster groupby   labels    cluster columns 0   max   Maximum     mi cluster groupby   labels    cluster columns 0   min   Minimum     stat pd concat  mi ma me co  axis 1  Add all column     stat  variable   stat columns 1  Column name change     stat columns   Minimum   Maximum   Median   count   variable       l        for j in range ncluster i            n  mi loc j  ma loc j            l append n       stat  Class   l     stat stat sort   Minimum        stat stat   variable   Class   Minimum   Maximum   Median   count        if missing num iloc i  gt 0          stat loc ncluster i   0         if stat iloc ncluster i  5   0              stat iloc ncluster i  5  missing num iloc i              stat iloc ncluster i  0  stat iloc 0 0      stat  Percentage    stat  5    100 count row Freq PERCENTAGE     stat  Cumulative Percentage   stat  Percentage   cumsum       km append stat  cluster pd concat km axis 0    see documentation for more info cluster cluster round   Minimum   2   Maximum   2  Median  2  Percentage  2  Cumulative Percentage  2

User · Answer

Yes  you can find the best number of clusters using Elbow method  but I found it troublesome to find the value of clusters from elbow graph using script  You can observe the elbow graph and find the elbow point yourself  but it was lot of work finding it from script   So another option is to use Silhouette Method to find it  The result from Silhouette completely comply with result from Elbow method in R    Here s what I did    Dataset for Clustering n   150 g   6  set seed g  d  lt - data frame x   unlist lapply 1 g  function i  rnorm n g  runif 1  i 2                      y   unlist lapply 1 g  function i  rnorm n g  runif 1  i 2     mydata lt -d  Plot 3X2 plots attach mtcars  par mfrow c 3 2     Plot the original dataset plot mydata x mydata y main  Original Dataset     Scree plot to deterine the number of clusters wss  lt -  nrow mydata -1  sum apply mydata 2 var     for  i in 2 15        wss i   lt - sum kmeans mydata centers i  withinss       plot 1 15  wss  type  b   xlab  Number of Clusters  ylab  Within groups sum of squares      Ward Hierarchical Clustering d  lt - dist mydata  method    euclidean     distance matrix fit  lt - hclust d  method  ward    plot fit    display dendogram groups  lt - cutree fit  k 5    cut tree into 5 clusters   draw dendogram with red borders around the 5 clusters  rect hclust fit  k 5  border  red     Silhouette analysis for determining the number of clusters library fpc  asw  lt - numeric 20  for  k in 2 20    asw  k    lt - pam mydata  k    silinfo   avg width k best  lt - which max asw   cat  silhouette-optimal number of clusters    k best    n   plot pam d  k best      K-Means Cluster Analysis fit  lt - kmeans mydata k best  mydata    get cluster means  aggregate mydata by list fit cluster  FUN mean    append cluster assignment mydata  lt - data frame mydata  clusterid fit cluster  plot mydata x mydata y  col   fit cluster  main  K-means Clustering results     Hope it helps

User · Answer

Hi I ll make it simple and straight to explain  I like to determine clusters using  NbClust  library    Now  how to use the  NbClust  function to determine the right number of clusters  You can check the actual project in Github with actual data and clusters - Extention to this  kmeans  algorithm also performed using the right number of  centers     Github Project Link  https   github com RutvijBhutaiya Thailand-Customer-Engagement-Facebook

User · Answer

One possible answer is to use Meta Heuristic Algorithm like Genetic Algorithm to find k  That s simple  you can use random K in some range  and evaluate the fit function of Genetic Algorithm with some measurment like Silhouette And Find best K base on fit function   https   en wikipedia org wiki Silhouette  clustering

User · Answer

Look at this paper   Learning the k in k-means  by Greg Hamerly  Charles Elkan  It uses a Gaussian test to determine the right number of clusters  Also  the authors claim that this method is better than BIC which is mentioned in the accepted answer

User · Answer

Another approach is using Self Organizing Maps  SOP  to find optimal number of clusters  The SOM  Self-Organizing Map  is an unsupervised neural network methodology  which needs only the input is used to clustering for problem solving  This approach used in a paper about customer segmentation    The reference of the paper is   Abdellah Amine et al   Customer Segmentation Model in E-commerce Using Clustering Techniques and LRFM Model  The Case of Online Stores in Morocco  World Academy of Science  Engineering and Technology International Journal of Computer and Information Engineering Vol 9  No 8  2015  1999 - 2010

User · Answer

May be someone beginner like me looking for code example  information for silhouette score is available here   from sklearn cluster import KMeans from sklearn metrics import silhouette score  range n clusters    2  3  4               clusters range you want to select dataToFit     12 23   112 46   45 23      sample data best clusters   0                         best cluster number which you will get previous silh avg   0 0  for n clusters in range n clusters      clusterer   KMeans n clusters n clusters      cluster labels   clusterer fit predict dataToFit      silhouette avg   silhouette score dataToFit  cluster labels      if silhouette avg  gt  previous silh avg          previous silh avg   silhouette avg         best clusters   n clusters    Final Kmeans for best clusters kmeans   KMeans n clusters best clusters  random state 0  fit dataToFit

User · Answer

You can choose the number of clusters by visually inspecting your data points  but you will soon realize that there is a lot of ambiguity in this process for all except the simplest data sets  This is not always bad  because you are doing unsupervised learning and there s some inherent subjectivity in the labeling process  Here  having previous experience with that particular problem or something similar will help you choose the right value   If you want some hint about the number of clusters that you should use  you can apply the Elbow method   First of all  compute the sum of squared error  SSE  for some values of k  for example 2  4  6  8  etc    The SSE is defined as the sum of the squared distance between each member of the cluster and its centroid  Mathematically   SSE  Ki 1 x cidist x ci 2   If you plot k against the SSE  you will see that the error decreases as k gets larger  this is because when the number of clusters increases  they should be smaller  so distortion is also smaller  The idea of the elbow method is to choose the k at which the SSE decreases abruptly  This produces an  elbow effect  in the graph  as you can see in the following picture     In this case  k 6 is the value that the Elbow method has selected  Take into account that the Elbow method is an heuristic and  as such  it may or may not work well in your particular case  Sometimes  there are more than one elbow  or no elbow at all  In those situations you usually end up calculating the best k by evaluating how well k-means performs in the context of the particular clustering problem you are trying to solve

User · Answer

Assuming you have a matrix of data called DATA  you can perform partitioning around medoids with estimation of number of clusters  by silhouette analysis  like this   library fpc  maxk  lt - 20    arbitrary here  you can set this to whatever you like estimatedK  lt - pamk dist DATA   krange 1 maxk  nc

User · Answer

If you use MATLAB  any version since 2013b that is  you can make use of the function evalclusters to find out what should the optimal k be for a given dataset    This function lets you choose from among 3 clustering algorithms - kmeans  linkage and gmdistribution   It also lets you choose from among 4 clustering evaluation criteria - CalinskiHarabasz  DaviesBouldin  gap and silhouette

User · Answer

First build a minimum spanning tree of your data  Removing the K-1 most expensive edges splits the tree into K clusters  so you can build the MST once  look at cluster spacings   metrics for various K  and take the knee of the curve     This works only for Single-linkage clustering  but for that it s fast and easy  Plus  MSTs make good visuals  See for example the MST plot under stats stackexchange visualization software for clustering

User · Answer

If you don t know the numbers of the clusters k to provide as parameter to k-means so there are four ways to find it automaticaly    G-means algortithm   it discovers the number of clusters automatically using a statistical test to decide whether to split a k-means center into two  This algorithm takes a hierarchical approach to detect the number of clusters  based on a statistical test for the hypothesis that a subset of data follows a Gaussian distribution  continuous function which approximates the exact binomial distribution of events   and if not it splits the cluster  It starts with a small number of centers  say one cluster only  k 1   then the algorithm splits it into two centers  k 2  and splits each of these two centers again  k 4   having four centers in total  If G-means does not accept these four centers then the answer is the previous step  two centers in this case  k 2   This is the number of clusters your dataset will be divided into  G-means is very useful when you do not have an estimation of the number of clusters you will get after grouping your instances  Notice that an inconvenient choice for the  k  parameter might give you wrong results  The parallel version of g-means is called p-means  G-means sources   source 1 source 2 source 3  x-means  a new algorithm that efficiently  searches the space of cluster locations and number of clusters to optimize the Bayesian Information Criterion  BIC  or the Akaike Information Criterion  AIC  measure  This version of k-means finds the number k and also accelerates k-means  Online k-means or Streaming k-means  it permits to execute k-means by scanning the whole data once and it finds automaticaly the optimal number of k   Spark implements it  MeanShift algorithm  it is a nonparametric clustering technique which does not require prior knowledge of the number of clusters  and does not constrain the shape of the clusters  Mean shift clustering aims to discover    blobs    in a smooth density of samples  It is a centroid-based algorithm  which works by updating candidates for centroids to be the mean of the points within a given region  These candidates are then filtered in a post-processing stage to eliminate near-duplicates to form the final set of centroids  Sources  source1  source2  source3

User · Answer

I worked on a Python package kneed  Kneedle algorithm   It finds cluster numbers dynamically as the point where the curve starts to flatten  Given a set of x and y values  kneed will return the knee point of the function  The knee joint is the point of maximum curvature  Here is the sample code  y    7342 1301373073857  6881 7109460930769  6531 1657905495022    6356 2255554679778  6209 8382535595829  6094 9052166741121   5980 0191582610196  5880 1869867848218  5779 8957906367368   5691 1879324562778  5617 5153566271356  5532 2613232619951   5467 352265375117  5395 4493783888756  5345 3459908298091   5290 6769823693812  5243 5271656371888  5207 2501206569532   5164 9617535255456   x   range 1  len y  1   from kneed import KneeLocator kn   KneeLocator x  y  curve  convex   direction  decreasing    print kn knee

User · Answer

I used the solution I found here   http   efavdb com mean-shift  and it worked very well for me     import numpy as np from sklearn cluster import MeanShift  estimate bandwidth from sklearn datasets samples generator import make blobs import matplotlib pyplot as plt from itertools import cycle from PIL import Image      Generate sample data centers     1  1    - 75  -1    1  -1    -3  2   X      make blobs n samples 10000  centers centers  cluster std 0 6       Compute clustering with MeanShift    The bandwidth can be automatically estimated bandwidth   estimate bandwidth X  quantile  1                                 n samples 500  ms   MeanShift bandwidth bandwidth  bin seeding True  ms fit X  labels   ms labels  cluster centers   ms cluster centers   n clusters    labels max   1      Plot result plt figure 1  plt clf    colors   cycle  bgrcmykbgrcmykbgrcmykbgrcmyk   for k  col in zip range n clusters    colors       my members   labels    k     cluster center   cluster centers k      plt plot X my members  0   X my members  1   col            plt plot cluster center 0   cluster center 1                 o   markerfacecolor col               markeredgecolor  k   markersize 14  plt title  Estimated number of clusters   d    n clusters   plt show

User · Answer

My idea is to use Silhouette Coefficient to find the optimal cluster number K   Details explanation is here

User · Answer

Basically  you want to find a balance between two variables   the number of clusters  k  and the average variance of the clusters  You want to minimize the former while also minimizing the latter  Of course  as the number of clusters increases  the average variance decreases  up to the trivial case of k n and variance 0    As always in data analysis  there is no one true approach that works better than all others in all cases  In the end  you have to use your own best judgement  For that  it helps to plot the number of clusters against the average variance  which assumes that you have already run the algorithm for several values of k   Then you can use the number of clusters at the knee of the curve

User · Answer

You can maximize the Bayesian Information Criterion  BIC    BIC C   X    L X   C  -  p   2    log n   where L X   C  is the log-likelihood of the dataset X according to model C  p is the number of parameters in the model C  and n is the number of points in the dataset  See  X-means  extending K-means with efficient estimation of the number of clusters  by Dan Pelleg and Andrew Moore in ICML 2000   Another approach is to start with a large value for k and keep removing centroids  reducing k  until it no longer reduces the description length   See  MDL principle for robust vector quantisation  by Horst Bischof  Ales Leonardis  and Alexander Selb in Pattern Analysis and Applications vol  2  p  59-72  1999   Finally  you can start with one cluster  then keep splitting clusters until the points assigned to each cluster have a Gaussian distribution   In  Learning the k in k-means   NIPS 2003   Greg Hamerly and Charles Elkan show some evidence that this works better than BIC  and that BIC does not penalize the model s complexity strongly enough

User · Answer

There is something called Rule of Thumb  It says that the number of clusters can be calculated by   k    n 2  0 5  where n is the total number of elements from your sample  You can check the veracity of this information on the following paper   http   www ijarcsms com docs paper volume1 issue6 V1I6-0015 pdf  There is also another method called G-means  where your distribution follows a Gaussian Distribution or Normal Distribution  It consists of increasing k until all your k groups follow a Gaussian Distribution  It requires a lot of statistics but can be done  Here is the source   http   papers nips cc paper 2526-learning-the-k-in-k-means pdf  I hope this helps

[cluster-analysis] How do I determine k when using k-means clustering?

Examples related to cluster-analysis

Examples related to k-means