About CMYK & Study

The K-means algorithm identifies a certain number of centroids within a data set, a centroid being the arithmetic mean of all the data points belonging to a particular cluster. The algorithm then allocates every data point to the nearest cluster as it attempts to keep the clusters as small as possible (the ‘means’ in K-means refers to the task of averaging the data or finding the centroid). At the same time, K-means attempts to keep the other clusters as different as possible.

K-means is very effective in capturing structure and making data inferences if the clusters have a uniform, spherical shape. But if the clusters have more complex geometric shapes, the algorithm does a poor job of clustering the data. Another shortcoming of K-means is that the algorithm does not allow data points distant from one another to share the same cluster, regardless of whether they belong in the cluster. K-means does not itself learn the number of clusters from the data, rather that information must be pre-defined. And finally, when there is overlapping between or among clusters, K-means cannot determine how to assign data points where the overlap occurs.