2

I am working with the K-means clustering algorithm for unsupervised learning.

Is the following dataset suitable for the k-means clustering task or not? Why or why not? The dataset has only two features.

enter image description here

enter image description here

nbro
  • 42,615
  • 12
  • 119
  • 217
Debugger
  • 69
  • 1
  • 1
  • 3

1 Answers1

2

One problem with clustering algorithms is that they will typically find you a solution, ie they will split your data set into clusters, but it will find you a structure even if there isn't one. Your data looks like it could consist of about 5 to 7 clusters, but it could equally well just be 2 or only 1.

What you need to do after the clustering is to assess the quality of the result. I recommend having a look at Finding Groups in Data by Kaufman & Rousseeuw. They discuss various clustering algorithms and also a procedure that works out how cohesive your clusters are. Though it is 30 years old, it is an excellent book on the topic.

You also have the issue of choosing a value for k in your clustering: I usually start with two, and increase it from there; at each step I compute the cohesion of the result using their method, until I get the best score. This is an objective way of finding a good value for k and usually a reasonable clustering result.

The ultimate test, of course, is then if looking at the result makes sense to you. No cluster algorithm can do that for you.

Oliver Mason
  • 5,477
  • 14
  • 32