Is this dataset with only two features suitable for clustering with k-means?

Question

I am working with the K-means clustering algorithm for unsupervised learning.

Is the following dataset suitable for the k-means clustering task or not? Why or why not? The dataset has only two features.

score 2 · Answer 1 · answered May 18 '20 at 09:37

One problem with clustering algorithms is that they will typically find you a solution, ie they will split your data set into clusters, but it will find you a structure even if there isn't one. Your data looks like it could consist of about 5 to 7 clusters, but it could equally well just be 2 or only 1.

What you need to do after the clustering is to assess the quality of the result. I recommend having a look at Finding Groups in Data by Kaufman & Rousseeuw. They discuss various clustering algorithms and also a procedure that works out how cohesive your clusters are. Though it is 30 years old, it is an excellent book on the topic.

You also have the issue of choosing a value for k in your clustering: I usually start with two, and increase it from there; at each step I compute the cohesion of the result using their method, until I get the best score. This is an objective way of finding a good value for k and usually a reasonable clustering result.

The ultimate test, of course, is then if looking at the result makes sense to you. No cluster algorithm can do that for you.

Is this dataset with only two features suitable for clustering with k-means?

1 Answers1