How to refine K-means clustering on a data set?

Question

I'm working with a data set where the data is stored in a string such as AxByCyA where A, B and C are actions and v,w,x,y,z are times between the actions (each letter represents an interval of time). It's worth noting that B cannot occur without A, and C cannot occur without B, and C is the action I'm attempting to study (ie: I'd like to be able to predict whether a user will do C based on their prior actions).

I intend to create 2 clusters: people who do C and those who don't.

From this data set, I build a training array to run the sci-kit (python) k-means algorithm on, containing the number of As, the number of Bs, the mean time between actions (calculated using the average of each interval) and the standard deviation between each interval.

This gives me an overall success rate of 82% on the test set, but is there anything I can do for more accuracy?

pasaba por aqui · Accepted Answer · 2018-05-09T10:09:30.840

5

The usual parameters to adjust in a k-means:

Number of clusters (recall many clusters can have same label).
Distance definition (euclidean is the most basic, Gauss is an
improvement)
Selection of initial cluster positions.
Data preprocessing (data normalization, ...)

edited May 09 '18 at 10:09

answered May 09 '18 at 09:54

pasaba por aqui

1,313
7
21

How to refine K-means clustering on a data set?

1 Answers1