1

Disclaimer: I am a novice in the world of machine learning, so please excuse my ignorance.

My dataset consists of things like age, days since last visit, etc. This information is medical related. None of which is geometrical, just data pertaining to particular clients.

The goal is to classify my dataset into three labels. The dataset is not labeled, meaning I'm dealing with an unsupervised learning problem. My dataset consists of ~20,000 records, but this will linearly increase overtime. The data is nearly all floats, with some being strings that can easily be converted into a float. Using this cheat sheet for selecting a solution from the scikit site, a KMeans Cluster seems like potential solution, but I've been reading that having high dimensionality can render the KMeans Cluster unhelpful. I'm not married to a particular implementation either. I've currently got a KMeans Cluster implementation using TensorFlow in Python, but am open for alternatives.

My question is: what would be some solutions for me to further explore that might be more optimal for my particular situation?

Patel Sunil
  • 185
  • 1
  • 9
Tory
  • 175
  • 6

2 Answers2

1

I would recommend to have a look at Finding Groups in Data, which is a very readable introduction to clustering methods. It gives a good overview over a number of different algorithms, both agglomerative and hierarchical. As far as I remember, source code for the various algorithms is available on the web somewhere.

I am sure you will find a fitting algorithm for your problem in there.

Oliver Mason
  • 5,477
  • 14
  • 32
1

This is supposed to be a comment but I haven't got enough reputation to do that.

In addition to what @the complexitytheorist has said, I recommend you to have a deeper look at your data first, using dimension reduction and visualisation methods such as PCA and t-SNE. A better understanding of data may always save you a lot of work.

Then you can choose which clustering algo to use. For example, KMeans or DBSCAN as a start.

Kevin. Fang
  • 373
  • 1
  • 2
  • 7