3

I have a dataset of unlabelled emails that fall into distinct categories (around a dozen). I want to be able to classify them along with new ones to come in the future in a dynamic matter. I know that there are dynamic clustering techniques that allow the clusters to evolve over time ('dynamic-means' being one of them). However, I would also like to be able to start with a predefined set of classes (or clusters/centroids), as I know for a fact what the types of those emails will be.

Furthermore, I need some guidance in terms of what vectorisation technique to use for my type of data. Would creating a term matrix using TF-IDF be sufficient? I assume that the data I am dealing with could be differentiated on the basis of keyword occurrence, but I cannot tell to what degree. Are there more sophisticated vectorisation techniques based more on the text semantics? Are they worth exploring?

nbro
  • 42,615
  • 12
  • 119
  • 217

2 Answers2

1

It sounds like you are trying to do some kind of semi-supervised learning. In semi-supervised learning, some data points are labelled (you know which class they belong to), and others are not. There are classification algorithms designed specifically for this kind of problem, like a transductive-SVM. I personally have not found these techniques to be more effective than simply discarding the unlabelled data and treating my problem as purely supervised, but YMMV.

TFIDF remains fairly popular, as do ngram-based approaches. A more modern vectorization to consider might be word2vec, which translated something like a bag-of-words style vector into a more meaningful feature space for words.

John Doucette
  • 9,452
  • 1
  • 19
  • 52
0

I would also like to be able to start with a predefined set of classes (or clusters/centroids) as I know for a fact what the types of those emails will be.

This is not a clustering problem, but a semi-supervised learning problem. If you don't have labeled data yet, then create some labels. You might also want to look into "active learning".

One approach is:

  1. For each category, create 5 labeled samples
  2. Train a classifier on them (e.g. tf-idf features and a small neural network)
  3. Let the neural network label your dataset
  4. Check the labels where it was most confident for all classes and the ones where the probabilities for all classes were most evenly spread. Use this to quickly create more labels.
  5. Maybe Amazon mechanical Turk is an option to quickly generate more labels
Martin Thoma
  • 1,055
  • 7
  • 17