1

I am dealing with an intent classification task on an Italian customer service data set.

I've more or less 1.5k sentences and 29 classes (imbalanced).

According to the literature, a good choice is to generate synthetic data, oversampling, or undersampling the training data, using for example the SMOTE algorithm.

I also want to use a cross-validation mechanism (stratified k-fold) to be more confident in the obtained result.

I also know that accuracy is not the right metric to take into account, I should use precision, recall, and confusion matrix.

Is it possible to combine k-fold cross-validation and oversampling (or undersampling) techniques?

nbro
  • 42,615
  • 12
  • 119
  • 217
Alfonso
  • 65
  • 4

1 Answers1

0

It is straightforward to combine k-fold cross-validation with a technique like oversampling or undersampling.

First, apply the balance-restoration technique to your training data. Then parametrize a model using k-fold cross-validation on the re-balanced training data. In Scikit learn, I believe you can even bundle these actions together into a single 'pipeline' object to make it easier to manipulate.

Precision/recall is probably a fine starting place for measuring performance.

John Doucette
  • 9,452
  • 1
  • 19
  • 52