Is it possible to combine k-fold cross-validation and oversampling for a multi-class text classification task with imbalanced data?

Question

I am dealing with an intent classification task on an Italian customer service data set.

I've more or less 1.5k sentences and 29 classes (imbalanced).

According to the literature, a good choice is to generate synthetic data, oversampling, or undersampling the training data, using for example the SMOTE algorithm.

I also want to use a cross-validation mechanism (stratified k-fold) to be more confident in the obtained result.

I also know that accuracy is not the right metric to take into account, I should use precision, recall, and confusion matrix.

Is it possible to combine k-fold cross-validation and oversampling (or undersampling) techniques?

score 0 · Accepted Answer · answered Mar 11 '19 at 18:13

It is straightforward to combine k-fold cross-validation with a technique like oversampling or undersampling.

First, apply the balance-restoration technique to your training data. Then parametrize a model using k-fold cross-validation on the re-balanced training data. In Scikit learn, I believe you can even bundle these actions together into a single 'pipeline' object to make it easier to manipulate.

Precision/recall is probably a fine starting place for measuring performance.

Is it possible to combine k-fold cross-validation and oversampling for a multi-class text classification task with imbalanced data?

1 Answers1