1

I have a dataset of texts, each text was identified with an ID number. I would like to do a prediction by finding the best match ID number for upcoming new texts. To use multi text classification, I am not sure if this is the right approach since there is only one text for most of ID numbers. In this case, I wouldn't have any test set. Can up-sampling help? Or is there any other approach than classification for such a problem?

The data set looks like this:

id1 'text1', id2 'text2', id3 'text3', id3 'text4', id3 'text5', id4 'text6', . . id200 'text170'

I would appreciate any guidance to find the best approach for this problem.

Fara
  • 11
  • 1
  • won't be able to classify if there are too many kinds of ids – Dan D Mar 11 '21 at 02:40
  • and theoretically, won't be able to classify, if the data are singlesample --to--> multiple ids; it must be manysamples --to--> single id – Dan D Mar 11 '21 at 02:43
  • If your texts are simple (have same words/phrases/subparts) then approaches like edit distance might work (levenshtein distance or any similar thing). If they are more complex, such as different words but similar meaning, then you can use pretrained models like Bert and use them to get embeddings and classify based on distance of embeddings – SajanGohil Nov 01 '22 at 12:16