1

I read that GPT tokenizers perform bad on non english languages because of the lack of train text for the BPE merging. So ,

Given a non english prompt, lets say Hindi. Why not just

  1. GPT makes an API call to a machine translator to convert hindi prompt to English
  2. Tokenize the english prompt, GPT generates output in english
  3. Translate output back to Hindi and give it to the user

Does this perform better or worse when compared to the existing tokenizations of non english langauges ?

Assuming we have a good translator in our hand

0 Answers0