I read that GPT tokenizers perform bad on non english languages because of the lack of train text for the BPE merging. So ,
Given a non english prompt, lets say Hindi. Why not just
- GPT makes an API call to a machine translator to convert hindi prompt to English
- Tokenize the english prompt, GPT generates output in english
- Translate output back to Hindi and give it to the user
Does this perform better or worse when compared to the existing tokenizations of non english langauges ?
Assuming we have a good translator in our hand