GPT Tokenization of non English Language

Asked Mar 15 '25 at 12:35

Active Mar 15 '25 at 12:35

Viewed 21 times

I read that GPT tokenizers perform bad on non english languages because of the lack of train text for the BPE merging. So ,

Given a non english prompt, lets say Hindi. Why not just

GPT makes an API call to a machine translator to convert hindi prompt to English
Tokenize the english prompt, GPT generates output in english
Translate output back to Hindi and give it to the user

Does this perform better or worse when compared to the existing tokenizations of non english langauges ?

Assuming we have a good translator in our hand

asked Mar 15 '25 at 12:35

Aditya Manjunatha

0 Answers0