0

BPE Tokenizers are the standard for modern LLMs. By default, most add_prefix_space, so that John went away is pretokenized to [_John][_went][_away]. To preserve reversibility, on roundtrip, the leading space is removed. This allows the tokenizer to map a word in the beginning of the sentence to the same word anywhere.

But what about cases where there is no leading space in natural language? Examples include poetry (John went away\nJohn will come back another day\n), quotations (He yelled "John, get down!"), CJK (adding the leading space will distinguish the first word from the rest, which have no space), and computer code.

Of course, causing words in these cases to get a different token isn't a show stopper - the LLM can simply learn embeddings for both _John and John. But, if that's the case, what have we gained by adding the leading space in the first place?

SRobertJames
  • 63
  • 1
  • 6

0 Answers0