5

Popular tokenizers use a special symbol such as "Ġ" (BPE) or "▁" (SentencePiece) to represent space. What is the reasoning behind this?

I did try searching for the answer. I got two types of explanations, but they don't explain anything to me.

Some languages such as Japanese don't typically use space to separate words.

That's true, but tokenizing such a language won't make use of the special character as well. I don't understand how having a token as "Ġhorse" is any different from " horse" in this scenario.

Let's assume we want to tokenize the made-up laguage Spaceless English, which is English, but without whitespace. tokenize("I'mridingahorse") -> ["I'm", "riding", "ahorse"]. No need for spaces at all, let alone a special character.

In fact this seems like it would be in favor of using a plain space as opposed to a special symbol.

Some tokens have a space in them and this helps us differentiate between spaces in the original text and spaces in the tokens

For example "New York" might be a token.

This kinda makes sense, but I still don't quite get why we would ever need this distinction. How is "ĠNew York" different from " New York"? " New York" is still different from [" New", " Toronto"]. A tokenizer would employ some kind of a greedy algorithm to tokenize text... and even if it's not greedy, but the full NP-complete search, it would still tokenize "I'm going to New York" as ["I'm", " going", " to", " New York"], the same way as we would expect it to tokenize "extracurricular" to say ["extra", "curricul", "ar"] and not to ["ex", "trac", "u", "rr", "ic", "u", "lar"], and the same way as we would expect it to work on Spaceless English.

Note that I'm looking at the problem only through the lens of converting text to tokens and tokens to text. Perhaps the need for these character has to do with creating the token list based on the training corpus?

1 Answers1

2

I'm over a year late, but I figured I'd give you an answer anyway. It's not what you proposed, though.

  1. The reason we don't want spaces in subword types (i.e. the strings in the vocabulary; they're only called "tokens" when you segment a text using the vocabulary) is for the historical reason that the paper that popularised subword tokenisers for NLP was Sennrich et al. (2016), which proposed repurposing the BPE compression algorithm as a tokenisation algorithm. BPE learns binary token merges (t1,t2) -> t1 + t2 and the way Sennrich stored these merges on disk was in a merges.txt file where every line represented one merge, encoded as the left type, SPACE, the right type, NEWLINE. So, to parse this file, there can be no spaces nor newlines in the types of the vocabulary. Sennrich et al.'s preprocessor already took care of this, so they didn't have to worry about this.

  2. The reason that Ġ represents a space is somewhat coincidental. Radford et al. (2019) realised that using Unicode characters for GPT-2's alphabet was a bad idea because even just limiting Unicode to actual alphabets would inflate the vocabulary. So instead, they used the 256 possible bytes as alphabet, and thus converted text to its UTF-8 bytes before applying BPE. Now, to store that BPE tokeniser to disk, they wanted the vocabulary (and, for BPE tokenisers, the merges) to be interpretable to humans, so they wanted to save it as characters and not bytes. This gave two problems: (1) bytes like byte 32 represent a space, which can't be converted as-is for the reason above, and (2) some other bytes don't correspond to any character at all, so they can't be rendered like regular characters. Their solution: map every such problematic character to the next unproblematic Unicode character after 256. That's how space (byte 32) became Ġ (codepoint 288 == 256 + 32) and e.g. byte 31 became ğ (codepoint 287 == 256 + 31) and so on.

As for the long underscore () in SentencePiece: this is actually hardcoded all across the package, rather than being an algorithmic coincidence. It's just because Taku Kudo needed a replacement character for spaces (again, for the reason above) and liked how looked.

P.S.: SentencePiece is a software package. It implements two subword tokenisers: BPE and ULM. BPE tokenisers trained with SentencePiece, which is the most frequently used implementation of BPE training, still use because (1) it is hardcoded in the package and (2) SentencePiece uses a character alphabet rather than using a byte mapping.

Mew
  • 216
  • 1
  • 4