Why Tokenization Algorithm replace space with underscore "_"?

Question

I have recently read an article for Byte-Pair Encoding (BPE), in which they have replaced space with an underscore(_), or we can say, adding underscore after each word/token.

I did not understand the intuition for adding an underscore; we can just do it without adding an underscore. We can also have any other character that is not frequently used; why underscore? We can also add an underscore at the start of the word or both end and start, but why add to the end?

Site: https://medium.com/@hsinhungw/understanding-byte-pair-encoding-fd196ebfe93f

Neil Slater · Accepted Answer · 2024-05-23T08:09:42.577

As you have spotted, there is no particular reason to choose a specific character. However, the author of the article is operating under some constraints that I can guess have guided the choices:

The eventual encoding will include mix of complete words with endings, word ending partials, middle of word partials and characters. For consistency, the encoding will need some way to track the difference.
The end of word marker is important adaptation of the more general BPE compression technique:

This symbol is important as it marks word boundaries, which prevents the algorithm from confusing the end of one word with the start of another.

Given the need for a marker, it helps to have a value that doesn't appear in the rest of the data, within any string that needs to be represented. Here _ is not necessarily a universal or even good choice in general. But it's fine for the data in the article.
It helps if readers can see the marker. That makes using a space a bit awkward, as a space can easily be overlooked. It also rules out more general choices of non-displaying character codes, which are what many systems will use in practice.

Using underscore to represent spaces or joins between words has precedent in a lot of software. It can be seen but doesn't visually interfere with reading the other characters as separate.

In more complex datasets, where underscore (or whichever character) could appear within the strings, you have a choice of either finding a (perhaps not real) character code or byte sequence that is not in the data, or adding a special parsing rule for the end of tokens. Both of these are added complications that the article can save for another time.

Why Tokenization Algorithm replace space with underscore "_"?

1 Answers1