What are the types of tokenizers?

Question

There are several ones like byte-pair encoding, WordPiece, SentencePiece, etc. What's the difference between them?

score 0 · Accepted Answer · answered Nov 12 '24 at 05:42

Byte-pair encoding (BPE) begins with a vocabulary of individual characters and iteratively merges the most frequent pairs of symbols (characters or subwords) in the corpus to form larger subword units, which continues until a predefined vocabulary size is reached. Therefore it's efficient for morphologically rich languages such as English as it allows breaking down rare words into meaningful subword units.

All the unique tokens found in a corpus are listed in a token vocabulary, the size of which, in the case of GPT-3.5 and GPT-4, is 100256.

WordPiece is similar to BPE, however, instead of merging the most frequent pairs, WordPiece chooses the token merge that maximizes the likelihood of the corpus under a language model, so it tends to create subwords that are more semantically meaningful within the corpus.

The tokenizer of BERT is WordPiece, which is a sub-word strategy like byte pair encoding. Its vocabulary size is 30,000, and any token not appearing in its vocabulary is replaced by [UNK]

SentencePiece treats the input text as a raw sequence of bytes or Unicode characters without assuming any predefined word boundary, which allows it to tokenize without relying on whitespace or other language-specific delimiters and makes it language-independent and highly adaptable.

While existing subword segmentation tools assume that the input is pre-tokenized into word sequences, SentencePiece can train subword models directly from raw sentences, which allows us to make a purely end-to-end and language independent system. We perform a validation experiment of NMT on English-Japanese machine translation, and find that it is possible to achieve comparable accuracy to direct subword training from raw sentences... Such language dependent processing also makes it hard to train multilingual NMT models (Johnson et al., 2016), as we have to carefully manage the configurations of pre- and postprocessors per language, while the internal deep neural architectures are language independent.

What are the types of tokenizers?

1 Answers1