How do LLM's deal with unknown words?

Question

It is my understanding that LLM's are trained with tokenizing text, and creating embeddings of those tokens and internally deal with those embeddings.

To me, it suggests that you can't feed words that didn't exist in training data. It should also suggest LLMs can't generate words that were not in training data.

However, in this screenshot from Chat GPT it seems like it can handle random words.

How does Chat GPT achieve this?

score 1 · Answer 1 · answered Dec 09 '24 at 05:27

LLMs like GPT-3 typically use subword tokenization, for example, Byte Pair Encoding for GPT models. These methods break down words into smaller pieces so that the model doesn't just work with whole words but also with parts of words (subwords), which are commonly seen in training. Afterwards LLMs generate embeddings for these tokens.

Therefore if you type in a made-up or novel word, ChatGPT might still generate something that seems reasonable because the underlying model knows the typical expected words formation (e.g., prefixes, suffixes, or common word parts) and subword tokens association in a probabilistic way, and uses this knowledge to predict the likely continuation of the text. For example, if a typo like "rel8ivity" appears in a prompt at inference time, the model after tokenization still recognizes "rel","8", and "ivity" as distinct tokens associated to "relativity" due to similar subword representations in LLM's enormous training corpus, most likely from "rel" and "ivity" tokens. This allows LLMs to further see through minor typos and still associate the most probable relevant meanings.

How do LLM's deal with unknown words?

1 Answers1