5

I was learning about tokenization (WordPiece) and how there is a normalization step prior to that that will remove consecutive whitespace from the input text, since these are not significant normally.

But that got we wonder

  • how do LLMs still understand Python and other programming languages whose blocking is done by indentation?

  • how do LLMs still understand ASCII art?

1 Answers1

2

While some tokenizers like WordPiece normalize whitespace by collapsing tabs to single space, modern LLM tokenizers such as BPE or other byte-level tokenization often retain whitespace structure to handle code and art from their vast training corpus, and they're getting better for newer models via smarter compression.

Wasteful Whitespace: Indentation in Python can be tokenized as individual spaces, bloating up sequences and making code harder for the LLM to understand. Newer tokenizers are getting better at this!... BPE is the algorithm behind most modern tokenizers. It works by finding common pairs of bytes in your training data and compressing them into single tokens. This creates a balance between vocabulary size and efficiency, allowing LLMs to process text more effectively.

This tokenizer inefficiency can be observed when constrasting tokenizers. Let’s take a Python FizzBuzz script... We observe it in the tiktokenizer tool with the GPT-2 encoder set: Note it generates a token count of 149. Now if we select the cl100k_base (which is the tokenizer for GPT-4), and notice that for the exact same string the token count drops significantly to 77.

Furthermore, LLMs track token order via positional encoding, so even if spaces are individual tokens, the model knows their sequential position. This helps reconstruct indentation levels and ASCII art layouts.

cinch
  • 11,000
  • 3
  • 8
  • 17