6

I understand that GPT models process input text by converting words into tokens and then embedding vectors and do not process them letter by letter. Given this approach, I am curious to know how a model like ChatGPT can identify the first (or n-th) letter of a given word. Can anyone explain the underlying mechanism or provide any insights on this capability?

enter image description here enter image description here

Peyman
  • 624
  • 1
  • 6
  • 14

2 Answers2

1

LLMs like GPT family use tokenizing to transform text into numbers. GPT models offered by OpenAI don't use separate tokens for each letter. Instead, they split the words into commonly occurring, short parts. Like backpack into back and pack.

You can check the tokenizing on OpenAI website and in tiktoken library.

When plugging the long english texts into tiktoken, I didn't observe any tokens in English that resulted in numbers higher than 100,000. This may suggest that there is around 100,000 different tokens in English.

There is a high chance, that the models learned the length and the order of letters in each of these tokens, in exactly the same manner that they've learned other things.

GPT-3.5 struggled with tasks regarding text structure (like length of the words), GPT-4 performs a bit better - this may result from further aligning and fine-tuning the model on datasets that force the model to learn the letters and lengths of each token.

GPT-3.5 struggling with counting the letters: enter image description here

-2

It's actually tokenizing each letter. That is the reason why asking it to count the number of letters in more than a few words will result in basically a guessing game, and it will more often than not result in wrong answers.

Now, if you prompt it to follow a more resource demanding methodology (1) identify the words one by one of x text, (2) then number each letter within the identified words, (3) then summarize the numbers of letters within that particular word, then (4) start over with the next word and (5) continue until you reach to the under of the text, then it will only be able to process maybe 10 to 15 average length words. It has to express the steps in concepts (tokens) including every single letter, then associating a number to that etc.

When you do the above, you will experience GPT-4 either simply cutting the cr—p, and saying, "[a]nd this is how you continue until you finish", or if you keep insisting that it go through the entirety of text x, then it will run out of tokens, and leave production unfinished—when you have to prompt again "Continue", or "carry on", etc.

This would either mean that GPT-4 in ChatGPT does not actually have 4,096 tokens (it does!), or that doing these tasks are extremely token demanding.

This is a high-level understanding of the issue, but it generally points at the fact that it doesn't work it out from a pre-existing "database" that was sort of hardcoded in the model; it doesn't call the result which is actually shown in the initial wrong answers when it sort of guesses instead of following a strict methodology to avoid errors. (If it did have that, the initial answer would be correct, and it wouldn't be so resource-demanding to actually run the "math" through this task.)

If needed, I can provide an entire example chat of this behavior.