Highest Voted 'tokenization' Questions - Artificial Intelligence Stack Exchange

5

votes

1 answer

How do LLMs tokenize python (significant whitespace)

I was learning about tokenization (WordPiece) and how there is a normalization step prior to that that will remove consecutive whitespace from the input text, since these are not significant normally. But that got we wonder how do LLMs still…

large-language-models tokenization

asked Mar 29 '25 at 20:33

Johannes Schaub - litb

175
5

1

vote

0 answers

GPT Tokenization of non English Language

I read that GPT tokenizers perform bad on non english languages because of the lack of train text for the BPE merging. So , Given a non english prompt, lets say Hindi. Why not just GPT makes an API call to a machine translator to convert hindi…

natural-language-processing tokenization

asked Mar 15 '25 at 12:35

Aditya Manjunatha

11
2

1

vote

1 answer

How do LLM's deal with unknown words?

It is my understanding that LLM's are trained with tokenizing text, and creating embeddings of those tokens and internally deal with those embeddings. To me, it suggests that you can't feed words that didn't exist in training data. It should also…

large-language-models chatgpt tokenization

asked Dec 08 '24 at 21:34

yasar

111
2

1

vote

1 answer

For LLM how is the token dictionary built?

For LLM, how was decided how big and how many unique tokens exist for the english language?

large-language-models tokenization

asked Sep 27 '24 at 21:16

Jack Miles

31
4

0

votes

0 answers

How do BPE Tokenizers with `add_prefix_space` handle natural language, such as quotations or poetry, where there won't be a prefix space?

BPE Tokenizers are the standard for modern LLMs. By default, most add_prefix_space, so that John went away is pretokenized to [_John][_went][_away]. To preserve reversibility, on roundtrip, the leading space is removed. This allows the tokenizer…

natural-language-processing large-language-models word-embedding tokenization

asked Apr 24 '25 at 12:43

SRobertJames

63
1
6

0

votes

0 answers

Why would the tokenizer for encoder-decoder model for machine translation use bos_token_id == eos_token_id? How does it know when a sequence ends?

I see on this PyTorch model Helsinki-NLP/opus-mt-fr-en (HuggingFace), which is an encoder-decoder model for machine translation: "bos_token_id": 0, "eos_token_id": 0, in its config.json. Why set bos_token_id == eos_token_id? How does it know…

pytorch machine-translation encoder-decoder tokenization huggingface

asked Apr 20 '25 at 19:42

Franck Dernoncourt

3,473
2
21
39

0

votes

1 answer

Why the maximum amount of output tokens is usually 4096?

LLMs' maximum amount of output tokens is usually capped to 4096. Is there a reason for this?

large-language-models tokenization

asked Dec 02 '24 at 23:28

user88306

0

votes

0 answers

LLM tokenization for metadata

I want to train a language dataset with transformer. The dataset has different types of texts e.g. webpage text, social media post, ecommerce site post, research article, story, and so on. In addition, the timestamp of the text generation is also…

transformer large-language-models data-preprocessing gpt tokenization

asked Nov 21 '24 at 07:53

Mohammad Mosiur

1

0

votes

0 answers

fine tuning Marbert for Tunisian dialect, ask for the tokenizer

i want to fine tune Marbert for tunisian text classification dialect, using this dataset : https://www.kaggle.com/datasets/waalbannyantudre/tunisian-arabizi-dialect-data-sentiment-analysis i have test the tokenizer : text example: "bravo slim riahii…

training bert fine-tuning text-classification tokenization

asked Nov 17 '24 at 18:12

Flissi Hamed

1

0

votes

1 answer

What are the types of tokenizers?

There are several ones like byte-pair encoding, WordPiece, SentencePiece, etc. What's the difference between them?

tokenization

asked Nov 12 '24 at 00:17

user88306

0

votes

0 answers

How to 'induce' or 'teach' pretrained model to a continous Transformer token-pruning Algorithm

I am currently looking for ways to improve Transformers performance in image processing, especially in image segmentation. I found this paper by Kong, Z., et al called "SPViT: Enabling Faster Vision Transformers via Latency-aware Soft Token Pruning"…

deep-learning transformer models image-segmentation tokenization

asked Sep 17 '24 at 06:27

RedSean

1
1

0

votes

1 answer

How to convert a positionally encoded predicted embedding from a decoder to its matching token?

Is it valid to just subtract the positional encoding from a predicted output if the decoder was also positionally encoded? Or does masking take care of this problem, and the decoder should only learn the embedding (not positionally encoded)

natural-language-processing transformer embeddings positional-encoding tokenization

asked Aug 14 '24 at 20:14

Yanall Boutros

3
1

0

votes

0 answers

How can I change the tokens BERT uses so that each digit is a separate token?

Rather than have the tokenizer generate this sort of thing: "$1009 Dollars" => ["$", "100#", "9", "Dollars"] I'd like to have: "$1009 Dollars" => ["$", "1#", "0#", "0#", "9", "Dollars"] Is that possible? I know I'd need to add all the n# tokens,…

transformer pytorch bert tokenization

asked Aug 07 '24 at 17:31

slim

101
2

0

votes

0 answers

How to add/embed categorical features in transformers network?

I would like to give more context to my transformers by adding some metadata related to each token. This metadata is mostly categorical (3 fields, with 3 possible values for each field). In addition of positional embedding (same shape as the tokens,…

transformer embeddings categorical-data tokenization

asked Jul 17 '24 at 08:22

JulienG

1

0

votes

1 answer

Why Tokenization Algorithm replace space with underscore "_"?

I have recently read an article for Byte-Pair Encoding (BPE), in which they have replaced space with an underscore(_), or we can say, adding underscore after each word/token. I did not understand the intuition for adding an underscore; we can just…

natural-language-processing tokenization

asked May 21 '24 at 12:23

Aamod Thakur

45
3

Questions tagged [tokenization]