Questions tagged [tokenization]
16 questions
5
votes
1 answer
How do LLMs tokenize python (significant whitespace)
I was learning about tokenization (WordPiece) and how there is a normalization step prior to that that will remove consecutive whitespace from the input text, since these are not significant normally.
But that got we wonder
how do LLMs still…
Johannes Schaub - litb
- 175
- 5
1
vote
0 answers
GPT Tokenization of non English Language
I read that GPT tokenizers perform bad on non english languages because of the lack of train text for the BPE merging. So ,
Given a non english prompt, lets say Hindi. Why not just
GPT makes an API call to a machine translator to convert hindi…
Aditya Manjunatha
- 11
- 2
1
vote
1 answer
How do LLM's deal with unknown words?
It is my understanding that LLM's are trained with tokenizing text, and creating embeddings of those tokens and internally deal with those embeddings.
To me, it suggests that you can't feed words that didn't exist in training data. It should also…
yasar
- 111
- 2
1
vote
1 answer
For LLM how is the token dictionary built?
For LLM, how was decided how big and how many unique tokens exist for the english language?
Jack Miles
- 31
- 4
0
votes
0 answers
How do BPE Tokenizers with `add_prefix_space` handle natural language, such as quotations or poetry, where there won't be a prefix space?
BPE Tokenizers are the standard for modern LLMs. By default, most add_prefix_space, so that John went away is pretokenized to [_John][_went][_away]. To preserve reversibility, on roundtrip, the leading space is removed. This allows the tokenizer…
SRobertJames
- 63
- 1
- 6
0
votes
0 answers
Why would the tokenizer for encoder-decoder model for machine translation use bos_token_id == eos_token_id? How does it know when a sequence ends?
I see on this PyTorch model Helsinki-NLP/opus-mt-fr-en (HuggingFace), which is an encoder-decoder model for machine translation:
"bos_token_id": 0,
"eos_token_id": 0,
in its config.json.
Why set bos_token_id == eos_token_id? How does it know…
Franck Dernoncourt
- 3,473
- 2
- 21
- 39
0
votes
1 answer
Why the maximum amount of output tokens is usually 4096?
LLMs' maximum amount of output tokens is usually capped to 4096. Is there a reason for this?
user88306
0
votes
0 answers
LLM tokenization for metadata
I want to train a language dataset with transformer. The dataset has different types of texts e.g. webpage text, social media post, ecommerce site post, research article, story, and so on. In addition, the timestamp of the text generation is also…
0
votes
0 answers
fine tuning Marbert for Tunisian dialect, ask for the tokenizer
i want to fine tune Marbert for tunisian text classification dialect, using this dataset : https://www.kaggle.com/datasets/waalbannyantudre/tunisian-arabizi-dialect-data-sentiment-analysis
i have test the tokenizer :
text example: "bravo slim riahii…
0
votes
1 answer
What are the types of tokenizers?
There are several ones like byte-pair encoding, WordPiece, SentencePiece, etc. What's the difference between them?
user88306
0
votes
0 answers
How to 'induce' or 'teach' pretrained model to a continous Transformer token-pruning Algorithm
I am currently looking for ways to improve Transformers performance in image processing, especially in image segmentation. I found this paper by Kong, Z., et al called "SPViT: Enabling Faster Vision Transformers via Latency-aware Soft Token Pruning"…
RedSean
- 1
- 1
0
votes
1 answer
How to convert a positionally encoded predicted embedding from a decoder to its matching token?
Is it valid to just subtract the positional encoding from a predicted output if the decoder was also positionally encoded? Or does masking take care of this problem, and the decoder should only learn the embedding (not positionally encoded)
Yanall Boutros
- 3
- 1
0
votes
0 answers
How can I change the tokens BERT uses so that each digit is a separate token?
Rather than have the tokenizer generate this sort of thing:
"$1009 Dollars" => ["$", "100#", "9", "Dollars"]
I'd like to have:
"$1009 Dollars" => ["$", "1#", "0#", "0#", "9", "Dollars"]
Is that possible? I know I'd need to add all the n# tokens,…
slim
- 101
- 2
0
votes
0 answers
How to add/embed categorical features in transformers network?
I would like to give more context to my transformers by adding some metadata related to each token. This metadata is mostly categorical (3 fields, with 3 possible values for each field).
In addition of positional embedding (same shape as the tokens,…
JulienG
- 1
0
votes
1 answer
Why Tokenization Algorithm replace space with underscore "_"?
I have recently read an article for Byte-Pair Encoding (BPE), in which they have replaced space with an underscore(_), or we can say, adding underscore after each word/token.
I did not understand the intuition for adding an underscore; we can just…
Aamod Thakur
- 45
- 3