How can I change the tokens BERT uses so that each digit is a separate token?

Asked Aug 07 '24 at 17:31

Active Aug 07 '24 at 17:38

Viewed 38 times

Rather than have the tokenizer generate this sort of thing:

"$1009 Dollars" => ["$", "100#", "9", "Dollars"]

I'd like to have:

"$1009 Dollars" => ["$", "1#", "0#", "0#", "9", "Dollars"]

Is that possible? I know I'd need to add all the n# tokens, but I'd need to remove a lot of tokens as well. I'm guessing I'd need to build my transformer from scratch?

For reference, here's some example code that tokenizes this string:

from transformers import AutoTokenizer, BertModel
import torch
tokenizer = AutoTokenizer.from_pretrained("bert-large-uncased")
inputs = tokenizer("$1009 Dollars", return_tensors="pt")
ids = inputs['input_ids']
for token_id in ids:
    token = tokenizer.convert_ids_to_tokens(token_id)
    print(token)

Here's the result I get:

['[CLS]', '$', '100', '##9', 'dollars', '[SEP]']

How can I control what tokens go into my model?

edited Aug 07 '24 at 17:38

asked Aug 07 '24 at 17:31

slim

How can I change the tokens BERT uses so that each digit is a separate token?

0 Answers0