Rather than have the tokenizer generate this sort of thing:
"$1009 Dollars" => ["$", "100#", "9", "Dollars"]
I'd like to have:
"$1009 Dollars" => ["$", "1#", "0#", "0#", "9", "Dollars"]
Is that possible? I know I'd need to add all the n# tokens, but I'd need to remove a lot of tokens as well. I'm guessing I'd need to build my transformer from scratch?
For reference, here's some example code that tokenizes this string:
from transformers import AutoTokenizer, BertModel
import torch
tokenizer = AutoTokenizer.from_pretrained("bert-large-uncased")
inputs = tokenizer("$1009 Dollars", return_tensors="pt")
ids = inputs['input_ids']
for token_id in ids:
token = tokenizer.convert_ids_to_tokens(token_id)
print(token)
Here's the result I get:
['[CLS]', '$', '100', '##9', 'dollars', '[SEP]']
How can I control what tokens go into my model?