0

Rather than have the tokenizer generate this sort of thing:

"$1009 Dollars" => ["$", "100#", "9", "Dollars"]

I'd like to have:

"$1009 Dollars" => ["$", "1#", "0#", "0#", "9", "Dollars"]

Is that possible? I know I'd need to add all the n# tokens, but I'd need to remove a lot of tokens as well. I'm guessing I'd need to build my transformer from scratch?

For reference, here's some example code that tokenizes this string:

from transformers import AutoTokenizer, BertModel
import torch

tokenizer = AutoTokenizer.from_pretrained("bert-large-uncased")

inputs = tokenizer("$1009 Dollars", return_tensors="pt") ids = inputs['input_ids'] for token_id in ids: token = tokenizer.convert_ids_to_tokens(token_id) print(token)

Here's the result I get:

['[CLS]', '$', '100', '##9', 'dollars', '[SEP]']

How can I control what tokens go into my model?

slim
  • 101
  • 2

0 Answers0