0

i want to fine tune Marbert for tunisian text classification dialect, using this dataset : https://www.kaggle.com/datasets/waalbannyantudre/tunisian-arabizi-dialect-data-sentiment-analysis

i have test the tokenizer :

text example: "bravo slim riahii hay la3bed li tkhdem fi bladha"

tokenizer: ['[CLS]',

'bra',

'##vo',

'sl',

'##im',

'r',

'##iah',

'##ii',

'hay',

'la',

'##3',

'##be',

'##d',

'li',

't',

'##kh',

'##de',

'##m',

'fi',

'bl',

'##adh',

'##a',

'[SEP]',

'[PAD]',

'[PAD]',

'[PAD]',

'[PAD]',

'[PAD]',

'[PAD]',

'[PAD]',

'[PAD]',

'[PAD]']

can any one tell me if i can use this tokenizer or i must make my own ?

0 Answers0