LLM tokenization for metadata

Question

I want to train a language dataset with transformer. The dataset has different types of texts e.g. webpage text, social media post, ecommerce site post, research article, story, and so on. In addition, the timestamp of the text generation is also available. All these texts have a lot of paragraphs.

The model's task will be to complete a paragraph based on the starting texts. Expected input size will be 10% of the final generated paragraph. My concern is regarding the tokenization and metadata inclusion.

For each paragraph, I have two additional data that I am saying metadata: (1) Source (e.g. research article, social media post etc.) (2) Timestamp. Training my model with these two metadata would be significantly beneficiary for the model's accuracy.

My question is how can I include these two metadata in the model? I have thought couple ways:

(1) I will add tokens ids for each source and timestamp corresponding date values and add that at the beginning of the paragraph. Something like this:

<paragraph>
     <source>sourceToken</source>
     <day>dayToken</day>
     <month>monthToken</month>
     <year>yearToken</year>
     <hour>hourToken</hour>
    <minute>minuteToken</minute>
    <second>secondToken</second>
    <content>
         textToken0, textToken1, textToken2, ...
    </content>
</paragraph>

(2) I can start and end each paragraph with their corresponding source token and then add the long integer timestamp value inside the source content. For example for social media post text we can make something like this:

<socialMediaPost>
     <timestamp>longIntValue</timestamp>
     <paragraph>
         textToken0, textToken1, textToken2, ...
     </paragraph>
</socialMediaPost>

For (2) longIntValue for timestamp will eventually be 4 separate uint16 values as I am planning to define the entire token space with 16 bit integers.

Now my question is which one between the two makes more sense? I will appreciate reference articles or conference proceedings if that helps to increase my knowledge in this regard.

LLM tokenization for metadata

0 Answers0