How to convert a positionally encoded predicted embedding from a decoder to its matching token?

Question

Is it valid to just subtract the positional encoding from a predicted output if the decoder was also positionally encoded? Or does masking take care of this problem, and the decoder should only learn the embedding (not positionally encoded)

Kulin Patel · Accepted Answer · 2024-08-15T07:12:57.200

No need to remove positional encodings from the predicted output. In most Transformer architectures, positional encodings are added to token embeddings to provide the model with information about the order of the sequence, as the Transformer encoder/decoder block does not have sequence awareness. Once position encodings are added to token embeddings, they become part of the model's learned representation, which the decoder uses to generate the output.

Both the encoder and decoder learn to interpret and use this positional information as part of the sequence processing. This integrated learning process allows the model to capture both the meaning of the tokens and their positional relationships within the sequence.

Subtracting positional encodings after the model has made predictions is not recommended. While masking helps the model attend to specific parts of the sequence, it doesn't "undo" the effects of positional encoding.

There's no need to "take care of this problem" for positional encoding - they are, in fact, essential to the model’s ability to generate outputs.

To convert predicted embedding from a decoder to its matching token, you typically pass the predicted embedding through a linear layer followed by a softmax. The linear layer maps the embedding back to the vocabulary space, and the softmax gives you the probability distribution over all possible tokens. The token with the highest probability is then selected as the predicted token.

How to convert a positionally encoded predicted embedding from a decoder to its matching token?

1 Answers1