0

As per "Attention is all you need" etc., positional encoding is added to the embedded word vector input. My knee-jerk reaction is that this would muddle the "signal" of the word vector. Since word order is not preserved, this additive noise, for instance, could make a word look like a different word (in a different position): $w_a + p_a = x = w_b + p_b$.

Is performance (from shorter input dimension) the main reason to add the positional information versus concatenating, or is adding theoretically sound?

Edit: I think a source of confusion in all of this is the assumption of whether the model adopts a pre-trained word embedder, like Word2Vec, or trains their embedder from scratch. Adding positional encoding to a pre-trained embedder might throw off the attention dot product?

SuaveSouris
  • 101
  • 3

0 Answers0