2

After going through both the "Illustrated Transformer" and "Annotated Transformer" blog posts, I still don't understand how the sinusoidal encodings are representing the position of elements in the input sequence.

Is it the fact that since each row (input token) in a matrix (entire input sequence) has a unique waveform as its encoding, each of which can be expressed as a linear function of any other element in the input sequence, then the transformer can learn relations between these rows via linear functions?

nbro
  • 42,615
  • 12
  • 119
  • 217
shoshi
  • 121
  • 3

0 Answers0