For questions about the concept of attention in artificial intelligence and machine learning. Attention-like mechanisms were successfully used in natural language processing and computer vision tasks, such as machine translation. For a review of attention-based mechanism used in NLP, take a look at "Attention in Natural Language Processing" by Andrea Galassi et al.
Questions tagged [attention]
180 questions
34
votes
1 answer
How does the (decoder-only) transformer architecture work?
How does the (decoder-only) transformer architecture work which is used in impressive models such as GPT-4?
Robin van Hoorn
- 2,780
- 2
- 12
- 33
27
votes
1 answer
What exactly are the "parameters" in GPT-3's 175 billion parameters and how are they chosen/generated?
When I studied neural networks, parameters were learning rate, batch size etc. But even GPT3's ArXiv paper does not mention anything about what exactly the parameters are, but gives a small hint that they might just be sentences.
Even tutorial…
Nav
- 491
- 1
- 5
- 10
23
votes
2 answers
Why does GPT-2 Exclude the Transformer Encoder?
After looking into transformers, BERT, and GPT-2, from what I understand, GPT-2 essentially uses only the decoder part of the original transformer architecture and uses masked self-attention that can only look at prior tokens.
Why does GPT-2 not…
Athena Wisdom
- 381
- 1
- 2
- 5
22
votes
1 answer
What is the intuition behind the dot product attention?
I am watching the video Attention Is All You Need by Yannic Kilcher.
My question is: what is the intuition behind the dot product attention?
$$A(q,K, V) = \sum_i\frac{e^{q.k_i}}{\sum_j e^{q.k_j}} v_i$$
becomes:
$$A(Q,K, V) = \text{softmax}(QK^T)V$$
DRV
- 1,843
- 3
- 15
- 20
21
votes
3 answers
What kind of word embedding is used in the original transformer?
I am currently trying to understand transformers.
To start, I read Attention Is All You Need and also this tutorial.
What makes me wonder is the word embedding used in the model. Is word2vec or GloVe being used? Are the word embeddings trained from…
Bert Gayus
- 645
- 1
- 5
- 12
18
votes
3 answers
Why are embeddings added, not concatenated?
Let's consider the following example from BERT
I cannot understand why "the input embeddings are the sum of the token embeddings, the segmentation embeddings, and the position embeddings". The thing is, these embeddings carry different types of…
nalzok
- 401
- 6
- 10
14
votes
1 answer
Why does a transformer not use an activation function following the multi-head attention layer?
I was hoping someone could explain to me why in the transformer model from the "Attention is all you need" paper there is no activation applied after both the multihead attention layer and to the residual connections. It seems to me that there are…
chasep255
- 193
- 1
- 7
13
votes
2 answers
What is different in each head of a multi-head attention mechanism?
I have a difficult time understanding the "multi-head" notion in the original transformer paper. What makes the learning in each head unique? Why doesn't the neural network learn the same set of parameters for each attention head? Is it because we…
mhsnk
- 253
- 2
- 5
13
votes
1 answer
What is the intuition behind the attention mechanism?
Attention idea is one of the most influential ideas in deep learning. The main idea behind attention technique is that it allows the decoder to "look back” at the complete input and extracts significant information that is useful in decoding.
I am…
Pluviophile
- 1,293
- 7
- 20
- 40
13
votes
2 answers
Is there any artificial intelligence that possesses "concentration"?
Humans can do multiple tasks at the same (e.g. reading while listening to music), but we memorize information from less focused sources with worse efficiency than we do from our main focus or task.
Do such things exist in the case of artificial…
Zoltán Schmidt
- 643
- 7
- 14
12
votes
1 answer
Is there a proper initialization technique for the weight matrices in multi-head attention?
Self-attention layers have 4 learnable tensors (in the vanilla formulation):
Query matrix $W_Q$
Key matrix $W_K$
Value matrix $W_V$
Output matrix $W_O$
Nice illustration from https://jalammar.github.io/illustrated-transformer/
However, I do not…
spiridon_the_sun_rotator
- 2,852
- 12
- 17
12
votes
1 answer
In Computer Vision, what is the difference between a transformer and attention?
Having been studying computer vision for a while, I still cannot understand what the difference between a transformer and attention is?
novice
- 123
- 1
- 4
12
votes
3 answers
What is the purpose of Decoder mask (triangular mask) in Transformer?
I'm trying to implement transformer model using this tutorial. In the decoder block of the Transformer model, a mask is passed to "pad and mask future tokens in the input received by the decoder". This mask is added to attention weights.
import…
Uchiha Madara
- 173
- 1
- 1
- 8
12
votes
2 answers
Why don't people use nonlinear activation functions after projecting the query key value in attention?
Why don't people use nonlinear activation functions after projecting the query key value in attention?
It seems like doing this would lead to much-needed nonlinearity, otherwise, we're just doing linear transformations.
This observation applies to…
user3180
- 648
- 5
- 15
12
votes
3 answers
Why is dot product attention faster than additive attention?
In section 3.2.1 of Attention Is All You Need the claim is made that:
Dot-product attention is identical to our algorithm, except for the scaling factor of $\frac{1}{\sqrt{d_k}}$. Additive attention computes the compatibility function using a…
user3180
- 648
- 5
- 15