Questions tagged [encoder-decoder]

36 questions
19
votes
4 answers

What exactly is a hidden state in an LSTM and RNN?

I'm working on a project, where we use an encoder-decoder architecture. We decided to use an LSTM for both the encoder and decoder due to its hidden states. In my specific case, the hidden state of the encoder is passed to the decoder, and this…
6
votes
2 answers

What are the differences between seq2seq and encoder-decoder architectures?

I've read many tutorials online that use both words interchangeably. When I search and find that they are the same, why not just use one word since they have the same definition?
user78615
6
votes
1 answer

Which situation will helpful using encoder or decoder or both in transformer model?

I have some questions about using (encoder / decoder / encoder-decoder) transformer models, included (language) transformer or Vision transformer. The overall form of a transformer consists of an encoder and a decoder. Depending on the model, you…
Yang
  • 79
  • 1
  • 6
4
votes
1 answer

Why do we need both encoder and decoder in sequence to sequence prediction?

Why do we need both encoder and decoder in sequence to sequence prediction? We could just have a single RNN that, given input $x$, outputs some value $y(t)$ and hidden state $h(t)$. Next, given $h(t)$ and $y(t)$, the next output $y(t+1)$ and hidden…
3
votes
0 answers

What is input (and shape) to K/V/Q of self-attention of EACH Decoder block of Language-translation model Transformer's tokens during Inference?

Transformer model of the original Attention paper has a decoder unit that works differently during Inference than Tranining. I'm trying to understand the shapes used during decoder (both self-attention and enc-dec-attention blocks), but it's very…
2
votes
1 answer

Difference of encoder-decoder to decoder-only transformers w.r.t. loss

What is the difference between an encoder-decoder transformer and decoder-only transformer with regard to the loss calculation. Specifically, how does the loss signal differ? And how does this relate to token efficiency? As far as I understand, an…
2
votes
1 answer

Why can decoder-only transformers be so good at machine translation?

In my understanding encoder-decoder transformers for translation are trained with sentence or text pairs. How can it be explained in simple (high-level) terms that decoder-only transformers (e.g. GPT) are so good at machine translation, even though…
2
votes
0 answers

Combining GANs and NLP for AI-Based Programming: Generating Input-Output Templates for Computer Functions

I would like to combine GANs and NLP to create a system that can take an input and generate an appropriate output. For example, given the input 9 to the power of 2, the system would output pow(9,2). I am not entirely sure how to research this, but I…
1
vote
1 answer

Should encoder only or decoder only transformers be used for name origin identification?

Like, identifying what mythology is a god from by providing it's name.
user1678860
  • 377
  • 8
1
vote
1 answer

Why does the DETR model for object detection have 17.8M parameters in transformer blocks?

The DETR paper specifies that there are 17.8M parameters in transformer blocks, given a hidden dimension of 256, 6 encoder blocks, and 6 decoder blocks. But if I use the approximate formula 28 x N x d^2 as derived in this post for an encoder-decoder…
1
vote
1 answer

What's make transformer encoder difference from its decoder part?

Encoders are commonly used for understanding sequences (like classification, NER, etc.), while decoders are used for generative tasks in an autoregressive manner (e.g., GPT-style text generation). Decoders are autoregressive because they generate…
1
vote
1 answer

Is an autoencoder model encoder-only or encoder-decoder?

I'm writing up about different model architectures used in NLP, namely encoder-only models, encoder-decoder-only models, and have come across what seems to be a naming inconsistency. For decoder-only models it seems that they can be referred to as…
1
vote
0 answers

What’s more efficient in multihead attention: multiply QKV by $W_i$ then split or linearly project QKV $h$ times into dimensions $d_k$?

I’m looking to bridge two implementations of multihead attention. Approach 1: Multiply and Split Each of the queries, keys, and values is multiplied by a separate square weight matrix of size (embedding size, embedding size). The transformed…
1
vote
0 answers

How to Interpret Cross Attention

I am a bit confused on what cross attention mechanisms are doing. I understand that the currently decoded output is usually the query and the conditioning/input (from an encoder) is the key and value. The query is multiplied by the key to make an…
1
vote
1 answer

Why do Transformer decoders use masked self attention when producing new tokens?

I've been reading that transformer decoders use masked self attention so that the decoder can't cheat by looking ahead. For example, when predicting the 6th token in the sequence we shouldn't have access to the 7th token. However, why can't the…
1
2 3