Questions tagged [decoder]
10 questions
3
votes
1 answer
Aren't context lengths for transformers an artificial restriction?
Let's focus on the case of decoder-only transformers, where I am using algorithm 10 from "Formal Algorithms for Transformers" by Mary Phung and Marcus Hutter as a reference.
: https://i.sstatic.net/ZWC9o.png
Previously I thought that the maximum…
Robert Wegner
- 133
- 5
2
votes
1 answer
What does "use log probability to automatically increase the temperature until certain thresholds are hit" mean with OpenAI ASR with temperature=0
I read on https://platform.openai.com/docs/api-reference/audio/createTranscription#audio-createtranscription-temperature (mirror):
temperature. number. Optional. Defaults to 0. The sampling temperature, between 0 and 1. Higher values like 0.8 will…
Franck Dernoncourt
- 3,473
- 2
- 21
- 39
2
votes
0 answers
Causal attention with left padding
I am trying to train a decoder-only transformer model. The dataset is left-padded to a fixed length so sequences of tokens can be batched. However, when I try to pass input through a multi head attention layer, with both a key padding mask and…
xnsc
- 21
- 1
1
vote
1 answer
Is an autoencoder model encoder-only or encoder-decoder?
I'm writing up about different model architectures used in NLP, namely encoder-only models, encoder-decoder-only models, and have come across what seems to be a naming inconsistency. For decoder-only models it seems that they can be referred to as…
KurtMica
- 111
- 3
1
vote
1 answer
Masking in Decoder of Transformer
I understand that the masked multi-head attention block ensures that generation of token at time step t doesn't rely on subsequent tokens of the input. But the residual connection which adds the input to the output of masked multi-head attention…
SAGALPREET SINGH
- 147
- 1
- 10
1
vote
1 answer
Transformer decoder. Causal masking during inference?
I understand how causal masking in the self-attention layer of the decoder works and why we use it during training. What I want to ask is: should we use causal masking during inference ?
Consider a machine translation task where you need to…
pi-tau
- 995
- 6
- 12
0
votes
1 answer
Autoregressive Models(LLM) inference Prediction
So while predicting the next word in autoregressive models(LLM) will the attention mechanism use queries from starting word or only previous word. Like for predicting after sentence "I love" attention mechanism takes query value for I and love and…
adithya
- 11
- 2
0
votes
0 answers
Grayscale to RGB888 vs RGB332 to RGB888 in same colorization training between two universes
Suppose there are two parallel universes that train deep learning models for color resolution.
The first universe uses grayscaled image as input that has dimension (640,480,1), the second universe uses RGB332 image as input that has same dimension…
Muhammad Ikhwan Perwira
- 800
- 3
- 10
0
votes
0 answers
Low resolution color channel into high resolution color channel
There is super resolution enhancement, but it's about image dimension resolution, such as 128×128×3 image can be enhanced into 2048×2048×3 HD image, where the color information is still 24-bit.
But, is there a model that can decode low resolution…
Muhammad Ikhwan Perwira
- 800
- 3
- 10
-1
votes
1 answer
Does a decoder in transformer model generate output embeddings like the following?
Encoder:
Input: [A, B, C, D] (word embeddings)
Output: [C1, C2, C3, C4] (contextual representations)
The encoder processes the input sequence [A, B, C, D] and generates contextual representations [C1, C2, C3, C4]. The specific calculations involved…
Steven
- 99
- 1