I have a very rough understanding of the "attention/self attention" mechanism of transformer models and how this can be used to process a set of word vectors provided as an input/prompt to the encoder of a network and how this will produce "attention weights" for the word vectors based on positional encodings and some other learnable parameters (key/query/value transforms). And then these can be "fed" to the decoder part of the network which will also consider word vectors that have been produced by the decoder so far and influence word selection by paying special attention to particular word combinations.
However LLMs clearly produce words in their output/response that do not occur anywhere inside the "prompt" text. So they must be using these "attention weights" to consider words from a wider vocabulary, which could be quite large.
Is it the case that the decoder "considers" each possible word in it's entire vocabulary when producing an output word? For example I'm imagining an input layer to a NN with several thousand nodes (one per word vector in dictionary) on the input then these are "combined" through some operation with attention weights (from the encoder and decoder "attention section") producing values for most word vectors that are very low (so below the threshold for some activation function) but each word is still "considered" to an extent? Or are only a subset of words considered in some way?
 
     
    
 
    