For a more concrete discussion, if we focus on the GPT-2 model, which is an auto-regressive model, I fully understand why we need masking for training. However, I need clarification on why we need to mask for inference. During inference, the model uses all information to predict the next token, so we should not need masking. Now, if we do not need masking for inference, I don’t understand the restriction of having a context window in LLMs. The only place that a context window appears when we configure the model's architecture is to define the triangle masking matrix with the size of context window * context window. But this masking matrix can be dynamically arranged as a matrix (vector) of ones for inference because we don’t need masking. Based on this argument, an LLM, aside from memory limitation, should be able to handle any context window. Please clarify where this argument falls short. In short the reason I ask this question is two folds:
I see in the public code written for GPT, even in the inference time, this masking matrix is defined and let's say if the length of the input sequence is ten and we only need to predict the 11th token, still all subsequent probability distributions for 2nd, 3rd, ... 11th tokens are calculated, but all except the 11th probability distribution are dropped out. This process sounds so wasteful.
It is not clear to me why LLMs are constrained by a context window.