Would maximizing (instead of minimizing) error of an LLM/HMM lead to complex behavior?

Question

Imagine we have some sort of "next token predictor," either with transformer architecture, LSTM, or just a HMM (though the terminology I use here will be less aligned to HMMs, I believe the question is generalizable to all generative NLP).

We reverse the cost function. That is, we are training to maximize error instead of minimizing it. In the case where error is neither maximized nor minimized, the behavior will be fairly boring. However, a model which is maximizing error may still need to learn patterns of syntax and which words usually follow one another in order to avoid them. I would expect that in some abstract way, it may behave creatively, because it is trying to produce output which is not in the training data, and is furthest away from it. In fact, it ideally should understand the user's query in order to avoid using words that follow it.

This makes me think the output may be non-boring, although probably not practically useful.

score 2 · Answer 1 · answered Nov 20 '23 at 15:33

You cannot really invert the loss, because that's undefined most likely.

Take linear regression with OLS, then we know that the loss function is quadratic wrt the parameters (assuming to have 1 covariate only):

Now, if you try to maximize that function, you would quickly see that they don't have any maximum, so you would just get ginormous predictions, and no pattern would be caught/learnt

score 1 · Answer 2 · answered Nov 20 '23 at 16:00

In HMM, a simple mechanism to reduce overfitting and therefore generating variety in the system output is tweaking the transition matrix A and/or the symbol emission probability B. Allowing for some transitions other than those observed in the training data or other symbols in states where they were not actually seen before can introduce a richer behavior.

score 0 · Answer 3 · answered Nov 27 '23 at 16:53

Going off of @Alberto's answer:

I agree that you would just get ginormous predictions, but I am not convinced no pattern would be learnt: in your example, we would learn to avoid predictions near zero: for a more complex error landscape, this might be more interesting. For example, we would not just be getting ginormous predictions, but the most ginormous predictions possible: we would not ascend the landscape randomly, we would ascend via the gradient, at the steepest possible path. With many variables, the steepest possible path may be interesting or unique.

Would maximizing (instead of minimizing) error of an LLM/HMM lead to complex behavior?

3 Answers3