7

(Just to preface that I do not have such a great understanding of LLMs and AI in general...)

My question is, when I pose a question to a LLM, will it present the fastest response that satisfies the parameters of the query, irrespective whether it is capable of providing a better answer with more compute? ("Better" in this sense meaning that if you posed the question and both answers to the LLM, the LLM itself would acknowledge that the second answer was "better" by some metric.)

I would provide some examples but I feel that that might be counterproductive - I want to avoid focus on a specific type of query. Also, clearly there are many questions that there is no complexity to the answers - e.g. "what is the capital of Thailand?" - so more compute will not improve the answer in any meaningful way.

I guess another way of asking this question is, what parameters does an LLM use to decide when an answer is complete/sufficient?

Oktarine
  • 181
  • 1
  • 3

5 Answers5

15

LLM isn’t holding back compute in hopes of improving an answer; it generates an answer autoregressively as soon as it can, according to its learned probability distribution and the chosen decoding parameters, and stops when those parameters indicate it’s done. Decoding parameters include temperature which controls randomness to generate the next token out of the distribution more accurately or more creatively, top-p/top‑k or nucleus sampling which restricts candidate tokens, and the maximum output length, all of which help shape both the style and the length of the answer. This is why you often see that for simple queries like “What is the capital of Thailand?” the answer as a straightforward fact is quickly generated since most of its inferred next token distributions are thinly peaked and insensitive to the decoding parameters, whereas those of more complex or open-ended queries are more flat and sensitive to the decoding parameters.

Having said that, there's another decoding parameter set called beam search including num_beamsand length_penalty which does explore multiple candidate token sequences, improving quality at higher compute. Default setups often use faster greedy decoding to generate single path thus prioritize efficiency, but adjusting parameters via beam search and greater maximum output length might yield more detailed and "better" answer at the expense of compute.

Because the model isn’t comparing multiple full answers by default, it doesn’t internally compute that one answer is “better” than another given more compute. Of course if you were to run the generation process multiple times with different random seeds or decoding parameters, you might get answers that you’d later judge as “better” in some respects, but the LLM isn’t actively trying to optimize or reflect on the generation quality beyond following its learned distribution.

cinch
  • 11,000
  • 3
  • 8
  • 17
9

LLMs have a context length and a token limit. Context length is a limitation for the input, token limit is a limit for the output.

Token limit is something you can easily ensure programmatically. Whenever the output is longer, you return an error message. However, you don't want to compute for a very long time and then tell the user "oh, the output is too long, please try again". Therefore, you want to train the LLM to give answers that fit into the token limit (probably via reinforcement learning).

You also want the LLM to have a conversation with the user. Therefore, the token limit should be smaller than the context length, so that an average conversation of ~10 followup questions still fits into the context length.

So: while there's no need for short answers technically, there is a business requirement and usability need for not-too-long answers.

Thomas Weller
  • 363
  • 3
  • 12
5

A LLM just predicts the next token in the sequence. At some point the best next token is to end the message.

What may seem like larger models thinking more is just a larger model which represents the underlying dynamics of the dataset they were trained on better.

Features like GPT-o1 “think longer” by doing the same thing but intentionally breaking down the problem into sub steps as it does it. But at the end of the day the model is still just picking words as it goes along.

foreverska
  • 2,347
  • 4
  • 21
2

The amount of compute used (well, at least for the decoding step) depends on the prompt or context length. Which, in most implementations, can include the entire conversation as it is fed back in with each query. And some difficult prompts trigger reasoning models like *seek to do a lot of hidden "thinking" output (presented as a clickable thought bubble) that adds to the overall generation time. That is why they recommend starting a new conversation whenever possible.

As for deciding when an answer is complete, a hard token generation limit may be set. This setting is usually hidden from the user. But it is available in the playground. You can also stop the model from generating by pressing the stop button, or by telling it how many tokens to produce, if compute is an issue. Certain implementations do limit the input context length below spec., such as for the free plan, and will warn if it is exceeded.

Recommend downloading a large language model e.g. from huggingface and loading it up using llama.cpp [parameters] [some model name]. The lag generating each token can be seen and experienced.

Models < 2B parameters can generally work without a special GPU or CUDA. Heavily-quantized models up to to 8B parameters in size can work on as little as 4GB VRAM. Free advertising. Many such small, quantized models can be found on hellork's huggingface page.

hellork
  • 121
  • 4
2

if you posed the question and both answers to the LLM, the LLM itself would acknowledge that the second answer was "better"

Actually it would be a fun experiment to someone who has access & time to play with a full model or can query model logprobs for a provided text.

  1. Ask an LLM some difficult question Q where you know the answer.
  2. Calculate logprob of the initial LLM answer A (given the question).
  3. Suggest the actually correct answer B to LLM, but ask the LLM to rewrite it as if it was its own answer to the question, let's call this answer B'.
  4. Calculate the logprob of the answer B' (given the question).

Is B' superior to A (in your evaluation) and also logprob(B'|Q) > logprob(A|Q)? Then the greedy text generation, beam search or whatever method the LLM uses to generate the text is suboptimal, and the LLM is actually "smarter" than it may seem from its first response.

Valentas
  • 121
  • 3