1

In LLMs today, temperature is implemented as a softmax function at the end of the neural network.

In physics, temperature increases the motion of atoms, like Brownian motion. In brownian motion, every particle undergoes small fluctuations.

I wonder whether this has been implemented in neural networks: Instead of the softmax function, every parameter could be slightly randomized, in order to provide a different output. I understand that this is probably not practical in large LLMs like ChatGPT, because it would require billions of random numbers and billions of parameter modifications. Also, for quantized parameters, the random changes would either be too high or not affect the quantized value.

Has this been done before? Was temperature implemented as Brownian motion, e.g. before we switched to softmax?

Thomas Weller
  • 363
  • 3
  • 12

1 Answers1

2

Perturbing the network’s parameters in a Brownian‐motion–like fashion to reflect “temperature” is not how temperature is typically implemented in LLMs. Instead, for non-distilled LLMs, temperature is commonly applied directly to the output logits via a scaling factor before the softmax as inference-time calibration adjustment. This approach is computationally efficient and neatly ties into the interpretation of the softmax as a Boltzmann distribution where temperature regulates randomness in the sampling process. Temperature scaling via softmax only affects the final probability distribution, leaving the core computations intact and proving to be a robust and effective way to balance determinism and randomness in generated text.

Writing ${\displaystyle b=e^{-\beta }}$ or ${\displaystyle b=e^{-\beta }}$ yields the expressions: ${\displaystyle \sigma (\mathbf {z} )_{i}={\frac {e^{\beta z_{i}}}{\sum _{j=1}^{K}e^{\beta z_{j}}}}{\text{ or }}\sigma (\mathbf {z} )_{i}={\frac {e^{-\beta z_{i}}}{\sum _{j=1}^{K}e^{-\beta z_{j}}}}{\text{ for }}i=1,\dotsc ,K.}$
A value proportional to the reciprocal of β is sometimes referred to as the temperature: ${\textstyle \beta =1/kT}$, where $k$ is typically $1$ or the Boltzmann constant and $T$ is the temperature. A higher temperature results in a more uniform output distribution (i.e. with higher entropy; it is "more random"), while a lower temperature results in a sharper output distribution, with one value dominating.

While there are methods such as NoisyNets in RL or Bayes by Backprop in BNNs that inject noise into weights to encourage exploration or represent uncertainty during training for non-LLMs, controlled randomness at the output level via temperature scaling is usually sufficient and preferable for inference in LLMs and weight noise injection during training for LLMs are computationally prohibitive and would alter the cross-entropy loss landscape by “softening” the targets which might have unintended effects on convergence or calibration. Having said that, in knowledge distillation, a teacher model’s output probabilities might be softened with a temperature parameter during training stage to provide richer training signals for the student.

cinch
  • 11,000
  • 3
  • 8
  • 17