What is the lowest possible loss for a language model?

Question

Example: Suppose a character-level language model (three input letters to predict the next one), trained on a dataset which contains three instances of the sequence aei, with two occurrences preceding o and one preceding u, i.e., the dataset is:

Input	Output
aei	o
aei	u
aei	o

In this case, the ideal probability distribution for the model's logits for aei would be $\sim 0.66$ for o, $\sim 0.33$ for u, and zero for other letters. In other words, when the model is input with aei, the ideal softmax of the logits would be $\sim 0.66$ for o, $\sim 0.33$ for u, and zero for other letters.

Following this reasoning, the objective is to optimize the model's output for a given input to match the distribution of occurrences in the dataset.

If this reasoning is correct, then we have the following ideal loss (cross-entropy):

$$ L = \frac{- log\left(\frac{2}{3}\right) - log\left(\frac{1}{3}\right) - log\left(\frac{2}{3}\right)}{3} \approx 0.63 $$

Thus, $\sim 0.63$ is the smallest loss we can get with this dataset.

Is my reasoning correct?

score 1 · Answer 1 · answered Oct 22 '23 at 16:42

Yes.. consider that neural networks are function approximators, and:

In mathematics, a function from a set X to a set Y assigns to each element of X exactly one element of Y

So, given the same X, you can only hope to have a specific Y, and if your dataset is ambiguous (such as in your case), then depending on the loss, you'll get some approximate answer (if derived from the MLE principle, probably the mean answer, but might change if you use MAP or regularization)

So yes, no NN will ever be able to give a 0 loss to such problem (at least formalized as such)

What is the lowest possible loss for a language model?

1 Answers1