Example: Suppose a character-level language model (three input letters to predict the next one), trained on a dataset which contains three instances of the sequence aei, with two occurrences preceding o and one preceding u, i.e., the dataset is:
| Input | Output |
|---|---|
| aei | o |
| aei | u |
| aei | o |
In this case, the ideal probability distribution for the model's logits for aei would be $\sim 0.66$ for o, $\sim 0.33$ for u, and zero for other letters. In other words, when the model is input with aei, the ideal softmax of the logits would be $\sim 0.66$ for o, $\sim 0.33$ for u, and zero for other letters.
Following this reasoning, the objective is to optimize the model's output for a given input to match the distribution of occurrences in the dataset.
If this reasoning is correct, then we have the following ideal loss (cross-entropy):
$$ L = \frac{- log\left(\frac{2}{3}\right) - log\left(\frac{1}{3}\right) - log\left(\frac{2}{3}\right)}{3} \approx 0.63 $$
Thus, $\sim 0.63$ is the smallest loss we can get with this dataset.
Is my reasoning correct?