1

Chapter 1.2.1.6 Maximum likelihood estimation of Probabilistic Machine Learning: An Introduction by Kevin P. Murphy says the following:

When fitting probabilistic models, it is common to use the negative log probability as our loss function: $$\mathcal{l}(y, f(\mathbf{x}; \mathbf{\theta})) = -\log p(y \mid f(\mathbf{x} ; \mathbf{\theta})) \tag{1.13}$$ The reasons for this are explained in Section 5.1.6.1, but intuition is that a good model (with low loss) is one that assigns a high probability to the true output $y$ for each corresponding input $\mathbf{x}$.

How does "a good model (with low loss) is one that assigns a high probability to the true output $y$ for each corresponding input $\mathbf{x}$" relate to $\mathcal{l}(y, f(\mathbf{x}; \mathbf{\theta})) = -\log p(y \mid f(\mathbf{x} ; \mathbf{\theta}))$?

The Pointer
  • 611
  • 5
  • 22

2 Answers2

1

The loss function you are minimising for classification is a cross-entropy between modelled probability $p(y| f(x, \theta))$ and true probability $p_{\text{gt}}(y| x)$: $$ \text{CrossEntropy}(p, p_{\text{gt}}) = -\sum_c p_{\text{gt}}(c| x) \text{log}(p(c| f(x, \theta))) $$ However for observed points, you already know $p_{\text{gt}}(y| x) $. It's basically 1 for class you are observing and 0 everywhere else. In other words, for the class ( lets say its c ) you observe: $$p_{\text{gt}}(y| x) =\delta(y = c)$$ So cross-entropy transforms into: $$ \text{CrossEntropy}(p, p_{\text{gt}}) = -\sum_c \delta(y = c)\text{log}(p(y| f(x, \theta))) = -\text{log}(p(c| f(x, \theta))) $$ Which is exactly the formula you mention in your question ( except for notation, in my example y is replaced by c ).

As you said, the good model is the one that has a low loss function. Sine we have found, that the loss function for model, that predicts $p(y| f(x, \theta))$ is basically: $$ \sum_c -\text{log}(p(c| f(x, \theta))) $$ It's easy to see, that the model, that fits the training set perfectly will have: \begin{align*} \sum_c -\text{log}(p(c| f(x, \theta))) &= 0 \\ \forall c, -\text{log}(p(c| f(x, \theta))) &= 0 \\ \forall c, p(c| f(x, \theta)) &= 1 \\ \end{align*} Therefore, a good model (with low loss) is one that assigns a high probability to the true output $y$ for each corresponding input $x$.

vl_knd
  • 496
  • 4
  • 14
0

Note that your context here is supervised classification and your LHS is not the usual empirical risk of the whole training set but the loss function between the true target label given an arbitrary input data and the learned model's output label given the same input data. And since you already understood the relation between log-probabilities and softmax function, it's easy to see the larger the conditional probability $p(y \mid f(\mathbf{x} ; \mathbf{\theta}))$ in the RHS of your equation, the smaller the loss should be, and the negative log-probability fits the bill intuitively.

Of course as your reference asserted KL divergence from a later chapter formalizes the notion of loss function above along with a Dirac delta "one-hot" distribution $p(Y|x)=\mathbb{I}(Y = c)$ to represent the degenerate distribution of any above observed true target label $c$, and $q(Y|x)$ is the above learned model's output label given the same input data. Thus minimizing loss means minimizing KL divergence, and this leads to minimizing the cross entropy of $p$ and $q$, and $q$ is shown to have the exact form of negative-log probability of $c$ (i.e., $Y|x$) on page 179.

cinch
  • 11,000
  • 3
  • 8
  • 17