Since $f_c$ returns the probability of class label $c$, we require $0 \le f_c \le 1$ for each $c$, and $\sum_{c = 1}^C f_c = 1$. Why avoid this?

Question

Chapter 1.2.1.5 Uncertainty of Probabilistic Machine Learning: An Introduction by Kevin P. Murphy says the following:

We can capture our uncertainty using the following conditional probability distribution: $$p(y = c \mid \mathbf{x}; \mathbf{\theta}) = f_c(\mathbf{x}; \mathbf{\theta}) \tag{1.7}$$ where $f: \chi \to [0, 1]^C$ maps inputs to a probability distribution over the $C$ possible output labels. Since $f_c(\mathbf{x}; \mathbf{\theta})$ returns the probability of class label $c$, we require $0 \le f_c \le 1$ for each $c$, and $\sum_{c = 1}^C f_c = 1$. To avoid this restriction, it is common to instead require the model to return unnormalized log-probabilities. We can then convert these to probabilities using the softmax function, which is defined as follows $$\text{softmax}(\mathbf{a}) \triangleq \left[ \dfrac{e^{a_1}}{\sum_{c^\prime = 1}^C e^{a_{c^\prime}}}, \dots, \dfrac{e^{a_C}}{\sum_{c^\prime = 1}^C e^{a_{c^\prime}}} \right] \tag{1.8}$$ This maps $\mathbb{R}^C$ to $[0, 1]^C$, and satisfies the constraints that $0 \le \text{softmax}(\mathbf{a})_c \le 1$ and $\sum_{c = 1}^C \text{softmax}(\mathbf{a})_c = 1$.

But why would we want to "avoid this restriction"? $f_c$ is a probability, so the restrictions $0 \le f_c \le 1$ for each $c$ and $\sum_{c = 1}^C f_c = 1$ are necessary by definition. Also, why would we do this and then require the model to return log-probabilities only to then use the softmax function to convert it back into a probability?

score 2 · Accepted Answer · answered Feb 24 '24 at 11:10

Your intuition is correct. The restrictions you wrote down are necessary by definition.

The author means that it is hard to build a machine learning model that gives probability by design ( returns $f_c$ directly). However we know very well how to model a function that outputs some real number. So the way to go is to create a model that gives some unnormalised log-probability ( essentially any number ) first and then transform it into $f_c$ by softmax function.

Hope it helps.

Since $f_c$ returns the probability of class label $c$, we require $0 \le f_c \le 1$ for each $c$, and $\sum_{c = 1}^C f_c = 1$. Why avoid this?

1 Answers1