Chapter 1.2.1.5 Uncertainty of Probabilistic Machine Learning: An Introduction by Kevin P. Murphy says the following:
We can capture our uncertainty using the following conditional probability distribution: $$p(y = c \mid \mathbf{x}; \mathbf{\theta}) = f_c(\mathbf{x}; \mathbf{\theta}) \tag{1.7}$$ where $f: \chi \to [0, 1]^C$ maps inputs to a probability distribution over the $C$ possible output labels. Since $f_c(\mathbf{x}; \mathbf{\theta})$ returns the probability of class label $c$, we require $0 \le f_c \le 1$ for each $c$, and $\sum_{c = 1}^C f_c = 1$. To avoid this restriction, it is common to instead require the model to return unnormalized log-probabilities. We can then convert these to probabilities using the softmax function, which is defined as follows $$\text{softmax}(\mathbf{a}) \triangleq \left[ \dfrac{e^{a_1}}{\sum_{c^\prime = 1}^C e^{a_{c^\prime}}}, \dots, \dfrac{e^{a_C}}{\sum_{c^\prime = 1}^C e^{a_{c^\prime}}} \right] \tag{1.8}$$ This maps $\mathbb{R}^C$ to $[0, 1]^C$, and satisfies the constraints that $0 \le \text{softmax}(\mathbf{a})_c \le 1$ and $\sum_{c = 1}^C \text{softmax}(\mathbf{a})_c = 1$.
But why would we want to "avoid this restriction"? $f_c$ is a probability, so the restrictions $0 \le f_c \le 1$ for each $c$ and $\sum_{c = 1}^C f_c = 1$ are necessary by definition. Also, why would we do this and then require the model to return log-probabilities only to then use the softmax function to convert it back into a probability?