For questions related to the softmax function, which a function that takes as input a vector of K real numbers, and normalizes it into a probability distribution consisting of K probabilities proportional to the exponentials of the input numbers. The softmax is often used as the activation function of the output layer of a neural network.
Questions tagged [softmax]
42 questions
22
votes
3 answers
Are softmax outputs of classifiers true probabilities?
BACKGROUND: The softmax function is the most common choice for an activation function for the last dense layer of a multiclass neural network classifier. The outputs of the softmax function have mathematical properties of probabilities and are--in…
Snehal Patel
- 1,037
- 1
- 4
- 27
7
votes
2 answers
Why does TensorFlow docs discourage using softmax as activation for the last layer?
The beginner colab example for tensorflow states:
Note: It is possible to bake this tf.nn.softmax in as the activation function for the last layer of the network. While this can make the model output more directly interpretable, this approach is…
galah92
- 173
- 5
6
votes
1 answer
Which paper introduced the term "softmax"?
Nowadays, the softmax function is widely used in deep learning and, specifically, classification with neural networks. However, the origins of this term and function are almost never mentioned anywhere. So, which paper introduced this term?
nbro
- 42,615
- 12
- 119
- 217
5
votes
2 answers
What is the advantage of using cross entropy loss & softmax?
I am trying to do the standard MNIST dataset image recognition test with a standard feed forward NN, but my network failed pretty badly. Now I have debugged it quite a lot and found & fixed some errors, but I had a few more ideas. For one, I am…
Ben
- 455
- 3
- 11
4
votes
1 answer
Why are policy gradient methods more effective in high-dimensional action spaces?
David Silver argues, in his Reinforcement Learning course, that policy-based reinforcement learning (RL) is more effective than value-based RL in high-dimensional action spaces. He points out that the implicit policy (e.g., $\epsilon$-greedy) in…
Saucy Goat
- 153
- 5
2
votes
1 answer
Why do we use the softmax instead of no activation function?
Why do we use the softmax activation function on the last layer?
Suppose $i$ is the index that has the highest value (in the case when we don't use softmax at all). If we use softmax and take $i$th value, it would be the highest value because $e$ is…
dato nefaridze
- 882
- 10
- 22
2
votes
1 answer
Why are there two versions of softmax cross entropy? Which one to use in what situation?
I have seen 2 forms of softmax cross-entropy loss and are confused by the two. Which one is the right one?
For example in this Quora answer, there are 2 answers:
$L(\mathbf{w})=\frac{1}{N} \sum_{n=1}^{N} H\left(p_{n}, q_{n}\right)=-\frac{1}{N}…
Herbert
- 123
- 4
2
votes
3 answers
What do the authors of this paper mean by the bias term in this picture of a neural network implementation?
I am reading a paper implementing a deep deterministic policy gradient algorithm for portfolio management. My question is about a specific neural network implementation they depict in this picture (paper, picture is on page 14).
The first three…
Mike
- 141
- 4
1
vote
1 answer
the scoring function of the policy
I read in the book and when I saw the formula to optimize the $\theta$
$$
\theta \leftarrow \theta + \alpha \nabla_\theta J(\pi_\theta) \\
\nabla_\theta J(\pi_\theta) = E_(\tau ~ \pi_\theta)[\sum_{t=0}^T…
1
vote
3 answers
Why softmax/sigmoid use base e instead of 2?
Performing -ln(ε) in NumPy returns relatively small values like this:
print(-np.log(np.finfo(np.float32).eps))
print(-np.log(np.finfo(np.float64).eps))
Output:
15.942385
36.04365338911715
Compared to -log2(ε) which has a greater range compared to…
Muhammad Ikhwan Perwira
- 800
- 3
- 10
1
vote
1 answer
What is it called when model asking for validation?
As we know, classification problems are just a bunch of probabilities, commonly it comes from logits or softmax output. Performing $argmax$ to get the most favorable class by model, discarding us from some information. Such as the distribution of…
Muhammad Ikhwan Perwira
- 800
- 3
- 10
1
vote
2 answers
One Softmax or two separate logistic regressions for the task of classifying pictures as a/b and c/d
Simply put, the question 11 in chapter 4 of Aurélien Géron's book "Hands-on Machine Learning" asks:
Suppose you want to classify pictures as outdoor/indoor and daytime/nighttime. Should you implement two logistic regression classifiers or one…
Dimitri
- 33
- 6
1
vote
1 answer
Since $f_c$ returns the probability of class label $c$, we require $0 \le f_c \le 1$ for each $c$, and $\sum_{c = 1}^C f_c = 1$. Why avoid this?
Chapter 1.2.1.5 Uncertainty of Probabilistic Machine Learning: An Introduction by Kevin P. Murphy says the following:
We can capture our uncertainty using the following conditional probability distribution:
$$p(y = c \mid \mathbf{x};…
The Pointer
- 611
- 5
- 22
1
vote
1 answer
Is Softmax Necessary as the Activation Function for Self-Attention Mechanisms?
I’m curious about the mathematical reasoning behind the use of the softmax function as the activation function in self-attention mechanisms within neural networks. Specifically, I’m interested in understanding if there is a theoretical basis that…
Kasia
- 303
- 2
- 9
1
vote
1 answer
Dealing with noise in models with softmax output
I have a device with an accelerometer and gyroscope (6-axis). The device sends live raw telemetry data to the model 40 samples for each input, 6 values per sample (accelerometer xyz, gyroscope xyz). The model predicts between 12 different labels of…
Sterling Duchess
- 113
- 3