In multilayer perceptron, to update the Bias value, do we use the neuron's delta, or do we use the sum of all deltas?

Question

I ask this because I once saw a video on YouTube where the formula for updating the neuron's Bias in backpropagation, The correction factor to update Bias was a sum of all the deltas (of the layer).

But, in other videos and articles, such as this article here: https://machinelearningmastery.com/implement-backpropagation-algorithm-scratch-python/, the bias correction factor for the neuron is the delta itself. This is because Bias has no input, so its correction factor was just the neuron's delta.

I was in doubt: In multilayer perceptron, to update the Bias value, do we use the neuron's delta, or do we use the sum of all deltas?

score 1 · Accepted Answer · answered Jan 03 '24 at 19:55

I really like the way Michael Nielsen describes backpropagation in his book. I will not explain all the details of these formulas because for that it's way better to read his chapter on backpropagation.

The formulas are the following:

\begin{align*} (1) \quad & \delta^L = \Delta_a C \odot \sigma'(z^L) \\ (2) \quad & \delta^l = ((w_k^{l+1})^T\delta^{l+1}) \odot \sigma'(z^l)\\ (3) \quad & \frac{\partial C}{\partial b^l_j}= \delta^l_j \\ (4) \quad & \frac{\partial C}{\partial w^l_{j \space k}}= a_k^{l-1} \delta^l_j \\ \end{align*}

Based on these formulas, the derivative of the cost function with respect to the bias is the derivative of the current layer's activation function with respect to the cost function, which in these formulas is represented by the $\delta^l$ variable. So to answer your question, no it doesn't use the sum of all Deltas of the layer. That would be a constraint we wouldn't want for the network because it would mean the derivative of all the biases in the network would have the same change factor. We want all the variables to be able to change independently of each other.Thats why we only use the neuron's Delta.

When we consider the sum of gradients, it is when performing gradient descent with multiple data points. Because the cost function is typically a sum of all the errors, we are supposed to calculate the gradient of that summation.

Let's say we are using the mean squared error cost function:

$$ C = \frac{1}{M}\sum_{i = 1}^{M} (y'_i - y_i)^2 $$

When differentiating with respect to the output of the last layer we get:

$$ \frac{\partial C}{\partial y'} = \frac{\partial}{\partial y_i'} \frac{1}{M} (\sum_{i = 1}^{M} (y'_i - y_i)^2) $$ But because we can exchange the summation of derivatives with the derivatives of a summation (because derivation is a linear transformation), we can calculate the gradients of each weight first and then do the summation: $$ \frac{\partial C}{\partial y'} = \frac{1}{M}\sum_{i = 1}^{M} \frac{\partial}{\partial y_i'} (y'_i - y_i)^2 $$

If you want another resource to see how backpropagation can be implemented, you can read the following code from the same author I mentioned before. Here they implement what I just said - they first calculate the gradients with respect to all the weights and then for the SGD update calculate the sum of the gradients over all the data points.

In multilayer perceptron, to update the Bias value, do we use the neuron's delta, or do we use the sum of all deltas?

1 Answers1