For questions about mini-batch (or batch) gradient descent, which is gradient descent with typically more than one sample of input-label pairs.
Questions tagged [mini-batch-gradient-descent]
24 questions
12
votes
2 answers
What exactly is averaged when doing batch gradient descent?
I have a question about how the averaging works when doing mini-batch gradient descent.
I think I now understood the general gradient descent algorithm, but only for online learning. When doing mini-batch gradient descent, do I have to:
forward…
Ben
- 455
- 3
- 11
10
votes
2 answers
Is neural networks training done one-by-one?
I'm trying to learn neural networks by watching this series of videos and implementing a simple neural network in Python.
Here's one of the things I'm wondering about: I'm training the neural network on sample data, and I've got 1,000 samples. The…
Ram Rachum
- 260
- 1
- 11
10
votes
1 answer
Is back-propagation applied for each data point or for a batch of data points?
I am new to deep learning and trying to understand the concept of back-propagation. I have a doubt about when the back-propagation is applied. Assume that I have a training data set of 1000 images for handwritten letters,
Is back-propagation…
Maanu
- 245
- 1
- 2
- 7
3
votes
1 answer
When using experience replay, do we update the parameters for all samples of the mini-batch or for each sample in the mini-batch separately?
I've been reading Google's DeepMind Atari paper and I'm trying to understand how to implement experience replay.
Do we update the parameters $\theta$ of function $Q$ once for all the samples of the minibatch, or do we do that for each sample of the…
user491626
- 241
- 1
- 5
3
votes
3 answers
What is the difference between batch and mini-batch gradient decent?
I am learning deep learning from Andrew Ng's tutorial Mini-batch Gradient Descent.
Can anyone explain the similarities and dissimilarities between batch GD and mini-batch GD?
DRV
- 1,843
- 3
- 15
- 20
2
votes
2 answers
What's the rationale behind mini-batch gradient descent?
I am reading a book that states
As the mini-batch size increases, the gradient computed is closer to the 'true' gradient
So, I assume that they are saying that mini-batch training only focuses on decreasing the cost function in a certain 'plane',…
ngc1300
- 133
- 5
2
votes
1 answer
Is it possible to use Mini-Batches with Adam optimization?
Is it possible/advised to use Mini-Batch like accumulation with Adam optimization?
How would that works?
Do I accumulate the loss function for each sample in the batch and then run Adam, or should I divide the Loss by Number of samples in batch…
CoffeDeveloper
- 291
- 1
- 7
2
votes
1 answer
Why is it called "batch" gradient descent if it consumes the full dataset before calculating the gradient?
While training a neural network, we can follow three methods: batch gradient descent, mini-batch gradient descent and stochastic gradient descent.
For this question, assume that your dataset has $n$ training samples and we divided it into $k$…
hanugm
- 4,102
- 3
- 29
- 63
2
votes
1 answer
When is the loss calculated, and when does the back-propagation take place?
I read different articles and keep getting confused on this point. Not sure if the literature is giving mixed information or I'm interpreting it incorrectly.
So from reading articles my understanding (loosely) for the following terms are as…
Hazzaldo
- 309
- 3
- 9
1
vote
1 answer
How to normalize gradient value due to the batch size?
A = (m x n) - input
B = (n x k) - weight
output = A @ B = (m x k)
outputloss = (m x k)
doutput/dB = A.T @ outputloss = (n x m) @ (m x k) = (n x k)
So, as we see m (batch size) is inner dimension and it is droping out. But the value of gradient is…
Тима
- 39
- 4
1
vote
1 answer
Why to use gradient accumulation?
I know that gradient accumulation is (1) a way to reduce memory usage while still enabling the machine to fit a large dataset (2) reducing the noise of the gradient compared to SGD, and thus smoothing the training process.
However, I wonder what…
Cyrus
- 111
- 2
1
vote
1 answer
What is the order of execution of steps in back-propagation algorithm in a neural network?
I am a machine learning newbie. I am trying to understand the back-propagation algorithm. I have a training dataset of 60 instances/records.
What is the correct order of the process? This one?
Forward pass of the first instance. Calculate the…
gokul
- 53
- 5
1
vote
0 answers
Why would one prefer the gradient of the sum rather than the sum of the gradients?
When gradients are aggregated over mini batches, I sometimes see formulations like this, e.g., in the "Deep Learning" book by Goodfellow et al.
$$\mathbf{g} = \frac{1}{m} \nabla_{\mathbf{w}} \left( \sum\limits_{i=1}^{m} L \left( f \left(…
Eddie C
- 11
- 1
1
vote
1 answer
Is it possible to use stochastic gradient descent at the beginning, then switch to batch gradient descent with only a few training examples?
Batch gradient descent is extremely slow for large datasets, but it can find the lowest possible value for the cost function. Stochastic gradient descent is relatively fast, but it kind of finds the general area where convergence happens and it kind…
Adith Raghav
- 121
- 3
1
vote
2 answers
When would it make sense to perform a gradient descent step for each term of a loss function with multiple terms?
I am training a neural network using a mini-batch gradient descent algorithm.
Now, consider the following loss function, which is composed of 2 terms.
$$L = L_{\text{MSE}} + L_{\text{regularization}} \label{1}\tag{1}$$
As far as I understand,…
hanugm
- 4,102
- 3
- 29
- 63