What's the rationale behind mini-batch gradient descent?

Question

I am reading a book that states

As the mini-batch size increases, the gradient computed is closer to the 'true' gradient

So, I assume that they are saying that mini-batch training only focuses on decreasing the cost function in a certain 'plane', sacrificing accuracy for speed. Is that correct?

score 2 · Accepted Answer · answered Aug 09 '18 at 04:16

The basic idea behind mini-batch training is rooted in the exploration / exploitation tradeoff in local search and optimization algorithms.

You can view training of an ANN as a local search through the space of possible parameters. The most common search method is to move all the parameters in the direction that reduces error the most (gradient decent).

However, ANN parameter spaces do not usually have a smooth topology. There are many shallow local optima. Following the global gradient will usually cause the search to become trapped in one of these optima, preventing convergence to a good solution.

Stochastic gradient decent solves this problem in much the same way as older algorithms like simulated annealing: you can escape from a shallow local optima because you will eventually (with high probability) pick a sequence of updates based on a single point that "bubbles" you out. The problem is that you'll also tend to waste a lot of time moving in wrong directions.

Mini-batch training sits between these two extremes. Basically you average the gradient across enough examples that you still have some global error signal, but not so many that you'll get trapped in a shallow local optima for long.

Recent research by Masters and Luschi suggests that in fact, most of the time you'd want to use smaller batch sizes than what's being done now. If you set the learning rate carefully enough, you can use a big batch size to complete training faster, but the difficulty of picking the correct learning rate increases with the size of the batch.

score 0 · Answer 2 · answered Aug 09 '18 at 04:29

It's like you have a class of 1000 children and you being a teacher, want all of them to learn something at the same time. It is difficult because all are not the same, they have different adaptability and reasoning strength. So one can have alternate strategies for the same task. 1) Take each child at a time and train it. It will be the good approach but it will take a long time here each child is equal to your batch size

2) Take a group of 10 children and train them, this can be the good compromise between time, and learning. In the smaller group, you can handle naughty one better. here your batch size is 10

3) If you take all 1000 children and teach them, it will take a very short time but you will not be able to give proper attention to those mischievous ones here your batch size is 1000

Same with machine learning, Take reasonable batch size, tune weight accordingly. I hope this analogy will clear your doubt.

What's the rationale behind mini-batch gradient descent?

2 Answers2