2

In Deep Learning by Goodfellow et al., I came across the following line on the chapter on Stochastic Gradient Descent (pg. 287):

The main question is how to set $\epsilon_0$. If it is too large, the learning curve will show violent oscillations, with the cost function often increasing significantly.

I'm slightly confused why the loss function would increase at all. My understanding of gradient descent is that given parameters $\theta$ and a loss function $\ell (\vec{\theta})$, the gradient update is performed as follows:

$$\vec{\theta}_{t+1} = \vec{\theta}_{t} - \epsilon \nabla_{\vec{\theta}}\ell (\vec{\theta})$$

The loss function is guaranteed to monotonically decrease because the parameters are updated in the negative direction of the gradient. I would assume the same holds for SGD, but clearly it doesn't. With a high learning rate $\epsilon$, how would the loss function increase in its value? Is my interpretation incorrect, or does SGD have different theoretical guarantees than vanilla gradient descent?

nbro
  • 42,615
  • 12
  • 119
  • 217
Shrey
  • 214
  • 1
  • 7

1 Answers1

1

This is the case as the loss doesn't have to monotonically decrease when it's updated in the negative direction. For example:

Let $L(\theta) = \theta^2 $ and $\theta_0= 3$

Let the subscript n in $\theta_n$ denote the iteration number.

Then $\nabla_{\theta}L(\theta_0) = 2*\theta = 2*3 = 6$

For the loss to decrease in this case $\epsilon < 1$ needs to hold so lets look at when it doesn't hold.

If $\epsilon = 1$ then the loss would be stable at 6 as the update would give $\theta_1 = \theta_0 - 1*\nabla_{\theta}L(\theta_0) = 3-6 = -3 $

$\theta_2 = \theta_1 - 1*\nabla_{\theta}L(\theta_1) = -3 -- 6 = 3$

$\theta_3 = \theta_2 - 1*\nabla_{\theta}L(\theta_2) = 3 - 6 = -3$

$L(\theta_1) = L(\theta_2) =L(\theta_3) = 6$ As $L(-3) = (-3)^2 = 6$ and $L(3) = 3^2 = 6$

So if $\epsilon = 1$ then $\theta$ will oscillate back and forth between 3 and -3 what gives a loss that is stable at 6.

Now if we look at what happens if $\epsilon >1$ for example $\epsilon =2$

note that $\nabla_{\theta}L(\theta) = 2*\theta$ as $L(\theta) = \theta^2$

$\theta_1 = \theta_0 - 2*\nabla_{\theta}L(\theta_0) = 3 -2*\nabla_{\theta}L(\theta_0) = 3 - 2 * (2*3) = 3 - 12 = -9$

$\theta_2 = \theta_1 - 2*\nabla_{\theta}L(\theta_1) = -9 - 2 *\nabla_{\theta}L(\theta_1) = -9-2*(2*-9) = -9 + 36 = 27$

$\theta_3 = \theta_2 - 2*\nabla_{\theta}L(\theta_2) = 27 - 2 *\nabla_{\theta}L(\theta_2) = 27 -2 * (2*27) = 27 -108 = -81 $

In turn the loss would increase after each update .

$L(\theta_1) = (-9)^2 = 81$

$L(\theta_2) = 27^2 = 729$

$(\theta_3) = (-81)^2 = 6561$

hal9000
  • 439
  • 3
  • 9