2

I am currently following course 1 of Andrew Ng's Machine Learning Specialization. I understood we need a convex cost function to reach the global minimum of the loss. But that means the gradient will reduce and approach 0 by itself as long as we choose a convex function.

Then what is the need of the learning rate?

My intuition tells me its an additional measure to give us more control in training the model, but I want a proper answer and evidence to solidify my understanding.

nbro
  • 42,615
  • 12
  • 119
  • 217

3 Answers3

5

In short, there are two major reasons:

  1. The optimization landscape in parameter space is non-convex even with convex loss function (e.g., MSE). Therefore, you need to do small update steps (i.e., the gradient scaled by the learning rate) to find a suitable local minimum and avoid divergence.

  2. The gradient is estimated on a batch of samples, which does not represent the full let's say "population" of data. Even by using batch gradient descent, which uses the full dataset each step, the resulting gradient does not point in the direction of the true gradient. So you need to introduce a step size (i.e., the learning rate.) Moreover, at least in principle, it is possible to correct the gradient direction by including second order information (e.g., the Hessian of the loss w.r.t. parameters) although it is usually infeasible to compute.

Luca Anzalone
  • 3,216
  • 4
  • 15
3

Indeed your intuition of Learning rate is right regarding its additional control in training the model, and the precise terminology for your somewhat vague 'additional control' intuition even in a convex loss landscape is called hyperparameter or tuning parameter in ML which specifies details of the learning process in contrast to parameters which determine the model itself, such as learning rate, momentum (Adam, Nesterov), batch size, regularization strength, etc. Specifically the linked reference clarifies the necessity of your concerned learning rate hyperparameter as follows.

Since it influences to what extent newly acquired information overrides old information, it metaphorically represents the speed at which a machine learning model "learns"... In setting a learning rate, there is a trade-off between the rate of convergence and overshooting. While the descent direction is usually determined from the gradient of the loss function, the learning rate determines how big a step is taken in that direction... In order to achieve faster convergence, prevent oscillations and getting stuck in undesirable local minima the learning rate is often varied during training either in accordance to a learning rate schedule or by using an adaptive learning rate.

Usually a fixed learning rate is set very small to ensure convergence, however, in practice a learning rate schedule is often used by decay and momentum which help faster convergence and prevent oscillations as discussed in my answer to another post.

cinch
  • 11,000
  • 3
  • 8
  • 17
3

Your intuition is correct and the short answer is: The learning rate is necessary in order to control the size of the update!

You probably already know that gradient descent (GD) is derived from the [Newton-Raphson algorithm] (https://en.wikipedia.org/wiki/Newton%27s_method "or Newton's method"). In Machine Learning applications, the learning rate in GD plays the same role as the second order derivative which encodes the curvature information. To be more explicit, let's consider two convex functions (which, if you know your math, have positive second derivatives), $f_{1}(x) = 2x^{2}$ and $f_{2}(x) = \frac{x^{2}}{2}$. We can see that their gradient and second-order derivative are: $(f'_{1}(x) = 4x, f''_{1}(x) = 4)$ and $(f'_{2}(x) = x, f''_{2}(x) = 1)$.

Now, if we consider the Newton-Raphson algorithm, then we see that:

$x_{k+1} = x_{k} - \frac{f'_{1}(x_{k})}{f''_{1}(x_{k})} = x_{k} - \frac{4x_{k}}{4} = x_{k} - \frac{f'_{2}(x_{k})}{f''_{2}(x_{k})} = x_{k} - \frac{x_{k}}{1} = 0, \;\forall x_{k}\in\mathbb{R}$.

We can see that the second order derivative rescales the gradient so that $\forall x_{k}\in\mathbb{R}$, the update always takes you directly to the minimum (this works in one step because the functions are quadratic. When the function is not precisely quadratic, the algorithm will take more iterations, but will converge similarly to the minimizing solution).

Hence, in the case of $f_{1}$, the scaling factor $\frac{1}{f''_{1}(x_{k})} = \frac{1}{4} = \alpha^{*}_{1}$ and for $f_{2}$, the scaling factor $\frac{1}{f''_{2}(x_{k})} = 1 = \alpha^{*}_{2}$. These represent the optimal learning rate values $\alpha^{*}_{1}$ and $\alpha^{*}_{2}$ for each function.

But, what if instead of these values we used values that are smaller or larger? Well, let's see:

  • Let $\alpha_{1} = \frac{1}{6} < \alpha^{*}_{1}$, then: $x_{k+1} = x_{k} - \frac{4x_{k}}{6} = x_{k} - \frac{2x_{k}}{3} = \frac{x_{k}}{3} \neq 0$. However, since this will be the case for every time step $k$, the update rule will give you a geometric progression (e.g. if we start with $x_{0} = 1$, then $x_{1} = \frac{1}{3}$, $x_{2} = \frac{1}{9}$, $x_{3} = \frac{1}{27}$, $\cdots$, $x_{k} = (\frac{1}{3})^k$. If $k$ is large enough, then the final solution will be very close to 0, i.e. $k\to\infty \implies x_{k}\to 0$). Hence, quite intuitively, a smaller learning rate leads to a slow convergence.
  • Let $\alpha_{1} = \frac{1}{2} > \alpha^{*}_{1}$, then: $x_{k+1} = x_{k} - \frac{4x_{k}}{2} = x_{k} - 2x_{k} = -x_{k} \neq 0$. As can be seen, this does not converge at all to the optimal solution, 0! In fact it oscillates back and forth between $x_{0}$ and $-x_{0}$. This is a divergent progression! In fact, if we chose $\alpha_{1} = \alpha^{*}_{2} = 1 > \alpha^{*}_{1}$, then: $x_{k+1} = x_{k} - 4x_{k} = x_{k} - 4x_{k} = -3x_{k} \neq 0$. which implies a geometric progression with ratio $r=-3$ which is divergent. Hence, a larger learning rate leads to divergence.

In summary, the learning rate scales the gradient in order to control the amount of change in the optimized parameters at every step and it plays the same role as the second order derivative, i.e. it is an estimate of the function's curvature information or shape. If properly estimated, then the algorithm will converge quickly. If underestimated, the convergence will be slow (but guaranteed to converge after a potentially large number of updates). If overestimated, the algorithm will overshoot and diverge, never settling down to a specific value.

P.S.: Chris Cundy mentions in his comment the Lipschitz coefficient and in general, you want your learning rate to satisfy $0 < \alpha < \frac{1}{L}$ where $L$ is the Lipschitz coefficient. This is also related to the second order derivative information and is valid for functions Lipschitz continuous gradients (this connects to L-smoothness). Indeed, if $f(x)$ is twice differentiable and if there exists $L < \infty$ such that its Hessian matrix (second order derivative) has a bounded spectral norm, i.e. $||f''(x)|| \leq L$, then $f(x)$ has a Lipschitz continuous gradient with Lipschitz constant $L$.

I recommend checking the theorem 3.4. in [1] https://gowerrobert.github.io/pdf/M2_statistique_optimisation/grad_conv.pdf

Ahnel
  • 116
  • 4