Your intuition is correct and the short answer is: The learning rate is necessary in order to control the size of the update!
You probably already know that gradient descent (GD) is derived from the [Newton-Raphson algorithm] (https://en.wikipedia.org/wiki/Newton%27s_method "or Newton's method").
In Machine Learning applications, the learning rate in GD plays the same role as the second order derivative which encodes the curvature information. To be more explicit, let's consider two convex functions (which, if you know your math, have positive second derivatives), $f_{1}(x) = 2x^{2}$ and $f_{2}(x) = \frac{x^{2}}{2}$.
We can see that their gradient and second-order derivative are: $(f'_{1}(x) = 4x, f''_{1}(x) = 4)$ and $(f'_{2}(x) = x, f''_{2}(x) = 1)$.
Now, if we consider the Newton-Raphson algorithm, then we see that:
$x_{k+1} = x_{k} - \frac{f'_{1}(x_{k})}{f''_{1}(x_{k})} = x_{k} - \frac{4x_{k}}{4} = x_{k} - \frac{f'_{2}(x_{k})}{f''_{2}(x_{k})} = x_{k} - \frac{x_{k}}{1} = 0, \;\forall x_{k}\in\mathbb{R}$.
We can see that the second order derivative rescales the gradient so that $\forall x_{k}\in\mathbb{R}$, the update always takes you directly to the minimum (this works in one step because the functions are quadratic. When the function is not precisely quadratic, the algorithm will take more iterations, but will converge similarly to the minimizing solution).
Hence, in the case of $f_{1}$, the scaling factor $\frac{1}{f''_{1}(x_{k})} = \frac{1}{4} = \alpha^{*}_{1}$ and for $f_{2}$, the scaling factor $\frac{1}{f''_{2}(x_{k})} = 1 = \alpha^{*}_{2}$. These represent the optimal learning rate values $\alpha^{*}_{1}$ and $\alpha^{*}_{2}$ for each function.
But, what if instead of these values we used values that are smaller or larger? Well, let's see:
- Let $\alpha_{1} = \frac{1}{6} < \alpha^{*}_{1}$, then:
$x_{k+1} = x_{k} - \frac{4x_{k}}{6} = x_{k} - \frac{2x_{k}}{3} = \frac{x_{k}}{3} \neq 0$. However, since this will be the case for every time step $k$, the update rule will give you a geometric progression (e.g. if we start with $x_{0} = 1$, then $x_{1} = \frac{1}{3}$, $x_{2} = \frac{1}{9}$, $x_{3} = \frac{1}{27}$, $\cdots$, $x_{k} = (\frac{1}{3})^k$. If $k$ is large enough, then the final solution will be very close to 0, i.e. $k\to\infty \implies x_{k}\to 0$). Hence, quite intuitively, a smaller learning rate leads to a slow convergence.
- Let $\alpha_{1} = \frac{1}{2} > \alpha^{*}_{1}$, then:
$x_{k+1} = x_{k} - \frac{4x_{k}}{2} = x_{k} - 2x_{k} = -x_{k} \neq 0$. As can be seen, this does not converge at all to the optimal solution, 0! In fact it oscillates back and forth between $x_{0}$ and $-x_{0}$. This is a divergent progression! In fact, if we chose $\alpha_{1} = \alpha^{*}_{2} = 1 > \alpha^{*}_{1}$, then: $x_{k+1} = x_{k} - 4x_{k} = x_{k} - 4x_{k} = -3x_{k} \neq 0$. which implies a geometric progression with ratio $r=-3$ which is divergent. Hence, a larger learning rate leads to divergence.
In summary, the learning rate scales the gradient in order to control the amount of change in the optimized parameters at every step and it plays the same role as the second order derivative, i.e. it is an estimate of the function's curvature information or shape. If properly estimated, then the algorithm will converge quickly. If underestimated, the convergence will be slow (but guaranteed to converge after a potentially large number of updates). If overestimated, the algorithm will overshoot and diverge, never settling down to a specific value.
P.S.: Chris Cundy mentions in his comment the Lipschitz coefficient and in general, you want your learning rate to satisfy $0 < \alpha < \frac{1}{L}$ where $L$ is the Lipschitz coefficient. This is also related to the second order derivative information and is valid for functions Lipschitz continuous gradients (this connects to L-smoothness). Indeed, if $f(x)$ is twice differentiable and if there exists $L < \infty$ such that its Hessian matrix (second order derivative) has a bounded spectral norm, i.e. $||f''(x)|| \leq L$, then $f(x)$ has a Lipschitz continuous gradient with Lipschitz constant $L$.
I recommend checking the theorem 3.4. in [1] https://gowerrobert.github.io/pdf/M2_statistique_optimisation/grad_conv.pdf