In all examples I've ever seen, the learning rate of an optimisation method is always less than $1$. However, I've never found an explanation as to why this is. In addition to that, there are some cases where having a learning rate bigger than 1 is beneficial, such as in the case of super-convergence.
Why is the learning rate generally less than 1? Specifically, when performing an update on a parameter, why is the gradient generally multiplied by a factor less than 1 (absolutely)?
 
    