Reading about learning rates I had the idea that Cyclic LRs could be interesting.
It's likely one could justify either way:
that they'd kick you off the minimum plus that saddle points are solved with Momentum based optimiser and make Cyclic LRs not needed.
that its sporadic abrupt changes or 'jumps' in the learning rate could help when a model has gotten stuck for a while.
I'm aware that there is a 2015 paper about it and some implementations available in PyTorch, and abandoned in Tensorflow.
My question is:
- Are there studies showing any problems of these approach, or validating it more broadly? Or equivalently: Are there known caveats for specific architectures?
- Are there large models that have shown substantial improvements with it?