The Magic of Learning Rate: Unlocking Super Convergence
There's some magic on learning rate that you played around with, yeah, interesting, yeah. This is all work that came from a guy called Leslie Smith, a researcher who like us cares a lot about just the practicalities of training neural networks quickly and accurately. Which I think is what everybody should care about but almost nobody does. He discovered something very interesting which he calls super convergence.
Super convergence is there are certain networks that with certain settings of high parameters could suddenly be trained 10 times faster by using a 10 times higher learning rate. Now, no one published that paper because it's not an area of kind of active research in the academic world. No academics recognized this as important and also deep learning in academia is not considered an experimental science so unlike in physics where you could say like I just saw a subatomic particle do something which the theory doesn't explain you could publish that without an explanation.
And then in the next 60 years people can try to work out how to explain it. We don't allow this in the deep learning world, so it's literally impossible for Leslie to publish a paper that says I've just seen something amazing happen this thing trained ten times faster than it should have I don't know why. And so the reviewers were like we can't publish that because you don't know why. So, anyway, that's important to pause on because there's so many discoveries that would need to start like that every every other scientific field I know of work.
So, why ours is uniquely disinterested in publishing unexplained experimental results but there it is. So, it wasn't published having said that I read a lot more unpublished papers and published papers because that's where you find the interesting insights. So, I absolutely read this paper and I was just like this is astonishingly mind-blowing and weird and awesome and like why isn't everybody only talking about this? Because like if you can train these things ten times faster they also generalize better because your doing less epochs which means you look at the data less you get better accuracy.
So, I've been kind of studying that ever since and eventually Leslie kind of figured out a lot of how to get it's done. And we added minor tweaks and a big part of the trick is starting at a very low learning rate very gradually increasing it. So as you're training your model you would take very small steps at the start and it gradually makes them bigger and bigger and troll eventually you're taking much bigger steps than anybody thought as possible there's a few other little tricks to make it work but ever ever.
It basically we can reliable to get super convergence and so for the drawing bench thing we were using just much higher learning rates than people expected to work. What do you think the future of I mean makes so much sense for that be critical hyper parameter learning rate that you very what do you think the future of learning rate magic looks like.
Well, there's been a lot of great work in the last 12 months in this area it's and people are increasingly realizing that up to might like we just have no idea really how optimizers work and the combination of weight decay which is how we regularize optimizers and the learning rate and then other things like the epsilon we use in in the atom optimizer they all work together in weird ways and different parts of the model.
This is another thing we've done a lot of work on is research into how different parts of the model should be trained at different rates in different ways. So, we do something we call discriminative learning rates which is really important particularly for transfer learning. So, really I think in the last 12 months a lot of people have realized that all this stuff is important there's been a lot of great work coming out and we're starting to see algorithms here which have very very few dials if any that you have to touch.
So, like that I think what's gonna happen is the idea of a learning rate well it almost already has disappeared in the latest research and instead it's just like you know we we know enough about how to interpret the gradients and the change of gradients we see to know how to set every parameter. Can't wait it you
"WEBVTTKind: captionsLanguage: enthere's some magic on learning rate that you played around with yeah interesting yeah so this is all work that came from a guy called Leslie Smith Leslie's a researcher who like us cares a lot about just the practicalities of training neural networks quickly and accurately which i think is what everybody should care about but almost nobody does and he discovered something very interesting which he calls super convergence which is there are certain networks that with certain settings of high parameters could suddenly be trained 10 times faster by using a 10 times higher learning rate now no one published that paper because it's not an area of kind of active research in the academic world no academics recognized this is important and also deep learning in academia is not considered a experimental science so unlike in physics where you could say like I just saw as a subatomic particle do something which the theory doesn't explain you could publish that without an explanation and then in the next 60 years people can try to work out how to explain it we don't allow this in the deep learning world so it's it's literally impossible for Leslie to publish a paper that says I've just seen something amazing happen this thing trained ten times faster than it should have I don't know why and so the reviewers were like we can't publish that because you don't know why so anyway that's important to pause on because there's so many discoveries that would need to start like that every every other scientific field I know of work so that way I don't know why ours is uniquely disinterested in publishing unexplained experimental results but there it is so it wasn't published having said that I read a lot more unpublished papers and published papers because that's where you find the interesting insights so I absolutely read this paper and I was just like this is astonishingly mind-blowing and weird and awesome and like why isn't everybody only talking about this because like if you can train these things ten times faster they also generalize better because your your doing less epochs which means you look at the data less you get better accuracy so I've been kind of studying that ever since and eventually Leslie kind of figured out a lot of how to get it's done and we added minor tweaks and a big part of the trick is starting at a very low learning rate very gradually increasing it so as you're training your model you would take very small steps at the start and it gradually makes them bigger and bigger and troll eventually you're taking much bigger steps than anybody thought as possible there's a few other little tricks to make it work but ever ever it basically we can reliable to get super convergence and so for the drawing bench thing we were using just much higher learning rates than people expected to work what do you think the future of I mean makes so much sense for that to be a critical hyper parameter learning rate that you very what do you think the future of learning rate magic looks like well there's been a lot of great work in the last 12 months in this area it's and people are increasingly realizing that up to might like we just have no idea really how optimizers work and the combination of weight decay which is how we regularize optimizers and the learning rate and then other things like the epsilon we use in in the atom optimizer they all work together in weird ways and different parts of the model this is another thing we've done a lot of work on is research into how different parts of the model should be trained at different rates in different ways so we do something we call discriminative learning rates which is really important particularly for transfer learning so really I think in the last 12 months a lot of people have realized that all this stuff is important there's been a lot of great work coming out and we're starting to see algorithms here which have very very few dials if any that you have to touch so like that I think what's gonna happen is the idea of a learning rate well it almost already has disappeared in the latest research and instead it's just like you know we we know enough about how to interpret the gradients and the change of gradients we see to know how to set every parameter you can't wait it youthere's some magic on learning rate that you played around with yeah interesting yeah so this is all work that came from a guy called Leslie Smith Leslie's a researcher who like us cares a lot about just the practicalities of training neural networks quickly and accurately which i think is what everybody should care about but almost nobody does and he discovered something very interesting which he calls super convergence which is there are certain networks that with certain settings of high parameters could suddenly be trained 10 times faster by using a 10 times higher learning rate now no one published that paper because it's not an area of kind of active research in the academic world no academics recognized this is important and also deep learning in academia is not considered a experimental science so unlike in physics where you could say like I just saw as a subatomic particle do something which the theory doesn't explain you could publish that without an explanation and then in the next 60 years people can try to work out how to explain it we don't allow this in the deep learning world so it's it's literally impossible for Leslie to publish a paper that says I've just seen something amazing happen this thing trained ten times faster than it should have I don't know why and so the reviewers were like we can't publish that because you don't know why so anyway that's important to pause on because there's so many discoveries that would need to start like that every every other scientific field I know of work so that way I don't know why ours is uniquely disinterested in publishing unexplained experimental results but there it is so it wasn't published having said that I read a lot more unpublished papers and published papers because that's where you find the interesting insights so I absolutely read this paper and I was just like this is astonishingly mind-blowing and weird and awesome and like why isn't everybody only talking about this because like if you can train these things ten times faster they also generalize better because your your doing less epochs which means you look at the data less you get better accuracy so I've been kind of studying that ever since and eventually Leslie kind of figured out a lot of how to get it's done and we added minor tweaks and a big part of the trick is starting at a very low learning rate very gradually increasing it so as you're training your model you would take very small steps at the start and it gradually makes them bigger and bigger and troll eventually you're taking much bigger steps than anybody thought as possible there's a few other little tricks to make it work but ever ever it basically we can reliable to get super convergence and so for the drawing bench thing we were using just much higher learning rates than people expected to work what do you think the future of I mean makes so much sense for that to be a critical hyper parameter learning rate that you very what do you think the future of learning rate magic looks like well there's been a lot of great work in the last 12 months in this area it's and people are increasingly realizing that up to might like we just have no idea really how optimizers work and the combination of weight decay which is how we regularize optimizers and the learning rate and then other things like the epsilon we use in in the atom optimizer they all work together in weird ways and different parts of the model this is another thing we've done a lot of work on is research into how different parts of the model should be trained at different rates in different ways so we do something we call discriminative learning rates which is really important particularly for transfer learning so really I think in the last 12 months a lot of people have realized that all this stuff is important there's been a lot of great work coming out and we're starting to see algorithms here which have very very few dials if any that you have to touch so like that I think what's gonna happen is the idea of a learning rate well it almost already has disappeared in the latest research and instead it's just like you know we we know enough about how to interpret the gradients and the change of gradients we see to know how to set every parameter you can't wait it you\n"