Learning the Optimizer by Luke Metz

We apologize for errors in the captions below.
If you find any egregious, please submit a pull request.

Tasks, stability, architecture, and compute: Training more effective learned optimizers, and using them to train themselves was a paper from September, 2020, by Luke Metz and co. It is another step towards replacing hand-designed features with learned functions, this time the optimizer. This has been a three year journey for Luke; listen to him describe what he’s learned along the way and where are the pain points, from research to engineering.

Cinjon Resnick
[00:00:00]
Explain this paper to me, like I'm your brother starting with? Why? Why was it so important to you?
Luke Metz
[00:00:05]
So over the last what, 10 years or so, it's, it's been increasingly clear that machine learning techniques are pretty powerful. We've seen sort of time and time again that the models created with these techniques can outperform humans on a large number of tasks. But what's interesting is these machine learning techniques are still designed by humans.And this is, a bottleneck because we're sort of own, we're sort of limited by the creativity of about what we can design in these algorithms. So a motivating factor for me, for my research for the last couple of years has been trying to use machine learning techniques to improve machine learning.And I believe doing this, we'll start to construct a new type of algorithm and, and sort of enable huge potentials and the types of things we can do. Not only will this hopefully enable sort of faster algorithms, like faster than we can design by hand or better in some way. But I also hope that it will make them easier to use because right now to use a lot of these methods, you need a PhD three years of [00:01:00] experience.Like the knowledge requires almost. Folkloric. And finally what's particularly exciting to me is we can start to use these things to solve problems that we don't really have good algorithms for now. So some examples of that might be working with unlabeled data. So if I'm like presented with a stream of video and like that, there's no real thing I want to do there, but humans get information out of that.And then eventually kit can use that information to perform some task.
Cinjon Resnick
[00:01:30]
That's interesting that you're positing your motivation here as being something where you can learn from a lot of data in a way that humans do.
Luke Metz
[00:01:38]
Or media. I think there's some analogy to using learning, to learn how we learn things to, to how humans do, because this is exactly what evolution has sort of created in us. Evolution as this giant process that has created us, which are learning machines, but evolution itself is an optimization procedure.Like we did optimize to become who we are. So I hope [00:02:00] similar types of behavior will emerge.
Cinjon Resnick
[00:02:02]
How did you pursue this motivation?
Luke Metz
[00:02:05]
For the last while our main goal is to sort of see how feasible these types of things are. See what types of assumptions we need and sort of how where our current techniques fail, I guess, in, in the limit, like. It feels like this, this type of system has to be possible because we have examples of it.Humans are an example of such a system.
Cinjon Resnick
[00:02:25]
When you say such a system, what do you mean? You mean a system that is able to teach itself? How to learn? Is that what you're saying?
Luke Metz
[00:02:32]
a system that was formed from some optimization procedure in this case evolution that does learning on new tasks, different tasks, tasks. Well, Can even create learning algorithms to then learn on tasks. Yeah, just a lot more flexibility than say, I want this model. That's a cat.
Cinjon Resnick
[00:02:53]
So then what exactly are you doing in this paper?
Luke Metz
[00:02:56]
So in this paper we're focusing on one very particular domain [00:03:00] and that's an optimization algorithms. Now when you're training these machine learning systems, oftentimes there's some model of some kind that makes predictions.And what we have to do is we want it. Tune, tune that model in such a way that will improve performance on something that you care about. So for example, if we're making a detector for Casper dogs, we'll have some model, we'll give it a bunch of images of cats and dogs. We'll use that data and we'll improve the model and we'll hopefully get a thing that predicts.If this is a cat, this is a cat and so on. So, so the, the algorithm you use to do that, to actually train the model we're often called optimizers. And what we're interested in is not just any old organizer, but a general purpose optimizer. Because as I said before, a lot of these techniques require a lot, a lot of domain specific knowledge.So when picking an optimizer, you have to be like versed in large numbers of, of techniques for pick the right model for the right task. And in addition, each one of these models has settings, for example, their settings that [00:04:00] control how fast you update the model and other things.And you have to set those correctly as well. So it's sort of a lot of work
Cinjon Resnick
[00:04:07]
What happens if you have you update the model too fast?
Luke Metz
[00:04:09]
If you update the model too fast the model becomes unstable, I guess, as a way to put it, you, you sort of might change the model too much in one direction and then like, you'll have to correct it going too far the other direction, and you have to correct it going too far, the other direction.And then, so you said the model will explode and you won't actually end up learning anything. Makes sense.
Cinjon Resnick
[00:04:28]
Yeah. And if you go and I'm guessing there's also a too slow.
Luke Metz
[00:04:30]
Yeah. If it's too slow, you'll just wait forever. These things are expensive, so that forever is like money. You're sort of wasting a lot of money. Yeah, yeah.Yeah. Okay. In this work, we're focusing on creating these a general purpose optimizers. And the general purpose thing here, it means we want these things to work on a wide variety of people without tuning in any of the, any of these settings. Now where we differ from past work is instead of actually just designing this thing by hand, like fiddling around with equations and trying to get things right.We're going to learn the thing. [00:05:00] And what this entails is three pieces. So first we need some sort of scaffolding in how we're going to set up this optimizer.
Cinjon Resnick
[00:05:07]
You say, learn a thing that means you're saying you're going to learn the algorithm that then trains the actual model you care about.
Luke Metz
[00:05:14]
perfect. Yep.
Cinjon Resnick
[00:05:16]
Yep.
Luke Metz
[00:05:18]
So the first thing we need to figure out or we need to specify is, is what thing we're actually learning. So like, what's the actual form of this optimizer. We draw inspiration from machine learning again and use yes, the same types of models that we're actually training.So this is just, yeah. Yeah. Yeah. The, the next piece that we need is we need some distribution of tasks or some distribution of problems. We want this thing to do.
Cinjon Resnick
[00:05:41]
The thing here is the model that you care
Luke Metz
[00:05:43]
yes. Yeah. We want the optimizer to do well on, so we need to make the optimizer that works well on some large distribution of tasks.It's actually where a lot of the work for this effort went because we train this thing on a sort of wide diversity of tasks that might capture the types of problems that you might want to use this thing [00:06:00] eventually on. Finally we need some way to actually optimize this optimizer.So we need some way to find the good settings of the optimizer so that after we found those good settings, we can apply this optimizer to some other tasks that you might care about. So one of the main findings of this work is that scale
Cinjon Resnick
[00:06:17]
I just wanna understand something. So you've trained an optimizer to do well on a training, another model. And to train that optimizer, you had to use an already existing optimizer. Is that how that
Luke Metz
[00:06:29]
yes, that is exactly how that works. Yeah.
Cinjon Resnick
[00:06:31]
So, Why not just use the first optimizer?Sorry, why not just use the one that you use to train the other one?
Luke Metz
[00:06:38]
so we actually can and we have done, so the problem is right at the beginning of training that officer isn't good. But optimize or might not even optimize, it might just do nothing. It might optimize in the wrong direction. Yeah, but once our optimizer is trained, once we have one of these optimizers, yeah.We can use it to optimize training another one of itself. [00:07:00]
Cinjon Resnick
[00:07:00]
A big advantage here is that the optimizer you learn does a lot better at the beginning of training than the optimizer that is the usual optimizer you would do otherwise use.
Luke Metz
[00:07:11]
The hope is that the optimizer we learn is better throughout all of training. It's, it's also good early in training. The comment about early in training is with regard to, if you initialize, they learned optimizer. So you're trying to train other an optimizer and you haven't trained it at all. It's just going to be some random thing, unless you have me, as the human putting in a lot of information about what optimization is, it's unlikely that that optimize will work. Something we found in this work that that was sort of surprising to us but really shouldn't be given all the recent results in this space is that scale really matters. So we found that we really needed a large number of times tasks to train her optimize optimizer on, and this case on the order of like 6,000, and we also needed a lot of compute over a long period of time.[00:08:00] So these things took about a 30,000 CPU course for about a month which in terms of training these models is I would say a four times longer than most people train these things.
Cinjon Resnick
[00:08:11]
How much better is this new optimizer versus what people are doing usually. And did you try it on the unlabeled video sequence data you described earlier.
Luke Metz
[00:08:20]
Yeah, so the all forensics we create work really well on the distribution of tasks they have trained on. So on those 6,000 tasks or tasks, similar to those tasks, they perform, it performs very well. This kind of makes sense. It, this is exactly what we're trying to target. Like this is the same problem.Because we're training on such a large diversity of tasks. These optimizers do learn things that are more, more general purpose and they can be applied, apply to other problems. Are they better than the state of the art of what people use to train other types of problems? No, they are general purpose though.So we, we have some experience in our paper, for example, where we take [00:09:00] larger, more realistic models that are used to classify images. And for, for those models, there's been a whole community targeting those and , figuring out how to make those models train quickly and well. Whereas our thing does work right out of the box. You don't, you don't really need to think about it. It just sort of you applied to these problems and then it performs pretty well. But it doesn't match the, the sort of hyper tuning that the community has done
Cinjon Resnick
[00:09:24]
Do you think, you know how to fix that? Do you think, you know, what the next version could look like and should look like.
Luke Metz
[00:09:29]
Yeah. Yeah. I think that where we are now, we, we sort of seen over the last couple of years dramatic increases in performance on these things. Like why these optimizers are becoming better and better at a rate that's very fast. And it feels like only a matter of time before we start to have organizers that are actually useful everywhere right now, for the reasons I kind of described.They're not quite ready for prime time.
Cinjon Resnick
[00:09:54]
About learned optimizers, right?
Luke Metz
[00:09:55]
Yeah. Learned optimizers. It's to the point now that I can give you the soft master [00:10:00] and you can apply it to like pretty much any task and it will optimize, which has already well more than existing methods have, but it's not yet at the point of beating human ingenuity on these tasks.But I think we are, we are getting, going to get there. I think that the key ingredients are more scale. So change these things on more tasks, tasks that are closer to the types of tasks that people might care about. As well as more iterations on getting the actual form of the optimize that correct.So right now the optimizers are relatively simple. They don't have as much knowledge as humans have about these problems. They only make you serve relatively simple signals, but adding more signals into that will hopefully increase the power of these optimizers.
Cinjon Resnick
[00:10:45]
Was there some part of this where you just really feel like, Oh, that was an interesting thing here that I should share that like, I would want you to know that.
Luke Metz
[00:10:52]
Yeah. So the most exciting thing for me I kind of alluded to this earlier. The fact that these optimizers can start to be used to optimize themselves. [00:11:00] this is analogous to like self hosting, compilers. So in the programming languages world there was a time when people like writing all that compilers and like basically binary, and then eventually people shifted to writing the compilers in the language that the compiler has actually written it.So you write your C compiler in C and this is great. This lets you make a lot faster progress much more sane and we're starting to see a similar thing here. So we actually have experiments in this work showing that our optimizer can optimize a new optimizer faster early in training. And once we have this, we sort of have this tool that'll help us make better tools faster, faster, and faster.And my hope is that this can be used to sort of bootstrap itself eventually. To let you then create the next version you use the past version and so on.
Cinjon Resnick
[00:11:47]
On another direction, was there anything that was especially hard during this research?
Luke Metz
[00:11:52]
There's a lot of things we don't really understand very well. I'm still, I I've spent the last three years pretty much exclusively on this, trying to [00:12:00] understand things and we've, we've made a lot of progress, but there's still a lot that's there. How this plays out is a lot of instability during training.So I might initialize an optimizer and train it and it might perform well, it might not perform well. I could initialize 10 different versions of it and sort of get a wide variation of, of different things. So that's, that's one. Key thing. Cause it, it meant that things like that make experimentation quite hard.Th the other thing that's difficult is more practical. These systems are a lot more complicated than your traditional machine learning problem. In a traditional machine learning problem, you might have one model that you're training with one optimizer, whereas in these things there's thousands of models running in parallel and thousands of machines each with a learned optimizer.And then there's some other optimizer on top of that. So the system side of things and the work needed there is, is considerable.
Cinjon Resnick
[00:12:54]
Do you have any example of what is the considered amount of work.
Luke Metz
[00:12:57]
We had to read it. I think how we want to do [00:13:00] distributed training altogether. So there's some existing methods that are used that are based on basically having large clumps of machines that are, that all run in lockstep. This is sort of known as a synchronized training.But the problem with this setup is that each task might run at a different rate. if I want to, if I do lockstep training or synchronized training, I'm going to be wasting a lot of compute. So you need to sort of run these things in async. Well, people have been doing async training, but that also doesn't work here because some of these tasks take considerably longer.And, and, and also in doing full async training you'd be only updating each time you train on one problem and in practice, this isn't enough signal. You need to train on sort of a large batch of problems.
Cinjon Resnick
[00:13:46]
You need a variety.
Luke Metz
[00:13:47]
Yeah, a variety. Yeah. So what we do is we sort of have this async batch training, which is like, we train on these thousands of workers and we have this like big list that every time something's finished with populate this list, [00:14:00] both with the thing we need to use to train from the tasks, like information from that task, as well as how new or how fresh finished mission is.If we get information about how to change our optimizer from a task way in the past, that's not very useful to us. So we, we ended up actually throwing that away. But so, so now we need the infrastructure to run these tasks, populate this list every now and again, grab big chunks of this list and then perform updates on our model.That makes sense.
Cinjon Resnick
[00:14:29]
Yeah. That's that makes sense. I sort of have a freshness to the update, but you still need a variety in the things that go into that update into the learning. And because of that tension, sometimes you have to throw away some of the information that you've held on to. In order to update this thing in a way that actually really works well for the entire model.
Luke Metz
[00:14:46]
One more. Difficulty that I particularly found interesting that this work is just trying to get a better sense of what's going on in the system at all. So in machine learning, people often monitor certain things about, about models [00:15:00] being trained, for example. But now there's not one model being trained.There's like huge numbers of models being trained in very different settings. So throughout this work we, we invested a lot into monitoring a classes, cluster. That this cluster of tasks, the tasks, and this alone is a pretty large scale effort. It's in our system, we're monitoring something on the order of 10,000 pieces of information in a second.And then monitoring these, looking at these eyes are getting these was both critical to the success of this because there's so many moving pieces, like knowing exactly what's going on here and being able to spot, Oh, that is, that is incorrect. Let me go fix that. Or, Oh, that signal is fixed bugs someplace was what, what was crucial
Cinjon Resnick
[00:15:40]
That's probably only going to become a bigger job as the scale you talked about earlier increases.
Luke Metz
[00:15:45]
yeah.
Cinjon Resnick
[00:15:46]
So if I try to recap some of what you said here the paper you're describing, this learned optimizers paper, you trained a new optimizer that works across a wide distribution of models and the way you trained it is by taking [00:16:00] an old optimizer that was human done in order to train the new optimizer to work on this distribution of tasks.And going forward now you can use this optimizer to work really well across all parts of training. And I think you, you know, you definitely have a lot of prognosis for what the future is going to bring in terms of making these things even better and better, but it appears even today, you can create an optimizer that is good enough for that distribution of tasks.
Luke Metz
[00:16:26]
Yes.
Cinjon Resnick
[00:16:27]
Luke,thanks so much for your time.
Luke Metz
[00:16:29]
Yeah. Thank you for having me.