Written by Cinjon Resnick
on August 03, 2021

T5 by Colin Raffel

We apologize for errors in the captions below.
If you find any egregious, please submit a pull request.

T5 by Colin Raffel et al is an important work in the NLP literature. The idea behind it was to perform a gigantic study on a wide array of methods and scientifically assess what worked. They then combined those working methods into a single model called … T5. It performs well in every common NLP task, from summarization to translation to question answering.

Cinjon Resnick

[00:00:00]

Why did you spend your time pursuing this?

Colin Raffel

[00:00:03]

Most of the work that I do has the goal of trying to make machine learning algorithms as broadly, useful and applicable to as many people as possible. Some of the ways that I think we can make machine learning more useful is by making it so that people have to spend as little time and money actually deploying these algorithms.And some of the things that people spend time and money on are collecting labeled data. So going out and collecting data and then paying a human to label it, or just modifying the algorithm until it works. So you're tweaking the details of the algorithm that you're using. And many people who want to use machine learning don't have much labeled data.It's usually expensive to obtain. When we started on this line of work, there was a technique called transfer learning. That was one way to make it so that you needed fewer labels in order to use these machine learning models. this [00:01:00] method was showing lot promise. But there was so much excitement around this technique called transfer learning that the science ended up being a little sloppy and hard to keep track of. And as I said the idea behind this work and behind a lot of the work I do is to try to take these techniques and make them very useful.

Cinjon Resnick

[00:01:26]

So your motivation then is towards how can we make the model that people already have done much more applicable to what people want to do. And that is that how we are transforming. It comes in.

Colin Raffel

[00:01:38]

Yeah, exactly.

Cinjon Resnick

[00:01:39]

You want to just briefly explain how does this work?

Colin Raffel

[00:01:41]

Yeah. So as I said, w one of the expensive pieces of using machine learning can be collecting labeled data. And what transfer learning allows you to do is train your model. So show your model, lots of data that is not labeled, or maybe comes from a [00:02:00] different problem where data is much more plentiful and it gives the model a general idea of the kind of problem you're trying to solve.So then when you ask it to solve your problem where you've labeled much less data, it already has a decent idea of how to do it. And so it, it takes less time for it to figure out what to do.

Cinjon Resnick

[00:02:18]

So this was the guiding motivation of what sort of directions you were thinking about and how were you then trying to tackle and help transfer learning?

Colin Raffel

[00:02:27]

So as I said, there was a lot of excitement around transfer learning, and there were tons of papers coming out, describing different ideas of how to do it. And so the first thing that we did Jade was just a big study of all of this work. And it's more than a literature survey. It was what we would call an empirical survey.So we actually took everyone's ideas and we implemented them ourselves. We re coded them up. And we tried them all out and compared them. But when you're doing this kind of comparison, it's really important that you're comparing everything in exactly the [00:03:00] same experimental setting. You don't want to benefit one method over the other because of how you've designed your expenses.So we were. Yeah, exactly. And I don't mean that in a disparaging way. I just mean that when lots of people are working on the same thing at the same time, there's not a lot of agreement on what the appropriate experimental practices. And so people are just trying things out and in different settings.The idea here was that if we did this big study and we tried all of these different methods out, we could figure out which things actually worked. And then we could combine all of the good stuff and explain, or how far we can push the current tools in the field.

Cinjon Resnick

[00:03:40]

So in a sense of what you were trying to do is what were the best practices for understanding, transfer, learning, and then use that codification to test all the things that come before and then figure out, okay if this test shows a is better than B, then let's use a, in the next approach and [00:04:00] so forth.Is that right?

Colin Raffel

[00:04:01]

Yeah. And if, there are a bunch of things that work well, like there's maybe technique a works. Tech B technique, B doesn't techniques. He does. Then we want to combine all this stuff that does work together and push the field as far forward as we can.

Cinjon Resnick

[00:04:16]

Cool. Yeah, the next last sentence you want to dive into the specifics of what exactly this paper is doing.

Colin Raffel

[00:04:21]

Yeah. So the first step was to design this experimental setting where we could try everything out in exactly the same format and the way that we did that was by proposing a way of treating basically every text problem in exactly the same format. So to give a couple of examples of the kind of text problems we were targeting, you might imagine that you want to translate an English sentence to a French.So in machine we might feed the English text into the model and we might train it to produce the French text. So that's pretty natural. But what we did is we made it so that every text problem had that [00:05:00] format where you feed some text in and you get some text out. So for example, if I have a movie review, And I want to know if the movie review is positive or negative, then I feed the text of the movie review into the article and I train the model to output the word positive or the word negative.So this is still a format where you be texted and get text out. yeah, this is a way of designing a experimental setting where you could try tons of different natural language processing or text problems. And you also could apply these different ideas and compare them in a fair way.Other nice thing about this is that every one of these tasks, like machine translation , sentiment analysis, figuring out the movie reviews, positive or negative, they all use exactly the same methodology. So you don't have to change your model or your training procedure or any experimental details when you change between these different tasks.Again, as I mentioned earlier, our, my goal is to make these machine learning [00:06:00] algorithms as useful to as many people as possible. And if you have to tweak fewer things, when applying the approach to a new problem, then I think it tends to be more easy to use and more applicable to more people. So anyways, once we had this framework, We compared tons and tons of different approaches that had been proposed in the last year.And we just ran lots and lots of experiments. Again, the hope was that by doing this all in the same experimental setting, we could fairly compare things. And then we basically figured out which things were most helpful and train some machine learning models that were much bigger than what people had trained in the past.It turns out that when you make these models bigger. And when I say bigger, they use more computation and they use more storage on disk and so on. They just seem to work quite a bit. Yeah. Better. By taking that the best of the field and scaling it up, making it as big as we could, we got some really good results on [00:07:00] benchmark tasks that people care about.

Cinjon Resnick

[00:07:02]

Before we get to those benchmarks. When you say that you just making them bigger, helps a lot. Was that one of the aspects of the codification or was that orthogonal?

Colin Raffel

[00:07:11]

That feels like the final step of this paper in the sense that we first want to see where the field stands and then see what happens when we push the field forward, we explore the limits, so to speak. And it feels like an independent step.We could have taken any set of approaches and scaled them up, but we really wanted to make sure we were doing things as best as we possibly could before we spent a bunch of computation testing out these ideas at scale.

Cinjon Resnick

[00:07:38]

What were the things that stood out when you did that large scale comparison?

Colin Raffel

[00:07:43]

Some things that turned out to be helpful were different ways of. Training the model on unlabeled data. As I mentioned before, the one of the nice things it's about transfer learning is it allows you to make use of a collection of data. That's not your [00:08:00] one small label data set that you actually care about and you want to use unlabeled data, data that you've just scraped off the internet and haven't had anyone sit down and hand label, you need to figure out how you're going to train the model on that data.One of the things we found was confirming a held belief in the field that an effective training strategy for text data is to take a bunch of texts and randomly remove words and train the model to fill in the missing words. So you can imagine that if you learned how to do this task effectively, you might learn things like the meanings of different words.You might learn grammar and sentence structure, and you might even learn some world knowledge too.

Cinjon Resnick

[00:08:41]

What do we call this method?

Colin Raffel

[00:08:43]

So that's what people call mask language modeling and the masking workers to masking out certain words.

Cinjon Resnick

[00:08:50]

And so that masking, that was one of the things that you found was an exceptionally useful property of a machine learning model. And you wanted to include that in the final, big model that [00:09:00] you trained. Is that right?

Colin Raffel

[00:09:00]

Yeah, exactly. And in particular it was, it's just an effective way to take unlabeled data. And train a model so that it learns to do useful stuff before you train it on the tasks that you actually care about because people probably aren't actually going to use these models to fill in blanks.It's not a super common application.

Cinjon Resnick

[00:09:19]

So then you got to the benchmarks. Were there any challenges with these benchmarks? Did you feel like the benchmarks themselves were the right things to test.

Colin Raffel

[00:09:29]

yeah, that's it. That's an interesting question. We try to apply this our model and our ideas to as diverse, a set of benchmarks and tasks as we could. So we had some problems that were like sentiment analysis, where you're giving and some texts and you want to classify it. We had some problems that were like machine translation, where you take language in one sentence in one language and output the sentence in another language.Then finally, and we also had a bunch of tasks that were like question answering tasks, where you feed the model, some texts, and then you ask the model a [00:10:00] question about the text and it has to output the answer. And I would say that among the benchmarks we chose, some of them were more useful than others simply because.The performance on those benchmarks was already pretty good before we came along. People usually measure how well we're doing on these benchmarks in accordance to how close the model is performing to a human. So if it's getting the same accuracy as a human, then we say we wouldn't really expect the model to do much better than that because we're training it on human labeled data.But there was one benchmark where that wasn't true, which was actually a set of tasks that was designed to be. Very hard for current machine learning algorithms, but easy for humans. And one of the exciting results that we got was that we closed the gap between the performance of machine learning algorithms and humans on that benchmark. A favorite example is this task called the winder grad schema challenge, which is a task where the model has [00:11:00] to disambiguate a pronoun. And so what that means. If I have a sentence, like the city government did not the protesters, a permit because they hear violence. The question is who feared violence, what does the word Bay refer to?Does it refer to the city government or the protestors and you and I, because we're people and we have common sense. We know that government is probably the one who's steering, a cruise fearing violence. Now machine learning models, don't always have the same kind of common sense. So in order to solve this problem effectively, they have to have actually a pretty deep understanding of how the world works.When we started out on this work, there was actually a huge gap between how well machine learning models we're doing on this pronoun resolution tasks and how humans would do. Humans can get basically a hundred percent accuracy on this task, but models were performing a chance . Throughout the course of the time that we were working on this task the machine learning approaches. We're making more and more progress. And our model was one of the first models that actually got very close to human performance on that [00:12:00] task.

Cinjon Resnick

[00:12:00]

What is your end performance status?

Colin Raffel

[00:12:01]

It's basically a hundred percent. Assuming that you're not using really ambiguous pronouns, which do exist you can get really high accuracy and on some variants of this benchmark as a human.

Cinjon Resnick

[00:12:13]

If I was to try and summarize what you've told us so far, it's that there's been a lot of. Models in NLP across a wide range of tasks, whereas machine translation where there's summarization, whether it's sentiment analysis, whether it's this winter greets, Winograd schema challenge you're describing, and you looked at the field and thought, you know what?People aren't really codifying, the task one, as well as the strategies for how to train these models. So let's figure out A schema such that we can train all these models in a way that's reliable and empirically sound and then take from that the best versions of it, scale that version up and then see what we get.And I'm guessing you did really well, is that right?

Colin Raffel

[00:12:51]

We were definitely happy with the results and. I think the community has now been able to use our results and do exciting [00:13:00] things with our models. The main caveat to the usefulness of our work is that as I mentioned, in order to get these good results, we train really big models and these models are very computationally, expensive.They require very powerful computers with lots of memory in order to run them or re or to train them. And so while many people have been able to make use of our results it can still be a little expensive.And as I mentioned, early on. My goal is to make this stuff as useful to as many people as possible. And if certain people can't use our results because you work, they require a lot of computation. Then that kind of puts an asterisk on it for me and also puts in, it puts an interesting direction for future work, trying to do this stuff more efficiently.

Cinjon Resnick

[00:13:44]

This has been super useful. Is there anything else that you would want to add to that the audience can understand?

Colin Raffel

[00:13:50]

The only thing that I'd like to add is just that, I, when I do research like this, I like to share everything. So we published a paper, we released our, all of the code that we used in our [00:14:00] paper and release our models. And we're trying to make it easy to use for people. So even if you're not, a machine learning expert. And you want to try these ideas out then I'd encourage you to poke around with it. The model's called and we have a little tutorial online in terms of how to use it. And if you're curious, I'd encourage you to check it out.

Cinjon Resnick

[00:14:20]

I actually do have one more question for you though, which is, what would you say is the major contribution for this?

Colin Raffel

[00:14:25]

It's two things. The first contribution is just to take a step back and see how well does what we have work w we've as a field, we've proposed lots of new methods.How well does all of this stuff work? That's contribution one and contribution two is if we take the stuff that works and scale it up, how much better does it work? And the answer is really well. And that's exciting.

← → Top