Presentation: The Unreasonable Effectiveness of Zero Shot Learning

MMS Founder
MMS Roland Meertens

Article originally posted on InfoQ. Visit InfoQ

Transcript

Meertens: My name is Roland Meertens.

Let’s get started by talking about machine learning projects in general. In my career, I often went through the following phases, when setting up a machine learning project. I would often first determine the requirements of what my program or neural network has to do. I would then go out and collect and label data. Then I would train the machine learning model, and only then would I deploy my model and get feedback from my boss, other teams, clients. Unfortunately, I noticed that the first time I would define my requirements, I would often miss things, get things wrong, and requirements would change if you’re working in a startup, quite often. The collecting and labeling data step is very expensive and takes a long time. Before you get any feedback and before your requirements are probably going to change, it takes you a long time. That’s a really expensive process.

Background

Recently, I found a way to radically speed up machine learning projects development using a technique called zero-shot learning. Right now, you might be wondering, who are you actually? What is your stake in getting us to annotate less data? I’m a product manager at Annotell, which is a data annotation and analytics company for self-driving cars. Although it might sound pretty ironic that someone who has a stake in data annotation is going to tell you about doing less data annotation, I think it’s important to make sure that people have a good understanding of what problem they’re trying to solve before they actually go at it. That way, they can build high quality datasets, which are necessary, especially for self-driving car projects.

Example Use Case 1: Lyric Language

If we go back to the problem at hand, I think the big underlying question is, can we start by determining requirements and deploying our application at the same time? Wouldn’t that be dreamy? Wouldn’t it be fantastic if we did that? To assess this question, I basically thought of two projects to try to do during this talk, which might benefit from something like zero-shot learning. The first idea I had was, let’s say that we have a startup places or something, and we want to determine the language a lyric or a song is written in. Let’s say that you have the lyric, “I’m in love with the shape of you,” that’s English. “Alors on danse,” that would of course be a French song. “Neunundneunzig Luftballons,” would be a German song. We basically have a media function which gets some text as input and gets the language as output.

Example Use Case 2: Speed up Self-Checkout Fruit Scan

The second example I thought of was, let’s speed up self-checkout fruit scanning. If you are going to a supermarket you often have to weigh product yourself and have to indicate what fruit or vegetable it is, it always takes me a long time to find them. That’s why I went to the supermarket, bought some fruit and vegetables, and let’s try to build a scanning application which can recognize these fruit and vegetables. It actually took me a bit of time to do this, because when I was buying it, the person behind the desk couldn’t find the code for a carrot. I thought that made the problem extra relevant. Yes, maybe right now take a bit of a moment to think about how you would approach these problems normally. What data would you collect? How much time would it take, especially in the case of these examples, if you were really working in a startup? What change in requirements can you expect in the future? Do you think that new languages will be added? Do you think there are seasonal fruits and vegetables we didn’t even consider right now? Maybe what kind of collisions would you see between the different fruits and vegetables, which will be hard for an AI to get right? Just reflect on that for a bit.

Foundational Models

Foundational models are models which are trained on a broad scale of data and can be adapted to a wide range of downstream tasks. Last year, Stanford founded the Center for Research on Foundational Models. These models are too big to train for yourself or by yourself, so you have to download a pre-trained model or use an API. It’s basically, you just download a model and you maybe treat it as a black box. Especially during this talk, we will do that. There are new research fields trying to see what can we do with these models, how can we use them. You might already know a couple of these models. One example is GPT-3, which can finish your natural language query. OpenAI Codex can finish your code if you have an invite to GitHub Copilots. There’s OpenAI CLIP, which operates on images and text.

Generative Pre-trained Transformer 3 (GPT-3)

Let’s start with the first model we’re going to look at, GPT-3, which is a Generative Pre-trained Transformer. It’s basically an autocomplete algorithm. If you give it a sequence of tokens, or a sequence of letters, or a sequence of text, it will try to predict the next token or the next text. It is pre-trained in a generative, unsupervised manner on 570 gigabytes of internet data. What OpenAI did is they just scraped the internet for webpages, books, Wikipedia. I think Wikipedia raw text is only 10 gigabytes. The model has seen a lot of text, has seen a lot of things. It’s also a massive model with 175 billion parameters, so it won’t even fit on your GPU anymore. It can only be used through the OpenAI API. Just to give you an example of what this autocompletion is, I think it’s common if you’re learning machine learning that people try to teach you some LSTMs and they let you make some Shakespeare generator or something. You could have a text like, “What doesn’t kill us makes us str.” Then if you send it to the OpenAI API, it could either return something like onger, or uggle, or ive, which are all nice ways to finish this sentence, “What doesn’t kill us makes us stronger, or struggle, or strive.”

Examples of Generative Text

Some examples of generative text which you could generate with the OpenAI API, or with any autocomplete algorithm at all, would be, “The quick brown fox jumps over the lazy dog.” Or, “Shall I compare thee to a,” could be summer’s day, if you’re a Shakespeare fan. It could also be anything else like, “Shall I compare thee to a nicer spring evening.” All kinds of things are possible. What people are also doing with the autocomplete algorithms are, for example, they’re giving it the text, how are you doing? Answer. That way GPT-3 can autocomplete with I am fine. This is the way that people talk to these algorithms. They try to see how GPT-3 is feeling by just seeing how it autocompletes sentences like this. You can write poetry. If you just input a poem name, “Shadows on the Way,” then it’s possible for GPT-3 to generate something like, “I must have shadows on the way. If I am to walk, I must have each step taken slowly and alone to have it ready made.” Beautiful poetry can be generated with these autocomplete algorithms. Now we get to some more interesting examples. Let’s say that you say the capital of France is, and then you let that like just API autocompleted, there’s hopefully a high chance that it says Paris, because it’s read this text on the internet, or has seen similar text. If you say 20 plus 22, hopefully it will autocomplete as 42, because it has just learned this or seen text like this.

That brings me to the point of GPT-3, it’s not just an interesting thing to play with. You can actually try to use it as a generic everything in and out API, with of course the asterisk that there are no guarantees at all on correct performance. Everything which can be represented as a complete as text task for which general knowledge is required can be given to GPT-3. In my experience, it can take everything which a bored high-schooler would do on a test. It would autocomplete some things in a roughly reasonable fashion. Of course, it’s not perfect. It looks lazy. It looks a little bit sloppy.

Two Examples GPT-3 Can Handle

These are two examples of something which GPT-3 can handle if you want to see it as an API. You could let it autocomplete the text. This is a program which translates English Text to French. English, hello, French, and then GPT-3 will probably return, bonjour. It’s a translation API now, if you phrase it this way. This is actually a zero-shot example, no training data was given to this specific program to make it do this task. For example, on the right half, this is a program which improves English grammar, poor English output, “I eated the purple berries.” Good English output, “I ate purple berries.” Poor English input, “The patient was died.” Good English output, hopefully you get, “The patient died.” This is an example of a one shot learning task. You give this program or this autocomplete algorithm one example of a text which it has to autocomplete, and then it has to complete next things.

Solving Our Lyrics Use Case in a Zero-Shot Fashion

If we go back to solving our lyrics use case in a zero-shot fashion, you can just say, this is a program which determines the language song lyrics are written in, lyric, and then you give the lyric. Then language, and then it has to autocomplete this language. I wrote a simple function. I used the OpenAPI. It’s really a small function if you use that API. Yes, if you give the text, “I’m in love with the shape of you,” it will return English. If you give it, “Alors on dance,” even if you make a spelling mistake, it will return French. If you say, “Neunundneunzig Luftballons,” it will return German. Isn’t it great? We had zero examples and it still does the thing. We serve this Python function, which returns the correct answer. I think this is brilliant. This really can speed up our product development if we had this as a product.

GPT-3 Pro Tips

I want to give you some prototypes, because I’ve been playing around with this for a long time over the last spring. In case you only have a few answer options, like in this case, we might only say, we have English, French, or German, it makes sense to tell that to GPT-3 in the first line. You can say this is a program which determines if text is English, French, or German, so it doesn’t necessarily repeat things it saw on the text before, but it does seem to prime these words to come up more quickly. The second tip is that language models are few-shot learners. That’s the title of the paper opening I used to announce GPT-3 to the world. It basically means that if you give it some examples, the autocompletion will work better. It’s a good primer for your query. Try to use each of the possible use cases or maybe try to use each of the languages before you give it a new language. Then it’ll hopefully latch on to that. The last tip which I frequently do is I allow the API to return multiple options, or multiple autocomplete options, like with, “What doesn’t kill us makes us str.” I just let it return three things. Then you can use a heuristic to select the best option. You can have something return three answers for what language some text is in. Then, because I know it’s either French, English, or German, I would just see if it returned any of these things, and I use that. That’s it for GPT-3. You can sign up on the OpenAI website for the beta, and hopefully get access if you have a good ID.

Zero-Shot Image Classification

That brings me to my second task which we wanted to tackle during this talk, zero-shot image classification. OpenAI also has something for that. They have OpenAI CLIP, which stands for Contrastive Language-Image Pre-training. What this model does is that it brings together text and image embeddings. It generates an embedding for each text and it generates an embedding for each image, and these inputs are aligned to each other. The way this model was trained is that, for example, you have a set of images, like an image of a cute puppy. Then you have a set of text like, Pepper the Aussie Pup. The way it’s trained is that hopefully the distance between the embedding of this picture of this puppy, and the embedding of the text, Pepper the Aussie Pup, that that is really close to each other. It’s trained on 400 million image text pairs, which were scraped from the internet. You can imagine that someone did indeed put an image of a puppy on the internet, and didn’t write under it, “This is Pepper the Aussie Pup.”

Using CLIP for Classification

The brilliant thing is that because we have an embedding for text and we have an embedding for images, and these are aligned to each other, we can use this for zero-shot prediction of images. If you have a picture of a puppy, and you know that it’s either a plane, or a car, or a dog, or a bird, you can just get the embedding for a photo of a plane, car, dog, and bird, and then see which of the embeddings of this text is closest to the embedding of the image. Hopefully, that will be a photo of a dog query.

Solving Our Scanning Application Using Zero-Shot Learning

Now let’s go back to solving our scanning application using zero-shot learning. What we can do is we can get an embedding for each of the labels we have like the carrot, potato, a bunch of grapes, and an empty table, in this case, if we want to know if there’s actually something on the table. Then we can get an embedding for a picture we took of our vegetable or our fruit or our empty table. We can determine the distance between the embedding of our picture and the embeddings of labels, and pick the one closest for our classification. Here you have a video of results. The kiwi is recognized reasonably well, but clashes a bit with the onion and the potato. The carrot is perfectly recognized. You can see that the bell pepper, perfect classification. Orange, really good. The onion is also perfectly recognized. A bunch of grapes, yes, that gets detected. Also, see that as soon as there’s nothing on the table, the empty table gets a high classification. Potato also nicely classified.

Here we get to some problems, this is a San Marzano tomato, which apparently looks like a bell pepper. The normal tomatoes do get recognized as vine tomatoes. My bunch of tiny cherry tomatoes gets recognized as a vine tomato. Yes, it’s nothing perfect, but at least it can get you started. You can even do multitasking. You don’t even need to do only vegetables, or only fruits and vegetables. You can also have this video, you can show the mobile phone. You can show it a mug. Then, for this person, you can maybe even see, is it a man or a woman? Is it a happy person or a sad person? Again, it doesn’t work perfectly. You can see that if I’m a happy man it works. Otherwise, I apparently get tagged as a mug. Also, here again, the code is really simple.

Zero-Shot Classification Example in Code

What I find so beautiful here is that you basically write down your labeling specification, and your labeling specification is instantly used to see how your classification is doing. If you notice that something is wrong here, or if you deploy this, you immediately get feedback on, is your labeling specification good. Or if we’re getting things, what are the likely clashes? Even the code for doing inference is really simple. You can download the model from the OpenAI GitHub repository.

Conclusion

We started this presentation by asking ourselves, can we shorten this process for machine learning projects a bit? If you use zero-shot learning, and you have natural language processing or an image classification task, you can basically start by determining the requirements together with deploying your model and getting feedback from your boss, other teams, colleagues. Then you can collect and label the data which you suspect will be a problem, like in this case, there will likely be a clash between the kiwi and the potato, or the onion. Then you can train a machine learning model with more confidence and keep deploying that. If you want to try these examples for yourself, I uploaded this code for this presentation to github.com/rmeertens/unreasonable-effectiveness-zero-shot. Then you can play around with it yourself.

Questions and Answers

Lazzeri: If you have to talk to junior data scientists who want to get started with these type of models, especially with the OpenAI API, what’s the best way to get started? I know that you shared the GitHub repo. I was already looking at that. That is a great way to, in my opinion, get started for machine learning research data scientists who are just joining the industry. Do you have any additional suggestions for data scientists that are early in their career, and they want to start, for example, with OpenAI API?

Meertens: The reason I’m now giving this talk is that, in spring, I played around a lot with the API. If you get accepted, you don’t have to write Python code to interact with it but it gives you some interactive textbox. You’re not allowed to show it in presentations, because they don’t like that. Because it can give you all kinds of recommendations. That’s how I would get started is just play around with that. What I did is frequently just talk with friends and say, do you think this would be possible? Do you think that would be possible? We tried out all kinds of ideas around, would it be possible to have it write song lyrics or something, or rap lyrics? Or, would it be possible to have a conversation between a child and a robot, for example? It’s really interesting that you can take all kinds of problems which are normally maybe difficult in industry, and you can just try to see what GPT-3 would do with it. That’s how you learn a bit about its capabilities. That’s just how to get started, this page of OpenAI. Even if you have no good idea yet on what to do with it, just make an account and try to get an invite to the API, because it’s really fun to play with.

Lazzeri: This is a great suggestion, and I agree with you, especially for machine learning scientists, data scientists, the best way to learn is by doing that. Go to the resources that Roland just shared with us, GitHub repos, and go through that material, try to replicate the demos with your own data. That is honestly the best way to learn.

Can this model recognize fish, meat, or poultry?

Meertens: Probably. What you see in the demo video I made of these fruits and vegetables, is not something I cherry picked or something. I just bought these things, went to the supermarket. This is the only video I made and you see that it works on some things really well, sometimes not. It really depends a bit on how specific you get, because I have the feeling that everything which is really general, or very frequent on the internet, for example, pictures of mobile phones, those are recognized perfectly. As soon as you get into very specific things, like I had these different types of tomatoes, there’s not a lot of people making pictures of very specific brands of tomatoes. That’s why the model will probably perform less well on that. If Ton has very clear pictures of the fish, or some other meats, or some poultry, he will probably have a good time. If he’s really wondering, what chicken is this, and it’s not something which there are many Wikipedia pages, you probably have a bad time. I think that you can at least already start by just going back maybe a bit to your talk, you can at least get started by deploying it. Then you can already set up your machine learning pipeline, because I think we already tried establishing that collecting your data and training everything and setting up this training pipeline, that takes a long time. That’s hard work.

Lazzeri: Have you tried fine-tuning CLIP for your specific problem, like for distinguishing specific types of tomatoes?

Meertens: Something which I maybe didn’t really address very well is the difference between those kind of foundational models, that you are not doing any post training or fine-tuning, and pre-trained models, like we already know in machine learning for a couple of years, where you take something which is trained on ImageNet, you try to fine-tune it. I didn’t try any fine-tuning with CLIP. That’s something which is more suitable for a pre-trained model. I didn’t try that. One thing which I do think would be very useful is that if you try CLIP and visualization here, you can at least already get started by collecting all this data. You can already see if the model is very confident in its prediction or not. Then you can label all the things where the model is not very confident, use that as your dataset. That way, with less data, you can hopefully establish something even better by just using CLIP early on.

Lazzeri: There was a data scientist who was using the GPT-3 model. At that time, that data scientist was mentioning to me that it wasn’t free as yet. I know that right now it’s open source, so it’s free, it’s accessible. What was your experience with this open source approach? I’m correct, right, that there is no cost associated with using the GPT-3 model?

Meertens: I think that that is still closed source, but the big problem with it is that you can’t run inference on it, because it doesn’t fit in one computer. It’s really hard to set it up for yourself. There are some open source alternatives, where people are training similar size models with open data. Right now, I think that if you make an account on the OpenAI website, you have some free space you can use. I think I paid 50 cents in making this talk through their API, just by trying a lot of things and trying out lots of things. I think that it can become expensive if you rely on it. On the other hand, it’s cheaper than making your own thing. I don’t know if you’re exactly interested in more of the open source aspects or more of the cost aspects?

Lazzeri: I think this person was asking specifically for the costs associated. That’s great to be also aware that there are some costs associated to it. Again, if you are building a solution, you just need to be aware of these costs.

Meertens: Maybe about the cost, something else which I didn’t really address is that there are different models of GPT-3. You have this really big one, and then you have some tinier ones. The tinier ones are cheaper to use inference with, but of course have less good performance. For some tasks I noticed that, for example, for this language task, you can use a model with less parameters because it’s probably learned the differences between different languages. If you are, for example, doing a chat application, the higher quality of chat you want, the better your model has to be, so you have to figure out what fits best for your use case.

Lazzeri: What are some of the additional applications that you have been seeing in the industry, not just based on your experience, but you have been seeing in the industry? These applications are based on the GPT-3 model.

Meertens: There are a couple of things which I personally played around with. One is chatbots. In the past, I at some point worked on having a robot which could assist elderly people, and they also could have some talks with them. Previously, all these things are scripted. Now with GPT-3, if someone tries to go off the beaten path, you can actually give a response, which is amazing. Because in the past, robots would always reply, “I don’t know what you mean, please say that again.” If you see that someone is going off the beaten path, you can give a reply, and then try to get them back.

The other thing which I tried is I like learning languages and I tried learning Swedish. I always hate to talk to people because I feel really insecure in my capability to speak Swedish. I tried to use GPT-3 to, first of all, give the interesting personas to talk with. It just generates new text every time. You always have to learn some new words, because you don’t understand them. The other thing I tried to do is getting the grammar correct, so that whenever I would input a sentence to this bot, it would try to correct it. That doesn’t work all the time, like some cases in German, it doesn’t get correct. At least you get feedback if you’re a novice Swedish learner like me. Those are for me really exciting that these are things which you couldn’t really do in the past. I spend a lot of time on, and friends of mine get PhDs to try to get this working, and now with this API, you just have it in one afternoon. I think that’s something I’m personally really excited about.

Lazzeri: This is a great example also based on your experience and on what you like to do in terms of learning new languages. I was reading about the Facebook Messenger application, and I know that it is based on the GPT-3 model. Facebook Messenger is just one of the many. It’s interesting to see that sometimes we don’t know about some models, but we are actually interacting with them. This is the reality. That’s why I was looking forward to your presentation, because it’s always great to learn about application APIs, because basically we probably don’t know the technical parts of them, but we are already interacting and leveraging them in our everyday life.

Based on your experience in the industry, how did you get started with machine learning, and what was your experience in the industry? Did you join your company just after your studies or you were at other companies before? Were you in different roles?

Meertens: I think I’m in quite a unique position, because in Netherlands you can study artificial intelligence, so I started studying artificial intelligence and machine learning before it was cool. I lucked into that, I think. If you’re getting started now, especially for machine learning, I would recommend to start doing Kaggle competitions. At least for me, I always like to have a concrete problem, and then you can try to solve it. Then you can also look at solutions and see what other people are doing. That’s something I will do. Actually, I think I wrote an article on InfoQ, about how to get hired as a machine learning engineer. There, I try to recommend a lot of sources of papers, or talks, or books, which you can read, or things you can practice if you want to become a machine learning engineer. I’ll post that in the QCon Slack.

See more presentations with transcripts

Subscribe for MMS Newsletter

By signing up, you will receive updates about our latest information.

  • This field is for validation purposes and should be left unchanged.