Neural Architecture Search and Google’s New AutoML Zero with Quoc Le - #366

The Power of Puns: A Breakthrough in Scaling Up NLP Models

The world of natural language processing (NLP) has long been fascinated by the concept of puns. While many people enjoy making and hearing jokes about puns, few have stopped to consider how they might be used to improve NLP models. That is, until recently, when a team of researchers made a groundbreaking discovery that could potentially revolutionize the field.

The breakthrough came about through the development of a new model that was trained on a large dataset of text. The model was designed to learn from the patterns and structures of language, with the ultimate goal of generating human-like responses to user input. As the team worked to fine-tune their model, they encountered a number of challenges, including difficulty measuring its performance.

One of the key difficulties in evaluating NLP models is determining how well they understand human language. Unlike traditional machine learning models, which can be evaluated on their ability to recognize patterns and make predictions based on those patterns, NLP models must be able to engage in meaningful conversations with humans. To address this challenge, the team turned to a local objective function, known as perplexity, which measures how well a model predicts the next word in a sequence.

Interestingly, the researchers found that the perplexity of the model was strongly correlated with human judgments of its ability to understand language. This correlation suggested that the local objective function, which is focused on predicting the next word, was also capturing some aspects of the global objective function, which is focused on understanding human language as a whole.

As the team continued to work on their model, they began to realize that the perplexity metric was not just a useful tool for evaluating performance, but was also closely tied to the creation of humor and jokes. In other words, the model's ability to predict surprise was crucial to its success in generating puns and other forms of humor.

This discovery has significant implications for the field of NLP, as it suggests that there may be a way to optimize models for both language understanding and creativity. By combining local and global objective functions, researchers may be able to create models that are not only more effective at understanding human language, but also more capable of generating humor and engaging in meaningful conversations.

The breakthrough achieved by the team has far-reaching implications for the field of NLP, and could potentially lead to a new generation of models that are better equipped to understand and generate human-like language. As researchers continue to explore the possibilities of this technology, it will be exciting to see how it is applied in real-world applications, from customer service chatbots to language translation software.

In conclusion, the discovery of the correlation between perplexity and human judgments of NLP performance is a significant breakthrough that has the potential to revolutionize the field. By combining local and global objective functions, researchers may be able to create models that are more effective at understanding human language and generating humor, leading to a new era of creativity and engagement in NLP applications.

Scaling Up: A New Era for NLP

The recent breakthrough in scaling up NLP models has opened up new possibilities for the field. By making large models like that one in our lab we're seeing some really interesting things come out of it.

One of the key insights from this work is that there's a clear relationship between optimizing for perplexity and optimizing for human-like performance. When we optimize for perplexity, which is just predicting the next word, it happens to capture some aspects of human language that are also important when we're talking about creating humor or jokes.

This has significant implications for how we approach NLP research. Rather than trying to build separate models for different tasks like conversation and humor, we can use a single model to optimize for both. This could lead to some really interesting breakthroughs in the field, as we start to see models that are capable of generating human-like language and also making jokes.

Another key insight from this work is that there's a clear difference between what we might call "local" optimization and what we might call "global" optimization. When we're optimizing for perplexity, which is just predicting the next word, that's a local objective function. But when we're talking about human-like performance, that's a global objective function.

The question is, how do these two objectives relate to each other? And how can we use one to inform and optimize the other? This is an area of ongoing research, but it has significant implications for how we approach NLP in general.

The Future of NLP: New Techniques and Technologies

One of the most exciting aspects of this breakthrough is the potential for new techniques and technologies that can be developed. By understanding how perplexity relates to human-like performance, researchers may be able to create new models and methods that are better equipped to understand and generate human language.

For example, researchers have already begun exploring the idea of using perplexity as a way to optimize models for specific tasks like conversation and humor. This has led to some really interesting breakthroughs in areas like chatbots and language translation software.

Another area of ongoing research is the development of new algorithms and techniques that can be used to optimize models for perplexity. This could involve everything from machine learning methods to more traditional approaches like linguistic analysis.

The potential applications of this technology are vast, and it's likely that we'll see some really exciting developments in the coming years. From customer service chatbots to language translation software, NLP has the potential to revolutionize a wide range of industries and applications.

In conclusion, the recent breakthrough in scaling up NLP models is just the beginning of a new era for the field. By understanding how perplexity relates to human-like performance, researchers may be able to create new models and methods that are better equipped to understand and generate human language. This has significant implications for the future of NLP, and it will be exciting to see how this technology is developed and applied in real-world applications.

"WEBVTTKind: captionsLanguage: enwelcome to the tunnel AI podcast I'm your host Sam Cherrington all right everyone I'm here with way is a research scientist at Google Kwok welcome to the 20th August all right hey it's great to have you on the show I've followed your research for your work for quite some time and I'm looking forward to digging into some of the new things that you're working on but before we do that I'd love to have you share a little bit about your background and how you got started working in machine learning okay so I was born in Vietnam and I did my undergrad in Australia and in my second year of my undergrad I started a summer project doing machine learning with Alex mola back in Australia and back then I was playing about with koto methods and then I did my PhD at Stanford on you know a lot of deep learning back in the day when deep learning was not very cool and that's around 2007 and around 2011 I did a summer internship at Google and that was when Google brain project was co-founded so when I was there and that was a Andrew Young and Jeff Dean and Greg Guerrero was there and I was the intern so we started a quest well that sounds very cool yeah and then I did some of the you know scaling up neural networks with the Google brain folks and then you know and then after two years I did some work on machine translation with the idea and audio video is now a bit my mentor Ilya is now at open air and we developed some of the end to end translation methods and and then around 2016 I started looking into more like you know auto email and you know architecture search yeah and more recently I look into more like together with outer may also look into semi-supervised learning and and so on awesome awesome now you mentioned early on doing work with Alex mola was he was this before he was at Carnegie Mellon or was he visiting in Australia he was a professor in Australia oh yeah yeah I I went to a university in a small area in the capital city but Australia Canberra Emre he was a camera and he was a professor there doing research so I thought you know I had I have been long very interested in AI and machine learning and I took before that I took a class in data mining and so on and I thought you know it's a little bit boring but machine learning the ability to to actually learn it's actually super fascinating so I contacted him and he he was doin like Kudo methods machine learning and we we worked together for maybe a few years yeah before he went to he went to America and then CMU and okay okay so a lot of your recent work has been focused on this idea of you know automating machine learning and neural architecture serves to allow machines to find the best deep learning architectures and like talk a little bit about how you arrived at working in that area and what some of the motivations were for getting started digging into that problem yeah yeah yeah so I've been a long interested in this idea of self-improvement machine should be say improving itself machine learning right and even and when I started doing coder methods with Alex I always asked him you know how the code the code or bandwidth and so on how some of the high program it does include all methods decided and you know apparently they decided by using things like cross-validation and so on and then when I work on koto methods sorry neural networks my hope is to make the hyper parameters go away right but that's how is the opposite so if now if you look at a convolutional neural networks is it has a lot of hyper parameters right like how many number and how many layers you want it to be and how many channels you want it to be and what are the sort of higher size of high parameters and so on a cooler with and so on and so measurement all the training parameters yeah yeah learning Dre and as researchers develop more and more techniques for neural nets there's more decisions that you have to make I feel like this is like a problem that you know can be helped by a little bit of automation so in so I I observe a lot of my colleagues at Google when the designer brew networks and I asked them about the principles of designing your neural networks and you know you started out having some really solid suppose like let us skip connection so the gradient can flow through the network and so on but as you tune the network harder and harder you no longer have the principal it's around you know trial and error right you you try this a little bit and it seems a little bit better so you try you try that more so I think that that is something that may be ready for automation so even during my grad school I already talked about trying this but I thought you know maybe we didn't have enough compute because training and that already takes took me days so when I saw that needle you can train your net in 30 minutes or something like that you know for all sefa I thought oh maybe this is the right time to try this so that's when I started doing this neural architecture search in 2016 mmm it's interesting that you know even with all of the compute resources of Google you had to wait until the time was compressed enough in order to be able to tackle the problem yeah yeah that's how that to get really good results you want the network should be really big and that will take a lot of time to Train yeah and it's it's it's funny coming from me that we have so much resources at Google but training neural nets still I've taken a long time yeah and so maybe talk about the first steps in in that area did you jump right into neural architecture search or was that the you know an end stage or end result of this work oh well you know I I work on some of the related ideas on and off since 2012 like playing around with how to do better high parameter tuning for new networks and none of that is really published because I didn't have good results you know I didn't have another computer and so on so so I tried it on and off hmm over the time you know every year I would set some time to try this idea for a few months and you know and it didn't work very well because like a pro q and so on and then throughout 2016 I met Barrett's off with my colleague now at Google and he's very talented so we say oh let's let's try this idea of using like a reinforcement learning to generate a network like a little layer in a network for for a CFR model see if a model is already at that time you could say that you know train in a few depends on how where you want it to be but you know from 30 minutes to a few hours and that seems like about the right amount of time to get this going and my prediction is that you have to train maybe either of between from 1,000 to 10,000 Sbarro's and I need a back of our envelope calculation and I thought oh this might be the right time to do it but you know I have tried this some of these related ideas in you know much before that so you're doing a c4 which is an image recognition or object detection and images and so you're doing you've got this when you say you're doing the when you say you're doing tens of thousands of models is that at part of the optimization process that you're describing here you're expecting they need to do 10,000 in order to optimize the hyper parameters yeah so so in this process you have a controller which is also a machine learning model right and every time it makes an update that's basically it has to it has to start a try train in one model to convergent one see far more to convergence and it will take the signal from the convergence of the CNS if our model you know maybe the CFM model will get 70% so that 70% will be used as a signal to make one update for the controller one update right so typically machine learning models take a long time to train so it requires requires you know ten thousands of updates so that's basically the number of models but we had to try were your initial attempts at this doing or you know how do you distinguish between like doing your hyper parameter optimization and kind of architecture search guess there's a varying degree of complexity in trying to come up with these new architectures yeah and your more recent work and this is like using evolutionary algorithms and the like to do this can you maybe talk through kind of the progression of complexity that you won through yeah so the first project that we did was architecture search for like a CFR model and that's already was already very expensive back when we did it and we didn't choose hyper parameters you only in other words we didn't choose high parameters like learning rate or ydk or drop out we just focus on architecture or hyper parameters basically you know number players what kind of player do you use in what stage in the network and that's already took us like almost a week for every time we try this it takes like a week on a few hundred GPUs so after that we moved to emission at and because imagenet the network is bigger we decided to use this idea called transfer learning which is basically find a module that works well on CFR 10 and transfer to a Mishnah because searching directly on a mission that would be very difficult this is very expensive now in parallel with that we also try a lot of methods in not only in reinforcement learning and but also in evolution and we also observe the evolution it goes as well or even better than reinforcement learning so we slowly adopt more reinforce our evolutionary methods and then so some of the first in any intuition for why evolution works better than reinforcement learning oh I see so evolution here is very flexible and very easy to implement right for example in evolution you just need to decide mutation and crossover mutation meaning you have a network and then you just changes a little bit and across over minitek two networks and make them so implementing evolution is actually quite easily now in contrast reinforcement learning because we're not experts in reinforcement learning so we we try a lot of reinforcement learning methods like we started out with reinforce and then we tried it something like PPO and so on and yeah Pio recently and they're good but you know it but they also require a fair bit about opportunity to get working very well so and on the other hand evolution seems to be very flexible in terms of implementation and it's also one thing that evolution does quite well is that it's easy to diversify the models so you can just try to in reinforcement learning one thing it happens it will zoom in into a particular area area in the Subspace and find and optimize the particular model whereas in evolution it would diversify the population over time it's easy to control the diversification process to get better models so we we use more evolution methods now and then feed-forward okay so in the second project we already found one network that was kind of state of the art in computer vision on par or slightly better on on these on the state of the art of image net so that was super excited because we didn't think that it was possible and then after that we realized that this transfer learning has a problem that you know you transfer the cell and maybe the part of things that you want or sefa and the kind of cells as you want for image net that's quite different so we started searching on image net directly basically you search a model on internet directly but that became super expensive you know our back of envelope calculation would take a few months for this is the finish so we realized that maybe one idea that we can have is such on small scale such a smaller model or Chifa let's say instead of searching the biggest model possible that you could find search for a to a small model right like the Train only like five epochs in you know hours or something like that right so that's mom and after that we after we found a good model we figure out a way to scale it up to be science so basically make it new with larger image or make it deeper or make it wider that's good in a in a learning way or scaling it up yes scale it up in a learning way okay with scaling up in a learning way so the second stage scale it up basically what we did was you know develop like a will learn the way that we should be scaling up hmm right and it looks like it works very well and that became something called efficient that now it's been used quite a bit in various places at Google and or in academia and the smallest network that we found turns out to be super helpful for mobile devices so people because the network's small networks seem to be quite fast now for mobile devices that became something like open at v3 yeah at Google and after that we say you know okay now that we can get state-of-the-art on imagenet but the problem is that a lot of muted blocks that we used very much you know building blocks that free describe pre described by human experts let's say we have to make use of the rail you layer design but design a human expert or we made use of a convolutional net convolution layer designed by a human expert or bottom layer is that my human experts so we say can we can we design everything from scratch basically assume that we know a number library like numpy is this basically just you know matrix vector multiplication and bunch of non-linearity can you use a number library to evolve the concept of machine learning and that became something like auto email zero which basically yeah that's that's automated and that's basically the thought process behind it in autumn l00 we didn't get se off the art yet and but what's exciting about it is it generates a program from just matrix vector multiplication and to do machine learning which is super excited and I hope that using this method with a discover fundamental new fundamental building blocks for for machine learning yeah folks haven't if anyone listening hasn't taken a look at any of the blog posts or the paper for autumn l0 really interesting there's a there's one particular diagram that I've seen a couple different versions of it but it kind of walks through the process that this algorithm takes to learn a model and shows the various steps that that it introduces and as well as the program that it that it outputs and is super interesting you know talk a little bit about this idea of you know having this model work by evolving a program where did that come from so you know I was just gonna say you know a lot of what we're doing is all software it's all programs but this in particular is like arithmetic arithmetic operations that in a very kind of simple way define all the steps that are used to evolve these algorithms yes yes so if you think about what computer sorry machine learning experts are doing nowadays is that basically they look at a computer program you know they lose tens of right look at tensorflow or package and they have a bunch of players and then they figure out develop to write a computer program to write a new program to do machine learning let's say you want to do forecasting or something like that right basically you look at how people use lsdm and then you put together a computer program to do a forecasting now the act of writing that program is still now not learnt right so basically it's basically human expert knowledge and changing that program can affect the quality of the model or greatly no stepping back a little bit is that that program the the autumn the the program you just put assume a lot of knowledge about machine learning right the fact that the concept of gradient you know back propagation is assumed during this process right because the different automated differentiation is built in into tensor flow and Python so we say that maybe what we can we do is a step back a little bit from Python and and there's a flow and started from numpy and using numpy can you put together a small computer program that can do you know see far classification or something like that yeah and in the setup that we have we have three functions so one function is set up meaning that you know it's like DNA that you start with right and then there's a predict function meaning that whatever you have learned you have to use it to for survival right so you have to make you given a given situation you have to make some prediction and then the third function is the learn function you know it it has to learn so that the predict function is better over time so there's only is this template have only three function setup predict and learn and auto ml 0 has to see you in the rest of the program the rest of the instruction within these building blocks but it's starting those three functions being totally empty yeah it started out with these three functions totally empty so at the beginning it started you know amazing right at the beginning it will do nothing so most of the programs will be garbage so you have no signal at all hmm you have no signal at all so by some random luck right it will find some kind of dot product linear kind of layer that's somehow there's more than better than rather just slightly a little bit slightly better than random right this is basically the predict function there's a little bit better than random and then it will basically slowly put together one more one more layer to become like a neural net and then it will invent the concept of gradient so all the time it will come it started from a very small program that this new linear to going through many many steps to eventually do a new network and I reference the this diagram if I'm understanding the diagram correctly you're identifying all these points where the algorithm evolves these techniques that you know humans do now like it eventually figures out how to do SGD it eventually figures out like Ray Lou and other things and it figures out techniques like gradient normalization and somebody am I reading that correctly that this is all stuff that has figured out yeah yeah yeah so basically if you put it in the whole sequence you know the first step it will fire find like something like linear model it find lost clipping it will find learning rate and it will find Rey Lu you know and that normalizing the input norm the gradient and then you know doing having my linear interaction things like that so things that you know like over the time in the last maybe 30 years of neural networks evolution just so that I understand there's no there's no kind of priors there's no like recipe book or techniques that are given to the model the algorithm at all is figuring out all of this from nothing you know just so-so the caveat is that autoimmune zero has access to 64 functions from Lumpa okay that's there some bias here so these 64 function from numpy because the way that linear algebra works is very in favor of neural nets right so it will tend to develop things like nurullah because Disney algebra yeah but that's the only thing that product there for it to use eventually it's gonna try and use it on some stuff yeah but it would be hard for it to find something like trees because you know numpy doesn't have the functions that cut up suitable for trees hmm but it's numpy you can argue that number is like a library they're very suitable for neuron that's right so it will if all things that eventually look like a neural net now what's surprising is that it the whole process of developing models that started from linear and then put in development learning rate and normalizing the gradient and things like that it looks very much like the evolution the our own evolution process of developing machine learning morals so and so do you have you have you has this allows you to see future techniques that we may learn to apply yes but we haven't found anything extremely novel in the sense that like we never we haven't seen it before but we have found something that we haven't looked into more closely so in particular it found this by linear layer that normally if you do a neural net you would take X multiplied by some W and then you apply some gnarly non-linearity mmm now what we found during autumn Auto ml0 is that if we fill X W X so and then apply some knowledge non-linearity so that basically what it say is a some kind of my linear interaction right and apparently this concept has been developed by other colleagues at Google but we never knew we didn't know it before our project so we are in the process of trying out this layer on larger problems but it looks like that layer is actually quite promising hmm yeah so the other thing is the concept of a gradient normalization so basically you take the grain and then you normalize it before you make the update so this is not not something new so people done this before but it's not very popular and I think maybe one thing is that we are also trying this on bigger networks now ourselves but if you can think about this process this it will lead the process of discovering either new idea or discovered old ideas but we actually overlook because we have so many ideas in machine learning that tend to overlook them so but some of the recent data it looks promising some of the ideas that it found do you think there's an opportunity to use a technique similar to what you did earlier with C far where you apply Auto ml 0 in a small way and then scale it up to bigger problems or networks yeah I asked you we still thinking about how to do it effectively but basically basically you right so the problems that we did in Auto mf0 is like an it's not even see far it's a down scale sefa it's a small version of sefa so and then you know the concept like gradient normalization or things like you know mount my linear interaction it's something that it found and then we can take some of these layers and transfer into big problems now unfortunately the problem in sefa it's so down scale that you we cannot present it like an image so it is only one D and in 1d you cannot find things like convolutional neural networks so that's a that's a limitation but in the next step we try to make it have like make it see an image rather than just a 1d image now that's that's one aspect which is basically how to scale this it more effectively which we did it before the other thing that we did is to zoom in into a particular aspect of the neural net F can you do better so related to this is the paper that we publish you know a couple days ago on evolving a better activation and normalization layer so basically in in a neural net people use this layer compassion um and relu a lot right this is interest net if you use resonant you have flash Naumann radio and then there's a skip connection and we say let's use this into a single layer and search for a new mathematical operation to replace this bash norman trailer so we accept the rest of the network but we search for a mathematical operation from from to define a new layer and it seems like to find a very good layer as a replacement and it works better better than fashion element radio and is the motivation they're primarily network performance or is it computational or what what is driving you to focus on those particular layers so bash Nam and drill ooh what did the bash nombre lou is a good thing about it it is allows you to train with very big match size right but when it when it's very small outside it doesn't work very well so it's a layer that you know Google would like but academia and more most data scientists could not like because this you know you don't have a big computer to train with a big batch hmm so this is a replacement called group norm and drill ooh which is very good but it works very well on small batch size but on the big bad size it's still a bit worse and then then then bash norm so so it's still still confusing were too many people what layers to use even at Google so developing a new layer is that first of all bash no value is if it can be a good replacement for bash non-renewed that means the layer can be used by both internal researchers and external researchers well and the second thing is ma chambre du helps training it stabilize the training a lot it's I speed up the training so we use it a lot in our work so who's really want to improve on that aspect it plays a huge role there's a there's a paper where does it say that you just train only the bottom layers and don't train the convolutional nets and you still can get a good performance you know it's not great but it's good performance it means that you know the the these layers play a very good important role and as I'm talking about training those layers only those layers from scratch or fine-tuning only those layers trained those are from scratch and so is the idea that you're applying techniques like what you've done it auto ml0 to finding these new layers or is that a totally separate a pro it's a it's a highly related and I would say you know it's it's like a replication of this idea of zero yeah yeah cool you so you've also been working on semi-supervised or self supervised learning recently yes can you describe some of that work and how it relates to this stuff yeah sure so I've been working on this automatic machine learning and automated architecture design and so on and I gave talks right about my this new development and a lot of people came to me and complained they say you know you automate the design of the neural networks but I have more data than you so I bid you so so I say oh that's a that's a good point so I thought about ok ultimate machine learning is cool but can you automate the labeling process can you automate labeling right because most people would prefer to have more data because more it is it's very it's very important right I don't mean that architecture is not important but having more data is also very important mm-hmm so the question is can you automate the process of labeling later right so today you don't have anything today you basically get some elaborate and you give it to some human experts and then they labeled the data for you annotate that it for you so techniques like active learning that can help ya the best labels the best data to label which yes that's right yeah active learning was speed up that process by selectively choose the the example to annotate the data mm-hmm now so one idea that we had is the concept of pseudo labels so faked labels can you take your model and generate and evaluate on the unlabeled set and now you have weekly label data our observation is this right like can you take your own model generate labels on a new set up and label data and then put it back assuming that they are correct labels and train the new model on that well actually I tried this idea many many years ago to and it didn't work and the problem is that if you take the model and generate the new new label data the new Whitley label data some of them are accurate you know a three would get a tree like that label of three but sometimes a tree would get a label of five and this error would propagate propagate into the new training and it would hurt the new training and the new training would not get better result than the old training right because it's a the confirmation bias going on mm-hmm now you so I I did have I did not know any way to fix that problem so so I thought the concept of self labeling is a is too good to be true but recently we realized that this a way to overcome this process is when you train the new training you inject a lot of noise into the new student so that's so you have a teacher the generate labels on a liberator you have a combined set of true labor and Whitley labor data or pseudo labor data and you train new student on this new combined set when you train the student make sure that you insert a lot of notes hmm so the a lot of noise is a student will we still don't know this how it happened yet but the noise in the students seems to have this process that it will overcome the confirmation bias probably because it will make the student more robust right you will not trust the labels all that much exactly right so it alone not to trust the label that much because it has to cope with so much noise and amazingly it actually outperform the teacher so the noise that we use is basically things like you know drop out and drop certain parts of the model they documentation and super aggressively Bundesrat documentation and noise and eventually it would do better than the student and you can just iterate the process right once you have a better student you label new data and then you put back so we we been doing this and it seems to work very well and how what's your what's your performance metric or your benchmark yeah so we worked on this deficit core image net and so I think when we work on this data set the day of the art was like 82% and then using architecture search we've pushed it into 85 percent eighty five point four something like that and then using this auto label process you get to eighty eight point four so three percent improvement so and keep in mind that 1% improvement on imagenet at that high range is very difficult yeah and I'm talking about top one accuracy okay and so is this are you is the setup here that you are you starting with your standard kind of 70 percent training and 30 percent test ratio or does that matter in this oh so you're having your your student label the that 30 percent like these ratios come into play into inside okay so we just follow the conventional imagenet set up so emission that has us already have a split up you know maybe 1.2 million example of training and a hundred thousand for validation and for for our label data you would use a different corpus so at Google we have this data that data set called jft that has about 300 million images mm-hmm I said we operating other data beyond an image net that labels using a model trained on image net yes she don't have any labels for I mean either you don't have any labels for your external data set but you're just labeling it based on image net some of those are going to be wrong so you introduced the noise and it all seems oh yeah yeah so basically yeah so here's something that we find really surprising there's one experiment in the paper that people did not check but it's super exciting is that we can propagate back images that so you know images has a thousand categories right right you know some flowers some dogs and some cats and so on now we we propagate back images that don't have that don't look like any like any categories in 1000 AD agree so could be like some kind of very strange anymore right and the model would suggest like Isis that imagenet yeah it doesn't belong to any yeah any categories and it still helps this vice is saying that you know this image is not any of these category and this put a lot of low low percentage know probably the on a lot is category from propagate back still okay like that that's not consistent with consistent maybe with the idea that the images are even if they're not you know helping tune your classifier layers at the end they're helping the network learn textures and you know low-level features better yeah so I think it what it really happening is that it just tried to learn information about natural images right and the fact that basically you give it a label so the teacher give it a label and then you when you do data documentation you ship the image a little bit right and then you say that the label is the same so you know the probably the table looks very similar so the model has to work really hard to type and sister in the production so doesn't and natural image in general is this very similar in similar ways right so it learns to be consistent in in terms of labeling for new images and maybe that's the reason why it's very helpful even though the images might not have anything to do with with your data set mm-hmm and you know I'm curious in kind of articulating that intuition for what's happening is that based on you know all of your experience working with these kind of networks so did you perform specific experiments to try to understand you know what the effect might be and you know what were those experiments oh okay so network introspection or any kinds of things and just curious you know what are the kinds of things you've done in order to deepen your understanding of why this is working are you mean this particular experiment or in general this particular experiment okay so we try to lower the threshold right so when you take the your model and then label new a couple of 300 million images we filter out we have a chance to suit our low confidence images right so things that you know have very low prediction low probability of having to be in any one of the class so we can keep you know only 10,000 images or we can keep 32 I'm sorry 1 million images or 30 million images or you know 300 million images Oh any anywhere in between right so we vary the special and it seems like actually the treasure can be quite really low mm-hmm right yeah so it could be you know in the 130 million or something like that and still okay and then we visualize the image that are actually very low accuracy and a low of confident and then we visualize them and see what they look like and they don't apparently they don't look like anything like an image net so that's what we found why the fact that they have full is very surprising but why they have full we still don't know so is this a hypothesis we don't we don't know yeah yeah yeah I which is basically our current what is trying to analyze wise careful okay and what do you expect your direction to be and analyzing that we probably gotta look into the hidden state of the neural network and you see we see you know with this a low confident where is it is it trying to make the hidden state more consistent to each other which is basically the same phenomenon that a lot of people do in contrastive learning in sales supervised learning is that they also have two images a plate the Augmented image and they try to make sure that the prediction is a set and if you have two images of unrelated images you make sure the prediction is different right so I think this is what happened in here so we were gonna visualize some of the hidden state of the neural network with and without these low confident example and see you know what happened during training okay that's that's one direction then we think we'll be doing okay okay yeah cool you were also an author on Mina about that work yeah so many years ago I worked on something called sequence sequence learning you know end-to-end neural networks through NLP and that's used for translation and I spent like two years after that try to be like a chat bot to chat with me because I always fascinated you have an agent that can talk to me you know elegent open all your emails and you know might you or were you trying to were you trying to have a communication with the chat but are we trying to replace yourself quote-unquote with the chat bot by having other people be able to talk to your both you know just talking to computers is it's fascinating yeah yeah and so I failed and then we I have white and then we ended up meeting this person called at Google and a very great engineer at Google or Daniel and he he's saying how about let's see work together to make this better and we did a lot of work on scaling up some of the models that we build we collected a huge amount of social media data even people talking on the internet and then we trained them huge model like a model I you know maybe a hundred times because I what I trained back in the day and it can do mounted on conversations started doing mounted on conversation very well and one of the magic moment that I really think that surely magic is that it actually invented like a joke it invented a joke so people talk it's it's it made this pun that you know horses go to Hayward so cows go to have what cows go to Harvard and horses go to Hayward mm-hmm but it's very fascinating remember it's been a while since I've looked at that one but I remember it was the you know when you described it as inventing this joke it wasn't anywhere in the training data yes you could find right yeah the only instance of mentioning hey what is doesn't have very any anyway there's only one instance I've mentioned in the world haver in the trainee later and we look at that context and it has nothing that looks like we like horses go to hybrid at all and so what do you think what's happening there how did that work and that's unlike what we see in kind of conventional language models like even the big ones they're picking up stuff that they've seen before generally right okay so I I can't tell you my version my version is following my version it's a following we still don't know right that's what I say it's a magic moment I think first of all a lot of social media jokes about puns mm-hmm a lot of you know like we find puns kind of funny so and Sony yeah and a pony and so in the training data we train with by pair at coding meaning that with well and we break it down into pieces yeah so isn't a ver it's not a world is bright to us hey but right at Harvard and etc so it must have learned the concept of punt and he would understand that you know Harvard and Harvard somehow is kind of related so it made up this bun so two things right learning the concept of puns and learning how to put together like a new concept and yeah there's a lot of new things like it make a jokes like it makes jokes about why chicken cross the roads and things like that I don't remember the particular jokes I could find it and send it to you know it make all kind of new jokes that we never found in the training data and it's truly fascinating and so where do you you know that line of work where do you see that going well what were the key you know was this was this kind of simply you know quote unquote an instance of scaling up the training or were there some new techniques or novel things that were developed you know in the model that you can see applying elsewhere I see well first of all it basically tells a lot of people that scaling up is very effective way to be you know NLP models right like I spent two years fell in Lou in that chat bar and then suddenly someone came along and found a solution and the solution is kind of simple which is basically just make the model a lot larger that we also found a lot of new new interesting insights is that during the process of muting the bar we had a lot of difficulty how to measure the performance of the bar so basically we whatever we want to measure is how human-like it is and we want to be about that have a conversation like a human life so we to measure human light we always want to ask human to look at a conversation that we have with about because we knew there many models of the box right many many models of the box and every model we have to look at a use human expert to look at their accommodations and hey I'm human like this is so new render process what is that we observe is that as the perplexity of the model so you know as you trade the model better so the objective function is called perplexity it's like a very local objective this is basically how how well you predict the next world and we've noticed that this objective function correlates very well with human a judgement of human likeness at the bat and and then we did a view plot of perplexity what is their local objective function you know technically in the next world and human likeness and we saw like a kernel a strong correlation so that's basically another contribution for NLP is that actually this local objective function which is just predicting the next book next school if you do a good job at it turns out is also a more global objective function which is human likeness how how I can behave like a human so local mimicking is global mimicking mmm-hmm what's interesting about that is that you know if you being kind of predicting the next work what makes jokes and puns work is that the next word is a surprise yeah what you saw so how do you get something that's good at jokes but also is optimizing on perplexity yeah that's that's a reason why I say it's very suppose surprising that low so this uh let me try to say it again so there must be a lower global objective function that we are optimized we like making fun so that we can get an engagement right making fun saying something meaningful yeah right that's not global right that's something that's not just optimizing the next world but what I'm just klemming is that local objective function is that actually just predicting the next world it somehow correlate to the global right and I mean I guess I could say that your what perplexity doing is doing in the case of the joke is predicting surprise it's not like blind to the surprise but it's predicting the surprise based on its predicted the surprise yeah you could say that yeah yeah huh interesting cool well thank you so much Kwok for taking the time to share with us what you're up to and and kind of walk us through these you know recent works here is really really interesting stuff yeah thank you so much it's a pleasure to talk to you same-same awesome awesome thank you all right everyone that's our show for today to learn more about today's guest or the topics mentioned in this interview visit twil male Icom of course if you like what you hear on the podcast please subscribe rate and review the show on your favorite podcast thanks so much for listening and catch you next time youwelcome to the tunnel AI podcast I'm your host Sam Cherrington all right everyone I'm here with way is a research scientist at Google Kwok welcome to the 20th August all right hey it's great to have you on the show I've followed your research for your work for quite some time and I'm looking forward to digging into some of the new things that you're working on but before we do that I'd love to have you share a little bit about your background and how you got started working in machine learning okay so I was born in Vietnam and I did my undergrad in Australia and in my second year of my undergrad I started a summer project doing machine learning with Alex mola back in Australia and back then I was playing about with koto methods and then I did my PhD at Stanford on you know a lot of deep learning back in the day when deep learning was not very cool and that's around 2007 and around 2011 I did a summer internship at Google and that was when Google brain project was co-founded so when I was there and that was a Andrew Young and Jeff Dean and Greg Guerrero was there and I was the intern so we started a quest well that sounds very cool yeah and then I did some of the you know scaling up neural networks with the Google brain folks and then you know and then after two years I did some work on machine translation with the idea and audio video is now a bit my mentor Ilya is now at open air and we developed some of the end to end translation methods and and then around 2016 I started looking into more like you know auto email and you know architecture search yeah and more recently I look into more like together with outer may also look into semi-supervised learning and and so on awesome awesome now you mentioned early on doing work with Alex mola was he was this before he was at Carnegie Mellon or was he visiting in Australia he was a professor in Australia oh yeah yeah I I went to a university in a small area in the capital city but Australia Canberra Emre he was a camera and he was a professor there doing research so I thought you know I had I have been long very interested in AI and machine learning and I took before that I took a class in data mining and so on and I thought you know it's a little bit boring but machine learning the ability to to actually learn it's actually super fascinating so I contacted him and he he was doin like Kudo methods machine learning and we we worked together for maybe a few years yeah before he went to he went to America and then CMU and okay okay so a lot of your recent work has been focused on this idea of you know automating machine learning and neural architecture serves to allow machines to find the best deep learning architectures and like talk a little bit about how you arrived at working in that area and what some of the motivations were for getting started digging into that problem yeah yeah yeah so I've been a long interested in this idea of self-improvement machine should be say improving itself machine learning right and even and when I started doing coder methods with Alex I always asked him you know how the code the code or bandwidth and so on how some of the high program it does include all methods decided and you know apparently they decided by using things like cross-validation and so on and then when I work on koto methods sorry neural networks my hope is to make the hyper parameters go away right but that's how is the opposite so if now if you look at a convolutional neural networks is it has a lot of hyper parameters right like how many number and how many layers you want it to be and how many channels you want it to be and what are the sort of higher size of high parameters and so on a cooler with and so on and so measurement all the training parameters yeah yeah learning Dre and as researchers develop more and more techniques for neural nets there's more decisions that you have to make I feel like this is like a problem that you know can be helped by a little bit of automation so in so I I observe a lot of my colleagues at Google when the designer brew networks and I asked them about the principles of designing your neural networks and you know you started out having some really solid suppose like let us skip connection so the gradient can flow through the network and so on but as you tune the network harder and harder you no longer have the principal it's around you know trial and error right you you try this a little bit and it seems a little bit better so you try you try that more so I think that that is something that may be ready for automation so even during my grad school I already talked about trying this but I thought you know maybe we didn't have enough compute because training and that already takes took me days so when I saw that needle you can train your net in 30 minutes or something like that you know for all sefa I thought oh maybe this is the right time to try this so that's when I started doing this neural architecture search in 2016 mmm it's interesting that you know even with all of the compute resources of Google you had to wait until the time was compressed enough in order to be able to tackle the problem yeah yeah that's how that to get really good results you want the network should be really big and that will take a lot of time to Train yeah and it's it's it's funny coming from me that we have so much resources at Google but training neural nets still I've taken a long time yeah and so maybe talk about the first steps in in that area did you jump right into neural architecture search or was that the you know an end stage or end result of this work oh well you know I I work on some of the related ideas on and off since 2012 like playing around with how to do better high parameter tuning for new networks and none of that is really published because I didn't have good results you know I didn't have another computer and so on so so I tried it on and off hmm over the time you know every year I would set some time to try this idea for a few months and you know and it didn't work very well because like a pro q and so on and then throughout 2016 I met Barrett's off with my colleague now at Google and he's very talented so we say oh let's let's try this idea of using like a reinforcement learning to generate a network like a little layer in a network for for a CFR model see if a model is already at that time you could say that you know train in a few depends on how where you want it to be but you know from 30 minutes to a few hours and that seems like about the right amount of time to get this going and my prediction is that you have to train maybe either of between from 1,000 to 10,000 Sbarro's and I need a back of our envelope calculation and I thought oh this might be the right time to do it but you know I have tried this some of these related ideas in you know much before that so you're doing a c4 which is an image recognition or object detection and images and so you're doing you've got this when you say you're doing the when you say you're doing tens of thousands of models is that at part of the optimization process that you're describing here you're expecting they need to do 10,000 in order to optimize the hyper parameters yeah so so in this process you have a controller which is also a machine learning model right and every time it makes an update that's basically it has to it has to start a try train in one model to convergent one see far more to convergence and it will take the signal from the convergence of the CNS if our model you know maybe the CFM model will get 70% so that 70% will be used as a signal to make one update for the controller one update right so typically machine learning models take a long time to train so it requires requires you know ten thousands of updates so that's basically the number of models but we had to try were your initial attempts at this doing or you know how do you distinguish between like doing your hyper parameter optimization and kind of architecture search guess there's a varying degree of complexity in trying to come up with these new architectures yeah and your more recent work and this is like using evolutionary algorithms and the like to do this can you maybe talk through kind of the progression of complexity that you won through yeah so the first project that we did was architecture search for like a CFR model and that's already was already very expensive back when we did it and we didn't choose hyper parameters you only in other words we didn't choose high parameters like learning rate or ydk or drop out we just focus on architecture or hyper parameters basically you know number players what kind of player do you use in what stage in the network and that's already took us like almost a week for every time we try this it takes like a week on a few hundred GPUs so after that we moved to emission at and because imagenet the network is bigger we decided to use this idea called transfer learning which is basically find a module that works well on CFR 10 and transfer to a Mishnah because searching directly on a mission that would be very difficult this is very expensive now in parallel with that we also try a lot of methods in not only in reinforcement learning and but also in evolution and we also observe the evolution it goes as well or even better than reinforcement learning so we slowly adopt more reinforce our evolutionary methods and then so some of the first in any intuition for why evolution works better than reinforcement learning oh I see so evolution here is very flexible and very easy to implement right for example in evolution you just need to decide mutation and crossover mutation meaning you have a network and then you just changes a little bit and across over minitek two networks and make them so implementing evolution is actually quite easily now in contrast reinforcement learning because we're not experts in reinforcement learning so we we try a lot of reinforcement learning methods like we started out with reinforce and then we tried it something like PPO and so on and yeah Pio recently and they're good but you know it but they also require a fair bit about opportunity to get working very well so and on the other hand evolution seems to be very flexible in terms of implementation and it's also one thing that evolution does quite well is that it's easy to diversify the models so you can just try to in reinforcement learning one thing it happens it will zoom in into a particular area area in the Subspace and find and optimize the particular model whereas in evolution it would diversify the population over time it's easy to control the diversification process to get better models so we we use more evolution methods now and then feed-forward okay so in the second project we already found one network that was kind of state of the art in computer vision on par or slightly better on on these on the state of the art of image net so that was super excited because we didn't think that it was possible and then after that we realized that this transfer learning has a problem that you know you transfer the cell and maybe the part of things that you want or sefa and the kind of cells as you want for image net that's quite different so we started searching on image net directly basically you search a model on internet directly but that became super expensive you know our back of envelope calculation would take a few months for this is the finish so we realized that maybe one idea that we can have is such on small scale such a smaller model or Chifa let's say instead of searching the biggest model possible that you could find search for a to a small model right like the Train only like five epochs in you know hours or something like that right so that's mom and after that we after we found a good model we figure out a way to scale it up to be science so basically make it new with larger image or make it deeper or make it wider that's good in a in a learning way or scaling it up yes scale it up in a learning way okay with scaling up in a learning way so the second stage scale it up basically what we did was you know develop like a will learn the way that we should be scaling up hmm right and it looks like it works very well and that became something called efficient that now it's been used quite a bit in various places at Google and or in academia and the smallest network that we found turns out to be super helpful for mobile devices so people because the network's small networks seem to be quite fast now for mobile devices that became something like open at v3 yeah at Google and after that we say you know okay now that we can get state-of-the-art on imagenet but the problem is that a lot of muted blocks that we used very much you know building blocks that free describe pre described by human experts let's say we have to make use of the rail you layer design but design a human expert or we made use of a convolutional net convolution layer designed by a human expert or bottom layer is that my human experts so we say can we can we design everything from scratch basically assume that we know a number library like numpy is this basically just you know matrix vector multiplication and bunch of non-linearity can you use a number library to evolve the concept of machine learning and that became something like auto email zero which basically yeah that's that's automated and that's basically the thought process behind it in autumn l00 we didn't get se off the art yet and but what's exciting about it is it generates a program from just matrix vector multiplication and to do machine learning which is super excited and I hope that using this method with a discover fundamental new fundamental building blocks for for machine learning yeah folks haven't if anyone listening hasn't taken a look at any of the blog posts or the paper for autumn l0 really interesting there's a there's one particular diagram that I've seen a couple different versions of it but it kind of walks through the process that this algorithm takes to learn a model and shows the various steps that that it introduces and as well as the program that it that it outputs and is super interesting you know talk a little bit about this idea of you know having this model work by evolving a program where did that come from so you know I was just gonna say you know a lot of what we're doing is all software it's all programs but this in particular is like arithmetic arithmetic operations that in a very kind of simple way define all the steps that are used to evolve these algorithms yes yes so if you think about what computer sorry machine learning experts are doing nowadays is that basically they look at a computer program you know they lose tens of right look at tensorflow or package and they have a bunch of players and then they figure out develop to write a computer program to write a new program to do machine learning let's say you want to do forecasting or something like that right basically you look at how people use lsdm and then you put together a computer program to do a forecasting now the act of writing that program is still now not learnt right so basically it's basically human expert knowledge and changing that program can affect the quality of the model or greatly no stepping back a little bit is that that program the the autumn the the program you just put assume a lot of knowledge about machine learning right the fact that the concept of gradient you know back propagation is assumed during this process right because the different automated differentiation is built in into tensor flow and Python so we say that maybe what we can we do is a step back a little bit from Python and and there's a flow and started from numpy and using numpy can you put together a small computer program that can do you know see far classification or something like that yeah and in the setup that we have we have three functions so one function is set up meaning that you know it's like DNA that you start with right and then there's a predict function meaning that whatever you have learned you have to use it to for survival right so you have to make you given a given situation you have to make some prediction and then the third function is the learn function you know it it has to learn so that the predict function is better over time so there's only is this template have only three function setup predict and learn and auto ml 0 has to see you in the rest of the program the rest of the instruction within these building blocks but it's starting those three functions being totally empty yeah it started out with these three functions totally empty so at the beginning it started you know amazing right at the beginning it will do nothing so most of the programs will be garbage so you have no signal at all hmm you have no signal at all so by some random luck right it will find some kind of dot product linear kind of layer that's somehow there's more than better than rather just slightly a little bit slightly better than random right this is basically the predict function there's a little bit better than random and then it will basically slowly put together one more one more layer to become like a neural net and then it will invent the concept of gradient so all the time it will come it started from a very small program that this new linear to going through many many steps to eventually do a new network and I reference the this diagram if I'm understanding the diagram correctly you're identifying all these points where the algorithm evolves these techniques that you know humans do now like it eventually figures out how to do SGD it eventually figures out like Ray Lou and other things and it figures out techniques like gradient normalization and somebody am I reading that correctly that this is all stuff that has figured out yeah yeah yeah so basically if you put it in the whole sequence you know the first step it will fire find like something like linear model it find lost clipping it will find learning rate and it will find Rey Lu you know and that normalizing the input norm the gradient and then you know doing having my linear interaction things like that so things that you know like over the time in the last maybe 30 years of neural networks evolution just so that I understand there's no there's no kind of priors there's no like recipe book or techniques that are given to the model the algorithm at all is figuring out all of this from nothing you know just so-so the caveat is that autoimmune zero has access to 64 functions from Lumpa okay that's there some bias here so these 64 function from numpy because the way that linear algebra works is very in favor of neural nets right so it will tend to develop things like nurullah because Disney algebra yeah but that's the only thing that product there for it to use eventually it's gonna try and use it on some stuff yeah but it would be hard for it to find something like trees because you know numpy doesn't have the functions that cut up suitable for trees hmm but it's numpy you can argue that number is like a library they're very suitable for neuron that's right so it will if all things that eventually look like a neural net now what's surprising is that it the whole process of developing models that started from linear and then put in development learning rate and normalizing the gradient and things like that it looks very much like the evolution the our own evolution process of developing machine learning morals so and so do you have you have you has this allows you to see future techniques that we may learn to apply yes but we haven't found anything extremely novel in the sense that like we never we haven't seen it before but we have found something that we haven't looked into more closely so in particular it found this by linear layer that normally if you do a neural net you would take X multiplied by some W and then you apply some gnarly non-linearity mmm now what we found during autumn Auto ml0 is that if we fill X W X so and then apply some knowledge non-linearity so that basically what it say is a some kind of my linear interaction right and apparently this concept has been developed by other colleagues at Google but we never knew we didn't know it before our project so we are in the process of trying out this layer on larger problems but it looks like that layer is actually quite promising hmm yeah so the other thing is the concept of a gradient normalization so basically you take the grain and then you normalize it before you make the update so this is not not something new so people done this before but it's not very popular and I think maybe one thing is that we are also trying this on bigger networks now ourselves but if you can think about this process this it will lead the process of discovering either new idea or discovered old ideas but we actually overlook because we have so many ideas in machine learning that tend to overlook them so but some of the recent data it looks promising some of the ideas that it found do you think there's an opportunity to use a technique similar to what you did earlier with C far where you apply Auto ml 0 in a small way and then scale it up to bigger problems or networks yeah I asked you we still thinking about how to do it effectively but basically basically you right so the problems that we did in Auto mf0 is like an it's not even see far it's a down scale sefa it's a small version of sefa so and then you know the concept like gradient normalization or things like you know mount my linear interaction it's something that it found and then we can take some of these layers and transfer into big problems now unfortunately the problem in sefa it's so down scale that you we cannot present it like an image so it is only one D and in 1d you cannot find things like convolutional neural networks so that's a that's a limitation but in the next step we try to make it have like make it see an image rather than just a 1d image now that's that's one aspect which is basically how to scale this it more effectively which we did it before the other thing that we did is to zoom in into a particular aspect of the neural net F can you do better so related to this is the paper that we publish you know a couple days ago on evolving a better activation and normalization layer so basically in in a neural net people use this layer compassion um and relu a lot right this is interest net if you use resonant you have flash Naumann radio and then there's a skip connection and we say let's use this into a single layer and search for a new mathematical operation to replace this bash norman trailer so we accept the rest of the network but we search for a mathematical operation from from to define a new layer and it seems like to find a very good layer as a replacement and it works better better than fashion element radio and is the motivation they're primarily network performance or is it computational or what what is driving you to focus on those particular layers so bash Nam and drill ooh what did the bash nombre lou is a good thing about it it is allows you to train with very big match size right but when it when it's very small outside it doesn't work very well so it's a layer that you know Google would like but academia and more most data scientists could not like because this you know you don't have a big computer to train with a big batch hmm so this is a replacement called group norm and drill ooh which is very good but it works very well on small batch size but on the big bad size it's still a bit worse and then then then bash norm so so it's still still confusing were too many people what layers to use even at Google so developing a new layer is that first of all bash no value is if it can be a good replacement for bash non-renewed that means the layer can be used by both internal researchers and external researchers well and the second thing is ma chambre du helps training it stabilize the training a lot it's I speed up the training so we use it a lot in our work so who's really want to improve on that aspect it plays a huge role there's a there's a paper where does it say that you just train only the bottom layers and don't train the convolutional nets and you still can get a good performance you know it's not great but it's good performance it means that you know the the these layers play a very good important role and as I'm talking about training those layers only those layers from scratch or fine-tuning only those layers trained those are from scratch and so is the idea that you're applying techniques like what you've done it auto ml0 to finding these new layers or is that a totally separate a pro it's a it's a highly related and I would say you know it's it's like a replication of this idea of zero yeah yeah cool you so you've also been working on semi-supervised or self supervised learning recently yes can you describe some of that work and how it relates to this stuff yeah sure so I've been working on this automatic machine learning and automated architecture design and so on and I gave talks right about my this new development and a lot of people came to me and complained they say you know you automate the design of the neural networks but I have more data than you so I bid you so so I say oh that's a that's a good point so I thought about ok ultimate machine learning is cool but can you automate the labeling process can you automate labeling right because most people would prefer to have more data because more it is it's very it's very important right I don't mean that architecture is not important but having more data is also very important mm-hmm so the question is can you automate the process of labeling later right so today you don't have anything today you basically get some elaborate and you give it to some human experts and then they labeled the data for you annotate that it for you so techniques like active learning that can help ya the best labels the best data to label which yes that's right yeah active learning was speed up that process by selectively choose the the example to annotate the data mm-hmm now so one idea that we had is the concept of pseudo labels so faked labels can you take your model and generate and evaluate on the unlabeled set and now you have weekly label data our observation is this right like can you take your own model generate labels on a new set up and label data and then put it back assuming that they are correct labels and train the new model on that well actually I tried this idea many many years ago to and it didn't work and the problem is that if you take the model and generate the new new label data the new Whitley label data some of them are accurate you know a three would get a tree like that label of three but sometimes a tree would get a label of five and this error would propagate propagate into the new training and it would hurt the new training and the new training would not get better result than the old training right because it's a the confirmation bias going on mm-hmm now you so I I did have I did not know any way to fix that problem so so I thought the concept of self labeling is a is too good to be true but recently we realized that this a way to overcome this process is when you train the new training you inject a lot of noise into the new student so that's so you have a teacher the generate labels on a liberator you have a combined set of true labor and Whitley labor data or pseudo labor data and you train new student on this new combined set when you train the student make sure that you insert a lot of notes hmm so the a lot of noise is a student will we still don't know this how it happened yet but the noise in the students seems to have this process that it will overcome the confirmation bias probably because it will make the student more robust right you will not trust the labels all that much exactly right so it alone not to trust the label that much because it has to cope with so much noise and amazingly it actually outperform the teacher so the noise that we use is basically things like you know drop out and drop certain parts of the model they documentation and super aggressively Bundesrat documentation and noise and eventually it would do better than the student and you can just iterate the process right once you have a better student you label new data and then you put back so we we been doing this and it seems to work very well and how what's your what's your performance metric or your benchmark yeah so we worked on this deficit core image net and so I think when we work on this data set the day of the art was like 82% and then using architecture search we've pushed it into 85 percent eighty five point four something like that and then using this auto label process you get to eighty eight point four so three percent improvement so and keep in mind that 1% improvement on imagenet at that high range is very difficult yeah and I'm talking about top one accuracy okay and so is this are you is the setup here that you are you starting with your standard kind of 70 percent training and 30 percent test ratio or does that matter in this oh so you're having your your student label the that 30 percent like these ratios come into play into inside okay so we just follow the conventional imagenet set up so emission that has us already have a split up you know maybe 1.2 million example of training and a hundred thousand for validation and for for our label data you would use a different corpus so at Google we have this data that data set called jft that has about 300 million images mm-hmm I said we operating other data beyond an image net that labels using a model trained on image net yes she don't have any labels for I mean either you don't have any labels for your external data set but you're just labeling it based on image net some of those are going to be wrong so you introduced the noise and it all seems oh yeah yeah so basically yeah so here's something that we find really surprising there's one experiment in the paper that people did not check but it's super exciting is that we can propagate back images that so you know images has a thousand categories right right you know some flowers some dogs and some cats and so on now we we propagate back images that don't have that don't look like any like any categories in 1000 AD agree so could be like some kind of very strange anymore right and the model would suggest like Isis that imagenet yeah it doesn't belong to any yeah any categories and it still helps this vice is saying that you know this image is not any of these category and this put a lot of low low percentage know probably the on a lot is category from propagate back still okay like that that's not consistent with consistent maybe with the idea that the images are even if they're not you know helping tune your classifier layers at the end they're helping the network learn textures and you know low-level features better yeah so I think it what it really happening is that it just tried to learn information about natural images right and the fact that basically you give it a label so the teacher give it a label and then you when you do data documentation you ship the image a little bit right and then you say that the label is the same so you know the probably the table looks very similar so the model has to work really hard to type and sister in the production so doesn't and natural image in general is this very similar in similar ways right so it learns to be consistent in in terms of labeling for new images and maybe that's the reason why it's very helpful even though the images might not have anything to do with with your data set mm-hmm and you know I'm curious in kind of articulating that intuition for what's happening is that based on you know all of your experience working with these kind of networks so did you perform specific experiments to try to understand you know what the effect might be and you know what were those experiments oh okay so network introspection or any kinds of things and just curious you know what are the kinds of things you've done in order to deepen your understanding of why this is working are you mean this particular experiment or in general this particular experiment okay so we try to lower the threshold right so when you take the your model and then label new a couple of 300 million images we filter out we have a chance to suit our low confidence images right so things that you know have very low prediction low probability of having to be in any one of the class so we can keep you know only 10,000 images or we can keep 32 I'm sorry 1 million images or 30 million images or you know 300 million images Oh any anywhere in between right so we vary the special and it seems like actually the treasure can be quite really low mm-hmm right yeah so it could be you know in the 130 million or something like that and still okay and then we visualize the image that are actually very low accuracy and a low of confident and then we visualize them and see what they look like and they don't apparently they don't look like anything like an image net so that's what we found why the fact that they have full is very surprising but why they have full we still don't know so is this a hypothesis we don't we don't know yeah yeah yeah I which is basically our current what is trying to analyze wise careful okay and what do you expect your direction to be and analyzing that we probably gotta look into the hidden state of the neural network and you see we see you know with this a low confident where is it is it trying to make the hidden state more consistent to each other which is basically the same phenomenon that a lot of people do in contrastive learning in sales supervised learning is that they also have two images a plate the Augmented image and they try to make sure that the prediction is a set and if you have two images of unrelated images you make sure the prediction is different right so I think this is what happened in here so we were gonna visualize some of the hidden state of the neural network with and without these low confident example and see you know what happened during training okay that's that's one direction then we think we'll be doing okay okay yeah cool you were also an author on Mina about that work yeah so many years ago I worked on something called sequence sequence learning you know end-to-end neural networks through NLP and that's used for translation and I spent like two years after that try to be like a chat bot to chat with me because I always fascinated you have an agent that can talk to me you know elegent open all your emails and you know might you or were you trying to were you trying to have a communication with the chat but are we trying to replace yourself quote-unquote with the chat bot by having other people be able to talk to your both you know just talking to computers is it's fascinating yeah yeah and so I failed and then we I have white and then we ended up meeting this person called at Google and a very great engineer at Google or Daniel and he he's saying how about let's see work together to make this better and we did a lot of work on scaling up some of the models that we build we collected a huge amount of social media data even people talking on the internet and then we trained them huge model like a model I you know maybe a hundred times because I what I trained back in the day and it can do mounted on conversations started doing mounted on conversation very well and one of the magic moment that I really think that surely magic is that it actually invented like a joke it invented a joke so people talk it's it's it made this pun that you know horses go to Hayward so cows go to have what cows go to Harvard and horses go to Hayward mm-hmm but it's very fascinating remember it's been a while since I've looked at that one but I remember it was the you know when you described it as inventing this joke it wasn't anywhere in the training data yes you could find right yeah the only instance of mentioning hey what is doesn't have very any anyway there's only one instance I've mentioned in the world haver in the trainee later and we look at that context and it has nothing that looks like we like horses go to hybrid at all and so what do you think what's happening there how did that work and that's unlike what we see in kind of conventional language models like even the big ones they're picking up stuff that they've seen before generally right okay so I I can't tell you my version my version is following my version it's a following we still don't know right that's what I say it's a magic moment I think first of all a lot of social media jokes about puns mm-hmm a lot of you know like we find puns kind of funny so and Sony yeah and a pony and so in the training data we train with by pair at coding meaning that with well and we break it down into pieces yeah so isn't a ver it's not a world is bright to us hey but right at Harvard and etc so it must have learned the concept of punt and he would understand that you know Harvard and Harvard somehow is kind of related so it made up this bun so two things right learning the concept of puns and learning how to put together like a new concept and yeah there's a lot of new things like it make a jokes like it makes jokes about why chicken cross the roads and things like that I don't remember the particular jokes I could find it and send it to you know it make all kind of new jokes that we never found in the training data and it's truly fascinating and so where do you you know that line of work where do you see that going well what were the key you know was this was this kind of simply you know quote unquote an instance of scaling up the training or were there some new techniques or novel things that were developed you know in the model that you can see applying elsewhere I see well first of all it basically tells a lot of people that scaling up is very effective way to be you know NLP models right like I spent two years fell in Lou in that chat bar and then suddenly someone came along and found a solution and the solution is kind of simple which is basically just make the model a lot larger that we also found a lot of new new interesting insights is that during the process of muting the bar we had a lot of difficulty how to measure the performance of the bar so basically we whatever we want to measure is how human-like it is and we want to be about that have a conversation like a human life so we to measure human light we always want to ask human to look at a conversation that we have with about because we knew there many models of the box right many many models of the box and every model we have to look at a use human expert to look at their accommodations and hey I'm human like this is so new render process what is that we observe is that as the perplexity of the model so you know as you trade the model better so the objective function is called perplexity it's like a very local objective this is basically how how well you predict the next world and we've noticed that this objective function correlates very well with human a judgement of human likeness at the bat and and then we did a view plot of perplexity what is their local objective function you know technically in the next world and human likeness and we saw like a kernel a strong correlation so that's basically another contribution for NLP is that actually this local objective function which is just predicting the next book next school if you do a good job at it turns out is also a more global objective function which is human likeness how how I can behave like a human so local mimicking is global mimicking mmm-hmm what's interesting about that is that you know if you being kind of predicting the next work what makes jokes and puns work is that the next word is a surprise yeah what you saw so how do you get something that's good at jokes but also is optimizing on perplexity yeah that's that's a reason why I say it's very suppose surprising that low so this uh let me try to say it again so there must be a lower global objective function that we are optimized we like making fun so that we can get an engagement right making fun saying something meaningful yeah right that's not global right that's something that's not just optimizing the next world but what I'm just klemming is that local objective function is that actually just predicting the next world it somehow correlate to the global right and I mean I guess I could say that your what perplexity doing is doing in the case of the joke is predicting surprise it's not like blind to the surprise but it's predicting the surprise based on its predicted the surprise yeah you could say that yeah yeah huh interesting cool well thank you so much Kwok for taking the time to share with us what you're up to and and kind of walk us through these you know recent works here is really really interesting stuff yeah thank you so much it's a pleasure to talk to you same-same awesome awesome thank you all right everyone that's our show for today to learn more about today's guest or the topics mentioned in this interview visit twil male Icom of course if you like what you hear on the podcast please subscribe rate and review the show on your favorite podcast thanks so much for listening and catch you next time you\n"