Rethinking Model Size - Train Large, Then Compress with Joseph Gonzalez - #378

### Article: Insights into Efficient Training and Explainability in AI Models: A Conversation with Joey

---

#### Introduction

In a recent discussion, Joey, a researcher at UC Berkeley, shared insights into his work on efficient training strategies for large language models and the importance of explainability in AI. His research challenges conventional wisdom about model sizing and offers innovative approaches to making AI models more interpretable and practical for real-world applications.

---

#### The Trade-offs in Model Training

Joey’s team conducted experiments that revealed a counterintuitive finding: larger models, when trained with careful consideration of batch sizes and hardware utilization, can actually lead to faster convergence. This discovery was unexpected because the initial assumption was that smaller models would be more efficient for training. However, by increasing model size and optimizing batch processing, they achieved better results in less time.

One key insight from their work is that larger models are not inherently inefficient if properly managed. For instance, making a model 6-7 times larger than standard configurations allowed them to complete training faster due to improved hardware utilization. This finding has significant implications for researchers who may have limited resources but still want to push the boundaries of AI innovation.

---

#### Balancing Training and Inference Costs

While increasing model size during training can speed up convergence, Joey emphasized the importance of also considering inference costs. Larger models require more computational power during inference, which can be costly in production environments. To address this, his team explored techniques like weight pruning and quantization to compress trained models without significant accuracy loss.

Their results showed that compressed versions of larger models could maintain high performance while being significantly smaller than their original counterparts. This approach not only reduces inference costs but also makes AI models more accessible for deployment in resource-constrained environments.

---

#### The Future of Pre-training and Domain-Specific Models

Joey discussed the potential for pre-trained models to be adapted for specific domains by fine-tuning them on domain-specific data. He advised researchers to start with fine-tuning for their specific tasks before diving into full-scale pre-training, as this approach can often yield better results at a lower cost.

He also highlighted the importance of leveraging large amounts of domain-specific data. For example, organizations with access to vast amounts of proprietary data could benefit from training models specifically tailored to their needs. This approach not only improves model performance but also makes AI more relevant to real-world applications.

---

#### Explaining AI Decisions: A Critical Need

A recurring theme in Joey’s work is the need for explainability in AI systems. He explained that while models likeBERT are powerful, their opacity can be a barrier to widespread adoption. His team is exploring ways to make these models more interpretable by connecting decisions back to the data they were trained on.

One innovative approach involves using decision trees in conjunction with neural networks. By overlaying a decision tree on top of a pre-trained network like ResNet-101, his team has created models that are both accurate and interpretable. For example, when shown an image of a zebra, the model routes the input to a node near “horse” but ultimately identifies it as “zebra,” demonstrating how the decision tree can guide predictions while maintaining semantic structure.

---

#### The Role of Decision Trees in Making AI More Transparent

Joey’s work on neural back decision trees demonstrates that interpretability does not have to come at the expense of performance. By fine-tuning neural networks to align with decision tree structures, his team has achieved competitive accuracy while making the decision-making process more transparent.

This approach also allows for correction mechanisms. For instance, if a model misclassifies an image (e.g., labeling a zebra as a horse), the decision tree can be adjusted to improve accuracy without retraining the entire model. This capability is particularly valuable for deploying AI systems in critical domains like healthcare or autonomous vehicles.

---

#### The Broader Implications of Efficient Training and Explainability

Joey’s research underscores two critical challenges in AI: efficiency and transparency. His work shows that larger models are not always less efficient if trained properly, and that interpretability can be achieved without sacrificing performance.

Looking ahead, Joey is excited about the potential for non-parametric approaches to model design. Instead of cramming all knowledge into model weights, he envisions systems that reference external knowledge bases dynamically. This approach could reduce the need for massive pre-trained models while enabling more flexible and context-aware AI systems.

---

#### Conclusion

Joey’s insights into efficient training strategies and the importance of explainability offer valuable lessons for researchers and practitioners in AI. By challenging conventional assumptions about model size, optimizing hardware utilization, and prioritizing transparency, his work paves the way for more practical and ethical AI applications.

As Joey and his team continue to explore new directions in machine learning, one thing is clear: the future of AI lies not just in advancing technology but also in making it accessible, transparent, and accountable.

"WEBVTTKind: captionsLanguage: enwelcome to the 1200 i podcast I'm your host Sam Charrington hey what's up everyone happy Memorial Day to those of you in the States although we might not be able to celebrate holidays like we once would I encourage you to find a way to enjoy yourself this weekend connect with family and friends and enjoy some good food and fun as best as you can I am super excited to be hosting tomorrow's live panel discussion on advancing your data science career during the pandemic this is going to be a great one featuring an amazing line-up and I encourage you to join us you can check out to malaya comm /d s careers for more information and to register i also want to give you a heads up regarding my upcoming webinar with algorithm eeeh CTO kenny daniel the hot topic at last year's twil makan unconference was a discussion on whether you should build or buy an ml or data science platform well we'll be tackling this topic head-on in our upcoming session we'll discuss what goes into building a machine learning management platform how to make the business case for ml ops at your company and how to evaluate off-the-shelf machine learning management solutions be sure to mark your calendar for 10:00 a.m. Pacific on June 9th and visit 12a Icom slash algorithm eeeh to register that's Himalayan comm /a LG o RI th m.i.a all right enjoy the show and enjoy the holiday all right everyone I am on the line with Joey Gonzales Joey is an assistant professor at UC Berkeley in the EECS Department Joey welcome to the twilight cast thank you for having me I'm really looking forward to diving into this this conversation and in particular talking about ML systems and your recent paper on train large then compress but before we do that please share a little bit about your background and how you came to work in ml nai yeah excellent so my story is a bit funny I started my PhD at Carnegie Mellon with an interest in actually flipping helicopters because that was a trendy thing to do back in 2006 a while backflipping helicopter was being helicopter by MUP and solo or fly them and then flip them I'm actually a colleague of mine Peter Beal now at Berkeley when he was you know finishing up his his thesis work he was looking at how to do interesting control for helicopters I thought that was really cool and and I at CMU I was you know I went to my thesis advisor you know you've worked on controllers well I'm kind of interested in flipping helicopters I think that's that's really neat research and you know I didn't know that wasn't thanks well it was and it actually was some of the pioneering work to what we see today in reinforcement learning but what's kind of cool about this story is my advisor at that time being a you know a real machine learning researcher I was like you know what you know flipping helicopters that's that's that's exciting but there's something more important like we can actually help the world with sensors we can build sensor networks to monitor fires and we can use kind of principled machine learning techniques I should add that when I was looking at the flippin helicopter so like you know what we should flip them with neural networks and the other thing my advisor said which was good advice at the time was a neural networks aren't really serious research we use more statistical methods graphical models things that have formal foundations that we can reason about and write it kind of detailed analysis and understand what our models are doing and that was good advice and so I went down this path of how to build Gaussian processes Bayesian nonparametric methods to reason about link quality and sensor networks and and in that process of doing that I kind of stumbled into a problem I was writing a lot of MATLAB code to compute big matrix inverses and then approximations that to make it run faster and one of the things I enjoyed doing in the process of you know exploring these more efficient MATLAB programs was trying to make them more parallel and I think my advisor occluding is a good advisor he's like you know what maybe you enjoy that more so maybe instead of focusing on the knob metrics in the sense networks let's start to think about how to make machine learning more efficient and and in particular at that point in time hadoop was taking off and smoothly you know what MapReduce that's gonna change machine learning and we were thinking well we're working on graphs and they just don't fit the the MapReduce pattern and the kinds of computation we were doing just it wasn't it didn't actually fit with the the technology that people were building so we started to explore a different design of system so design of systems for computation on graphs which took us down the design of of graph processing systems system that I ended up writing is kind of the end of my thesis was a graph lab for doing very large analysis of graphs and so by the time I finished my PhD I was actually writing systems papers not machine learning papers and the field was changing very very rapidly to this around 2012 and if anyone's been following the history of machine learning around 2012 everyone started to realize maybe actually the neural nets for a good idea the deep learning these ideas actually really bethe dated back to 1980s they're actually really starting to work and they were changing the field of machine learning and graphs were also taking off so we built actually a company around the systems that I was developing as a graduate student it was graph lab that evolved into a company for building tools for data scientists to do interesting machine learning at scale that was ultimately acquired by Apple and around that time I also joined the UC Berkeley em pleb as a postdoc and there was you know a chance to come out to California and it was a really exciting opportunity to do research in a different system a system called spark which eventually became apache spark and there we started developed the graph processing foundation for the apache spark system and again as i started to explore more and more into the the field i learned more about research and data systems and transaction processing and how those connect back to machine learning and so after finishing my postdoc I came to Berkeley in fact I chose not to follow the much more lucrative path of the startup I was going to ask about that yeah I made a terrible financial decision but I'm happy because I have a chance to work with students I'm a little less happy because I'm not as wealthy as one could have been but now I am teaching do research at the intersection of machine learning and systems and so we have a pretty broad agenda around how to build better technologies for delivering models to manage the machine learning lifecycle not just training but prediction how to prioritize training experiments on the cloud to use serverless computing to make machine learning more cost-effective and easier to deploy we have a big agenda around autonomous driving building the actual platform that supports autonomous driving not necessarily the models but how they are connected together to make a reliable car and we have work in natural language processing and computer vision and one of those papers one I'm hoping to talk a bit about today which is our work on making Bert models easier to train and it too has a kind of funny story how he came to - actually a realization that what we were thinking was entirely wrong and that's what that paper talks a bit about well let's we'll get to that funny story in a second there's so much interesting stuff that you just mentioned it's there there at least three or four interesting podcasts in here I'd love to dig into some of the stuff you're doing with serverless at some point and how that intersects with ml and is something I've looked at a little bit as well but Before we jump into even more that I'm curious your co-founder at GraphLab and and tariq carlos gastrin was one of my very first guests on this show when we'll talk number seven in fact and I'm curious how you came to know and found the company with Carlos yeah Carlos is awesome so he was my okay when I get to CMU Carlos was the guy who said let's not flip helicopters let's do something that you know could make an impact in the world and he was a great advisor like he pushed me down the right path in my PhD the thing that reflected I was interested in and he's one of the pioneers in the modern field of machine learning in systems yeah he did go to Apple he did saw him recently at at nerves most recently in Vancouver it seems to be really having a good time there yeah he's had a chance to have a lot of impact doing really cool stuff you kind of laid out this broad space of research it sounds very broad actually yeah you know tied together by systems I'm curious how you kind of you know is it rationalized by hey you've got a bunch of you know students and you're letting them flip helicopters and the way that they want to flip helicopters more so than you were you know in Cirrus yeah or it alright so is it's challenging as faculty to decide what is your research agenda I mean one likes to imagine you sit there and go here the three two things I want to study usually not one cuz you have to have enough for a couple students to build you know their thesis around the reality is it students pull you and and I actually I think sort of like artists it's hard to compel people to follow the research agenda that you ultimately want my adviser did a great job it's not about telling you what to do it's about showing you the exciting opportunities you can explore and so with my students I I've pushed them in directions to think about how we make model more models more efficient to not just train but to serve how we support new emerging applications of machine learning that might require innovation both in the system and and the modeling techniques and actually what's kind of neat about the field of systems and machine learning is again when I started it wasn't really a thing in fact some of my colleagues at CMU like you're just hacking you're not actually doing research you're not proving anything fundamental about machine learning you're writing software a little bit of that was true we were definitely writing a lot of software we were trying to prove some stuff too but I think the impact might have actually been more on the software side and one of the funny things about the broader field of systems and machine learning is that it actually has been kind of the undercurrent of a lot of the recent progress in AI when we look at this revolution in deep learning we can go back to the 2012 the Alex paper that's actually not the beginning it goes way back the 1980s in fact the techniques are from the 1980s the architectures the models even the algorithms that we're using are from 1980s if you actually read the Alex net paper more than half the paper is devoted to how they got it to run on a GPU how they got it to run on a very large image data set and some of the optimizations they made to the training process to make it run at scale so it is the movement to scale that really helped launch the revolution that we are on today and now there's the other factor which I think people overlook and it's sort of when I was doing my PhD we were writing the Fortran of machine learning we were writing MATLAB code to implement algorithms and debugging gradient procedures and that's absurd today it's just too easy so a graduate student can pick up PI torch tensorflow and the x net one of these packages and very easily architect a new model and train it on TP use GPUs harder they barely understand and get it to run at scale on datasets that they don't have to collect so that is an enormous jump forward and and if you look really carefully and a little bit depressingly that models didn't change that radically the algorithms didn't change that radically what changed was it became a lot easier we developed the languages the tools to make machine learning practical and that really boiled down to getting the right abstractions and maybe if you roll all the way back when Alex and I came out they didn't quite have that but right after Alex net came out Theano started to really take off Caffe at berkeley started to take off and it became so much easier to build that next model in the next model and so on and today we're stuck in like a flood of archive papers because basically anyone can download one of these packages and start building state-of-the-art machine learning models there's some learning that you go in the process but the fundamental principles are define your objective define your decision process and then tune your decision process optimize it for that objective that's it and the undercurrent that drive all of this has been a lot of the innovation in the system's front not necessarily the machine learning and so my research is trying to find those right abstractions and especially as we look at new frontiers not just training models but how we deliver them how we support them in autonomous driving and how we adjust the architectures of the models themselves to make them more efficient in these new kinds of applications it when I first started doing these interviews one of my favorite questions was looking to explore kind of the the way folks came up with new models and you know trying to find the the kind of science behind it and I think that the takeaways were a lot of it was you know the the answer was like graduate student dissent like we would just throw a graduate student at this and you know they tweak something that pre-existed you know but there wasn't necessarily kind of a you know hard science behind how to come up with a new model architecture but so we've seen a lot of you know innovation like in around you know you know Burt and the kinds of transformer models that we're seeing here it has that has you know would your answer to that be kind of similar has it you know it changed a lot or how do you think of you know beyond kind of that high level process you just lay it out how do you think of the the process for coming up with these new types of architectures yeah so that's been a struggle for me so remember I start with this religion this Bayesian philosophy of model development they have these principles of priors and likelihoods that gave us at least the the basic foundations of what to think about when building a model that's all that's not gone but that's you know effectively got for a lot of the machine learning world and so we're left low coming back you though actually right like the cool little modeling is is on the rise rank so I should say it's not gone and it's very important to note that that a lot of the world actually runs on these more traditional techniques the research community where we're writing these new papers for language modeling or speech or driving where there's very specific cases that have been kind of dominated by deep learning techniques but the Bayesian methods are still you know fully alive and medicine and in even traditional advertising algorithms but with that in mind so when I start to look at the deep learning work how do I find those principles and actually they actually exist there they're a little smaller and and sadly we start to embody the the model with personality like the models trying to do X which is sad because that's not how we liked it you know formally think of things but the these building blocks convolution recurrence attention each of these becomes tools that we can use to architect our models when students go how deep should we make it well we try to go a little deeper they start to look at when how it affects convergence rates variation and batch size and its relation to batch norm so we have little rules of thumb and unfortunately there's no great like my PhD students like two or three years to get up to speed with the rules of thumb in the area that they're working in and once they have that I hope they teach the next PhD students and so on because it's hard to really grasp what those are it's more like it comes from experience working with these models and going out so like the transformer in this particular piece of work that we've been exploring like how to make it more efficient really should we make it deeper to make it wider who knows so we start to measure each of these things and that's one of the jokes that we make is it the machine learnings become more like a biological science it's driven by laboratory experiments by using compute to understand better the models that were building as opposed to the more principled approach you might have had in the past we tried to to frame it in some probabilistic architecture you mentioned that there was a story behind the work that led to train large then compress yeah so I'm happy to go into that story sure sure yeah so so the story behind that the Train large and compress work it starts in the following so we've been doing a lot of work in how to make computer vision models more efficient in particular not for training but for inference and so we have these skip net architectures ways of making things more dynamic and some of my colleagues go you know what maybe we should start thinking about language models they seem to be eating up a lot of cycles the the transformer the burp model that's become a kind of a foundation for reasoning about text and context well so that model is pretty expensive to run and so we said all right maybe we'll explore what we can do in the context of making these burp models more efficient now I should say a lot of people are studying to think about that because bird is you know incredibly expensive to run on text feeds and text is a pretty large you know body of data that we might want to process yeah I'll mention that for folks that want to dig into that particular point I did an interview with Emma Struble who has you know in fair amount of detail kind of characterized both the cost and kind of environmental impact of training some of these large-scale NLP models and it is crazy it's it's crazy actually the the co2 narrative was one of things that got me especially like I was maybe we don't touch language that's there's plenty of people thinking about it and then I saw him in his papers like wow and you're trying to make autonomous cars so that you know a little bit more environmentally friendly when it comes to driving when I could go fix it you know funding problem right in my field and so yeah so we look at these language models and go how can we make these better and the first step to doing that is we got to understand them so we need to run some experiments and and my students go oh we're gonna need a lot of machines like I can't afford a lot of machines so if I look at Google at Facebook they can throw a lot of compute and trying to understand something and that's actually one of their their tools that we don't really have access to we've actually started collaboration of Google so we could get access to TP use but we can't do it at the scale that they're going to run their experiments so we had to get lean so how can we rapidly iterate on variations and architectures we want to look at different types of attention different architecture sizes understand the effects of these hyper parameters and so my students go out here's what we'll do we'll make the models really small and we run a training run every day with different configurations and we'll get a good picture of what's going on and they did that and so they made the model really small because that would in theory make it really fast to train but they also really small in terms of number matters or yes so they made the model smaller in terms of both the height the number of layers and the width the number of these hidden hidden layers hidden parameters inside of each of the attention heads so basically I tried to make the model so it would it would train faster because it had less to compute less to update this is more of a classic way of thinking how I would approach the problem to if it's too big make the problem smaller it should go faster right it's less to do all right so they did that and it was working but one of them was like I what if we make a look bigger just you know to get a point of comparison and they applied the point of comparison on top of the you know the the smart model zero training and and the point of comparison seemed to go down pretty quickly and I said well let's put it in time and you put in time and actually the bigger model the point of comparison was actually getting to a better accuracy quicker than the smaller models that we were supposed to be running because they were faster to train and then we started wondering mmm maybe it's the other way around maybe we had this backwards all along that if we want to go faster we have to make the problem bigger which is really counterintuitive but it actually turns out to be a really neat intersection between how these models converge and how they can take advantage of parallel resources in the GPUs and TP use to get good scaling as you make them bigger and that sort of forms the foundation of this work that sort of went against what we thought would be the case and actually presented each way to approach training these expensive models is the idea related to the kind of the rate of change of of kind of accuracy for these models and you know taking advantage of the idea that the larger models learn quicker but the you know I guess that the area under that learning curve is proportional to your compute cost and you cannot kind of optimize that yeah so there's a bunch of trade-offs let me I try to walk through them because they were counterintuitive to me at first too so the first trade-off to think about is actually let's talk about compute so as I make my model bigger I'm going to compute more so that's more work and it should run slower but the neat thing is that when I make these models wider at least I actually expose more parallelism and if we look at the execution these models it's a little surprising we've optimized these GPUs and TP use to have a substantial amount of compute often for computer vision models and so now we have an opportunity when we run a transformer if we don't crank the bat's batch size up incredibly high we actually have a fair amount of leftover compute that we can use so making the model bigger doesn't necessarily translate to a linear increase in runtime so we can afford to make the models bigger without linearly increasing execution parallelism and runtime don't correlate to cost because you're just running more compute at the same time yeah so this is looking at its own carbon for that matter yes so all right this is you're getting to the interesting stuff so so first let's in the paper we actually tried to control for this exact I was like ah no ha no second you're just going to increase the cost of compute so we looked at one GPU and what happens is you're not using all the cores on the one GPU when we were looking at the smaller model so as we make the models bigger for a fixed batch size we can get an increase in the utilization of the GPU and right now it's not easy to turn off those cores and you're also paying a fixed overhead to power the the actual box that the cores are living in plus cooling so trying to power throttle individual cores on GPU is generally not a great idea especially if we can get better utilization of the cores that we have now you could say I should have as we did so then we look at more GPUs we're going to burn more resources as we turn on more GPUs but the hope is that we can get to a solution quicker and if end if those GPUs are already attached to our box which they often hire there's usually some incentive to go ahead and try to use those as efficiently as possible and so that brings us to the second question which is if I make my model bigger is it really improving in efficiency which is what we'd like to think of as the improvement in our perplexity a reduction in perplexity as we train as we'd like to reduce our error as quickly as possible in in wall clock time because we have to pay for the power of the building and so on so we want to train as fast as possible in time the simpler way to look at that first is how is it housing per plexi or air decreasing is a function of the amount of data that we touch and so there are two knobs there so now we're getting into the weeds but there's the batch size which determines how much data we look at per step of our algorithm the more data we look at the better of an estimate of the direction that minimizes the loss which in principle should give us faster convergence it also increases GPU utilization so we can use that as another mechanism to get better utilization out of each of our piece of hardware but it also has diminishing returns so as we increase the batch size our our speed at which were able to converge as a function of samples we look at doesn't necessarily increase linearly and one of the other sort of side effects of this which if you work in computer vision you're like oh no there's a problem that as we increase the batch size there's some risk of overfitting and this is a fact that shows up more in computer vision models where where it's somewhat data poor we're in an NLP it seems at this point at least that we have opportunities to overfit more before we actually are properly overfitting so there's this question of the generalization gap the gap between how well your models fitting the training data and the test data and in NLP tasks were not at a point where we're that generalization cap is disappearing which means that we can increase the batch size quite a bit more without overfitting but it also means we can crease the model size quite a bit more and so this paper then does is tries to play they compare this trade-off between model size and batch size to find the best combination and one of the neat results we find is it actually cranking up the model size and a batch size to a certain extent as well kind of gives us the best outcome it gets us to a a model that's more sample efficient the more samples it sees the faster reduces the the test air and it also lets us better utilize our hardware so we're actually getting a gain from parallelism and those two forces interact to give us a faster in terms of lock lock time reduction in the test perplexity or the air metric that we care about in terms of this generalization gap and the differences between what you see in computer vision and what you see in NLP tasks is that related to the way the problems are formulated in terms of supervised versus self supervised semi-supervised and the kind of availability of data and labels and that kind of thing absolutely so you're hitting a key point so in computer vision we are largely still focused on supervised tasks we need labeled data sets which are big but they're not as big as we want them to be whereas in NLP we can go to really large unlabeled data sets because we're essentially predicting missing words so we've created this self supervised task and that means that we have so much more data we can support bigger models and bigger batch sizes without having this this generalization gap disappear we're a late sorry without eliminating or causing our training err to go to zero and our test data to you know dominate so so there is this this opening created by this self supervised training that we were able to take advantage of now in our experiments we test both the self supervised training tasks as well as the downstream translation or classification tasks it would be applied to actual language modeling you know supervised training tasks but that's typically done as a small fine-tuning step on top of this extensive pre training which is where all the the co2 is going to pre train these models and then in your description of the Train large it sounded a little bit like you're ultimately saying you know fully utilize whatever box you're training on but there's a lot more nuance there I am yeah elaborate on on that so this has been a big question in data center design generally as a systems person like should I turn off my GPU should I turn off my CPUs or should I try to utilize them better and the general consensus when we think about data centers is that we really do want to try to maximize our utilization part because we bought the data center it's expensive we should use it and we have to keep it cool we have to staff it there's a lot of other factors that go into play and so we want to be able to use that hardware as much as possible and as efficiently as possible and part of the reason we might want to use it as efficiently as possible you think of things like serverless if I'm not using the hardware I can put something else there and we're creating markets for filling in the excess capacity so the idea that I would turn off a GPU is sort of silly I should always have something running now the question is can I make that thing running on the GPU as efficient as possible and so in our work we're focusing on trying to maximize that efficiency in my lab for example students are competing for the GPU so I think the one GPU experiments definitely easier to run because they're not fighting for the entire box and so the other GPS are being used for other experiments and then when we go to you know H GPUs we're gonna again use the whole box so the general consensus are at least the thought process today when we think about the data centers to really maximize our utilization and not try to power throttle or limit the performance of each of the course now it could be in the future and new kinds of hardware might change that trade-off but the underlying economics would sort of suggest that if you bought the device you should really try to find ways to maximize its usage and given machine learning has an infinite supply of things that we'd like to Train it's not hard to imagine that I can always fill my excess capacity with more training is the paper fundamentally an economics paper in the sense of you know you're trying to maximize utilization and those kind of things or do you also get two results that talk about performance given a set of constraints like your traditional computer science e kinds of papers yeah so it's it's funny we hadn't gone down the economic throughout so it's a funny I mean a very loosely yeah so we we are actually very much thinking about the economics of computing when we look at server list that is going to fundamentally change your economics computing in a way that I think will make things more efficient more cost effective and actually easier so it's to win for everyone and we actually have an upcoming paper on this a hot cloud about you know the economics of server this are going to generally be favorable for everyone assuming we get some of the system's problems you know ironed out this paper was really our students says in the you know the first effort to really make progress in the Berkeley training space to find mechanisms that we and academia can use to go fast and part of that is finding better ways to be more efficient about training it allows us to run experiments more quickly and so we can now innovate on Bert and one of things are actually looking at is trying to make these models more nonparametric so they can leverage other data sources one of the side consequences of this paper is sort of a you know if you're out there thinking about I should that's really cool I want to play to play that but hey wait a second you made the model 4 or 5 X bigger 6 7 X bigger that's expensive for inference what am I going to do about that and in fact when we got this result that was like my first conclusion yeah we went on the training front but we just made inference which is actually the more expensive problem worse by 7 10 X and if you think about it training should only be a small part of the use of these models inference is where it really the cost should be and it is when you look at a practical application the time we might train it but we're gonna run that model 24/7 at every single tweet every single web page that we encounter that's a lot of inference and v100 so you're doing about a thousand when bat should optimize the thousand sentences per second which just sounds good but then you think of the amount of text in the web that's a lot of expensive GPU hardware so making the models smaller after training was one of the questions that we had to solve and so so the second half this paper comes back it was wait a second so even the models bigger to train faster but now we need a way to squeeze them down and maybe actually the bigger insight which is also maybe a little less counterintuitive is that the bigger models we could actually make smaller more efficiently and actually with with less of a degradation in accuracy so we make a we train a really large model and then we chop it up so we both explored weight pruning so eliminating weights making the model more sparse and quantization reducing the bit precision of each of the the weights and so we able to take our much larger models and then apply these these compression techniques to make them smaller and the effect of that is we can make the model actually smaller than the small models while retaining higher accuracy and so that's something that we're still able to use the compression techniques off-the-shelf where did you have to adapt them to this kind of model or the specifics of the way you train them yeah so getting close to the deadline realizing our models are now 10x bigger really great good news one of the students Shen who is working on this project had just finished a work on quantization and so we're like right Jen can we use your quantization techniques like I don't know maybe and so he started playing around and it turns out that you shouldn't hurt really well and so we looked at the standard quantization and standard pruning so we tried not to innovate extensively in each of these pieces more of an exploration how they interact with this kind of counterintuitive hypothesis that bigger models might actually be better both for training and it turns out for inference as well if we compress it got it got it so not that you have any of these numbers like right at your fingertips but can you give us a sense for when you say train large like what large means in this case and how that compares to what Google might do typically yeah so I think we were looking at like 6 X 7 X bigger than was normally was normally published I'm guessing Google actually goes much larger still and they might already be benefitting from these ideas and what what order of magnitude is that in terms of number of you know servers or CPUs of GPUs or so we were at 8 GPUs we actually all surround experiments on a TPU v3 TPU as well remember the exact sizes have a paper in front of me if I can find it many tabs yes so I think we're at we were up to like 20 hidden layers or so our 24 went up twenty four layers and we tried hidden sizes of like the order of 15:36 so 1536 hidden units for each of the layers so we tried a pretty reasonable space we built off off the roberto work which is actually if people haven't looked at it's kind of neat sort of revisiting what bert did and in some ways but really had the right answer just this broader experimentation of the the trade off space makes a big difference so we built off of that and tried different variations on the sizes described in that paper yeah the kind of rough magnitudes that i remember reading about and I don't remember if this was you know bird or elm or some of the different variations or Road but there was you know on the order of like 64 you know tens of GPUs for you know yak or more know so we went we used to tip you cluster for our big experiment so we actually tried to reproduce the Roberta setup so our comparisons are compared to these standard baselines so we had to use a TP cluster for several weeks it's expensive to run the full experiment for the baselines those need is by making the model bigger we could get comparable results quicker yeah and so we've we've kind of characterized bigger now characterized quicker what is what did that mean in practice so we used an i guess at about a hundred thousand seconds I'll be able to get fairly competitive accuracy should see if the final number so it sort of depends also on on which tasks we also accounted for the downstream fine-tuning that you'd need to do as well I don't actually member atop my head well we'll leave it as an exercise to the listener to pull up the paper so you've increased the size of the model and the number of resources the CPU research if you rather resources that the model is running on and in turn decreased the the training time of the model and the aggregate compute costs right did you need to do anything special to accomplish that or was the paper primarily observing the fact that you know that you know in aggregate you keep the the preferable approaches to increase the the model size so we did small changes to where we place the batch normalization we did pre normalization but it's like negligible changes in the underlying architecture most of us really exploring this this trade-off space ok these different parameters so in some sense it's a first step towards a bigger agenda it wasn't intended to be like a ground changing work but what's kind of neat is it does really make us at least we think how we approach training of these models well it's you know it's important stuff like you I think a lot of people will think well you know if Berkeley's worrying about the cost of training these models what what hope is there for you know my lab in you know an on Berkeley institution you know that has such close ties to Silicon Valley and relatively awash in resources and so if you can figure out how to make training these models more efficient than that's potentially a huge impact for a lot of people there there's a an important subnet of here which is that training the pre training isn't something that everyone needs to do and so Google has done a great job of offering these pre trained models so this really expensive part isn't something that every you know group in the world has to address and that's a good thing but if we want to innovate on that pre training process if you want to do research in it or we want to in fact the data suggests that adding more data that specialized to your domain can improve the quality model so if we want to be able to push pre-training research forward we do need to find ways to make it more efficient and I should say we started out with thinking oh we're gonna invent a new Bert and discovered in the process and maybe we don't necessarily need a new Bert yet that may be approaches to how we do the training how we choose our hyper parameters can make a really big difference your comment just prompted a thought to what degree has the kind of the I don't know theoretical trade-off space around pre-training versus fine-tuning been explored so that you know if I know that I have a unique domain you know and some you know corpus of documents or data available to me you know is era is there any kind of concrete research I can look to to help me understand you know if I should be pre trained if I should be pre training from scratch versus fine tuning and or do I just need to try everything and see what works the try everything is is not terrible advice but here's what I would tell my students so pre training is expensive so maybe start with fine-tuning understand what what is your prediction task and this is what you know the the practical world will do to take your your your prediction task whether or not a translation or sentiment tagging or maybe it's like which call center which call person should this this you know message be forwarded to focus on on fine-tuning for that task first there's a little bit of art and choosing learning rights and stuff to get your fine-tuning to work so go through that process understand how well you can do by fine-tuning to your domain and then if you have and you might you know billions of call records from the past you think you could really better improve the underlying representation on you could then try to go back to this mass language model training the the pre training process and then the work that we've done and you know other work that's it's that's going on around us can help to make that process more efficient so that you can in a matter of weeks or a week in our case on take your v100 box and really make substantial progress towards pre-trained model that's now pre trained on on your domain as well hmm and so where do you see this line of research leading yeah so I asked my students this every be able to develop a new bird so now we have the tools to start testing pre-training what should we do next one of the things that I'm kind of excited about is well the realization that to cram a lot of knowledge in the weights of our model and making the models bigger certainly helps with that another way to deal of knowledge is do not put it in the model at all I actually have to look up stuff most of time when I want to remember facts I'm terrible remembering facts so I used the internet I have a neural net in my head it doesn't have to memorize the internet because I have the internet so having access to a knowledge base can make a big difference in how much we need to encode in our model make him model perhaps smaller the ability to in synthesize decisions or to apply logic on top of knowledge bases seems like a really big opportunity for language modeling for for an LP broadly and maybe even for these basic representations like Bert and so we've been looking at and starting to look at if some of their groups actually got some early published work on how to bring in a non parametric or semi-parametric lens on on these models so that we can reference into a large knowledge base in in the construction of our embeddings themselves and that has you know the advantage of maybe being more efficient allowing us to grow the knowledge base without having to retrain the model you could get more data our model just gets better without having to adjust the model itself and maybe even giving us some explained ability so when we go to make a prediction about like how we embedded the sentence or how we represent this you know this decision for which called or out to we can now actually point back at historical data and say here's the data we use in our embedding to reason about that and you know that's terrible data I don't want that in my data set or that actually that that makes sense and and so that that connection to the data could actually also help with explained ability so that's sort of the vision that I that that my students and I are pretty excited about right now does that pull from or kind of lead you to like memory based networks or like information retrieval types of problems or yeah yes so memory nets I are all of these these kind of let's say I are more classic memory nets also increase anywhere classic so those are our tools one things we're looking at right now is something as simple as like can we use embeddings from other pieces of text and simple text similarity to recall those embeddings and and there's some other work exploring this now ultimately things like memory nets or pointers to point into a knowledge base and attend to an entire corpus would be really exciting and we're just starting to explore this so there's a lot more to do it does pushes in the direction of IR imagine a neural net that can Google stuff for you and to answer questions so it's certainly there's a well-studied area that we'll have to build on you touched on explained ability in there as a possible benefit here I'm curious about you know maybe elaborating on that a little bit more and then also you've been doing some work on explaining ability as applied to reinforcement learning maybe a little if you can share a little bit about that work as well yeah so I'm the co-director of the UC Berkeley rise lab which stands for a real time working on that intelligent working on that secure so you have an interesting agenda on security which is another time and then the e and there were for a long time like what is the e really what should it mean and we were initially thinking execution we're gonna execute things securely but that actually has some bad connotations so maybe there's another agenda which actually came out of our exploration in lineage how we we tracked relationship between data and the model development process explainable decisions would be a good thing to be able to support and so we had an agenda around how to do we have actually ongoing agenda on how to do explainable machine learning by connecting our training process to our data but what's actually pretty exciting is pulling in some of the recent work and explainable AI actually my thesis advisor Carlos has an exciting work line which provides a black box mechanism for explaining models decisions so my students have been also exploring this edge tremendous it hanging power this is you know one of the things we talked about three plus years ago and that went that early 12 episode and lime comes up all the time still yeah so I run the risk of another tangent but the world of explained ability is is it's kind of rich and it is created by this need to make sense of models that no longer make sense and so this idea that I can inspect my model and go hack that I like how that makes decisions that's gone or at least you know to a first approximation so we're left with justify the decision you made let's go back and at least connect it to the data to even the input so so my group had started looking at that and one of the things that we started to ask is why can't we have some interpretability and so one of the agendas that I'm exploring that's actually not in the language to me but more in the vision domain is how to apply decision trees to connect decision trees back with our deep learning models so that we can get the accuracy we want but we can also go and interrogate our mom and go well it's going to call us a cat but in order to do that it has to first identify that's an animal and it has to cluster it in animals with with legs and then you know with fur and then it gets us to cat so there is an opportunity to actually understand what the model is going to do at the high level each of these decisions is governed by a neural net so understanding that is sort of off the table but at least we now have an idea of the decision process the model we'll go through to make a decision so this is our recent work on neural back decision trees and when we look at language it's been an interesting question of what what would an explanation look like in the language domain so there are techniques like grad cam that have been pretty popular envision that would give us you know highlighting parts of an image that say you know this is the part of the image you're attending to we could do that and there are explorations of that in the language domain but one of the neat things going back to our very beginning narrative is can i connect my decisions back to data in many cases that is sort of the ideal explanation it's like here's the data that informed my decision about what you've given me now and so so that explanation is what we're exploring one of the hopes and doing that is you can not only connect it but you can even fix it so one of the kind of ideal outcomes of an explainable model is when it gets it wrong you go that's wrong here's what is wrong about it and that extra signal could be way more valuable than just some more labeled data and so that's our hope and maybe being able to correct our knowledgebase if we are referencing data so that we don't use that reference data in the future that would be one mechanism in the case of decision tree changing the paths so you know cats can't be attached to the thing that they're you know they're underwater that doesn't make sense so I want to move my my cat class somewhere else in my tree so the opportunity to intervene in a model is something that I make about when we look at explanations do you draw a hard line between explanations and interpretability just when you're speaking about them casually hey I do a little bit because at least classically to me I was again background being more in the traditional machine learning we really cared a lot about interpreting models and that meant that I could look at the individual pieces the model and start to reason about conditional probabilities what they would say about the the priors that I'm trying to impose on my data so interpretability to me means something that is sort of independent of the kind of intrinsic ly yeah its intrinsic to the model whereas an explanation could also be called a justification sort of looks retroactive Lee here's a decision I made provide an explanation if we look at humans humans are not interpretive why you can't look at their brain well most people can't look at the brain and go okay yeah I know what you're going to do but they provide meaningful explanations and that's maybe all we can hope for and we learn to work with that my hope with with the work and explain ability it works wise to sort of provide a little bit more of the interpret ability and I think that's important one because it you know if I'm going to deploy a model I'd like to be able to in general understand how it's going to behave not just on a you know candidate piece of data the other lens that we're bringing is be able to adjust the model when I get it when it gets it wrong I want to be able to correct it so this work with the decision trees is the set up such that the decision tree is an intrinsic part of like when you refer to the model there is the model superset that includes the decision tree or is the decision tree a kind of a surrogate model that's used when you're asking explain ability kinds of questions no so the challenge here was to make the decision tree the model and so what we're doing and we don't do really crazy complicated things so in this set up a standard ResNet 101 to find an embedding of our input and then we're using the decision tree as a decision tree on top of that embedding and so that's allowing us to route our decision process based on that embedding from ResNet 101 so there's a part of the model that I can make no sense out there deep interpretable or even explained to a component of that little piece but there is now structure and how decisions are made right so this ResNet is basically kind of learning this space of relationships between the things that it's seeing at least in a computer vision sense and then the decision tree is on top making decisions about what is what based on the space that the ResNet has yeah so it is a funny recipe and a lot of vision now let's take a res net like architecture as a backbone and its role as to embodying the things you know it wants to but its role is to extract pixel information to extract texture shape things image attributes that would be then used to make a decision and it places them in a fairly high dimensional space and the the decision tree is is constructed in a way that tries to use that that space to make decisions now that actually alone doesn't work so you need to take that decision tree and fine-tune the neural net the res net backbone so that it's compatible with the decision tree we build so it's a you know a small twist that's needed but that small twist now allows us to get competitive accuracy to the original model were you still using the model but now have an interpretor path where like one of the fun examples is if we give it a picture of zebra it's the class we've never seen before it'll route it down to near the horse but then it doesn't know what the you know classifieds and then something in that one of the horse categories and so it does try to extract some structure to the classes that is semantically meaningful but also a picture in the image domain meaningful so things that look similar yeah can you provide a kind of a quick intuition for why a fix is needed in a neural network domain as opposed to just throwing the existing decision tree against the existing embeddings and what the kind of intuition of that fixes yeah so the simple answers we tried throwing these simplified into the decision tree and it didn't work the deeper answer is that it's the decision tree wasn't optically the neural that wasn't optimized to give feature have some coherent structure that we can build this this class hierarchy on top of and so adding this extra decision tree loss we can actually force our decision tree to cluster things using semantically similar structure like we want horses to be nearby dogs and farther like share less common ancestors to fish so so we can impose some structure in our tree and then we can force the neural Nets embeddings to reflect that structure so that is why we need to adjust and all that to deal to compensate or to be able to work in the context of the tree so no and you're changing your loss function and a decision tree to accentuate kind of like maybe you know I'm envisioning kind of spreading out the embedding space or something like that so that's essentially can support meaningful for the semantics of the decision tree cool awesome well Joey this has been wonderful learning a little bit about what you're up to there we still never got very deep into serverless so we're gonna have to put a pin in that one for next time but very cool stuff and thanks for taking the time to share it with us thank you it's been fun awesome thanks alright everyone that's our show for today for more information on today's show visit 1200 Icom slash shows as always thanks so much for listening and catch you next timewelcome to the 1200 i podcast I'm your host Sam Charrington hey what's up everyone happy Memorial Day to those of you in the States although we might not be able to celebrate holidays like we once would I encourage you to find a way to enjoy yourself this weekend connect with family and friends and enjoy some good food and fun as best as you can I am super excited to be hosting tomorrow's live panel discussion on advancing your data science career during the pandemic this is going to be a great one featuring an amazing line-up and I encourage you to join us you can check out to malaya comm /d s careers for more information and to register i also want to give you a heads up regarding my upcoming webinar with algorithm eeeh CTO kenny daniel the hot topic at last year's twil makan unconference was a discussion on whether you should build or buy an ml or data science platform well we'll be tackling this topic head-on in our upcoming session we'll discuss what goes into building a machine learning management platform how to make the business case for ml ops at your company and how to evaluate off-the-shelf machine learning management solutions be sure to mark your calendar for 10:00 a.m. Pacific on June 9th and visit 12a Icom slash algorithm eeeh to register that's Himalayan comm /a LG o RI th m.i.a all right enjoy the show and enjoy the holiday all right everyone I am on the line with Joey Gonzales Joey is an assistant professor at UC Berkeley in the EECS Department Joey welcome to the twilight cast thank you for having me I'm really looking forward to diving into this this conversation and in particular talking about ML systems and your recent paper on train large then compress but before we do that please share a little bit about your background and how you came to work in ml nai yeah excellent so my story is a bit funny I started my PhD at Carnegie Mellon with an interest in actually flipping helicopters because that was a trendy thing to do back in 2006 a while backflipping helicopter was being helicopter by MUP and solo or fly them and then flip them I'm actually a colleague of mine Peter Beal now at Berkeley when he was you know finishing up his his thesis work he was looking at how to do interesting control for helicopters I thought that was really cool and and I at CMU I was you know I went to my thesis advisor you know you've worked on controllers well I'm kind of interested in flipping helicopters I think that's that's really neat research and you know I didn't know that wasn't thanks well it was and it actually was some of the pioneering work to what we see today in reinforcement learning but what's kind of cool about this story is my advisor at that time being a you know a real machine learning researcher I was like you know what you know flipping helicopters that's that's that's exciting but there's something more important like we can actually help the world with sensors we can build sensor networks to monitor fires and we can use kind of principled machine learning techniques I should add that when I was looking at the flippin helicopter so like you know what we should flip them with neural networks and the other thing my advisor said which was good advice at the time was a neural networks aren't really serious research we use more statistical methods graphical models things that have formal foundations that we can reason about and write it kind of detailed analysis and understand what our models are doing and that was good advice and so I went down this path of how to build Gaussian processes Bayesian nonparametric methods to reason about link quality and sensor networks and and in that process of doing that I kind of stumbled into a problem I was writing a lot of MATLAB code to compute big matrix inverses and then approximations that to make it run faster and one of the things I enjoyed doing in the process of you know exploring these more efficient MATLAB programs was trying to make them more parallel and I think my advisor occluding is a good advisor he's like you know what maybe you enjoy that more so maybe instead of focusing on the knob metrics in the sense networks let's start to think about how to make machine learning more efficient and and in particular at that point in time hadoop was taking off and smoothly you know what MapReduce that's gonna change machine learning and we were thinking well we're working on graphs and they just don't fit the the MapReduce pattern and the kinds of computation we were doing just it wasn't it didn't actually fit with the the technology that people were building so we started to explore a different design of system so design of systems for computation on graphs which took us down the design of of graph processing systems system that I ended up writing is kind of the end of my thesis was a graph lab for doing very large analysis of graphs and so by the time I finished my PhD I was actually writing systems papers not machine learning papers and the field was changing very very rapidly to this around 2012 and if anyone's been following the history of machine learning around 2012 everyone started to realize maybe actually the neural nets for a good idea the deep learning these ideas actually really bethe dated back to 1980s they're actually really starting to work and they were changing the field of machine learning and graphs were also taking off so we built actually a company around the systems that I was developing as a graduate student it was graph lab that evolved into a company for building tools for data scientists to do interesting machine learning at scale that was ultimately acquired by Apple and around that time I also joined the UC Berkeley em pleb as a postdoc and there was you know a chance to come out to California and it was a really exciting opportunity to do research in a different system a system called spark which eventually became apache spark and there we started developed the graph processing foundation for the apache spark system and again as i started to explore more and more into the the field i learned more about research and data systems and transaction processing and how those connect back to machine learning and so after finishing my postdoc I came to Berkeley in fact I chose not to follow the much more lucrative path of the startup I was going to ask about that yeah I made a terrible financial decision but I'm happy because I have a chance to work with students I'm a little less happy because I'm not as wealthy as one could have been but now I am teaching do research at the intersection of machine learning and systems and so we have a pretty broad agenda around how to build better technologies for delivering models to manage the machine learning lifecycle not just training but prediction how to prioritize training experiments on the cloud to use serverless computing to make machine learning more cost-effective and easier to deploy we have a big agenda around autonomous driving building the actual platform that supports autonomous driving not necessarily the models but how they are connected together to make a reliable car and we have work in natural language processing and computer vision and one of those papers one I'm hoping to talk a bit about today which is our work on making Bert models easier to train and it too has a kind of funny story how he came to - actually a realization that what we were thinking was entirely wrong and that's what that paper talks a bit about well let's we'll get to that funny story in a second there's so much interesting stuff that you just mentioned it's there there at least three or four interesting podcasts in here I'd love to dig into some of the stuff you're doing with serverless at some point and how that intersects with ml and is something I've looked at a little bit as well but Before we jump into even more that I'm curious your co-founder at GraphLab and and tariq carlos gastrin was one of my very first guests on this show when we'll talk number seven in fact and I'm curious how you came to know and found the company with Carlos yeah Carlos is awesome so he was my okay when I get to CMU Carlos was the guy who said let's not flip helicopters let's do something that you know could make an impact in the world and he was a great advisor like he pushed me down the right path in my PhD the thing that reflected I was interested in and he's one of the pioneers in the modern field of machine learning in systems yeah he did go to Apple he did saw him recently at at nerves most recently in Vancouver it seems to be really having a good time there yeah he's had a chance to have a lot of impact doing really cool stuff you kind of laid out this broad space of research it sounds very broad actually yeah you know tied together by systems I'm curious how you kind of you know is it rationalized by hey you've got a bunch of you know students and you're letting them flip helicopters and the way that they want to flip helicopters more so than you were you know in Cirrus yeah or it alright so is it's challenging as faculty to decide what is your research agenda I mean one likes to imagine you sit there and go here the three two things I want to study usually not one cuz you have to have enough for a couple students to build you know their thesis around the reality is it students pull you and and I actually I think sort of like artists it's hard to compel people to follow the research agenda that you ultimately want my adviser did a great job it's not about telling you what to do it's about showing you the exciting opportunities you can explore and so with my students I I've pushed them in directions to think about how we make model more models more efficient to not just train but to serve how we support new emerging applications of machine learning that might require innovation both in the system and and the modeling techniques and actually what's kind of neat about the field of systems and machine learning is again when I started it wasn't really a thing in fact some of my colleagues at CMU like you're just hacking you're not actually doing research you're not proving anything fundamental about machine learning you're writing software a little bit of that was true we were definitely writing a lot of software we were trying to prove some stuff too but I think the impact might have actually been more on the software side and one of the funny things about the broader field of systems and machine learning is that it actually has been kind of the undercurrent of a lot of the recent progress in AI when we look at this revolution in deep learning we can go back to the 2012 the Alex paper that's actually not the beginning it goes way back the 1980s in fact the techniques are from the 1980s the architectures the models even the algorithms that we're using are from 1980s if you actually read the Alex net paper more than half the paper is devoted to how they got it to run on a GPU how they got it to run on a very large image data set and some of the optimizations they made to the training process to make it run at scale so it is the movement to scale that really helped launch the revolution that we are on today and now there's the other factor which I think people overlook and it's sort of when I was doing my PhD we were writing the Fortran of machine learning we were writing MATLAB code to implement algorithms and debugging gradient procedures and that's absurd today it's just too easy so a graduate student can pick up PI torch tensorflow and the x net one of these packages and very easily architect a new model and train it on TP use GPUs harder they barely understand and get it to run at scale on datasets that they don't have to collect so that is an enormous jump forward and and if you look really carefully and a little bit depressingly that models didn't change that radically the algorithms didn't change that radically what changed was it became a lot easier we developed the languages the tools to make machine learning practical and that really boiled down to getting the right abstractions and maybe if you roll all the way back when Alex and I came out they didn't quite have that but right after Alex net came out Theano started to really take off Caffe at berkeley started to take off and it became so much easier to build that next model in the next model and so on and today we're stuck in like a flood of archive papers because basically anyone can download one of these packages and start building state-of-the-art machine learning models there's some learning that you go in the process but the fundamental principles are define your objective define your decision process and then tune your decision process optimize it for that objective that's it and the undercurrent that drive all of this has been a lot of the innovation in the system's front not necessarily the machine learning and so my research is trying to find those right abstractions and especially as we look at new frontiers not just training models but how we deliver them how we support them in autonomous driving and how we adjust the architectures of the models themselves to make them more efficient in these new kinds of applications it when I first started doing these interviews one of my favorite questions was looking to explore kind of the the way folks came up with new models and you know trying to find the the kind of science behind it and I think that the takeaways were a lot of it was you know the the answer was like graduate student dissent like we would just throw a graduate student at this and you know they tweak something that pre-existed you know but there wasn't necessarily kind of a you know hard science behind how to come up with a new model architecture but so we've seen a lot of you know innovation like in around you know you know Burt and the kinds of transformer models that we're seeing here it has that has you know would your answer to that be kind of similar has it you know it changed a lot or how do you think of you know beyond kind of that high level process you just lay it out how do you think of the the process for coming up with these new types of architectures yeah so that's been a struggle for me so remember I start with this religion this Bayesian philosophy of model development they have these principles of priors and likelihoods that gave us at least the the basic foundations of what to think about when building a model that's all that's not gone but that's you know effectively got for a lot of the machine learning world and so we're left low coming back you though actually right like the cool little modeling is is on the rise rank so I should say it's not gone and it's very important to note that that a lot of the world actually runs on these more traditional techniques the research community where we're writing these new papers for language modeling or speech or driving where there's very specific cases that have been kind of dominated by deep learning techniques but the Bayesian methods are still you know fully alive and medicine and in even traditional advertising algorithms but with that in mind so when I start to look at the deep learning work how do I find those principles and actually they actually exist there they're a little smaller and and sadly we start to embody the the model with personality like the models trying to do X which is sad because that's not how we liked it you know formally think of things but the these building blocks convolution recurrence attention each of these becomes tools that we can use to architect our models when students go how deep should we make it well we try to go a little deeper they start to look at when how it affects convergence rates variation and batch size and its relation to batch norm so we have little rules of thumb and unfortunately there's no great like my PhD students like two or three years to get up to speed with the rules of thumb in the area that they're working in and once they have that I hope they teach the next PhD students and so on because it's hard to really grasp what those are it's more like it comes from experience working with these models and going out so like the transformer in this particular piece of work that we've been exploring like how to make it more efficient really should we make it deeper to make it wider who knows so we start to measure each of these things and that's one of the jokes that we make is it the machine learnings become more like a biological science it's driven by laboratory experiments by using compute to understand better the models that were building as opposed to the more principled approach you might have had in the past we tried to to frame it in some probabilistic architecture you mentioned that there was a story behind the work that led to train large then compress yeah so I'm happy to go into that story sure sure yeah so so the story behind that the Train large and compress work it starts in the following so we've been doing a lot of work in how to make computer vision models more efficient in particular not for training but for inference and so we have these skip net architectures ways of making things more dynamic and some of my colleagues go you know what maybe we should start thinking about language models they seem to be eating up a lot of cycles the the transformer the burp model that's become a kind of a foundation for reasoning about text and context well so that model is pretty expensive to run and so we said all right maybe we'll explore what we can do in the context of making these burp models more efficient now I should say a lot of people are studying to think about that because bird is you know incredibly expensive to run on text feeds and text is a pretty large you know body of data that we might want to process yeah I'll mention that for folks that want to dig into that particular point I did an interview with Emma Struble who has you know in fair amount of detail kind of characterized both the cost and kind of environmental impact of training some of these large-scale NLP models and it is crazy it's it's crazy actually the the co2 narrative was one of things that got me especially like I was maybe we don't touch language that's there's plenty of people thinking about it and then I saw him in his papers like wow and you're trying to make autonomous cars so that you know a little bit more environmentally friendly when it comes to driving when I could go fix it you know funding problem right in my field and so yeah so we look at these language models and go how can we make these better and the first step to doing that is we got to understand them so we need to run some experiments and and my students go oh we're gonna need a lot of machines like I can't afford a lot of machines so if I look at Google at Facebook they can throw a lot of compute and trying to understand something and that's actually one of their their tools that we don't really have access to we've actually started collaboration of Google so we could get access to TP use but we can't do it at the scale that they're going to run their experiments so we had to get lean so how can we rapidly iterate on variations and architectures we want to look at different types of attention different architecture sizes understand the effects of these hyper parameters and so my students go out here's what we'll do we'll make the models really small and we run a training run every day with different configurations and we'll get a good picture of what's going on and they did that and so they made the model really small because that would in theory make it really fast to train but they also really small in terms of number matters or yes so they made the model smaller in terms of both the height the number of layers and the width the number of these hidden hidden layers hidden parameters inside of each of the attention heads so basically I tried to make the model so it would it would train faster because it had less to compute less to update this is more of a classic way of thinking how I would approach the problem to if it's too big make the problem smaller it should go faster right it's less to do all right so they did that and it was working but one of them was like I what if we make a look bigger just you know to get a point of comparison and they applied the point of comparison on top of the you know the the smart model zero training and and the point of comparison seemed to go down pretty quickly and I said well let's put it in time and you put in time and actually the bigger model the point of comparison was actually getting to a better accuracy quicker than the smaller models that we were supposed to be running because they were faster to train and then we started wondering mmm maybe it's the other way around maybe we had this backwards all along that if we want to go faster we have to make the problem bigger which is really counterintuitive but it actually turns out to be a really neat intersection between how these models converge and how they can take advantage of parallel resources in the GPUs and TP use to get good scaling as you make them bigger and that sort of forms the foundation of this work that sort of went against what we thought would be the case and actually presented each way to approach training these expensive models is the idea related to the kind of the rate of change of of kind of accuracy for these models and you know taking advantage of the idea that the larger models learn quicker but the you know I guess that the area under that learning curve is proportional to your compute cost and you cannot kind of optimize that yeah so there's a bunch of trade-offs let me I try to walk through them because they were counterintuitive to me at first too so the first trade-off to think about is actually let's talk about compute so as I make my model bigger I'm going to compute more so that's more work and it should run slower but the neat thing is that when I make these models wider at least I actually expose more parallelism and if we look at the execution these models it's a little surprising we've optimized these GPUs and TP use to have a substantial amount of compute often for computer vision models and so now we have an opportunity when we run a transformer if we don't crank the bat's batch size up incredibly high we actually have a fair amount of leftover compute that we can use so making the model bigger doesn't necessarily translate to a linear increase in runtime so we can afford to make the models bigger without linearly increasing execution parallelism and runtime don't correlate to cost because you're just running more compute at the same time yeah so this is looking at its own carbon for that matter yes so all right this is you're getting to the interesting stuff so so first let's in the paper we actually tried to control for this exact I was like ah no ha no second you're just going to increase the cost of compute so we looked at one GPU and what happens is you're not using all the cores on the one GPU when we were looking at the smaller model so as we make the models bigger for a fixed batch size we can get an increase in the utilization of the GPU and right now it's not easy to turn off those cores and you're also paying a fixed overhead to power the the actual box that the cores are living in plus cooling so trying to power throttle individual cores on GPU is generally not a great idea especially if we can get better utilization of the cores that we have now you could say I should have as we did so then we look at more GPUs we're going to burn more resources as we turn on more GPUs but the hope is that we can get to a solution quicker and if end if those GPUs are already attached to our box which they often hire there's usually some incentive to go ahead and try to use those as efficiently as possible and so that brings us to the second question which is if I make my model bigger is it really improving in efficiency which is what we'd like to think of as the improvement in our perplexity a reduction in perplexity as we train as we'd like to reduce our error as quickly as possible in in wall clock time because we have to pay for the power of the building and so on so we want to train as fast as possible in time the simpler way to look at that first is how is it housing per plexi or air decreasing is a function of the amount of data that we touch and so there are two knobs there so now we're getting into the weeds but there's the batch size which determines how much data we look at per step of our algorithm the more data we look at the better of an estimate of the direction that minimizes the loss which in principle should give us faster convergence it also increases GPU utilization so we can use that as another mechanism to get better utilization out of each of our piece of hardware but it also has diminishing returns so as we increase the batch size our our speed at which were able to converge as a function of samples we look at doesn't necessarily increase linearly and one of the other sort of side effects of this which if you work in computer vision you're like oh no there's a problem that as we increase the batch size there's some risk of overfitting and this is a fact that shows up more in computer vision models where where it's somewhat data poor we're in an NLP it seems at this point at least that we have opportunities to overfit more before we actually are properly overfitting so there's this question of the generalization gap the gap between how well your models fitting the training data and the test data and in NLP tasks were not at a point where we're that generalization cap is disappearing which means that we can increase the batch size quite a bit more without overfitting but it also means we can crease the model size quite a bit more and so this paper then does is tries to play they compare this trade-off between model size and batch size to find the best combination and one of the neat results we find is it actually cranking up the model size and a batch size to a certain extent as well kind of gives us the best outcome it gets us to a a model that's more sample efficient the more samples it sees the faster reduces the the test air and it also lets us better utilize our hardware so we're actually getting a gain from parallelism and those two forces interact to give us a faster in terms of lock lock time reduction in the test perplexity or the air metric that we care about in terms of this generalization gap and the differences between what you see in computer vision and what you see in NLP tasks is that related to the way the problems are formulated in terms of supervised versus self supervised semi-supervised and the kind of availability of data and labels and that kind of thing absolutely so you're hitting a key point so in computer vision we are largely still focused on supervised tasks we need labeled data sets which are big but they're not as big as we want them to be whereas in NLP we can go to really large unlabeled data sets because we're essentially predicting missing words so we've created this self supervised task and that means that we have so much more data we can support bigger models and bigger batch sizes without having this this generalization gap disappear we're a late sorry without eliminating or causing our training err to go to zero and our test data to you know dominate so so there is this this opening created by this self supervised training that we were able to take advantage of now in our experiments we test both the self supervised training tasks as well as the downstream translation or classification tasks it would be applied to actual language modeling you know supervised training tasks but that's typically done as a small fine-tuning step on top of this extensive pre training which is where all the the co2 is going to pre train these models and then in your description of the Train large it sounded a little bit like you're ultimately saying you know fully utilize whatever box you're training on but there's a lot more nuance there I am yeah elaborate on on that so this has been a big question in data center design generally as a systems person like should I turn off my GPU should I turn off my CPUs or should I try to utilize them better and the general consensus when we think about data centers is that we really do want to try to maximize our utilization part because we bought the data center it's expensive we should use it and we have to keep it cool we have to staff it there's a lot of other factors that go into play and so we want to be able to use that hardware as much as possible and as efficiently as possible and part of the reason we might want to use it as efficiently as possible you think of things like serverless if I'm not using the hardware I can put something else there and we're creating markets for filling in the excess capacity so the idea that I would turn off a GPU is sort of silly I should always have something running now the question is can I make that thing running on the GPU as efficient as possible and so in our work we're focusing on trying to maximize that efficiency in my lab for example students are competing for the GPU so I think the one GPU experiments definitely easier to run because they're not fighting for the entire box and so the other GPS are being used for other experiments and then when we go to you know H GPUs we're gonna again use the whole box so the general consensus are at least the thought process today when we think about the data centers to really maximize our utilization and not try to power throttle or limit the performance of each of the course now it could be in the future and new kinds of hardware might change that trade-off but the underlying economics would sort of suggest that if you bought the device you should really try to find ways to maximize its usage and given machine learning has an infinite supply of things that we'd like to Train it's not hard to imagine that I can always fill my excess capacity with more training is the paper fundamentally an economics paper in the sense of you know you're trying to maximize utilization and those kind of things or do you also get two results that talk about performance given a set of constraints like your traditional computer science e kinds of papers yeah so it's it's funny we hadn't gone down the economic throughout so it's a funny I mean a very loosely yeah so we we are actually very much thinking about the economics of computing when we look at server list that is going to fundamentally change your economics computing in a way that I think will make things more efficient more cost effective and actually easier so it's to win for everyone and we actually have an upcoming paper on this a hot cloud about you know the economics of server this are going to generally be favorable for everyone assuming we get some of the system's problems you know ironed out this paper was really our students says in the you know the first effort to really make progress in the Berkeley training space to find mechanisms that we and academia can use to go fast and part of that is finding better ways to be more efficient about training it allows us to run experiments more quickly and so we can now innovate on Bert and one of things are actually looking at is trying to make these models more nonparametric so they can leverage other data sources one of the side consequences of this paper is sort of a you know if you're out there thinking about I should that's really cool I want to play to play that but hey wait a second you made the model 4 or 5 X bigger 6 7 X bigger that's expensive for inference what am I going to do about that and in fact when we got this result that was like my first conclusion yeah we went on the training front but we just made inference which is actually the more expensive problem worse by 7 10 X and if you think about it training should only be a small part of the use of these models inference is where it really the cost should be and it is when you look at a practical application the time we might train it but we're gonna run that model 24/7 at every single tweet every single web page that we encounter that's a lot of inference and v100 so you're doing about a thousand when bat should optimize the thousand sentences per second which just sounds good but then you think of the amount of text in the web that's a lot of expensive GPU hardware so making the models smaller after training was one of the questions that we had to solve and so so the second half this paper comes back it was wait a second so even the models bigger to train faster but now we need a way to squeeze them down and maybe actually the bigger insight which is also maybe a little less counterintuitive is that the bigger models we could actually make smaller more efficiently and actually with with less of a degradation in accuracy so we make a we train a really large model and then we chop it up so we both explored weight pruning so eliminating weights making the model more sparse and quantization reducing the bit precision of each of the the weights and so we able to take our much larger models and then apply these these compression techniques to make them smaller and the effect of that is we can make the model actually smaller than the small models while retaining higher accuracy and so that's something that we're still able to use the compression techniques off-the-shelf where did you have to adapt them to this kind of model or the specifics of the way you train them yeah so getting close to the deadline realizing our models are now 10x bigger really great good news one of the students Shen who is working on this project had just finished a work on quantization and so we're like right Jen can we use your quantization techniques like I don't know maybe and so he started playing around and it turns out that you shouldn't hurt really well and so we looked at the standard quantization and standard pruning so we tried not to innovate extensively in each of these pieces more of an exploration how they interact with this kind of counterintuitive hypothesis that bigger models might actually be better both for training and it turns out for inference as well if we compress it got it got it so not that you have any of these numbers like right at your fingertips but can you give us a sense for when you say train large like what large means in this case and how that compares to what Google might do typically yeah so I think we were looking at like 6 X 7 X bigger than was normally was normally published I'm guessing Google actually goes much larger still and they might already be benefitting from these ideas and what what order of magnitude is that in terms of number of you know servers or CPUs of GPUs or so we were at 8 GPUs we actually all surround experiments on a TPU v3 TPU as well remember the exact sizes have a paper in front of me if I can find it many tabs yes so I think we're at we were up to like 20 hidden layers or so our 24 went up twenty four layers and we tried hidden sizes of like the order of 15:36 so 1536 hidden units for each of the layers so we tried a pretty reasonable space we built off off the roberto work which is actually if people haven't looked at it's kind of neat sort of revisiting what bert did and in some ways but really had the right answer just this broader experimentation of the the trade off space makes a big difference so we built off of that and tried different variations on the sizes described in that paper yeah the kind of rough magnitudes that i remember reading about and I don't remember if this was you know bird or elm or some of the different variations or Road but there was you know on the order of like 64 you know tens of GPUs for you know yak or more know so we went we used to tip you cluster for our big experiment so we actually tried to reproduce the Roberta setup so our comparisons are compared to these standard baselines so we had to use a TP cluster for several weeks it's expensive to run the full experiment for the baselines those need is by making the model bigger we could get comparable results quicker yeah and so we've we've kind of characterized bigger now characterized quicker what is what did that mean in practice so we used an i guess at about a hundred thousand seconds I'll be able to get fairly competitive accuracy should see if the final number so it sort of depends also on on which tasks we also accounted for the downstream fine-tuning that you'd need to do as well I don't actually member atop my head well we'll leave it as an exercise to the listener to pull up the paper so you've increased the size of the model and the number of resources the CPU research if you rather resources that the model is running on and in turn decreased the the training time of the model and the aggregate compute costs right did you need to do anything special to accomplish that or was the paper primarily observing the fact that you know that you know in aggregate you keep the the preferable approaches to increase the the model size so we did small changes to where we place the batch normalization we did pre normalization but it's like negligible changes in the underlying architecture most of us really exploring this this trade-off space ok these different parameters so in some sense it's a first step towards a bigger agenda it wasn't intended to be like a ground changing work but what's kind of neat is it does really make us at least we think how we approach training of these models well it's you know it's important stuff like you I think a lot of people will think well you know if Berkeley's worrying about the cost of training these models what what hope is there for you know my lab in you know an on Berkeley institution you know that has such close ties to Silicon Valley and relatively awash in resources and so if you can figure out how to make training these models more efficient than that's potentially a huge impact for a lot of people there there's a an important subnet of here which is that training the pre training isn't something that everyone needs to do and so Google has done a great job of offering these pre trained models so this really expensive part isn't something that every you know group in the world has to address and that's a good thing but if we want to innovate on that pre training process if you want to do research in it or we want to in fact the data suggests that adding more data that specialized to your domain can improve the quality model so if we want to be able to push pre-training research forward we do need to find ways to make it more efficient and I should say we started out with thinking oh we're gonna invent a new Bert and discovered in the process and maybe we don't necessarily need a new Bert yet that may be approaches to how we do the training how we choose our hyper parameters can make a really big difference your comment just prompted a thought to what degree has the kind of the I don't know theoretical trade-off space around pre-training versus fine-tuning been explored so that you know if I know that I have a unique domain you know and some you know corpus of documents or data available to me you know is era is there any kind of concrete research I can look to to help me understand you know if I should be pre trained if I should be pre training from scratch versus fine tuning and or do I just need to try everything and see what works the try everything is is not terrible advice but here's what I would tell my students so pre training is expensive so maybe start with fine-tuning understand what what is your prediction task and this is what you know the the practical world will do to take your your your prediction task whether or not a translation or sentiment tagging or maybe it's like which call center which call person should this this you know message be forwarded to focus on on fine-tuning for that task first there's a little bit of art and choosing learning rights and stuff to get your fine-tuning to work so go through that process understand how well you can do by fine-tuning to your domain and then if you have and you might you know billions of call records from the past you think you could really better improve the underlying representation on you could then try to go back to this mass language model training the the pre training process and then the work that we've done and you know other work that's it's that's going on around us can help to make that process more efficient so that you can in a matter of weeks or a week in our case on take your v100 box and really make substantial progress towards pre-trained model that's now pre trained on on your domain as well hmm and so where do you see this line of research leading yeah so I asked my students this every be able to develop a new bird so now we have the tools to start testing pre-training what should we do next one of the things that I'm kind of excited about is well the realization that to cram a lot of knowledge in the weights of our model and making the models bigger certainly helps with that another way to deal of knowledge is do not put it in the model at all I actually have to look up stuff most of time when I want to remember facts I'm terrible remembering facts so I used the internet I have a neural net in my head it doesn't have to memorize the internet because I have the internet so having access to a knowledge base can make a big difference in how much we need to encode in our model make him model perhaps smaller the ability to in synthesize decisions or to apply logic on top of knowledge bases seems like a really big opportunity for language modeling for for an LP broadly and maybe even for these basic representations like Bert and so we've been looking at and starting to look at if some of their groups actually got some early published work on how to bring in a non parametric or semi-parametric lens on on these models so that we can reference into a large knowledge base in in the construction of our embeddings themselves and that has you know the advantage of maybe being more efficient allowing us to grow the knowledge base without having to retrain the model you could get more data our model just gets better without having to adjust the model itself and maybe even giving us some explained ability so when we go to make a prediction about like how we embedded the sentence or how we represent this you know this decision for which called or out to we can now actually point back at historical data and say here's the data we use in our embedding to reason about that and you know that's terrible data I don't want that in my data set or that actually that that makes sense and and so that that connection to the data could actually also help with explained ability so that's sort of the vision that I that that my students and I are pretty excited about right now does that pull from or kind of lead you to like memory based networks or like information retrieval types of problems or yeah yes so memory nets I are all of these these kind of let's say I are more classic memory nets also increase anywhere classic so those are our tools one things we're looking at right now is something as simple as like can we use embeddings from other pieces of text and simple text similarity to recall those embeddings and and there's some other work exploring this now ultimately things like memory nets or pointers to point into a knowledge base and attend to an entire corpus would be really exciting and we're just starting to explore this so there's a lot more to do it does pushes in the direction of IR imagine a neural net that can Google stuff for you and to answer questions so it's certainly there's a well-studied area that we'll have to build on you touched on explained ability in there as a possible benefit here I'm curious about you know maybe elaborating on that a little bit more and then also you've been doing some work on explaining ability as applied to reinforcement learning maybe a little if you can share a little bit about that work as well yeah so I'm the co-director of the UC Berkeley rise lab which stands for a real time working on that intelligent working on that secure so you have an interesting agenda on security which is another time and then the e and there were for a long time like what is the e really what should it mean and we were initially thinking execution we're gonna execute things securely but that actually has some bad connotations so maybe there's another agenda which actually came out of our exploration in lineage how we we tracked relationship between data and the model development process explainable decisions would be a good thing to be able to support and so we had an agenda around how to do we have actually ongoing agenda on how to do explainable machine learning by connecting our training process to our data but what's actually pretty exciting is pulling in some of the recent work and explainable AI actually my thesis advisor Carlos has an exciting work line which provides a black box mechanism for explaining models decisions so my students have been also exploring this edge tremendous it hanging power this is you know one of the things we talked about three plus years ago and that went that early 12 episode and lime comes up all the time still yeah so I run the risk of another tangent but the world of explained ability is is it's kind of rich and it is created by this need to make sense of models that no longer make sense and so this idea that I can inspect my model and go hack that I like how that makes decisions that's gone or at least you know to a first approximation so we're left with justify the decision you made let's go back and at least connect it to the data to even the input so so my group had started looking at that and one of the things that we started to ask is why can't we have some interpretability and so one of the agendas that I'm exploring that's actually not in the language to me but more in the vision domain is how to apply decision trees to connect decision trees back with our deep learning models so that we can get the accuracy we want but we can also go and interrogate our mom and go well it's going to call us a cat but in order to do that it has to first identify that's an animal and it has to cluster it in animals with with legs and then you know with fur and then it gets us to cat so there is an opportunity to actually understand what the model is going to do at the high level each of these decisions is governed by a neural net so understanding that is sort of off the table but at least we now have an idea of the decision process the model we'll go through to make a decision so this is our recent work on neural back decision trees and when we look at language it's been an interesting question of what what would an explanation look like in the language domain so there are techniques like grad cam that have been pretty popular envision that would give us you know highlighting parts of an image that say you know this is the part of the image you're attending to we could do that and there are explorations of that in the language domain but one of the neat things going back to our very beginning narrative is can i connect my decisions back to data in many cases that is sort of the ideal explanation it's like here's the data that informed my decision about what you've given me now and so so that explanation is what we're exploring one of the hopes and doing that is you can not only connect it but you can even fix it so one of the kind of ideal outcomes of an explainable model is when it gets it wrong you go that's wrong here's what is wrong about it and that extra signal could be way more valuable than just some more labeled data and so that's our hope and maybe being able to correct our knowledgebase if we are referencing data so that we don't use that reference data in the future that would be one mechanism in the case of decision tree changing the paths so you know cats can't be attached to the thing that they're you know they're underwater that doesn't make sense so I want to move my my cat class somewhere else in my tree so the opportunity to intervene in a model is something that I make about when we look at explanations do you draw a hard line between explanations and interpretability just when you're speaking about them casually hey I do a little bit because at least classically to me I was again background being more in the traditional machine learning we really cared a lot about interpreting models and that meant that I could look at the individual pieces the model and start to reason about conditional probabilities what they would say about the the priors that I'm trying to impose on my data so interpretability to me means something that is sort of independent of the kind of intrinsic ly yeah its intrinsic to the model whereas an explanation could also be called a justification sort of looks retroactive Lee here's a decision I made provide an explanation if we look at humans humans are not interpretive why you can't look at their brain well most people can't look at the brain and go okay yeah I know what you're going to do but they provide meaningful explanations and that's maybe all we can hope for and we learn to work with that my hope with with the work and explain ability it works wise to sort of provide a little bit more of the interpret ability and I think that's important one because it you know if I'm going to deploy a model I'd like to be able to in general understand how it's going to behave not just on a you know candidate piece of data the other lens that we're bringing is be able to adjust the model when I get it when it gets it wrong I want to be able to correct it so this work with the decision trees is the set up such that the decision tree is an intrinsic part of like when you refer to the model there is the model superset that includes the decision tree or is the decision tree a kind of a surrogate model that's used when you're asking explain ability kinds of questions no so the challenge here was to make the decision tree the model and so what we're doing and we don't do really crazy complicated things so in this set up a standard ResNet 101 to find an embedding of our input and then we're using the decision tree as a decision tree on top of that embedding and so that's allowing us to route our decision process based on that embedding from ResNet 101 so there's a part of the model that I can make no sense out there deep interpretable or even explained to a component of that little piece but there is now structure and how decisions are made right so this ResNet is basically kind of learning this space of relationships between the things that it's seeing at least in a computer vision sense and then the decision tree is on top making decisions about what is what based on the space that the ResNet has yeah so it is a funny recipe and a lot of vision now let's take a res net like architecture as a backbone and its role as to embodying the things you know it wants to but its role is to extract pixel information to extract texture shape things image attributes that would be then used to make a decision and it places them in a fairly high dimensional space and the the decision tree is is constructed in a way that tries to use that that space to make decisions now that actually alone doesn't work so you need to take that decision tree and fine-tune the neural net the res net backbone so that it's compatible with the decision tree we build so it's a you know a small twist that's needed but that small twist now allows us to get competitive accuracy to the original model were you still using the model but now have an interpretor path where like one of the fun examples is if we give it a picture of zebra it's the class we've never seen before it'll route it down to near the horse but then it doesn't know what the you know classifieds and then something in that one of the horse categories and so it does try to extract some structure to the classes that is semantically meaningful but also a picture in the image domain meaningful so things that look similar yeah can you provide a kind of a quick intuition for why a fix is needed in a neural network domain as opposed to just throwing the existing decision tree against the existing embeddings and what the kind of intuition of that fixes yeah so the simple answers we tried throwing these simplified into the decision tree and it didn't work the deeper answer is that it's the decision tree wasn't optically the neural that wasn't optimized to give feature have some coherent structure that we can build this this class hierarchy on top of and so adding this extra decision tree loss we can actually force our decision tree to cluster things using semantically similar structure like we want horses to be nearby dogs and farther like share less common ancestors to fish so so we can impose some structure in our tree and then we can force the neural Nets embeddings to reflect that structure so that is why we need to adjust and all that to deal to compensate or to be able to work in the context of the tree so no and you're changing your loss function and a decision tree to accentuate kind of like maybe you know I'm envisioning kind of spreading out the embedding space or something like that so that's essentially can support meaningful for the semantics of the decision tree cool awesome well Joey this has been wonderful learning a little bit about what you're up to there we still never got very deep into serverless so we're gonna have to put a pin in that one for next time but very cool stuff and thanks for taking the time to share it with us thank you it's been fun awesome thanks alright everyone that's our show for today for more information on today's show visit 1200 Icom slash shows as always thanks so much for listening and catch you next time\n"