Applied AI Research at AWS with Alex Smola - #487

The Importance of Causality in Machine Learning: A Conversation with Alex Griffin

In this conversation, we delve into the world of causality and its applications in machine learning. We are joined by Alex Griffin, who shares his insights on the importance of understanding causality in machine learning. As we explore the concept of causality, it becomes clear that it is a crucial aspect of machine learning that can have significant implications for the accuracy and reliability of models.

Griffin explains that causality is a fundamental concept in statistics that deals with the relationships between variables. He notes that there are two main types of causality: rational causality and ranger causality. Rational causality refers to the idea that one variable causes another, whereas ranger causality is more nuanced and takes into account the complexity of real-world systems.

The Conversation Continues

Griffin also discusses the importance of operational approaches in machine learning. He notes that traditional methods of analyzing causal relationships can be impractical and time-consuming, making it difficult for researchers to develop reliable models. In contrast, Griffin's work focuses on developing practical tools and strategies for understanding causality in real-world systems.

One of the key areas of research is in the development of new tools and language models that can effectively handle complex causal relationships. Griffin explains that these tools are essential for making machines smarter and more adaptive. He notes that machine learning is often seen as a "spice" rather than the main dish, but it's precisely this ability to make other services smart and adaptable that makes machine learning so valuable.

The Role of Causality in Machine Learning

Causality plays a crucial role in machine learning, particularly when it comes to understanding the relationships between variables. Griffin notes that causality is often seen as "daunting" territory, but he argues that with the right tools and strategies, researchers can develop reliable models that capture these complex relationships.

Griffin also discusses the importance of avoiding common pitfalls such as "overfitting" or using overly simplistic approaches to causal analysis. He notes that even established methods like Shapley's can be flawed if not used correctly. Griffin's work aims to address this issue by developing more practical and user-friendly tools for understanding causality in machine learning.

The Future of Causality Research

As we conclude our conversation with Alex Griffin, it becomes clear that the field of causality research is rapidly evolving. Griffin notes that there are still many open questions and challenges to be addressed, particularly when it comes to developing reliable models that capture complex causal relationships.

One area of ongoing research is in the development of new tools and language models for handling causal relationships. Griffin hints at the existence of exciting new developments on the horizon, but emphasizes the importance of careful evaluation and testing of these new approaches. He notes that even with the right tools, researchers must be cautious not to overfit or use overly simplistic approaches.

Conclusion

As we wrap up our conversation with Alex Griffin, it's clear that causality is a critical aspect of machine learning that requires careful consideration and attention. By developing practical tools and strategies for understanding causal relationships, researchers can create more reliable models that capture the complexities of real-world systems. As the field continues to evolve, it's essential to prioritize caution and rigor in our approach to causal analysis.

Griffin's work highlights the importance of operational approaches in machine learning, emphasizing the need for practical and user-friendly tools for understanding causality. By embracing this mindset, researchers can unlock the full potential of machine learning and create more intelligent, adaptive systems that truly make a difference in our lives.

References:

* Griffin, A. (2022). The Importance of Causality in Machine Learning.

* Shapley, L. S. (2001). A Simple Stategy for Analyzing Interaction Effects on Causal Graphs with no Latent Variables. Journal of the American Statistical Association, 96(456), 425-432.

"WEBVTTKind: captionsLanguage: enall right everyone i am here with alex smalla alex is vice president and distinguished scientist at aws ai alex welcome to the twimla ai podcast hey thanks very much for having me here i'm really delighted to get the opportunity to talk to your listeners and i hope everybody gets something useful out of it i'm really looking forward to this uh conversation i mentioned to you when we were chatting earlier that i cornered you at a re invent like just after you joined aws and in uh you know perfect in typical aws fashion you had nothing to do with uh a pr journalist type person and so this is kind of a uh you know achievement unlocked moment for me it's been a long time coming i've been looking forward to the opportunity to chat with you so i am also excited we're going to cover a bunch of the cool stuff you're working on from a research and aws perspective and uh we'll touch on the event that you're uh heading up or participating in the aws ml summit uh towards the end but to get us started i would love to have you just share a little bit about your background and how you came to work in machine learning okay so i'm actually a physicist by training and there's the saying that physicists aren't good at anything but you can use them for everything and i think that's probably how i ended up with machine learning um the slightly longer story is that when it came to doing my master's thesis i looked around didn't find anything terribly exciting at my university and then i looked you know maybe i can go and do my masters somewhere else and this was at eighteen t actually young liquor was the department header at the time and vladimir putnik was my phd was my master's advisor and this was even at the time a really great opportunity so mind you this was in 1995. so i went to 18t and never looked back and so i've taught computer science classes but i've never really attended a proper computer science lecture in my life and sometimes we chose in quite embarrassing ways any particular examples of that um so for my phd defense i had to read up on what this p versus np thing was all about um because i was told alex if you don't know that they can fail you on your phd and so of course i did read up so i think by now i understand and also why it's really a useful and necessary concept the good news is it still makes life exciting for me because i'm still learning new things in computer science even decades later oh that's that's awesome that's awesome uh what's your role at amazon at aws so my role is to help us design and plan new algorithms new tools new services to share them with our customers to make sure we plan ahead strategically for instance through the lablets where we invest in longer term strategic research and then also how to teach this to well users say for instance the die from the deep learning project has 175 universities that teach from that now and this is essentially also a way for us to give back to the community to share with them and to help them use machine learning so in other words my role is to plan but also if something's broken okay blame me for it if it if it works credits to the team obviously um so that would be probably a good summary awesome awesome so i i'd love to dig into some of the research oriented things that you're working on uh and one of the areas that i found really interesting in our earlier chat was uh your kind of perspective and take on deep learning on graphs and some of the work that you were doing there so i'd love to kind of start there um so yeah well just to kind of contextualize you you gave this really you know we i've talked to a bunch of people about uh you know graph uh machine learning and deep learning on graphs and uh the examples that often come up are you know these health care use cases and of course social networks uh you gave this example you know to motivate the conversation around like learning uh page rank uh you know i'd love to to have you walk through that example i thought it was a great one so to some extent graphs are really everywhere because pretty much all the data that's for instance being stored in relational databases or maybe it's ultimately a graph it's just that it's instantiated it's stored as multiple tables and then when a key is being shared well you know that there's your age but let's start with something very simple like page rank and i think we can all agree that this is a really really impactful algorithm if for no other reason that you know this is what got google started and it was really the brilliance of you know um basically the co-founder so there are patients i gave green together with uh advisor motvani um i need to you know come up with an algorithm of a random surfer of moving from vertex one vertex to the next and then you know you essentially work out the distribution of you know where a surface would go and this is you know it was a stroke of genius to figure out that this is a great algorithm but stroke of genius is very rare to find on the other hand if you look at what the algorithm actually does it's very simple it basically looks at a particular vertex in the webpage graph and it updates one number namely the page rank based on what the naval vertices have as the information actually it's just all the incoming vertices and then it's propagated out but that's minor details and there are different flavors of that now wouldn't it be nice if we could learn a function that does similar things to page rank if i just knew that certain pages are more interesting than others so in this way i wouldn't need two brilliant phd student and a brilliant professor to come up with a function that i can just learn from data and learning on graphs is exactly that so in a nutshell it's really learning what one could consider vertex update functions where in my graph i learn how to update the state and the representation of that vertex based on its neighbors and based on a desired outcome so for instance if i want to detect fraud and i know that you know you're a good guy and i'm associated with you then you would help that corresponding information about trustworthiness is propagated and that i can learn this on the other hand if i'm up to no good and you're associated with me then maybe some of that may rub off onto you conversely so this is one of the applications where deep learning on graphs is meaningful but you can do so much more you can extract knowledge graphs you can reasonable knowledge graphs you can answer questions essentially pretty much you you should just go to kdd and take most of their database and you know structured data problems and you can recast them as graph problems and probably have quite some improvements out of it beyond pagerank are there problems that you've uh that you've you've either done this recasting with or you think are interesting ones to kind of further motivate the the idea of thinking around graphs sure so one rather recent paper that uh my team in shanghai uh did is a mapping between and it's basically an unsupervised extraction of knowledge graph from texts so usually generating knowledge graphs is really a costly process because you need humans to annotate things and for instance you then get an annotation like alex amazon works at right and somebody needs to build those extractors crawl into all the things now there's a fair amount of work that people have done with regard to cycle consistent training um one of the first ones to propose this was aliosha efros you may have seen examples of picks to picks or galloping zebras where or mapping satellite images into maps or you know sketching handbags and then having photos of handbags you may have seen such demos in the past and the idea is actually very simple what you say is i have two sets of modalities and they are paired in some ways and i want to make sure that as i translate from one to the other that what comes out of it looks like that desired modality and if i translate back i get something back that looks like the original so for instance if i want to translate from english into french well whatever comes out of it even if i don't have paired text should look and read like french and if i translate it back then it should look and read like english and furthermore it better be very close to the original so this is the site consistency and you can use the same idea and see what was the term cycle consistency because what you do is you go for instance from english to french and then back to english right or in our case what the team did is we went from text to knowledge graphs and then back to text and likewise from knowledge graphs to text to knowledge graphs and those two cycles need to be consistent in so far as you get meaningful knowledge graphs out of text or you can generate meaningful text out of knowledge graphs but there's there are a couple of wrinkles to it and they that actually made it quite challenging because i can translate things like alex amazon and works at in many many different ways so i can you know just say well alex works at amazon or amazon employs alex or many other different ways of text so there's not a one-to-one mapping right so unlike many other works you really need to take care of the full proper distribution of the mapping rather than just you know the pairing and that gives you then if you use appropriation a lot encoder good models that you can train and you basically get accuracies that are very close to fully labeled problems that's really exciting because generating knowledge graphs is costly but it's also prerequisite if you then afterwards want to reason in more structured ways if you want to be able for instance to edit a knowledge representation so before we jump into into that so you have these two domains text and knowledge graph and you're able to go from one to the other and back and we know that you know something like a variational auto encoder it's great from going from one domain to the other how do we is is the application of graph just that one of those domains is a graph or are we kind of using graph architecturally with the auto encoder to enable the the cyclic consistency okay so in this case one of the domains just so happens to be okay but you still need to then actually compute gradients right and the thing is just because it's a knowledge graph i mean you still have relations between the various entities and and you know so you basically need to estimate whether an it should occur or not so you need to reason over the graph per se as you generate the text got it right so that's why um you need deep learning for graphs for instance for this application i just want to pick something that is significantly not trivial because you know we've all done some you know reasoning on graphs for instance for fraud for recommendation personalization and all of that that's pretty much straightforward the other challenge is more how do we scale it up to very large problems and scale it up in a way where we actually still know what the model does and some of our conversations around graphs these uh concepts like symmetry and isomorphism and these properties that allow you to manipulate grass but still have them be fundamentally the same in some way come into play it seems like that would make the you know the cyclic aspect of what you're trying to do particularly challenging um i think there's the the aspect of symmetries arises from something quite different and i quite honestly i think it's been a little bit overplayed in graphs because all you have is you have permutation symmetry of the neighbors and so a very long time ago we wrote this paper on deep sets essentially this was motivated by okay well translation in variance well gives you a convolution and so you can derive conv kind of nets from you know locality and translation in variance over images or then you know if you have stationarity in a time series well you get autoregressive models with the appropriate state and so at some point i was wondering you know which other symmetries are there still out there that haven't been exploited yet and then you know the simplest symmetry that you can think of is really the permutation symmetry in other words if i have a set where the order of the elements within the set doesn't matter and that's what we got deep sets where you basically say well i have functions that operate on a set of elements in no particular order okay and well why does this relate to graphs because well if i have a vertex and it has some neighbors well these are just neighbors they're not in any particular order so any function that is defined on the vertex and its neighbors needs to satisfy this permutation symmetry right and so that then imposes certain functional forms that you know this vertex update function can only take now why would you care about it because i mean this sounds like you know some you know fairly fancy mathematical theory you know who cares about group theory um it's actually really useful because it means that your search space of what you need to design your function class at is much smaller it's just like you know why do you care about translation in variants well because the convnet is so much easier to design than a fully connected network right right it has a lot fewer parameters you can optimize very different chips in the same way with a graph if you have permutation symmetry it means that you can actually get away with a lot fewer parameters because now it means that you cannot learn specific functions for specific neighbors and that means your parameter space collapses now regardless of how many or which order the neighbors are so that simplifies life a lot yeah and then you can focus on other aspects in the implementation nice nice um you mentioned in our our chat uh you had a really interesting take on the relationship between language models and graphs uh i'd love for you to jump into that so this is the other reason why we are really looking at look at graphs so if you look at what's currently going on with language models um it seems to be that bigger is better and there is i think a very conspicuous arms race in terms of who can train the largest language model and that's controversial as well yeah so and i think there is a one good thing is that it's a little bit self-limiting because somebody has to pay for this and you know at some point even large companies will quite happily decide not to spend unreasonable amounts of money on training those models i i'm not so sure whether it's really controversial on the energy expenditure because ultimately it's not like you train this model and then you throw it away but you actually will go and use it a lot i mean it's essentially you can think of it like an infrastructure investment you build and train this model once and then you can go and deploy it in many different applications so i don't think it's really that much of an issue in terms of energy use unless of course you train hundreds of models but then from your finance department will tell you that this is a bad idea kind of back to your self-limiting point but what's more important i think is is that those models are essentially large opaque blobs and people have struggled to deal with really being able to edit and to manipulate and to update the knowledge that's stored in those models right so let me give you an example that's maybe not so controversial was was abraham lincoln a vampire hunter well most people will probably draw a blank like what is alex talking about well there is actually a be great or maybe secret movie hollywood movie in which abraham lincoln played a vampire hunter so now if you ask the model well was abraham lincoln the vampire hunter it may very well be that the answer will be yes because you just so happen to train on that data set right because there isn't really much curation now if you add to that the fact that people on the internet do not always write the pure unadulterated truth you are starting to get into a real problem of having to somehow curtail prune and reason over what those models produce that doesn't mean that you shouldn't train on a large amount of text also of questionable origin but it just means that when the model speaks you need to make sure that what it produces is reasonable and sound so basically you want to make sure that your decoder is well constrained and that the knowledge base that it reasons over is also at least curatable now one way of doing this is if you put more emphasis into a structured graph representation of knowledge rather than having everything in a giant 20 30 layer deep transformer model that stretches over maybe a billion per well a trillion parameters now and a trillion parameters requires at you know 16-bit precision around eight p4 servers um so that's probably not something that most customers would want to use instead of that you would want to have something that actually fits into a size that is economically feasible for customers we are in the business of helping our customers solve their problems right so our customers problems are our problems and our job is to make their job easier yeah so therefore if we you know give them you know big berta then that may not necessarily solve the problems that they have now you're you're making an equivalence to some degree or another between language models and graphs that they could ultimately solve the same problems or some of the same problems uh you know is this is this broadly accepted or you know or known to be true or is this you know speculative to some degree is there some theoretical work that needs to be done to demonstrate this where are we how grounded is that uh that conjecture or our proposal okay so there's early days i would say this cycle consistency training is an example where you know you now have you know a knowledge graph that can clearly produce text we know i think reasonably well how to reason over knowledge graphs i don't think it's going to be knowledge graphs all the way but at least you want to have some mechanism of being able to inject and edit knowledge that your language model tr you know reasons over that can be manipulated separately from just a very big fat transformer model now whether this is the ultimate solution i don't know i think there is going to be a lot of interesting work in terms of designing meaningful structures maybe how to sparsify things how to disentangle different representations so there's a lot of good work right now and that's actually what makes this field so exciting um we'll find out probably in a year or two whether this bet really pays off in the way that i hope it will and what what needs to what needs to happen or or you know what's the benchmark you know how do you know if the the bet is paying off is there yeah would you be applying a graph based uh or knowledge graph type of model to the same type of task as a language model to language modeling or are you looking for different things it's a supplement and to some extent you would want to use that in some of the products and services that are then being offered right so uh so if you can deliver a service that you know you might take a off-the-shelf language model with but instead a smaller kind of more compact knowledge graph that's your benchmark for success that might be one benchmark for success i mean there are other ways how you can supplement the language model with knowledge for instance a really nice paper that kyung yun-cho published a couple of years ago was in the context of search engine enhanced language models what they did is they basically issued queries in addition to whatever text was being produced and there thus able then to produce machine translation that was much higher fidelity because it was able to also use essentially translation memories so translation memories are what happens when you do machine translation and you have maybe some other reference documents or other reference translations around and you want to use those to enhance your translation model and so what you can do is you can basically then have your nmt system so neural machine translation system and you fold it with translation memories in order to get high accuracy and in the same way you can fold a language model with a graph and with possibly other things in order to enhance it you know to allow you to steer those models into directions that will make sense so i think it's a very exciting perspective where this is going um stay tuned for the next year or two nice uh you also spend uh quite a bit of time focused on uh automl and research in that domain uh can you share a little bit about what you're up to there right so automl i think is a really key component in lowering the bar to access for machine learning and it's a key component in multiple ways um so we all know the notion of technical debt right so it's basically you you know decide to live with something that's maybe not optimal and then you need to keep on paying a price for it later on and as you accumulate technical debt well at some point it'll slow you down so much that you can't really build anything new again now the good news is that all of this can be nicely automated by having an automl system that actually keeps on improving as science advances so that's what we're doing with autoglue and we just whenever somebody comes up with a new model well we add it to our inventory of models secondly we automatically adapt and perform you know all the model tuning the stacking and the bagging and all the pieces directly such that the model improves and it adjusts to new data as it comes in the last thing is and this is very different in what we're doing relative to pretty much every other auto mail system everybody else hunts for snowflakes they want to have that one single best shiny model whether it's a deep network or a boosted decision tree or whatever and maybe they tune over all the different models with hyper parameter tuning and so on but they basically want that snowflake instead what we do is we just throw them all in because it we found that a wide range of models very diverse models bagged and then often also stacked leads to much better accuracy it makes the models much more robust it gives you much better uncertainty estimates and here's the other thing we only fail if all of the base models fail because no matter what the implementation of the model is sometimes the code fails for bizarre reasons now if you have five different models at your disposal the rtml system will only fail if all five underlying models fail are you is are you referring to failing at search to find the model or are you suggesting an automl system that produces composite ensemble models that has a a lesser chance of failing in production when it sees some out of distribution data it's much more trivial than that it's just core dumping or not converging failing to produce results right yeah i wonder if you can be more concrete in this idea assertion about like the the snowflake what does that mean is the uh are you saying that that a lot of the energy is placed on kind of exotics like neural architecture search and things that are particularly complex and uh your approach is also you know looking at simple things or are you saying that the the auto ml systems tend to be tuned for one particular type of model and you think the the better approach is to to first focus on finding the right model like what what's the okay so there's a lot to to unpack here and to some extent it's i mean there are probably 10 reasonable rtml systems out there that you may want to use so i'm bound to not do everybody justice just because there's a wide range of what people do yeah but the classical syste setting goes as follows well you know you want automl to for instance you know adjust the learning rate whether you do early stopping you know the depth of the number of layers if you have a deep network or you know some other parameters i mean they're usually half a dozen or more parameters mean or for instance for a kernel method which kernel do you use which kernel width which which optimizer do you use basically there are a lot of different knobs that you could adjust and this is what people typically think about when they say automl that it'll give you you know that one single model back and in some there are perfectly legitimate reasons why you may want that um the there is a separate part namely math so neural architecture search and that's a very reasonable thing to do every once in a while but it's super costly yeah so basically you want to do that if you want to come up with that new computer vision you know backbone model and you do that once and then you use it in many many different applications because you want to amortize the high cost so the average user probably isn't going to do nas i mean it if if it's a really really important problem if it's an embedded solution where maybe you can reduce you know the cost for your chip by a significant amount you know the economics of it may very much make it worth it or let's say you want to deploy in a certain class of mobile phones with a certain processor and you want to optimize let's say for mediatek versus qualcomm and maybe you want to optimize for a specific version of the arduino then yeah for that it makes sense but in many other cases nas isn't so much what as an end user you may want to put your uh to invest your compute dollars in yeah instead you may be better off taking a convex combination of maybe five or six models and then you may want to go and stack them and then maybe you may even want to stack them combine them with non-deep models so for instance what we found with text and tabular is that those typical two tower models actually don't work so well okay it's a bit of context about two towers or multiple tower models one of the ideas is you go and you know take you know your tabular data you embed it in some way run it through maybe a couple of fully connected layers until you get some representation then you do the same thing let's say for the text and then in the end you just fuse everything together and you have maybe another layer it works okay but you can improve on it significantly and we've actually got the kdd submission in the pipeline so i don't know whether it'll go through or not i guess we'll find out but essentially what works a lot better is if you fuse a little bit earlier and if you then go and use other other models in a stacking manner on top of it so for instance you may very well then end up creating a frankenstein model that you know uses a birth embedding and you know some tablet embedding and then runs a decision tree on top of that and then on top of that it ends up stacking nearest neighbors right now most people will be quite horrified at the thought of building such a complex system because it takes forever to run but what you can do is you can then go and distill this back down to an architecture that you're much more comfortable with in order to get the speed but also the arc the accuracy of the original model meaning a la compression or a technique like that yeah so the difference though is that you now have you know some black box object which is you know your horribly designed very complicated rtml model and then you perform function approximation to whatever target architecture that you want that can be a deep network it could be a decision tree or whatever yeah and so now all you do is you basically have stimuli so that covariates being fit into both models and you then minimize the error between the two between the teacher and the student now there are a couple of tricks that you need to worry about because if you just you know go and you know train um on the data that you trained your original model on then well you're basically not going to do much better the reason is simply that uh well there is only so much information that was in the original data and all you then get is essentially like maybe slightly cleaner labels from what we have before what you can do instead is you because i mean this gives you the one over square root sample size rate of convergence and you can't there's no way around it this is this is math right so how do you so how do you cheat on the math well you just make more data and you make more data by sampling data that's similar to the one that was in your training set so you create a synthetic data or if you have additional data well that's of course perfect but otherwise you can essentially synthesize some data this allows you to cheat on the one over square root sample size bound but there's a price you pay for a bit of bias and then you go and design an effective gibbs sampler to make sure that your bias isn't too big and this gives you distillations that essentially lose next to no accuracy but are then you know twitters of magnitude faster i've got to imagine as you increase the complexity of your modeling step here you're also thinking from a research perspective around like what are the implications of that you touch on some of it but like you know what kind of guarantees do you have you know in terms of convergence are those kinds of things you're looking at yeah so there so in terms of model selection and guarantees i think we're actually in a really good situation now compared to where we were like maybe about 20 years ago so if i look back at my phd thesis maybe about 40 of the thesis was proving fairly advanced theorems in bannock spaces and working on covering numbers and metric entropy of spaces and essentially a lot of beautiful math and i was super proud at the time because our bounds were so much better than what everybody else people have before so before that it was basically vc dimension or then you know scale sensitive versions of this and what we had was really nicely you know data adaptive and all of that and we you know proved good spectral bounds and all of that and then we went on this was i was doing a postdoc at the time we we tried to use it for something as simple as a two sample test so two sample tests is basically i have two sets of data are those two drawn from the same distribution so this is what for instance in a gans in a generative adversarial network the adversary does it tries to distinguish two sets this was before guns and we used you know special methods for it anyway so finally we use that and we applied all this beautiful theory and the bound was particularly tight and we ran it and it failed to work at all in spite of the tight balance exactly so essentially we did the equivalent of trying to drive using our seat belt so if equations meet real world yeah so what turned out to be the case is that a lot of more empirical estimates were much much more usable and so in order to make those tests actually practically usable we had to give up on some of the mathematical purity and you know look at asymptotic statistics and other estimates in order to get something very accurate done and actually arthur gritten who was my partnering crime for a lot of this work i mean he's still working on these problems now and it's basically been a very fertile research agenda for the past 20 years so it's i think what i hear you saying is that um you're you're willing to sacrifice the traditional mathematical rigor for you know throwing data at the problem and kind of getting your comfort statistically um is that almost it's not it's not quite i think the the beauty of the situation now is that we have a lot more data and what you can do is you can derive guarantees that are a lot more data adaptive so this is the thing that i think has really qualitatively changed from what it was maybe 10 20 years ago that by now in order to get guarantees you're much more willing to you know so cross-validation is a simple example of that where you're setting some data aside and then you get actually reasonably good bounce you know at rate one over square at sample size between the error that the estimate on a validation set and what really happens in reality later of course this goes out of the window if i start cheating by tuning my model on the validation set right so this is a little bit like the kid who goes and you know tries to you know practice for his sat test and so he has you know all those existing sat tests available and he goes and studies exactly the things that are written in previous exams and while this is going to be somewhat useful it may not give him the full truth because he's at some point starting to overfit to the historical sat tests right so we know that the new tests are probably going to be similarly distributed but they're not going to be quite the same yeah yeah so probably the smart thing would be to do some of this but then to leave out you know one test that's very very recent and then just run it in the end and see whether you do well on that and if you do then okay you can probably sleep well hey by the way don't use this as actual advice for for your high school exam so don't blame me for it but okay maybe i i personally would do it but i didn't graduate it didn't go to high school in the united states i was in germany but basically you know don't overfit on your validation set or something like that but the good thing is we now have so much data right that you know a thousand or two thousand observations among friends isn't a big deal right whereas you know 10 20 years ago this was painful i mean there's still cases where this is so for medical data you know a thousand people having some rare disease is a terrible thing and these are cases where you're happy that maybe the data set is small because it means not so many people died but nonetheless we are now in a situation where we have a lot more data we have a lot more computation so we can use things like cross-validation we can use bagging we can use you know nested versions of validation approaches such that you get the nice conditional independence for the next level of the stacker to avoid overfitting in this case so you can do all of those things and they will help you to get much higher confidence estimates than what you could have done by just sitting down and proving a theorem on the general properties of you know the spectral embeddings of you know some algorithm got it got it got it anyway got it so we're actually living in a good time nice uh there's one more research topic that i want to dive into with you and that's some of your work on causality and causal modeling but actually before we do that i want to uh chat briefly about the ml summit which you've got coming up yeah yeah i'd love to hear maybe a little bit about the event and its origins and you know we'll talk a little bit about what you're excited about there yeah so this entire thing started actually as a well i don't know whether it's a crazy idea or an ambitious idea at reinvent in 2016 and we had two or three hours of like a mini event where we grabbed a couple of faculty friends and asked them to give talks right and i mean green event is the biggest conference i've ever been to okay i don't know maybe i've not been out very much but it seems to grow at an exponential rate um unfortunately last year was not in person uh but it's those it takes over a large part of the las vegas strip and you basically have nerds everywhere okay so it's an awesome event but what happened is so we we had this machine learning uh talk fest and people were sitting camping out outside the lecture theater for two three hours before the event trying to get in so it was clearly signaled that this was something that people would enjoy we weren't so sure because these were more technical talks and so the following year we did more and more and so until this year well we've decided okay to actually graduate this out as a separate event for instance i mean last year you if you went to reinvent i mean hasn't if you attended the live streams you would have seen that uh slimey shiva's brahmanian got so my manager got you know a full keynote so machine learning has clearly become a very important ingredient in building successful things that our customers want yeah and so we broke it out this year and this is the first attempt um okay maybe with the timing it didn't go quite so well because it's also the nurbs deadline this week but but actually i mean we tried to do a good job it's just that what happened is that nurip's moved that deadline by one week because of the things going on in india and otherwise we would have made sure that it doesn't clash but yeah so right now people have to choose between listening to this or polishing their papers is there any speakers that are really delightful so for instance andrewing is coming and uh i mean he has a busy schedule uh so coursera went public end of march and in addition to that i mean he is really been a great mentor for a lot of machine learning startups and overall companies he's done a great work at google then if i do and then beyond what he's now doing with deep learning so and obviously also you know great faculty in stanford um so he's a wonderful colleague and i think this is probably one of the highlights uh to have him um something else that i'm really excited about this is ryan tipsyrani's talk and this is probably the one talk that will affect most people the most uh because he's going to talk about essentially what his he and his team so ronyrose felt at cmu have done in the context of kovit 19 forecasting so this is very much a view from the trenches view uh presentation so ryan is a hardcore theorist who basically decided this was the thing that would you know help uh the most and essentially they pretty much became the clearinghouse for well coveted forecasts uh working with pretty much everybody um and supplying guidance data to the administration um they're having a lot an easier job now than previously but there are lots of challenges in how to get data so for instance what was the case last year is that a lot of hospitals would send emails to report um kovit 19 numbers because the powerstep b decided not to set up a database now now this is time consuming and annoying that's that's one thing and that's the minor issue the bigger issue is that it means that you have data that is not always accurate that needs to be rewritten so therefore you're predicting not based on all the accurate data that you should be having right now but based on the accurate data maybe up to about three four weeks ago and then the recent data is sort of kind of accurate ish right and doesn't make for the best predictions it it makes the statistical problem quite challenging and so what's so this is a really important talk because he's going to explain a lot of what happened essentially in the trenches right awesome that's probably why i'm most excited about that so we'll include we will include a link to that talk in the or not just the talk the entire event in the show notes page it's uh june 2nd and 3rd i think there are separate uh kind of schedules for europe and asia and folks in different time zones and we will link to those um but as promised i want to jump back into this research uh stuff which is great and specifically some of the work you're doing on causality um this has been a topic that you know it clearly has been around but over the past couple of years in the machine learning community it's just been on fire just in terms of popularity and and interest and i'm really interested in hearing a little bit about what you're doing in that space so there's an entire team actually in tubing in and that's being led by two excellent scientists one good friend of mine bernard chalkov so he's max blank director and besides all the other things that he's doing he also helps amazon on the causality research there and then the resident expert dominique godzing and obviously there's an entire team that stands behind that and they're using causal models to infer for instance you know why is my server not working or why is my supply chain not doing the things that it should be doing now um this has actually this has meaningful product impact so for instance if you're using lookout for matrix and then you're getting some of the causal tools for it when you want to understand not just why is there that that there's an anomaly but you want to have an explanation why something went wrong now it's actually quite interesting because a lot of their tools are quite fundamental or simple in terms of well they don't necessarily use lots of fancy deep networks to do the modeling but they really think very hard about you know what the underlying questions are in order to answer answer this so for instance you may want if you if you get data and you have some form of a dependency graph you may want to ask a question like you know if the data has changed why has it changed and now well what's what's fun there is you can actually if you have a direct graphical model so as a nice causal model you can actually go and then look at individual clicks and try to identify the one that has changed such that you don't just say hey look my entire world has changed but you can actually work backwards and say well this component has changed in here is how it has so this is the type of answers that you can get by using causality now there are a couple of different flavors of what you can use and i think most of us are used to the you know you know graphical model judea pearl style causal analysis there's actually a slightly more pragmatic one and this is called granger causality granger granger yeah so there's a funny story to that so clive granger and so he got the nobel prize in 2003 actually for his work so he at some point was asked to you know come up with some estimators for causality and so he went to norbert wiener physicist and had him explain a little bit what he thought about things essentially he came up with a model that goes more or less as follows let's say i have time series x and y index the time so x t and y t and then i have some other parameters wt so w is essentially auxiliary parameters and what you want to do is you want to find out whether x causes y or y causes x or at least whether they causally affect each other and so what you can ask is essentially for instance does x t affect y t plus one or does y t affect x t plus one so basically going forward in time so the temporal aspect is quite important there okay and so what you can do is you can basically try to predict x t plus one just using the auxiliary variables or you can go and try to predict it using the auxiliary variables and also yt right so and now if my prediction is better after i use this additional variable then i can say well you know there's some causal information in there now of course uh it's also heavily dependent on the auxiliary data namely wt and essentially the analysis and that's a little bit the achilles heel but you can do you can reason well about it is if you if i just throw lots of context at it and the context doesn't actually allow me to predict things then it really must be that other variable that was causal and so this is all done without the mechanics of interventions and a lot from the exactly exactly so so the idea is really if it allows me to predict then there must be a causal structure in it so this way i can have you know things like x causing y and y causing x and still being able to reason over it so it's a lot more pragmatic and operational um and the funny thing is that you know when granger went on to explain this to people they went like yeah that's not real causality it's just ranger causality that's that's literally what happened and so the name stuck i think it's a very nice operational approach because it gives you very concrete strategies of how do you establish whether some you know you know such dependency happens now do we care about it in amazon yes for many many cases i mean obviously you want to understand for instance within your supply chain why something happens or doesn't happen you want to understand when you look at you know various metrics you know whether and why something goes wrong you might want to use similar things overall for systems identification so there's a lot of really exciting work that happens there you also want to use it for instance for testing procedures um some of this as i said for instance look out for metrics is in an explicitly usable public service some of the other things happen more behind the scenes just because i mean machine learning is a spice right it's not the main dish in most cases machine learning is what makes the main dish tasty right but if you just use machine learning as the only thing that you're offering as a product then you're essentially only you know helping other people then use that somewhere else and you know in some cases that's perfectly reasonable but it's just as important to make a lot of other services smart and adaptive and that's really what machine learning can help and causality in particular is you know one of those tools that are quite subtle and you do need a reasonable amount of skill to understand exactly why and what what happens is that that skill uh as well as the kind of machinery that's been built up around the pearl style uh causality is it um is it equally established for grainger causality i imagine it benefits from the simplicity quite a bit um but even things like uh you know libraries like pyro for probabilistic programming does that apply equally well to granger or is it so you need slightly different tools um right i would say watch this space that's that's all i can say right now i think good things will happen uh this is about as specific as i can be without getting myself or the team in trouble but okay watch this space yes got it um basically there's still a bit of thinking that goes into you know how to make it very usable so i mean to give an example why this is you know a little bit dangerous territory or can be you may have heard of shop right the shamply value for explainability and so this is a great paper and great work um and so you know a lot of people use shop to explain you know why certain inputs cause a certain output now it turns out that actually the code is correct but the math in the paper isn't quite so this is actually funny because the approximation is the right thing to do and so the tubing and team actually wrote the follow-up saying hey your code is correct the math in your paper isn't then here's why but it has a lot to do with the fact that which variables you condition on when you look at an intervention or in other words if i have a light bulb and a switch and i observe that with the light bulbs on the switch is on and with the light bulbs off the switch is off right of course if i manipulate the light the switch the light bulb goes on but if i go and smash the light bulb the switch doesn't go off right and so if you look at interventions you need to be careful over which variables you stratify when you do the analysis and so in the actual shapley code it was done correctly but the initial analysis was improvable so this is exactly the level where you need reasonably well-trained teams of scientists to actually look at that because the average engineer will probably i mean it's it's hard to package it in a form that you don't end up with conclusions that may hurt you in the end i think what i hear you saying with this example is that the the shapley paper uh was based on kind of this the machinery of interventions and it's easy to get it wrong and so something simpler is better and that's why you're excited about this grainger causality and stuff to come yep i think that's a pretty good summary okay got it cool um well alex it has been amazing catching up with you uh and learning a little bit about some of the stuff that you're working on looking forward to tuning into you virtually at the ml summit and you know catching up in a year or two to talk about new causality tools and language models based on graphs and all kinds of cool stuff that we talked about here thanks thanks for having me and have a good day thanks thanks alex wowall right everyone i am here with alex smalla alex is vice president and distinguished scientist at aws ai alex welcome to the twimla ai podcast hey thanks very much for having me here i'm really delighted to get the opportunity to talk to your listeners and i hope everybody gets something useful out of it i'm really looking forward to this uh conversation i mentioned to you when we were chatting earlier that i cornered you at a re invent like just after you joined aws and in uh you know perfect in typical aws fashion you had nothing to do with uh a pr journalist type person and so this is kind of a uh you know achievement unlocked moment for me it's been a long time coming i've been looking forward to the opportunity to chat with you so i am also excited we're going to cover a bunch of the cool stuff you're working on from a research and aws perspective and uh we'll touch on the event that you're uh heading up or participating in the aws ml summit uh towards the end but to get us started i would love to have you just share a little bit about your background and how you came to work in machine learning okay so i'm actually a physicist by training and there's the saying that physicists aren't good at anything but you can use them for everything and i think that's probably how i ended up with machine learning um the slightly longer story is that when it came to doing my master's thesis i looked around didn't find anything terribly exciting at my university and then i looked you know maybe i can go and do my masters somewhere else and this was at eighteen t actually young liquor was the department header at the time and vladimir putnik was my phd was my master's advisor and this was even at the time a really great opportunity so mind you this was in 1995. so i went to 18t and never looked back and so i've taught computer science classes but i've never really attended a proper computer science lecture in my life and sometimes we chose in quite embarrassing ways any particular examples of that um so for my phd defense i had to read up on what this p versus np thing was all about um because i was told alex if you don't know that they can fail you on your phd and so of course i did read up so i think by now i understand and also why it's really a useful and necessary concept the good news is it still makes life exciting for me because i'm still learning new things in computer science even decades later oh that's that's awesome that's awesome uh what's your role at amazon at aws so my role is to help us design and plan new algorithms new tools new services to share them with our customers to make sure we plan ahead strategically for instance through the lablets where we invest in longer term strategic research and then also how to teach this to well users say for instance the die from the deep learning project has 175 universities that teach from that now and this is essentially also a way for us to give back to the community to share with them and to help them use machine learning so in other words my role is to plan but also if something's broken okay blame me for it if it if it works credits to the team obviously um so that would be probably a good summary awesome awesome so i i'd love to dig into some of the research oriented things that you're working on uh and one of the areas that i found really interesting in our earlier chat was uh your kind of perspective and take on deep learning on graphs and some of the work that you were doing there so i'd love to kind of start there um so yeah well just to kind of contextualize you you gave this really you know we i've talked to a bunch of people about uh you know graph uh machine learning and deep learning on graphs and uh the examples that often come up are you know these health care use cases and of course social networks uh you gave this example you know to motivate the conversation around like learning uh page rank uh you know i'd love to to have you walk through that example i thought it was a great one so to some extent graphs are really everywhere because pretty much all the data that's for instance being stored in relational databases or maybe it's ultimately a graph it's just that it's instantiated it's stored as multiple tables and then when a key is being shared well you know that there's your age but let's start with something very simple like page rank and i think we can all agree that this is a really really impactful algorithm if for no other reason that you know this is what got google started and it was really the brilliance of you know um basically the co-founder so there are patients i gave green together with uh advisor motvani um i need to you know come up with an algorithm of a random surfer of moving from vertex one vertex to the next and then you know you essentially work out the distribution of you know where a surface would go and this is you know it was a stroke of genius to figure out that this is a great algorithm but stroke of genius is very rare to find on the other hand if you look at what the algorithm actually does it's very simple it basically looks at a particular vertex in the webpage graph and it updates one number namely the page rank based on what the naval vertices have as the information actually it's just all the incoming vertices and then it's propagated out but that's minor details and there are different flavors of that now wouldn't it be nice if we could learn a function that does similar things to page rank if i just knew that certain pages are more interesting than others so in this way i wouldn't need two brilliant phd student and a brilliant professor to come up with a function that i can just learn from data and learning on graphs is exactly that so in a nutshell it's really learning what one could consider vertex update functions where in my graph i learn how to update the state and the representation of that vertex based on its neighbors and based on a desired outcome so for instance if i want to detect fraud and i know that you know you're a good guy and i'm associated with you then you would help that corresponding information about trustworthiness is propagated and that i can learn this on the other hand if i'm up to no good and you're associated with me then maybe some of that may rub off onto you conversely so this is one of the applications where deep learning on graphs is meaningful but you can do so much more you can extract knowledge graphs you can reasonable knowledge graphs you can answer questions essentially pretty much you you should just go to kdd and take most of their database and you know structured data problems and you can recast them as graph problems and probably have quite some improvements out of it beyond pagerank are there problems that you've uh that you've you've either done this recasting with or you think are interesting ones to kind of further motivate the the idea of thinking around graphs sure so one rather recent paper that uh my team in shanghai uh did is a mapping between and it's basically an unsupervised extraction of knowledge graph from texts so usually generating knowledge graphs is really a costly process because you need humans to annotate things and for instance you then get an annotation like alex amazon works at right and somebody needs to build those extractors crawl into all the things now there's a fair amount of work that people have done with regard to cycle consistent training um one of the first ones to propose this was aliosha efros you may have seen examples of picks to picks or galloping zebras where or mapping satellite images into maps or you know sketching handbags and then having photos of handbags you may have seen such demos in the past and the idea is actually very simple what you say is i have two sets of modalities and they are paired in some ways and i want to make sure that as i translate from one to the other that what comes out of it looks like that desired modality and if i translate back i get something back that looks like the original so for instance if i want to translate from english into french well whatever comes out of it even if i don't have paired text should look and read like french and if i translate it back then it should look and read like english and furthermore it better be very close to the original so this is the site consistency and you can use the same idea and see what was the term cycle consistency because what you do is you go for instance from english to french and then back to english right or in our case what the team did is we went from text to knowledge graphs and then back to text and likewise from knowledge graphs to text to knowledge graphs and those two cycles need to be consistent in so far as you get meaningful knowledge graphs out of text or you can generate meaningful text out of knowledge graphs but there's there are a couple of wrinkles to it and they that actually made it quite challenging because i can translate things like alex amazon and works at in many many different ways so i can you know just say well alex works at amazon or amazon employs alex or many other different ways of text so there's not a one-to-one mapping right so unlike many other works you really need to take care of the full proper distribution of the mapping rather than just you know the pairing and that gives you then if you use appropriation a lot encoder good models that you can train and you basically get accuracies that are very close to fully labeled problems that's really exciting because generating knowledge graphs is costly but it's also prerequisite if you then afterwards want to reason in more structured ways if you want to be able for instance to edit a knowledge representation so before we jump into into that so you have these two domains text and knowledge graph and you're able to go from one to the other and back and we know that you know something like a variational auto encoder it's great from going from one domain to the other how do we is is the application of graph just that one of those domains is a graph or are we kind of using graph architecturally with the auto encoder to enable the the cyclic consistency okay so in this case one of the domains just so happens to be okay but you still need to then actually compute gradients right and the thing is just because it's a knowledge graph i mean you still have relations between the various entities and and you know so you basically need to estimate whether an it should occur or not so you need to reason over the graph per se as you generate the text got it right so that's why um you need deep learning for graphs for instance for this application i just want to pick something that is significantly not trivial because you know we've all done some you know reasoning on graphs for instance for fraud for recommendation personalization and all of that that's pretty much straightforward the other challenge is more how do we scale it up to very large problems and scale it up in a way where we actually still know what the model does and some of our conversations around graphs these uh concepts like symmetry and isomorphism and these properties that allow you to manipulate grass but still have them be fundamentally the same in some way come into play it seems like that would make the you know the cyclic aspect of what you're trying to do particularly challenging um i think there's the the aspect of symmetries arises from something quite different and i quite honestly i think it's been a little bit overplayed in graphs because all you have is you have permutation symmetry of the neighbors and so a very long time ago we wrote this paper on deep sets essentially this was motivated by okay well translation in variance well gives you a convolution and so you can derive conv kind of nets from you know locality and translation in variance over images or then you know if you have stationarity in a time series well you get autoregressive models with the appropriate state and so at some point i was wondering you know which other symmetries are there still out there that haven't been exploited yet and then you know the simplest symmetry that you can think of is really the permutation symmetry in other words if i have a set where the order of the elements within the set doesn't matter and that's what we got deep sets where you basically say well i have functions that operate on a set of elements in no particular order okay and well why does this relate to graphs because well if i have a vertex and it has some neighbors well these are just neighbors they're not in any particular order so any function that is defined on the vertex and its neighbors needs to satisfy this permutation symmetry right and so that then imposes certain functional forms that you know this vertex update function can only take now why would you care about it because i mean this sounds like you know some you know fairly fancy mathematical theory you know who cares about group theory um it's actually really useful because it means that your search space of what you need to design your function class at is much smaller it's just like you know why do you care about translation in variants well because the convnet is so much easier to design than a fully connected network right right it has a lot fewer parameters you can optimize very different chips in the same way with a graph if you have permutation symmetry it means that you can actually get away with a lot fewer parameters because now it means that you cannot learn specific functions for specific neighbors and that means your parameter space collapses now regardless of how many or which order the neighbors are so that simplifies life a lot yeah and then you can focus on other aspects in the implementation nice nice um you mentioned in our our chat uh you had a really interesting take on the relationship between language models and graphs uh i'd love for you to jump into that so this is the other reason why we are really looking at look at graphs so if you look at what's currently going on with language models um it seems to be that bigger is better and there is i think a very conspicuous arms race in terms of who can train the largest language model and that's controversial as well yeah so and i think there is a one good thing is that it's a little bit self-limiting because somebody has to pay for this and you know at some point even large companies will quite happily decide not to spend unreasonable amounts of money on training those models i i'm not so sure whether it's really controversial on the energy expenditure because ultimately it's not like you train this model and then you throw it away but you actually will go and use it a lot i mean it's essentially you can think of it like an infrastructure investment you build and train this model once and then you can go and deploy it in many different applications so i don't think it's really that much of an issue in terms of energy use unless of course you train hundreds of models but then from your finance department will tell you that this is a bad idea kind of back to your self-limiting point but what's more important i think is is that those models are essentially large opaque blobs and people have struggled to deal with really being able to edit and to manipulate and to update the knowledge that's stored in those models right so let me give you an example that's maybe not so controversial was was abraham lincoln a vampire hunter well most people will probably draw a blank like what is alex talking about well there is actually a be great or maybe secret movie hollywood movie in which abraham lincoln played a vampire hunter so now if you ask the model well was abraham lincoln the vampire hunter it may very well be that the answer will be yes because you just so happen to train on that data set right because there isn't really much curation now if you add to that the fact that people on the internet do not always write the pure unadulterated truth you are starting to get into a real problem of having to somehow curtail prune and reason over what those models produce that doesn't mean that you shouldn't train on a large amount of text also of questionable origin but it just means that when the model speaks you need to make sure that what it produces is reasonable and sound so basically you want to make sure that your decoder is well constrained and that the knowledge base that it reasons over is also at least curatable now one way of doing this is if you put more emphasis into a structured graph representation of knowledge rather than having everything in a giant 20 30 layer deep transformer model that stretches over maybe a billion per well a trillion parameters now and a trillion parameters requires at you know 16-bit precision around eight p4 servers um so that's probably not something that most customers would want to use instead of that you would want to have something that actually fits into a size that is economically feasible for customers we are in the business of helping our customers solve their problems right so our customers problems are our problems and our job is to make their job easier yeah so therefore if we you know give them you know big berta then that may not necessarily solve the problems that they have now you're you're making an equivalence to some degree or another between language models and graphs that they could ultimately solve the same problems or some of the same problems uh you know is this is this broadly accepted or you know or known to be true or is this you know speculative to some degree is there some theoretical work that needs to be done to demonstrate this where are we how grounded is that uh that conjecture or our proposal okay so there's early days i would say this cycle consistency training is an example where you know you now have you know a knowledge graph that can clearly produce text we know i think reasonably well how to reason over knowledge graphs i don't think it's going to be knowledge graphs all the way but at least you want to have some mechanism of being able to inject and edit knowledge that your language model tr you know reasons over that can be manipulated separately from just a very big fat transformer model now whether this is the ultimate solution i don't know i think there is going to be a lot of interesting work in terms of designing meaningful structures maybe how to sparsify things how to disentangle different representations so there's a lot of good work right now and that's actually what makes this field so exciting um we'll find out probably in a year or two whether this bet really pays off in the way that i hope it will and what what needs to what needs to happen or or you know what's the benchmark you know how do you know if the the bet is paying off is there yeah would you be applying a graph based uh or knowledge graph type of model to the same type of task as a language model to language modeling or are you looking for different things it's a supplement and to some extent you would want to use that in some of the products and services that are then being offered right so uh so if you can deliver a service that you know you might take a off-the-shelf language model with but instead a smaller kind of more compact knowledge graph that's your benchmark for success that might be one benchmark for success i mean there are other ways how you can supplement the language model with knowledge for instance a really nice paper that kyung yun-cho published a couple of years ago was in the context of search engine enhanced language models what they did is they basically issued queries in addition to whatever text was being produced and there thus able then to produce machine translation that was much higher fidelity because it was able to also use essentially translation memories so translation memories are what happens when you do machine translation and you have maybe some other reference documents or other reference translations around and you want to use those to enhance your translation model and so what you can do is you can basically then have your nmt system so neural machine translation system and you fold it with translation memories in order to get high accuracy and in the same way you can fold a language model with a graph and with possibly other things in order to enhance it you know to allow you to steer those models into directions that will make sense so i think it's a very exciting perspective where this is going um stay tuned for the next year or two nice uh you also spend uh quite a bit of time focused on uh automl and research in that domain uh can you share a little bit about what you're up to there right so automl i think is a really key component in lowering the bar to access for machine learning and it's a key component in multiple ways um so we all know the notion of technical debt right so it's basically you you know decide to live with something that's maybe not optimal and then you need to keep on paying a price for it later on and as you accumulate technical debt well at some point it'll slow you down so much that you can't really build anything new again now the good news is that all of this can be nicely automated by having an automl system that actually keeps on improving as science advances so that's what we're doing with autoglue and we just whenever somebody comes up with a new model well we add it to our inventory of models secondly we automatically adapt and perform you know all the model tuning the stacking and the bagging and all the pieces directly such that the model improves and it adjusts to new data as it comes in the last thing is and this is very different in what we're doing relative to pretty much every other auto mail system everybody else hunts for snowflakes they want to have that one single best shiny model whether it's a deep network or a boosted decision tree or whatever and maybe they tune over all the different models with hyper parameter tuning and so on but they basically want that snowflake instead what we do is we just throw them all in because it we found that a wide range of models very diverse models bagged and then often also stacked leads to much better accuracy it makes the models much more robust it gives you much better uncertainty estimates and here's the other thing we only fail if all of the base models fail because no matter what the implementation of the model is sometimes the code fails for bizarre reasons now if you have five different models at your disposal the rtml system will only fail if all five underlying models fail are you is are you referring to failing at search to find the model or are you suggesting an automl system that produces composite ensemble models that has a a lesser chance of failing in production when it sees some out of distribution data it's much more trivial than that it's just core dumping or not converging failing to produce results right yeah i wonder if you can be more concrete in this idea assertion about like the the snowflake what does that mean is the uh are you saying that that a lot of the energy is placed on kind of exotics like neural architecture search and things that are particularly complex and uh your approach is also you know looking at simple things or are you saying that the the auto ml systems tend to be tuned for one particular type of model and you think the the better approach is to to first focus on finding the right model like what what's the okay so there's a lot to to unpack here and to some extent it's i mean there are probably 10 reasonable rtml systems out there that you may want to use so i'm bound to not do everybody justice just because there's a wide range of what people do yeah but the classical syste setting goes as follows well you know you want automl to for instance you know adjust the learning rate whether you do early stopping you know the depth of the number of layers if you have a deep network or you know some other parameters i mean they're usually half a dozen or more parameters mean or for instance for a kernel method which kernel do you use which kernel width which which optimizer do you use basically there are a lot of different knobs that you could adjust and this is what people typically think about when they say automl that it'll give you you know that one single model back and in some there are perfectly legitimate reasons why you may want that um the there is a separate part namely math so neural architecture search and that's a very reasonable thing to do every once in a while but it's super costly yeah so basically you want to do that if you want to come up with that new computer vision you know backbone model and you do that once and then you use it in many many different applications because you want to amortize the high cost so the average user probably isn't going to do nas i mean it if if it's a really really important problem if it's an embedded solution where maybe you can reduce you know the cost for your chip by a significant amount you know the economics of it may very much make it worth it or let's say you want to deploy in a certain class of mobile phones with a certain processor and you want to optimize let's say for mediatek versus qualcomm and maybe you want to optimize for a specific version of the arduino then yeah for that it makes sense but in many other cases nas isn't so much what as an end user you may want to put your uh to invest your compute dollars in yeah instead you may be better off taking a convex combination of maybe five or six models and then you may want to go and stack them and then maybe you may even want to stack them combine them with non-deep models so for instance what we found with text and tabular is that those typical two tower models actually don't work so well okay it's a bit of context about two towers or multiple tower models one of the ideas is you go and you know take you know your tabular data you embed it in some way run it through maybe a couple of fully connected layers until you get some representation then you do the same thing let's say for the text and then in the end you just fuse everything together and you have maybe another layer it works okay but you can improve on it significantly and we've actually got the kdd submission in the pipeline so i don't know whether it'll go through or not i guess we'll find out but essentially what works a lot better is if you fuse a little bit earlier and if you then go and use other other models in a stacking manner on top of it so for instance you may very well then end up creating a frankenstein model that you know uses a birth embedding and you know some tablet embedding and then runs a decision tree on top of that and then on top of that it ends up stacking nearest neighbors right now most people will be quite horrified at the thought of building such a complex system because it takes forever to run but what you can do is you can then go and distill this back down to an architecture that you're much more comfortable with in order to get the speed but also the arc the accuracy of the original model meaning a la compression or a technique like that yeah so the difference though is that you now have you know some black box object which is you know your horribly designed very complicated rtml model and then you perform function approximation to whatever target architecture that you want that can be a deep network it could be a decision tree or whatever yeah and so now all you do is you basically have stimuli so that covariates being fit into both models and you then minimize the error between the two between the teacher and the student now there are a couple of tricks that you need to worry about because if you just you know go and you know train um on the data that you trained your original model on then well you're basically not going to do much better the reason is simply that uh well there is only so much information that was in the original data and all you then get is essentially like maybe slightly cleaner labels from what we have before what you can do instead is you because i mean this gives you the one over square root sample size rate of convergence and you can't there's no way around it this is this is math right so how do you so how do you cheat on the math well you just make more data and you make more data by sampling data that's similar to the one that was in your training set so you create a synthetic data or if you have additional data well that's of course perfect but otherwise you can essentially synthesize some data this allows you to cheat on the one over square root sample size bound but there's a price you pay for a bit of bias and then you go and design an effective gibbs sampler to make sure that your bias isn't too big and this gives you distillations that essentially lose next to no accuracy but are then you know twitters of magnitude faster i've got to imagine as you increase the complexity of your modeling step here you're also thinking from a research perspective around like what are the implications of that you touch on some of it but like you know what kind of guarantees do you have you know in terms of convergence are those kinds of things you're looking at yeah so there so in terms of model selection and guarantees i think we're actually in a really good situation now compared to where we were like maybe about 20 years ago so if i look back at my phd thesis maybe about 40 of the thesis was proving fairly advanced theorems in bannock spaces and working on covering numbers and metric entropy of spaces and essentially a lot of beautiful math and i was super proud at the time because our bounds were so much better than what everybody else people have before so before that it was basically vc dimension or then you know scale sensitive versions of this and what we had was really nicely you know data adaptive and all of that and we you know proved good spectral bounds and all of that and then we went on this was i was doing a postdoc at the time we we tried to use it for something as simple as a two sample test so two sample tests is basically i have two sets of data are those two drawn from the same distribution so this is what for instance in a gans in a generative adversarial network the adversary does it tries to distinguish two sets this was before guns and we used you know special methods for it anyway so finally we use that and we applied all this beautiful theory and the bound was particularly tight and we ran it and it failed to work at all in spite of the tight balance exactly so essentially we did the equivalent of trying to drive using our seat belt so if equations meet real world yeah so what turned out to be the case is that a lot of more empirical estimates were much much more usable and so in order to make those tests actually practically usable we had to give up on some of the mathematical purity and you know look at asymptotic statistics and other estimates in order to get something very accurate done and actually arthur gritten who was my partnering crime for a lot of this work i mean he's still working on these problems now and it's basically been a very fertile research agenda for the past 20 years so it's i think what i hear you saying is that um you're you're willing to sacrifice the traditional mathematical rigor for you know throwing data at the problem and kind of getting your comfort statistically um is that almost it's not it's not quite i think the the beauty of the situation now is that we have a lot more data and what you can do is you can derive guarantees that are a lot more data adaptive so this is the thing that i think has really qualitatively changed from what it was maybe 10 20 years ago that by now in order to get guarantees you're much more willing to you know so cross-validation is a simple example of that where you're setting some data aside and then you get actually reasonably good bounce you know at rate one over square at sample size between the error that the estimate on a validation set and what really happens in reality later of course this goes out of the window if i start cheating by tuning my model on the validation set right so this is a little bit like the kid who goes and you know tries to you know practice for his sat test and so he has you know all those existing sat tests available and he goes and studies exactly the things that are written in previous exams and while this is going to be somewhat useful it may not give him the full truth because he's at some point starting to overfit to the historical sat tests right so we know that the new tests are probably going to be similarly distributed but they're not going to be quite the same yeah yeah so probably the smart thing would be to do some of this but then to leave out you know one test that's very very recent and then just run it in the end and see whether you do well on that and if you do then okay you can probably sleep well hey by the way don't use this as actual advice for for your high school exam so don't blame me for it but okay maybe i i personally would do it but i didn't graduate it didn't go to high school in the united states i was in germany but basically you know don't overfit on your validation set or something like that but the good thing is we now have so much data right that you know a thousand or two thousand observations among friends isn't a big deal right whereas you know 10 20 years ago this was painful i mean there's still cases where this is so for medical data you know a thousand people having some rare disease is a terrible thing and these are cases where you're happy that maybe the data set is small because it means not so many people died but nonetheless we are now in a situation where we have a lot more data we have a lot more computation so we can use things like cross-validation we can use bagging we can use you know nested versions of validation approaches such that you get the nice conditional independence for the next level of the stacker to avoid overfitting in this case so you can do all of those things and they will help you to get much higher confidence estimates than what you could have done by just sitting down and proving a theorem on the general properties of you know the spectral embeddings of you know some algorithm got it got it got it anyway got it so we're actually living in a good time nice uh there's one more research topic that i want to dive into with you and that's some of your work on causality and causal modeling but actually before we do that i want to uh chat briefly about the ml summit which you've got coming up yeah yeah i'd love to hear maybe a little bit about the event and its origins and you know we'll talk a little bit about what you're excited about there yeah so this entire thing started actually as a well i don't know whether it's a crazy idea or an ambitious idea at reinvent in 2016 and we had two or three hours of like a mini event where we grabbed a couple of faculty friends and asked them to give talks right and i mean green event is the biggest conference i've ever been to okay i don't know maybe i've not been out very much but it seems to grow at an exponential rate um unfortunately last year was not in person uh but it's those it takes over a large part of the las vegas strip and you basically have nerds everywhere okay so it's an awesome event but what happened is so we we had this machine learning uh talk fest and people were sitting camping out outside the lecture theater for two three hours before the event trying to get in so it was clearly signaled that this was something that people would enjoy we weren't so sure because these were more technical talks and so the following year we did more and more and so until this year well we've decided okay to actually graduate this out as a separate event for instance i mean last year you if you went to reinvent i mean hasn't if you attended the live streams you would have seen that uh slimey shiva's brahmanian got so my manager got you know a full keynote so machine learning has clearly become a very important ingredient in building successful things that our customers want yeah and so we broke it out this year and this is the first attempt um okay maybe with the timing it didn't go quite so well because it's also the nurbs deadline this week but but actually i mean we tried to do a good job it's just that what happened is that nurip's moved that deadline by one week because of the things going on in india and otherwise we would have made sure that it doesn't clash but yeah so right now people have to choose between listening to this or polishing their papers is there any speakers that are really delightful so for instance andrewing is coming and uh i mean he has a busy schedule uh so coursera went public end of march and in addition to that i mean he is really been a great mentor for a lot of machine learning startups and overall companies he's done a great work at google then if i do and then beyond what he's now doing with deep learning so and obviously also you know great faculty in stanford um so he's a wonderful colleague and i think this is probably one of the highlights uh to have him um something else that i'm really excited about this is ryan tipsyrani's talk and this is probably the one talk that will affect most people the most uh because he's going to talk about essentially what his he and his team so ronyrose felt at cmu have done in the context of kovit 19 forecasting so this is very much a view from the trenches view uh presentation so ryan is a hardcore theorist who basically decided this was the thing that would you know help uh the most and essentially they pretty much became the clearinghouse for well coveted forecasts uh working with pretty much everybody um and supplying guidance data to the administration um they're having a lot an easier job now than previously but there are lots of challenges in how to get data so for instance what was the case last year is that a lot of hospitals would send emails to report um kovit 19 numbers because the powerstep b decided not to set up a database now now this is time consuming and annoying that's that's one thing and that's the minor issue the bigger issue is that it means that you have data that is not always accurate that needs to be rewritten so therefore you're predicting not based on all the accurate data that you should be having right now but based on the accurate data maybe up to about three four weeks ago and then the recent data is sort of kind of accurate ish right and doesn't make for the best predictions it it makes the statistical problem quite challenging and so what's so this is a really important talk because he's going to explain a lot of what happened essentially in the trenches right awesome that's probably why i'm most excited about that so we'll include we will include a link to that talk in the or not just the talk the entire event in the show notes page it's uh june 2nd and 3rd i think there are separate uh kind of schedules for europe and asia and folks in different time zones and we will link to those um but as promised i want to jump back into this research uh stuff which is great and specifically some of the work you're doing on causality um this has been a topic that you know it clearly has been around but over the past couple of years in the machine learning community it's just been on fire just in terms of popularity and and interest and i'm really interested in hearing a little bit about what you're doing in that space so there's an entire team actually in tubing in and that's being led by two excellent scientists one good friend of mine bernard chalkov so he's max blank director and besides all the other things that he's doing he also helps amazon on the causality research there and then the resident expert dominique godzing and obviously there's an entire team that stands behind that and they're using causal models to infer for instance you know why is my server not working or why is my supply chain not doing the things that it should be doing now um this has actually this has meaningful product impact so for instance if you're using lookout for matrix and then you're getting some of the causal tools for it when you want to understand not just why is there that that there's an anomaly but you want to have an explanation why something went wrong now it's actually quite interesting because a lot of their tools are quite fundamental or simple in terms of well they don't necessarily use lots of fancy deep networks to do the modeling but they really think very hard about you know what the underlying questions are in order to answer answer this so for instance you may want if you if you get data and you have some form of a dependency graph you may want to ask a question like you know if the data has changed why has it changed and now well what's what's fun there is you can actually if you have a direct graphical model so as a nice causal model you can actually go and then look at individual clicks and try to identify the one that has changed such that you don't just say hey look my entire world has changed but you can actually work backwards and say well this component has changed in here is how it has so this is the type of answers that you can get by using causality now there are a couple of different flavors of what you can use and i think most of us are used to the you know you know graphical model judea pearl style causal analysis there's actually a slightly more pragmatic one and this is called granger causality granger granger yeah so there's a funny story to that so clive granger and so he got the nobel prize in 2003 actually for his work so he at some point was asked to you know come up with some estimators for causality and so he went to norbert wiener physicist and had him explain a little bit what he thought about things essentially he came up with a model that goes more or less as follows let's say i have time series x and y index the time so x t and y t and then i have some other parameters wt so w is essentially auxiliary parameters and what you want to do is you want to find out whether x causes y or y causes x or at least whether they causally affect each other and so what you can ask is essentially for instance does x t affect y t plus one or does y t affect x t plus one so basically going forward in time so the temporal aspect is quite important there okay and so what you can do is you can basically try to predict x t plus one just using the auxiliary variables or you can go and try to predict it using the auxiliary variables and also yt right so and now if my prediction is better after i use this additional variable then i can say well you know there's some causal information in there now of course uh it's also heavily dependent on the auxiliary data namely wt and essentially the analysis and that's a little bit the achilles heel but you can do you can reason well about it is if you if i just throw lots of context at it and the context doesn't actually allow me to predict things then it really must be that other variable that was causal and so this is all done without the mechanics of interventions and a lot from the exactly exactly so so the idea is really if it allows me to predict then there must be a causal structure in it so this way i can have you know things like x causing y and y causing x and still being able to reason over it so it's a lot more pragmatic and operational um and the funny thing is that you know when granger went on to explain this to people they went like yeah that's not real causality it's just ranger causality that's that's literally what happened and so the name stuck i think it's a very nice operational approach because it gives you very concrete strategies of how do you establish whether some you know you know such dependency happens now do we care about it in amazon yes for many many cases i mean obviously you want to understand for instance within your supply chain why something happens or doesn't happen you want to understand when you look at you know various metrics you know whether and why something goes wrong you might want to use similar things overall for systems identification so there's a lot of really exciting work that happens there you also want to use it for instance for testing procedures um some of this as i said for instance look out for metrics is in an explicitly usable public service some of the other things happen more behind the scenes just because i mean machine learning is a spice right it's not the main dish in most cases machine learning is what makes the main dish tasty right but if you just use machine learning as the only thing that you're offering as a product then you're essentially only you know helping other people then use that somewhere else and you know in some cases that's perfectly reasonable but it's just as important to make a lot of other services smart and adaptive and that's really what machine learning can help and causality in particular is you know one of those tools that are quite subtle and you do need a reasonable amount of skill to understand exactly why and what what happens is that that skill uh as well as the kind of machinery that's been built up around the pearl style uh causality is it um is it equally established for grainger causality i imagine it benefits from the simplicity quite a bit um but even things like uh you know libraries like pyro for probabilistic programming does that apply equally well to granger or is it so you need slightly different tools um right i would say watch this space that's that's all i can say right now i think good things will happen uh this is about as specific as i can be without getting myself or the team in trouble but okay watch this space yes got it um basically there's still a bit of thinking that goes into you know how to make it very usable so i mean to give an example why this is you know a little bit dangerous territory or can be you may have heard of shop right the shamply value for explainability and so this is a great paper and great work um and so you know a lot of people use shop to explain you know why certain inputs cause a certain output now it turns out that actually the code is correct but the math in the paper isn't quite so this is actually funny because the approximation is the right thing to do and so the tubing and team actually wrote the follow-up saying hey your code is correct the math in your paper isn't then here's why but it has a lot to do with the fact that which variables you condition on when you look at an intervention or in other words if i have a light bulb and a switch and i observe that with the light bulbs on the switch is on and with the light bulbs off the switch is off right of course if i manipulate the light the switch the light bulb goes on but if i go and smash the light bulb the switch doesn't go off right and so if you look at interventions you need to be careful over which variables you stratify when you do the analysis and so in the actual shapley code it was done correctly but the initial analysis was improvable so this is exactly the level where you need reasonably well-trained teams of scientists to actually look at that because the average engineer will probably i mean it's it's hard to package it in a form that you don't end up with conclusions that may hurt you in the end i think what i hear you saying with this example is that the the shapley paper uh was based on kind of this the machinery of interventions and it's easy to get it wrong and so something simpler is better and that's why you're excited about this grainger causality and stuff to come yep i think that's a pretty good summary okay got it cool um well alex it has been amazing catching up with you uh and learning a little bit about some of the stuff that you're working on looking forward to tuning into you virtually at the ml summit and you know catching up in a year or two to talk about new causality tools and language models based on graphs and all kinds of cool stuff that we talked about here thanks thanks for having me and have a good day thanks thanks alex wow\n"