#130 The Path to Becoming a Kaggle Grandmaster (with Jean-Francois Puget PhD)

**Remote Work and Communication: A Key to Productivity**

One of the challenges of working remotely is communication. When team members are in different time zones, it can be difficult to schedule web conferences. To overcome this challenge, John Francis uses asynchronous communication techniques such as writing down ideas and results on a shared platform, which allows him to collaborate with his team without having to be in constant contact.

John's experience working remotely has been beneficial for him, allowing him to avoid the relocation to Silicon Valley that many people in similar positions have had to endure. He believes that effective communication is crucial to productivity, particularly when working on a remote team with members in different time zones. Asynchronous communication, where ideas and results are written down, helps to overcome this challenge.

**Conferences: A Clash of Data Science Events**

Two data science conferences, Radar (on the 22nd and 23rd of March) and NVIDIA's GTC conference, have overlapping dates. John Francis, a data scientist, attended both conferences and shared his insights on what he learned from them. The NVIDIA GTC conference is semi-annual, with sessions ranging from research to applied topics, including games and other user applications where GPUs are used.

John noted that the conference is free for participants who register early, although some sessions are only available in replay. He suggested a practical approach for those stuck between attending one or both conferences: registering for both events and watching whichever sessions appeal to you on recordings later on. This allows participants to stay up-to-date with the latest developments in data science and machine learning without feeling overwhelmed by conflicting conference schedules.

**The Excitement of Data Science and Machine Learning**

John Francis expressed his excitement about the rapid evolution of meta-learning, a field that has seen significant progress in recent years. He highlighted two areas where he's particularly enthusiastic: the speedup achieved with Rapids, an AI compiler developed by Meta, and the blurring of boundaries between statisticians and machine learning practitioners.

In particular, John noted that the shift towards using Python as a common language among data scientists has helped break down barriers between different subfields. He cited an example where his colleague used a statistical model without training any deep learning model to achieve impressive results, demonstrating the potential for innovation when traditional approaches are combined with new techniques.

**Conclusion**

John Francis concluded his conversation by thanking John Francis, the host of the podcast, for inviting him to share his insights on remote work and data science. He emphasized the importance of communication in achieving productivity while working remotely, particularly for teams with members in different time zones. By adopting asynchronous communication techniques, team leaders can facilitate collaboration and ensure that everyone is working towards a common goal.

John Francis also encouraged listeners to stay up-to-date with the latest developments in data science and machine learning by attending conferences like Radar and GTC, or following podcasts like Data Framed, which shares insights on all things data.

"WEBVTTKind: captionsLanguage: enyou're listening to data framed a podcast by datacamp in this show you'll hear all the latest trends and insights in data science whether you're just getting started in your data career or you're a data leader looking to scale data-driven decisions in your organization join us for in-depth discussions with data and analytics leaders at the Forefront of the data Revolution Let's Dive Right In foreign this is Richie today's guest is jean-francois Puget a distinguished engineer at Nvidia in this episode we've got two topics to cover that means we'll be talking about doing data science with the Nvidia stack meaning Computing with gpus but we'll also get into the software that accompanies them Sean Francois other claim to fame is that he's been in the top 10 of the kaggle machine learning competition leaderboard for the last few years so we'll delve into the world of competitive machine learning to see how to become a grand master hi Jean Francois thank you for joining me today just to begin with can you tell the audience a little bit about what you do and what it means to be a distinguished engineer people believe that to get more influence better compensation whatever we need to move to management while in some tech companies including Nvidia it was true at IBM my previous employer they let people grow as individual contributors so distinguish engineer means I'm a good individual contributor I manage a small team of individual contributors but it does not take me a real time so what I do mostly machine learning models be it as part of machine learning competitions and keger for instance to Showcase Nvidia stack and Nvidia as your brand maybe but also for internal projects or Nvidia Partners but my job is to build good possibly good machine learning models and can you give some examples of different projects you're working on so what are you building machine learning models about so for instance the last few competitions I did I can discuss openly care girl because it's public the internal projects I'm working on some very exciting ones but people will have to come to GTC soon to learn about this project so recent machine learning competition for instance was to predict some protein property from the protein sequence and for this we use models like Alpha fold and other modes that were quite hyped recently so it's a breakthrough in computational biology previously it was a natural language processing not the kind of chat GPT it was more text classification for specific topic for medical examination so it can vary a lot I also worked on diagnosing from medical images from Radio images or microscopic images it could be also Temple forecasting so time series forecasting sales forecasting or for instance Here We There is a competition sponsored by GoDaddy we have to predict basically the number of business websites or student GoDaddy by U.S County so that's time service focusing so you see it's very it varies a lot it's across industry so the only common piece and that's a bit surprising if you read the same mathematical technology machine learning can be used across a lot of use case and and Industry okay so it seems like you're doing a pretty broad range of machine learning projects just all over the place in terms of different Technologies in different industries that you work with so you mentioned kaggle I'd like to get into a bit more depth about that the top rank of competitive abilities I believe a kaggle so can you tell me how you got started using kaggle I started my professional life when I was student I was fascinated by yeah in particular machine learning so I did the PHD in machine learning it was a long time ago so before deep learning wave Etc I always was interested in Michigan after my PhD there was an early winter to adjust confidential so I moved to start doing something else mathematical optimization and like in 2008 also IBM acquired the company I was at and noticed I had a background in machine learning wanted to invest in Ai and data science so I was asked to do more than mathematical optimization and I was looking for a way to refresh my knowledge of machine learning so I started rereading papers academic papers but especially at that time it's less true now but Academia was a bit remote from practical use so I looked and I found kegert a site where people could compete and people were using whatever tools they could to get the best possible result and there was no preconception as long as something works it was used so I started the reading I got top Solutions and using it for my job at IBM and by looking at what tools people were using so I saw the emergence of Keras for instance of exubus which is now very popular but it started on kagger I witnessed deep learning frenzy there so it was useful for my job but after a while I say maybe I should try myself so I remember until my first competition I say you will see what you will see guys I'm a PhD in machine learning I would crush you all so I was doing quite well until the results and the hidden test set of private data set on again and I dropped from Top 100 to like 2000s rank so I say okay I need to learn my theoretical knowledge is not really practical so I started learning I enjoyed it and after a while I became one of top 10 on Kegel and Kegel Grand Master so I keep doing it even today that's a pretty impressive achievement being top 10 in the world out of I'm not sure how many is it hundreds of thousands of people who participate in these Chicago competitions Kegel has 15 million users not everybody enters competition so people who got a rank on competition I think that in the tens of thousands which means many more entered but got no results yeah it's quite a lot of people but that's a very impressive achievement and so I'd like to hear a little bit about the secret to your success so how have you managed to get to that high ranking position I would say it's a combination is to have a scientific approach I was trained as a scientist I was good at physics and math in France I achieved the best possible match result that's good I even got to physics olympiads representing France so I have a good scientific background and the scientific method is I could say in a nutshell is you checked your assumptions always so you have assumptions you design experiments such that the result will tell you if your assumption is right or wrong and I do I strongly believe machine learning is the science it's an experimental science like physics and like parts of physics and so I approach a competition with a scientific approach and everything is an experiment so for instance if I have an idea of a new data processing a new feature our new model architecture whatever I must have a Baseline something I know I trust and then I run an experiment with a new data processing with a new model architecture change whatever I run my experiment and then I look at the result an easy bit or Worse sometimes I dig a bit in the outputs to to understand where it is better it is worse and from there either I accept the change and it becomes my new Baseline Etc and for this you need to have a good what we call a good cross validation setting so the the bread and butter of practical machine learning is cross validation so can be careful cross-validation if it's time series it's temporal plus validation but the point is you split your training data you keep some of it to validate your model you train on the rest and when it is Trend you predict on the validation and you compare with the branches with the Target and the validation and K4 means you do this K time with K different splits so there are variants but really you don't evaluate your model on the training data that's most common era I see it's surprisingly common and that's something Tiger teaches you where to evaluate what that's right so the the real point is evaluate your assumptions evaluate whatever modification you make to your code so make one modification at a time that's also something I've seen people modify three things but it's better but maybe one is detrimental or the modified three things result is worse they will discussions maybe one of them was good but it's upset by the other so you have to be consistent to run experiment and if you run experiment correctly record the settings you can reproduce what you do that's also important I really like this idea that you should treat machine learning as an experimental science because I think quite often you find that people they learn about things like a b testing in a statistics class and then machine learning classes separate and they don't reply those ideas that will actually be doing experiments when I'm machine learning so I really like that idea and you mentioned that things like cross validation are really important so there there was a course I was recommending a lot Andrew in pools and Coursera but it's outdated now it's with Matlab but still he was teaching how to evaluate models and I saw people just forget what he says because they were using a different kind of models a deep learning model as you've taken this course yes why don't you use cross-validation well does it apply to deep learning yeah it's a methodology it does not depend on the type of model that is interesting once you switch from just regular machine learning to deep learning people forget all the stuff that they learned in the original machine learning models do you have any tips for how you go about winning at kaggle like what are the techniques or things that you use in order to get better predictions so if you have a good cross validation setting so you can rely on what you do and the next thing to avoid is overfitting even if you scratch validation if you use it a lot same splits over and over again you end up overfitting to it so you need to use to be conservative make sure you don't select something just because it was lucky so there is a tendency of prepared Kegel offers you publicly doubles so it's a trend that split fix one across the competition and people rely too much on this fixed Trend test split so they will overfit to the public test so I almost never use a public test on cable almost anything oh and let's say I use it as an additional for if I use a five four cross validation on my training data the public test is a either six four but no more and the second is to have no preconception and so quickly create a baseline within our model usually if it's tabular data a simple CNN if it's a computer vision just running a pre-trained Transformer if it's NFP very quickly have a complete pipeline where you can submit and create a solution and then you improve gradually and you have no preconception always wonder well in all I've said oh I have this parameter fixed why not try to vary it why not don't be shy I see people they ask in the Forum do you think this could work don't ask just try it and see what happens you will learn something always so it's really good performance is just from a solid use of a scientific method sometimes people have a great idea that nobody else has it happens it happened to me as well from time to time but that's less and less frequent because the level of people is increasing the knowledge for instance I did well in NLP competition because I started using prompts there were some papers coming with prompt engineering there was an NLP competition at the same time so I just did some prompt engineering before it came really known so that was a good Advantage but now I would say the key is to perform the right experiment and what does it mean it depends on the competition and you get some that's that's your knowledge the training we get from here I would say so practice don't be shy test your hypothesis and be conservative that's really interesting the idea that is very difficult to predict which theory is going to give you the best result so the only thing you can do is just try lots of things and see what is the data show or the results actually any good or not because it seems very different to a lot of Sciences but I'd like to talk a bit about the kaggle grand masters of Nvidia so this is your team of taggle competitors I presume so it's not just you it's a group of several of you who are competing together yeah well I call the kechimon so our CEO found the acronym and it sounds like Pokemon so it's not by chance so our cargo Grandmaster of Nvidia we are eight of us plus me so it's nine so it's not a big team now about 150 I believe competition Grandmaster so there are not many companies that a few companies having a grand master team as well so it depends on what you want to achieve but I do believe in small teams of very good people and they all do the same as me it's like people having a PhD so I will come back to this but they know how to work effectively otherwise they would not be when restaurants so having good work habits is key which means they don't need much management so I don't see myself really as a manager coordinator maybe but most of my job is individual contribution and they all do the same as me a mix of competitions and internal projects I want to come back to phds I believe the one thing people learn during a PhD is autonomy good PhD does not need to be told what to do every day and they know how to complete a complex project till the end and kegger competition are the same they are complex projects time box usually three months and to do where you need to complete your project on time so that's also something that is good about career brand Masters they work fast and they meet deadlines so I think a lot of people listening are going to be thinking that sounds like a cushy job being able to just participate in competitions while I'm at work and they're going to want to know how do I get this for myself so you talk about how you persuaded management to let you do this as your career when I started caggling at IBM maybe I was spending on average one or two hours of work every day on it which is already good but most of my tackle time was evenings weekends holidays it was a hobby a passion so it's like people going to casino for gambling I believe it's the same it's a legal drug except you don't lose money here and to become a grand master you need to spend a lot of time it's Fierce competition a lot of people they became Grand Master because they are students phds in machine learning usually and once our Grandmaster they get a job and they stop caggling because they don't have enough time later and when I was hired at Nvidia I remember I was seventh and kegger so I became good at Kegel before it became my work so I would say just invest time if you if you have those skills and the motivation become a grand master and then you will find jobs like mine at Nvidia or at few other companies so obviously I answered the Nvidia job ad this career Grandmaster as prereq but I see on LinkedIn function so we see is from time to time companies asking for kager and Masters and what does Nvidia get out of you being a grand master like what's the benefit to the rest of the company in a competition someone shared a notebook that accelerated pandas and a pipeline using polar and this and that and I looked at it I said let's see what we can do in GPU so I use Nvidia chocola data frames part of Rapids I used querman KN so I used Nvidia all Nvidia stack recorded our notebook and got I believe uh 17x speed up so as a result people know that if they use GPU they will get better performance if I had not done it people would say oh if pandas is too slow just use polar which is an interesting advice and yes portal is more efficient in general than pendants but kujif is even faster another thing we did there is another competition in medical imaging so dicom image it's a standard format in medical image and in the competition only people who are not using GPU to decode the images but Nvidia had a toolkit so some people on my team they tried the Nvidia toolkits so that it did not handle some of the formats they worked with Nvidia product team and last month's released a notebook with an early access of the new toolkit and as a result images can be decoded on GPU in kagger and same the speed up is at least 10x I believe it's fast it's more than that so we showcase Nvidia tooling that's really interesting and honestly I'd love to talk more about the Nvidia tooling so of course gpus encaps your Baseline of business so can you just tell me a bit about what sort of machine learning problems are particularly suitable for a GPU I would say so if the plan name is the way to go think of GPU that's your first advice so if you're in computer vision so image classification or object location video tagging whatever NLP since broad paper since Transformers to cover it's again deep learning with some pre-processing called The Fast Forward transform it's unable to copy top Vision models so for these three class of data which people call unstructured data usually using an accelerator and especially GPU is a way to go compared to CPU for tabular data so say you have sales you have passages and you need to predict the future I'd say depend on the size of data sometimes you have to probably have a hundred location they have five years monthly data so it's like 60 times 30 data point 60 times 100 data point six thousand diffuse xgboost for instance you may not need GPU that's fine so small data I use whatever you want but for tabular data for instance recent competition it was a recommender system we had 18 million user 1.8 million products and 100 Millions interactions running so doing data processing and modeling we take the Boost or something else using a Rapids on GPU with the speed up is enormous it's 50 or more so again if we go back to what I was saying the key is to perform experiments quickly and effectively so we if you as soon as you can accelerate with GPU you will run more experiments So within a day you will test more hypothesis and you will make progress much more effectively so it seems like most of these examples you gave where the GPU is going to be faster these are examples where the code can be easily parallelized so you're doing multiple things independently is that correct well GPU does the the Frameworks do it for you so for instance let's say event for data processing if you use pandas and you want to compute I don't know one colon has uh the aggregate but you do a group by functions you want the mean of the spending by user in pan as it you would Group by users and you compute the mean but it will iterate through the users one by one so let's see control with qdf and GPU it will be run and parallel for you so you don't need to write a parallel code the code is paralyzed so this way you can get 100 times speed up just because it will process hundreds of users at a time in parallel so that's how you get the speed up for deep learning the bulk of competition is metrics multiplication and so on notifications and then GPU are designed to do this actually in Thailand so they change the memory and do the multiply of two parts of the Matrix in one cycle the CPU they do have some parallel but GPU are massively valid so when you can use massive parallel GPU is a great idea exhibus is the same most of the computations can be paralleled on GPU so you just select GPU haste it's one parameter in exit boost and your code runs on GPU using the GPU panel design but you don't need to change your code Network that sounds like a pretty useful thing it's like not having to write completely different code when you're switching to gpus so the Nvidia software stack for doing all this data science and gpus so that's Rapids and can you just tell us a little bit about what you can do with Rapids and who is for so I'd say Rapids is fairly comprehensive the motivation was to get a GPU accelerated version of pandas and psychifier sure you have a package called qdf with a data frames UDF which is similar to pandas except the data processing is done in GPU but the API is really similar to pandas to the point that now when I have a panda Squad that is too slow the first thing I do is import kdf as PD and then I run my code and most often runs as is and we are working with a rapid steam to to read user case where behavior is different and for psychic learn the Rapids equivalent is called who ml so good machine learning so not every algorithm is implemented yet but a lot and the API is really similar to the point that qml documentation refers to psychic learn documentation so that Google is really that it's easy to translate pipeline that uses pandas and circuit learn into a pipeline that use PDF and equipment and then over the recently many other packages have been added to Rapid like cusign I have less experience with this so they are a bit more specialized but then without tooling so that's always the same idea is to see what is needed to move pipeline from CPU to GPU for deep learning we there is no package no framework from Nvidia because tensorflow by torch and others did the work correctly so we support these Frameworks there is a backend called code DNN that this framework use but users don't need to worry about it so personally I use pytorch I know it uses School DNN under the hood but I just use by torch so for that reason given the Deep learning framework were already on GPU Rapids itself is not dealing with deep learning but we know and it's part of the feedback we gave okay government Masters that often it is useful to combine deep learning with other machine learning models so work has been done and recently UDF team has released a way to share memory between PDF and pytorch so you can prepare your data with UDF and when it's ready issues by pytorch without memory copy and it's all on GPU so the food pipeline is MVP all right so it seems like qdf is perhaps the most interesting part of Rapids for data scientists and machine learning scientists so it's a high performance pandas alternative but there are about a dozen of these different high performance panel turns around so how does qdf compare to things like Vex and modin and koalas and all the others well some of these are distributed so they get speed up by Distributing computation because for people who listen to us especially data scientists some may not know yet they soon learn it as soon as you use Python that python is minus threads because of something called the global interlock Gill so python is mono thread which means that if you want to use parallelism in Python either you call say a c or C bus library that does it for you like or you implement multiprocessing or you distribute across machines so there are some distributed data frames and you could have mentioned spark as well our preferred way at Nvidia is called desk so that's why the distributed computing system a bit similar to spark but it's more python friendly I would say and there is a desk for those who want to distribute and one reason to distribute is when your data is too big to fit in one machine memory and GPU memory is increasing but it's limited still so that's good f is a way to distribute data processing across multi-gpu and then when it comes to benchmarking as I said each time I tried PDF it was faster than anything else because really GPU are so so powerful so massively valid that it's really hard to complete the only thing that would limit application is the fact that the GPU memory is limited so for this looking at this PDF usually but that's I would say if if you can fit in the GPU memory in the memory of your gpus using this PDF it's hard to beat so it seems like Q UDF is a pretty high performance thing and maybe worth checking out if your pandas code is running too slow but I want to Circle back now to talking about your kaggle competitions and how it relates to more standard Machinery work in a business context so do you find there's a difference between competitive data science and machine learning at work yeah sure and IT addresses some valid criticism of cat girl which is not at work you maybe not just a data scientist but a company the organization using machine learning must cover a full life cycle that starts with framing a problem as machine learning Gathering data for it since most of the applicable machine learning is supervised learning you need to annotate this data to get trending data then you have data curation modeling model evaluation and once models are evaluated properly you put them in some production system or behind the dashboard or whatever you connect it to an e-commerce site for recommendation whatever your use case is and then you need to monitor all the model in production detect if performance is going down which may mean you need to retrain because something has changed in your environment there is a full life cycle and Kegel does not cover all of it when you in a Kegel competition you have curated data you have annotated data you have a metric so the problem is already defined for you and once you've trained the model you submit predictions to kaggle or you give your prediction code but it's applied to test data and that's it so you don't deploy you don't need to worry about Downstream so okay girl is only part of the machine learning pipeline but for this part it teaches you the right methodology which is what I explained before experiment base Etc so I would say category is great to learn about modeling and model evaluation but it's good only for this to someone who never worked on real life and onion caguard is not your full-fledged data scientist people need to get the experience in okay how do I even apply machine learning to this business problem where do I get the data to working with people how do I annotate it how do I get labels and downstream as well so Downstream is more understood I would say there is this mln engineer profession that has emerged that can personalize your model so we find more and more ml Engineers but I would say the Upstream part firming the problem as machine learning getting data reliably creating it Etc it's still a kind of art and will be underestimated at this point so that is interesting the idea that the competition only focuses on the sort of the middle part of the machine learning life cycle around making predictions but you don't get the start bit about frame the problem collecting the data and the end bit about how do I deploy this or how do I actually use the model so it seems like a big part of this is about not having to align your model with some kind of business goal so do you have any advice for people on how to do that like how do you make sure that the machine learning work you're doing is going to align with some kind of business objective that's a great question and actually when I'm asked to help a machine learning project if I'm not at the start I ask people suppose Imaging assume that your model is perfect it makes perfect predictions how do you use it and for instance if it's forecasting you can play back so assume you had perfect predictions how would have this impacted your business you know how to use a perfect prediction you predict exactly the target what would happen and not surprisingly for me but most of the time people have no clue so I say you need to design your business process including whatever so that it can consume the the output of of your model it's straightforward for instance I've seen once I used to be active on Twitter but I remember once saying they work at pharmaceutical company they don't say which one and they worked based on feedback about one medication produced by that company to predict when the medicine was most effective and they did a good job so with their machine learning model based on on a patient features the model could predict if it was worth using the medicine or not so it could be a good help for medical doctors when they presented the result to their management the project was shut down so I guess it's because the pressure to sell the medicine even when it's not effective so I'm not going to discuss the pharmaceutical industry incentives but I want to point that the people working on the machine learning project should have asked should have present should have asked the stakeholders what if we succeed how would we use a model is it worth doing and maybe someone would have said no we have no interest in doing this instead they spent one year a team so just check that you're doing something useful when you stop not when you're done so that leads to an interesting point about like how do you measure the success of machine learning projects I think like the kaggle ideal is machine learning works best when you have the best predictions but in real life that's not always the case so we can talk about what constitutes success for machine learning yeah so in cat girl most of the time what matters is how good a metric can become on the test data and this leads sometimes to complex Solutions with lots of models being ensembled and several stages and whole stacking and it's too complex to be used so Edgar is trying to limit the complexity but in short you want to balance the quality of the predictions with the cost of Maintenance the cost of implementation so you want you prefer to have one model that is a bit less performant and complex and somewhere you could get on kegger but which is simple to implement simpler to retrain you can maybe automate everything Etc so complexity of the model complexity of training the model is a key factor the other point is the metric is a proxy to the business problem so it's not because you get a good metric that you improve your business so let me give you an anecdote that I read I don't know if it's true or not maybe it's too good to be true but someone I know claimed to have worked on a support organization and did a customer churn problem for a subscription company like a tedco or TV or cabin or whatever so his model was predicting which customers were most likely to not renew subscription so what they did and this is a classical example you see in many machine learning textbooks so then they say okay let's run it on the customer base and we'll have in the court Center the support team called the people most at risk to propose them an incentive rebate or what have you the problem is many calls went like this so hello Mr customer I'm from company X ah great I wanted to cancel my subscription let's do it so in fact The Accelerated the subscription because they targeted the right people but not with the right answer so this is an extreme but it really highlights what I see assume your mother is good how do you use it effectively to improve the business the other thing is to measure but if you don't know up front you have or do a b testing as you cited before say you run the previous process for Alpha whatever thing you apply to your users your machines your whatever and the other half you use the process with machine learning and you monitor and you see if there is a difference and in which way hopefully the pathways machine learning works better so you would use it more but always keep a small fraction without machine learning so that you can see you can detect if to point the machine Learning System no longer works well and this can happen if the underlying conditions are changing so monitor what happens I've seen a presentation in an industry form someone right before me and they were describing I believe it was a recommender system and The Wizard results were not good but they only discovered it like nine months after deployment because they were not monitoring and as soon as they started monitoring they noticed that the sales of promoted items were not increasing and in fact they did not include promotions in the training data so the model was insensitive to Promotions so it predicted that some products were popular for the wrong reason they were popular because they were promoted but the system did not have the data so it invented the reason for which that would explain why the products were popular it was over fitting once they noticed it the retrender model with past promotion data and all of a sudden sales started to increase but they had to Monitor and sin practice so it's same as I said always check your assumption you assume you have a good model and you have good reason you have done cross-validation or all of that check that it is really good in practice okay and those are two really great stories of machine learning disasters the one in I don't like churn in particular is really terrifying to me at data camp where primarily a subscription business so customer churn is something that we live in Terror of so the idea that you could do a machine learning project and then make it worse is absolutely horrifying so I'd like to talk about productivity a little it seems like particularly with your competitive machine learning background you've got good at doing models very quickly so do you have any productivity tips for how you do machine learning faster key is to have a modular pipeline that is easy to maintain to modify to log so now I'm used to log things to have something modular controlled by configuration file Etc so it's I will say it's standard software development practice but data scientists are not developers so that's something I really believe is true and for those that have no experience in software development they have to learn it and unfortunately it should be the hard way there are no programming pools for data scientists but really so people need to be able to version their code to have a clear distinction between configuration and Baseline scripts all sorts of things but the goal is to automate as much as possible and then there are tools like weight and balance or Neptune AI that helps you track experiments for instance so there is more and more tooling that comes but really the goal is to automate most of the things and focus on your idea you have a new idea you should just write a bit of code or change sub configuration run it get the results logged easy to compare with other experiments so the key is really to remove the need to do manual work manual meaning typing to get a new result so once I start with notebooks they are very good but as soon as possible I move to Python scripts with configuration files because it's fast start to return okay so removing any kind of manual test trying to automate as much as possible that seems like a great productivity tip and do you have any advice for if you have to do machine learning projects in a very short time so how do you do like very fast projects that just a couple of days or maybe a week or two so if it's only a few days so it's if you're clear it's amazing what you can do in followers really if you only have few hours I would use and automated too so fun sounds have tried one called autobio I think it's from Amazon it's quite impressive but it's not the only one so Auto ml if you have few hours I would go without with an automatic tool if you have weeks then you beat automl because you can include additional data for instance that is relevant you can include domain knowledge system cannot divide so you work more on the data Etc so you can beat automl but if you only have few hours I would run an automatic and if it's tabular data I would even start simpler I would run linear regression or logistic progression if I have a bit more time I would run XC boost if you have few hours you can do something complex so use a simple model so start simple and if you have more time you can always make things more complex later on I'd like to talk a bit about collaboration since that's a big part of productivity so do you have any tips on doing machine learning as a team now 2K is one is when we have to deliver a command code and the other is when we have to deliver predictions the second case is more for Kegel where you don't care about productionizing your model if you only need to ship predictions we collaborate by exchanging data so data sets prediction on these data sets if it's a common code we have to use something like GitHub gitlab and use software engineering technique for communication I often use slack because of time difference my team and tiger teams and I always worked since with remote teams to find Science on my team I have one person in Japan one in Germany two in France one in Brazil three in the U.S I hope I'm not forgetting someone I will be crushed but you see so if my time difference it's hard to have everybody on a web conference we do it but not often so we rely on asynchronous communication uh but commit upload download in a directory and slack slightly people can use other but the point is it's asynchronous so we write our ideas our result the other guy comes and relates our response so it's like a remote development team like an open source project it's quite different from some Dev organization where everybody is in the same office have been used before the pandemic so after the pandemic more people are working remotely but that's what I've been doing that may be why I did not need to relocate to Silicon Valley being able to work remotely yeah it just seems that communication is just a huge part of productivity and having this idea of asynchronous communication where things are written down is incredibly important particularly when you're in a remote team in different time zones all right before we wrap up I'd like to talk about conferences so we've had we've got a bit of a clash since both data cap and Nvidia have data science conferences going on with a partial overlapping dates of the data camp conference is called radar that's on the 22nd and 23rd of March and Nvidia has arrivaled GTC conference with a few dates overlap so can you tell us a little bit about what's going on at GTC yeah so GTC is a semi-annual it runs once in Spring once in Fall that's your Nvidia conference so we have Keynotes but there were CEO he usually announces new products Services then you have tons of sessions by industry by use case smallest technical ranging from research to very applied and we do have a couple of sessions from us on Digimon so if you're using gpus attending GTC So to avoid the clash with a concurrent sessions are always available in replay but first GTC the main conference is free and you can register watch when you're ready so for instance being in France I can't watch everything live I I just use replays for what I'm interested in but it's really how to get the latest news and it's not just on data science as people know Nvidia is great for gaming and other user GPU so whatever your interests if you use GPU that's the conference to it that actually seems like a good sort of practical diplomatic approach if you stuck between trying to decide which conference to go to since they're both virtual you can register them both and then watch whichever sessions appeal to you on the recordings later on so I've just finished what are you most excited about in data science and machine learning uh the The Meta excitation is it's evolving so fast you have to learn all the time so I like it I don't know what will be hot next year I don't know what is doable so there is a frenzy about generative AI Etc so I'm listening to that not sure why so what excites me it's a progress I spoke a lot about Rapids but this year I used it more than before and the speed up I'm pretty incredible so that's one thing the other is the ability when I started there was a clear divide between statisticians classical quote unquote machine learning deep learning and now this is the barriers are being removed maybe because everybody moves to using python so it's great when we see unexpected use of one technique to a place where it was not used for instance or colleague of mine well on an image classification competition without training energy planning model and running svm regression to Perfect all machine regression so here runs super victim machine regression on the predictions of the planning models without training any deep learning model that's surprising so I know I would have surprised all the time that's what I loved that's a great answer and I do think that having people pushed into different situations like people moving to python from a different language or coming from statistic to machine learning they do show up lots of interesting opportunities and innovation all right super thank you for your time lots of really great insights and yeah thank you for coming on the show John Francis thank you for inviting me I enjoyed it you've been listening to data framed a podcast by datacamp keep connected with us by subscribing to the show in your favorite podcast player please give us a rating leave a comment and share episodes you love that helps us keep delivering insights into all things data thanks for listening until next timeyou're listening to data framed a podcast by datacamp in this show you'll hear all the latest trends and insights in data science whether you're just getting started in your data career or you're a data leader looking to scale data-driven decisions in your organization join us for in-depth discussions with data and analytics leaders at the Forefront of the data Revolution Let's Dive Right In foreign this is Richie today's guest is jean-francois Puget a distinguished engineer at Nvidia in this episode we've got two topics to cover that means we'll be talking about doing data science with the Nvidia stack meaning Computing with gpus but we'll also get into the software that accompanies them Sean Francois other claim to fame is that he's been in the top 10 of the kaggle machine learning competition leaderboard for the last few years so we'll delve into the world of competitive machine learning to see how to become a grand master hi Jean Francois thank you for joining me today just to begin with can you tell the audience a little bit about what you do and what it means to be a distinguished engineer people believe that to get more influence better compensation whatever we need to move to management while in some tech companies including Nvidia it was true at IBM my previous employer they let people grow as individual contributors so distinguish engineer means I'm a good individual contributor I manage a small team of individual contributors but it does not take me a real time so what I do mostly machine learning models be it as part of machine learning competitions and keger for instance to Showcase Nvidia stack and Nvidia as your brand maybe but also for internal projects or Nvidia Partners but my job is to build good possibly good machine learning models and can you give some examples of different projects you're working on so what are you building machine learning models about so for instance the last few competitions I did I can discuss openly care girl because it's public the internal projects I'm working on some very exciting ones but people will have to come to GTC soon to learn about this project so recent machine learning competition for instance was to predict some protein property from the protein sequence and for this we use models like Alpha fold and other modes that were quite hyped recently so it's a breakthrough in computational biology previously it was a natural language processing not the kind of chat GPT it was more text classification for specific topic for medical examination so it can vary a lot I also worked on diagnosing from medical images from Radio images or microscopic images it could be also Temple forecasting so time series forecasting sales forecasting or for instance Here We There is a competition sponsored by GoDaddy we have to predict basically the number of business websites or student GoDaddy by U.S County so that's time service focusing so you see it's very it varies a lot it's across industry so the only common piece and that's a bit surprising if you read the same mathematical technology machine learning can be used across a lot of use case and and Industry okay so it seems like you're doing a pretty broad range of machine learning projects just all over the place in terms of different Technologies in different industries that you work with so you mentioned kaggle I'd like to get into a bit more depth about that the top rank of competitive abilities I believe a kaggle so can you tell me how you got started using kaggle I started my professional life when I was student I was fascinated by yeah in particular machine learning so I did the PHD in machine learning it was a long time ago so before deep learning wave Etc I always was interested in Michigan after my PhD there was an early winter to adjust confidential so I moved to start doing something else mathematical optimization and like in 2008 also IBM acquired the company I was at and noticed I had a background in machine learning wanted to invest in Ai and data science so I was asked to do more than mathematical optimization and I was looking for a way to refresh my knowledge of machine learning so I started rereading papers academic papers but especially at that time it's less true now but Academia was a bit remote from practical use so I looked and I found kegert a site where people could compete and people were using whatever tools they could to get the best possible result and there was no preconception as long as something works it was used so I started the reading I got top Solutions and using it for my job at IBM and by looking at what tools people were using so I saw the emergence of Keras for instance of exubus which is now very popular but it started on kagger I witnessed deep learning frenzy there so it was useful for my job but after a while I say maybe I should try myself so I remember until my first competition I say you will see what you will see guys I'm a PhD in machine learning I would crush you all so I was doing quite well until the results and the hidden test set of private data set on again and I dropped from Top 100 to like 2000s rank so I say okay I need to learn my theoretical knowledge is not really practical so I started learning I enjoyed it and after a while I became one of top 10 on Kegel and Kegel Grand Master so I keep doing it even today that's a pretty impressive achievement being top 10 in the world out of I'm not sure how many is it hundreds of thousands of people who participate in these Chicago competitions Kegel has 15 million users not everybody enters competition so people who got a rank on competition I think that in the tens of thousands which means many more entered but got no results yeah it's quite a lot of people but that's a very impressive achievement and so I'd like to hear a little bit about the secret to your success so how have you managed to get to that high ranking position I would say it's a combination is to have a scientific approach I was trained as a scientist I was good at physics and math in France I achieved the best possible match result that's good I even got to physics olympiads representing France so I have a good scientific background and the scientific method is I could say in a nutshell is you checked your assumptions always so you have assumptions you design experiments such that the result will tell you if your assumption is right or wrong and I do I strongly believe machine learning is the science it's an experimental science like physics and like parts of physics and so I approach a competition with a scientific approach and everything is an experiment so for instance if I have an idea of a new data processing a new feature our new model architecture whatever I must have a Baseline something I know I trust and then I run an experiment with a new data processing with a new model architecture change whatever I run my experiment and then I look at the result an easy bit or Worse sometimes I dig a bit in the outputs to to understand where it is better it is worse and from there either I accept the change and it becomes my new Baseline Etc and for this you need to have a good what we call a good cross validation setting so the the bread and butter of practical machine learning is cross validation so can be careful cross-validation if it's time series it's temporal plus validation but the point is you split your training data you keep some of it to validate your model you train on the rest and when it is Trend you predict on the validation and you compare with the branches with the Target and the validation and K4 means you do this K time with K different splits so there are variants but really you don't evaluate your model on the training data that's most common era I see it's surprisingly common and that's something Tiger teaches you where to evaluate what that's right so the the real point is evaluate your assumptions evaluate whatever modification you make to your code so make one modification at a time that's also something I've seen people modify three things but it's better but maybe one is detrimental or the modified three things result is worse they will discussions maybe one of them was good but it's upset by the other so you have to be consistent to run experiment and if you run experiment correctly record the settings you can reproduce what you do that's also important I really like this idea that you should treat machine learning as an experimental science because I think quite often you find that people they learn about things like a b testing in a statistics class and then machine learning classes separate and they don't reply those ideas that will actually be doing experiments when I'm machine learning so I really like that idea and you mentioned that things like cross validation are really important so there there was a course I was recommending a lot Andrew in pools and Coursera but it's outdated now it's with Matlab but still he was teaching how to evaluate models and I saw people just forget what he says because they were using a different kind of models a deep learning model as you've taken this course yes why don't you use cross-validation well does it apply to deep learning yeah it's a methodology it does not depend on the type of model that is interesting once you switch from just regular machine learning to deep learning people forget all the stuff that they learned in the original machine learning models do you have any tips for how you go about winning at kaggle like what are the techniques or things that you use in order to get better predictions so if you have a good cross validation setting so you can rely on what you do and the next thing to avoid is overfitting even if you scratch validation if you use it a lot same splits over and over again you end up overfitting to it so you need to use to be conservative make sure you don't select something just because it was lucky so there is a tendency of prepared Kegel offers you publicly doubles so it's a trend that split fix one across the competition and people rely too much on this fixed Trend test split so they will overfit to the public test so I almost never use a public test on cable almost anything oh and let's say I use it as an additional for if I use a five four cross validation on my training data the public test is a either six four but no more and the second is to have no preconception and so quickly create a baseline within our model usually if it's tabular data a simple CNN if it's a computer vision just running a pre-trained Transformer if it's NFP very quickly have a complete pipeline where you can submit and create a solution and then you improve gradually and you have no preconception always wonder well in all I've said oh I have this parameter fixed why not try to vary it why not don't be shy I see people they ask in the Forum do you think this could work don't ask just try it and see what happens you will learn something always so it's really good performance is just from a solid use of a scientific method sometimes people have a great idea that nobody else has it happens it happened to me as well from time to time but that's less and less frequent because the level of people is increasing the knowledge for instance I did well in NLP competition because I started using prompts there were some papers coming with prompt engineering there was an NLP competition at the same time so I just did some prompt engineering before it came really known so that was a good Advantage but now I would say the key is to perform the right experiment and what does it mean it depends on the competition and you get some that's that's your knowledge the training we get from here I would say so practice don't be shy test your hypothesis and be conservative that's really interesting the idea that is very difficult to predict which theory is going to give you the best result so the only thing you can do is just try lots of things and see what is the data show or the results actually any good or not because it seems very different to a lot of Sciences but I'd like to talk a bit about the kaggle grand masters of Nvidia so this is your team of taggle competitors I presume so it's not just you it's a group of several of you who are competing together yeah well I call the kechimon so our CEO found the acronym and it sounds like Pokemon so it's not by chance so our cargo Grandmaster of Nvidia we are eight of us plus me so it's nine so it's not a big team now about 150 I believe competition Grandmaster so there are not many companies that a few companies having a grand master team as well so it depends on what you want to achieve but I do believe in small teams of very good people and they all do the same as me it's like people having a PhD so I will come back to this but they know how to work effectively otherwise they would not be when restaurants so having good work habits is key which means they don't need much management so I don't see myself really as a manager coordinator maybe but most of my job is individual contribution and they all do the same as me a mix of competitions and internal projects I want to come back to phds I believe the one thing people learn during a PhD is autonomy good PhD does not need to be told what to do every day and they know how to complete a complex project till the end and kegger competition are the same they are complex projects time box usually three months and to do where you need to complete your project on time so that's also something that is good about career brand Masters they work fast and they meet deadlines so I think a lot of people listening are going to be thinking that sounds like a cushy job being able to just participate in competitions while I'm at work and they're going to want to know how do I get this for myself so you talk about how you persuaded management to let you do this as your career when I started caggling at IBM maybe I was spending on average one or two hours of work every day on it which is already good but most of my tackle time was evenings weekends holidays it was a hobby a passion so it's like people going to casino for gambling I believe it's the same it's a legal drug except you don't lose money here and to become a grand master you need to spend a lot of time it's Fierce competition a lot of people they became Grand Master because they are students phds in machine learning usually and once our Grandmaster they get a job and they stop caggling because they don't have enough time later and when I was hired at Nvidia I remember I was seventh and kegger so I became good at Kegel before it became my work so I would say just invest time if you if you have those skills and the motivation become a grand master and then you will find jobs like mine at Nvidia or at few other companies so obviously I answered the Nvidia job ad this career Grandmaster as prereq but I see on LinkedIn function so we see is from time to time companies asking for kager and Masters and what does Nvidia get out of you being a grand master like what's the benefit to the rest of the company in a competition someone shared a notebook that accelerated pandas and a pipeline using polar and this and that and I looked at it I said let's see what we can do in GPU so I use Nvidia chocola data frames part of Rapids I used querman KN so I used Nvidia all Nvidia stack recorded our notebook and got I believe uh 17x speed up so as a result people know that if they use GPU they will get better performance if I had not done it people would say oh if pandas is too slow just use polar which is an interesting advice and yes portal is more efficient in general than pendants but kujif is even faster another thing we did there is another competition in medical imaging so dicom image it's a standard format in medical image and in the competition only people who are not using GPU to decode the images but Nvidia had a toolkit so some people on my team they tried the Nvidia toolkits so that it did not handle some of the formats they worked with Nvidia product team and last month's released a notebook with an early access of the new toolkit and as a result images can be decoded on GPU in kagger and same the speed up is at least 10x I believe it's fast it's more than that so we showcase Nvidia tooling that's really interesting and honestly I'd love to talk more about the Nvidia tooling so of course gpus encaps your Baseline of business so can you just tell me a bit about what sort of machine learning problems are particularly suitable for a GPU I would say so if the plan name is the way to go think of GPU that's your first advice so if you're in computer vision so image classification or object location video tagging whatever NLP since broad paper since Transformers to cover it's again deep learning with some pre-processing called The Fast Forward transform it's unable to copy top Vision models so for these three class of data which people call unstructured data usually using an accelerator and especially GPU is a way to go compared to CPU for tabular data so say you have sales you have passages and you need to predict the future I'd say depend on the size of data sometimes you have to probably have a hundred location they have five years monthly data so it's like 60 times 30 data point 60 times 100 data point six thousand diffuse xgboost for instance you may not need GPU that's fine so small data I use whatever you want but for tabular data for instance recent competition it was a recommender system we had 18 million user 1.8 million products and 100 Millions interactions running so doing data processing and modeling we take the Boost or something else using a Rapids on GPU with the speed up is enormous it's 50 or more so again if we go back to what I was saying the key is to perform experiments quickly and effectively so we if you as soon as you can accelerate with GPU you will run more experiments So within a day you will test more hypothesis and you will make progress much more effectively so it seems like most of these examples you gave where the GPU is going to be faster these are examples where the code can be easily parallelized so you're doing multiple things independently is that correct well GPU does the the Frameworks do it for you so for instance let's say event for data processing if you use pandas and you want to compute I don't know one colon has uh the aggregate but you do a group by functions you want the mean of the spending by user in pan as it you would Group by users and you compute the mean but it will iterate through the users one by one so let's see control with qdf and GPU it will be run and parallel for you so you don't need to write a parallel code the code is paralyzed so this way you can get 100 times speed up just because it will process hundreds of users at a time in parallel so that's how you get the speed up for deep learning the bulk of competition is metrics multiplication and so on notifications and then GPU are designed to do this actually in Thailand so they change the memory and do the multiply of two parts of the Matrix in one cycle the CPU they do have some parallel but GPU are massively valid so when you can use massive parallel GPU is a great idea exhibus is the same most of the computations can be paralleled on GPU so you just select GPU haste it's one parameter in exit boost and your code runs on GPU using the GPU panel design but you don't need to change your code Network that sounds like a pretty useful thing it's like not having to write completely different code when you're switching to gpus so the Nvidia software stack for doing all this data science and gpus so that's Rapids and can you just tell us a little bit about what you can do with Rapids and who is for so I'd say Rapids is fairly comprehensive the motivation was to get a GPU accelerated version of pandas and psychifier sure you have a package called qdf with a data frames UDF which is similar to pandas except the data processing is done in GPU but the API is really similar to pandas to the point that now when I have a panda Squad that is too slow the first thing I do is import kdf as PD and then I run my code and most often runs as is and we are working with a rapid steam to to read user case where behavior is different and for psychic learn the Rapids equivalent is called who ml so good machine learning so not every algorithm is implemented yet but a lot and the API is really similar to the point that qml documentation refers to psychic learn documentation so that Google is really that it's easy to translate pipeline that uses pandas and circuit learn into a pipeline that use PDF and equipment and then over the recently many other packages have been added to Rapid like cusign I have less experience with this so they are a bit more specialized but then without tooling so that's always the same idea is to see what is needed to move pipeline from CPU to GPU for deep learning we there is no package no framework from Nvidia because tensorflow by torch and others did the work correctly so we support these Frameworks there is a backend called code DNN that this framework use but users don't need to worry about it so personally I use pytorch I know it uses School DNN under the hood but I just use by torch so for that reason given the Deep learning framework were already on GPU Rapids itself is not dealing with deep learning but we know and it's part of the feedback we gave okay government Masters that often it is useful to combine deep learning with other machine learning models so work has been done and recently UDF team has released a way to share memory between PDF and pytorch so you can prepare your data with UDF and when it's ready issues by pytorch without memory copy and it's all on GPU so the food pipeline is MVP all right so it seems like qdf is perhaps the most interesting part of Rapids for data scientists and machine learning scientists so it's a high performance pandas alternative but there are about a dozen of these different high performance panel turns around so how does qdf compare to things like Vex and modin and koalas and all the others well some of these are distributed so they get speed up by Distributing computation because for people who listen to us especially data scientists some may not know yet they soon learn it as soon as you use Python that python is minus threads because of something called the global interlock Gill so python is mono thread which means that if you want to use parallelism in Python either you call say a c or C bus library that does it for you like or you implement multiprocessing or you distribute across machines so there are some distributed data frames and you could have mentioned spark as well our preferred way at Nvidia is called desk so that's why the distributed computing system a bit similar to spark but it's more python friendly I would say and there is a desk for those who want to distribute and one reason to distribute is when your data is too big to fit in one machine memory and GPU memory is increasing but it's limited still so that's good f is a way to distribute data processing across multi-gpu and then when it comes to benchmarking as I said each time I tried PDF it was faster than anything else because really GPU are so so powerful so massively valid that it's really hard to complete the only thing that would limit application is the fact that the GPU memory is limited so for this looking at this PDF usually but that's I would say if if you can fit in the GPU memory in the memory of your gpus using this PDF it's hard to beat so it seems like Q UDF is a pretty high performance thing and maybe worth checking out if your pandas code is running too slow but I want to Circle back now to talking about your kaggle competitions and how it relates to more standard Machinery work in a business context so do you find there's a difference between competitive data science and machine learning at work yeah sure and IT addresses some valid criticism of cat girl which is not at work you maybe not just a data scientist but a company the organization using machine learning must cover a full life cycle that starts with framing a problem as machine learning Gathering data for it since most of the applicable machine learning is supervised learning you need to annotate this data to get trending data then you have data curation modeling model evaluation and once models are evaluated properly you put them in some production system or behind the dashboard or whatever you connect it to an e-commerce site for recommendation whatever your use case is and then you need to monitor all the model in production detect if performance is going down which may mean you need to retrain because something has changed in your environment there is a full life cycle and Kegel does not cover all of it when you in a Kegel competition you have curated data you have annotated data you have a metric so the problem is already defined for you and once you've trained the model you submit predictions to kaggle or you give your prediction code but it's applied to test data and that's it so you don't deploy you don't need to worry about Downstream so okay girl is only part of the machine learning pipeline but for this part it teaches you the right methodology which is what I explained before experiment base Etc so I would say category is great to learn about modeling and model evaluation but it's good only for this to someone who never worked on real life and onion caguard is not your full-fledged data scientist people need to get the experience in okay how do I even apply machine learning to this business problem where do I get the data to working with people how do I annotate it how do I get labels and downstream as well so Downstream is more understood I would say there is this mln engineer profession that has emerged that can personalize your model so we find more and more ml Engineers but I would say the Upstream part firming the problem as machine learning getting data reliably creating it Etc it's still a kind of art and will be underestimated at this point so that is interesting the idea that the competition only focuses on the sort of the middle part of the machine learning life cycle around making predictions but you don't get the start bit about frame the problem collecting the data and the end bit about how do I deploy this or how do I actually use the model so it seems like a big part of this is about not having to align your model with some kind of business goal so do you have any advice for people on how to do that like how do you make sure that the machine learning work you're doing is going to align with some kind of business objective that's a great question and actually when I'm asked to help a machine learning project if I'm not at the start I ask people suppose Imaging assume that your model is perfect it makes perfect predictions how do you use it and for instance if it's forecasting you can play back so assume you had perfect predictions how would have this impacted your business you know how to use a perfect prediction you predict exactly the target what would happen and not surprisingly for me but most of the time people have no clue so I say you need to design your business process including whatever so that it can consume the the output of of your model it's straightforward for instance I've seen once I used to be active on Twitter but I remember once saying they work at pharmaceutical company they don't say which one and they worked based on feedback about one medication produced by that company to predict when the medicine was most effective and they did a good job so with their machine learning model based on on a patient features the model could predict if it was worth using the medicine or not so it could be a good help for medical doctors when they presented the result to their management the project was shut down so I guess it's because the pressure to sell the medicine even when it's not effective so I'm not going to discuss the pharmaceutical industry incentives but I want to point that the people working on the machine learning project should have asked should have present should have asked the stakeholders what if we succeed how would we use a model is it worth doing and maybe someone would have said no we have no interest in doing this instead they spent one year a team so just check that you're doing something useful when you stop not when you're done so that leads to an interesting point about like how do you measure the success of machine learning projects I think like the kaggle ideal is machine learning works best when you have the best predictions but in real life that's not always the case so we can talk about what constitutes success for machine learning yeah so in cat girl most of the time what matters is how good a metric can become on the test data and this leads sometimes to complex Solutions with lots of models being ensembled and several stages and whole stacking and it's too complex to be used so Edgar is trying to limit the complexity but in short you want to balance the quality of the predictions with the cost of Maintenance the cost of implementation so you want you prefer to have one model that is a bit less performant and complex and somewhere you could get on kegger but which is simple to implement simpler to retrain you can maybe automate everything Etc so complexity of the model complexity of training the model is a key factor the other point is the metric is a proxy to the business problem so it's not because you get a good metric that you improve your business so let me give you an anecdote that I read I don't know if it's true or not maybe it's too good to be true but someone I know claimed to have worked on a support organization and did a customer churn problem for a subscription company like a tedco or TV or cabin or whatever so his model was predicting which customers were most likely to not renew subscription so what they did and this is a classical example you see in many machine learning textbooks so then they say okay let's run it on the customer base and we'll have in the court Center the support team called the people most at risk to propose them an incentive rebate or what have you the problem is many calls went like this so hello Mr customer I'm from company X ah great I wanted to cancel my subscription let's do it so in fact The Accelerated the subscription because they targeted the right people but not with the right answer so this is an extreme but it really highlights what I see assume your mother is good how do you use it effectively to improve the business the other thing is to measure but if you don't know up front you have or do a b testing as you cited before say you run the previous process for Alpha whatever thing you apply to your users your machines your whatever and the other half you use the process with machine learning and you monitor and you see if there is a difference and in which way hopefully the pathways machine learning works better so you would use it more but always keep a small fraction without machine learning so that you can see you can detect if to point the machine Learning System no longer works well and this can happen if the underlying conditions are changing so monitor what happens I've seen a presentation in an industry form someone right before me and they were describing I believe it was a recommender system and The Wizard results were not good but they only discovered it like nine months after deployment because they were not monitoring and as soon as they started monitoring they noticed that the sales of promoted items were not increasing and in fact they did not include promotions in the training data so the model was insensitive to Promotions so it predicted that some products were popular for the wrong reason they were popular because they were promoted but the system did not have the data so it invented the reason for which that would explain why the products were popular it was over fitting once they noticed it the retrender model with past promotion data and all of a sudden sales started to increase but they had to Monitor and sin practice so it's same as I said always check your assumption you assume you have a good model and you have good reason you have done cross-validation or all of that check that it is really good in practice okay and those are two really great stories of machine learning disasters the one in I don't like churn in particular is really terrifying to me at data camp where primarily a subscription business so customer churn is something that we live in Terror of so the idea that you could do a machine learning project and then make it worse is absolutely horrifying so I'd like to talk about productivity a little it seems like particularly with your competitive machine learning background you've got good at doing models very quickly so do you have any productivity tips for how you do machine learning faster key is to have a modular pipeline that is easy to maintain to modify to log so now I'm used to log things to have something modular controlled by configuration file Etc so it's I will say it's standard software development practice but data scientists are not developers so that's something I really believe is true and for those that have no experience in software development they have to learn it and unfortunately it should be the hard way there are no programming pools for data scientists but really so people need to be able to version their code to have a clear distinction between configuration and Baseline scripts all sorts of things but the goal is to automate as much as possible and then there are tools like weight and balance or Neptune AI that helps you track experiments for instance so there is more and more tooling that comes but really the goal is to automate most of the things and focus on your idea you have a new idea you should just write a bit of code or change sub configuration run it get the results logged easy to compare with other experiments so the key is really to remove the need to do manual work manual meaning typing to get a new result so once I start with notebooks they are very good but as soon as possible I move to Python scripts with configuration files because it's fast start to return okay so removing any kind of manual test trying to automate as much as possible that seems like a great productivity tip and do you have any advice for if you have to do machine learning projects in a very short time so how do you do like very fast projects that just a couple of days or maybe a week or two so if it's only a few days so it's if you're clear it's amazing what you can do in followers really if you only have few hours I would use and automated too so fun sounds have tried one called autobio I think it's from Amazon it's quite impressive but it's not the only one so Auto ml if you have few hours I would go without with an automatic tool if you have weeks then you beat automl because you can include additional data for instance that is relevant you can include domain knowledge system cannot divide so you work more on the data Etc so you can beat automl but if you only have few hours I would run an automatic and if it's tabular data I would even start simpler I would run linear regression or logistic progression if I have a bit more time I would run XC boost if you have few hours you can do something complex so use a simple model so start simple and if you have more time you can always make things more complex later on I'd like to talk a bit about collaboration since that's a big part of productivity so do you have any tips on doing machine learning as a team now 2K is one is when we have to deliver a command code and the other is when we have to deliver predictions the second case is more for Kegel where you don't care about productionizing your model if you only need to ship predictions we collaborate by exchanging data so data sets prediction on these data sets if it's a common code we have to use something like GitHub gitlab and use software engineering technique for communication I often use slack because of time difference my team and tiger teams and I always worked since with remote teams to find Science on my team I have one person in Japan one in Germany two in France one in Brazil three in the U.S I hope I'm not forgetting someone I will be crushed but you see so if my time difference it's hard to have everybody on a web conference we do it but not often so we rely on asynchronous communication uh but commit upload download in a directory and slack slightly people can use other but the point is it's asynchronous so we write our ideas our result the other guy comes and relates our response so it's like a remote development team like an open source project it's quite different from some Dev organization where everybody is in the same office have been used before the pandemic so after the pandemic more people are working remotely but that's what I've been doing that may be why I did not need to relocate to Silicon Valley being able to work remotely yeah it just seems that communication is just a huge part of productivity and having this idea of asynchronous communication where things are written down is incredibly important particularly when you're in a remote team in different time zones all right before we wrap up I'd like to talk about conferences so we've had we've got a bit of a clash since both data cap and Nvidia have data science conferences going on with a partial overlapping dates of the data camp conference is called radar that's on the 22nd and 23rd of March and Nvidia has arrivaled GTC conference with a few dates overlap so can you tell us a little bit about what's going on at GTC yeah so GTC is a semi-annual it runs once in Spring once in Fall that's your Nvidia conference so we have Keynotes but there were CEO he usually announces new products Services then you have tons of sessions by industry by use case smallest technical ranging from research to very applied and we do have a couple of sessions from us on Digimon so if you're using gpus attending GTC So to avoid the clash with a concurrent sessions are always available in replay but first GTC the main conference is free and you can register watch when you're ready so for instance being in France I can't watch everything live I I just use replays for what I'm interested in but it's really how to get the latest news and it's not just on data science as people know Nvidia is great for gaming and other user GPU so whatever your interests if you use GPU that's the conference to it that actually seems like a good sort of practical diplomatic approach if you stuck between trying to decide which conference to go to since they're both virtual you can register them both and then watch whichever sessions appeal to you on the recordings later on so I've just finished what are you most excited about in data science and machine learning uh the The Meta excitation is it's evolving so fast you have to learn all the time so I like it I don't know what will be hot next year I don't know what is doable so there is a frenzy about generative AI Etc so I'm listening to that not sure why so what excites me it's a progress I spoke a lot about Rapids but this year I used it more than before and the speed up I'm pretty incredible so that's one thing the other is the ability when I started there was a clear divide between statisticians classical quote unquote machine learning deep learning and now this is the barriers are being removed maybe because everybody moves to using python so it's great when we see unexpected use of one technique to a place where it was not used for instance or colleague of mine well on an image classification competition without training energy planning model and running svm regression to Perfect all machine regression so here runs super victim machine regression on the predictions of the planning models without training any deep learning model that's surprising so I know I would have surprised all the time that's what I loved that's a great answer and I do think that having people pushed into different situations like people moving to python from a different language or coming from statistic to machine learning they do show up lots of interesting opportunities and innovation all right super thank you for your time lots of really great insights and yeah thank you for coming on the show John Francis thank you for inviting me I enjoyed it you've been listening to data framed a podcast by datacamp keep connected with us by subscribing to the show in your favorite podcast player please give us a rating leave a comment and share episodes you love that helps us keep delivering insights into all things data thanks for listening until next time\n"

#130 The Path to Becoming a Kaggle Grandmaster (with Jean-Francois Puget PhD)

Random Videos