From Zero to Hero in Two Lines of Code with ClearML

**The Clear Model Serving Engine: A Hub for Data Science**

When using the kf serving um uh engine instead of the triton bare metal engine, another layer is needed, but this is still built into the clear model serving offering great so we talked a little bit about the data science persona and and we transitioned into uh devops.

**The Collaboration Model: A Central Hub for Data Science**

So, let's dive into the collaboration model right, so the idea is that um clear mill is kind of the hub of everything, you can think of it as in a way what uh the version, the software version control is for ci cd in traditional software development. This becomes the hub of everything when it comes to data science because, as opposed to traditional software development, where the version control actually stores everything that we need. When we think about data science, we need to not only connect with the code but also with the data and the parameters of the specific model training inference etc, and that's exactly where this kind of experiment management solution as a hub comes into play.

**Everything is Visible to Everyone**

Everywhere is visible to everyone, so the developers actually have an overview of what's being served and used in production. The ml engineers have the ability to constantly moderate what's going on in the current development and insert that into the next generation or next versions of the pipelines. And the devops, they have full visibility on the resources that are used either in development itself because nowadays development is actually quite resource heavy training models testing them etc but also in production in terms of the resources and utilization that are used in model inference and usage usage itself.

**Visibility and Control**

With all the different metrics and use cases and usage available to everyone, it's possible to create even better visibility for your own organization. This is achieved through a full rest api and pythonic api, so you can actually build your own applications on top of it with rest api, especially when it comes to creating your own dashboards and integrating with the system in terms of automation etc.

**Rest API: A Key Integration Point**

Rest api becomes very easy, basically you have a specific job you clone it and you send it for execution and you have the entire pipeline triggered. And you can always monitor it, but if we think about more of a development um uh integration scenario then currently it's mostly python.

**Python as a Base Level Requirement**

Yes, I guess it kind of rules this entire field so there is a full rest api and you can can connect to it so on application level you can definitely build your own applications on top of it with rest api especially when it comes to creating your own dashboards and integrating with the system in terms of automation etc.

**Integrations: Beyond Python**

For example, things like triggering a pipeline from your own web application so rest api becomes very easy. But if we think about more of a development um uh integration scenario then currently um it's mostly python basically so yeah got it got it uh

**Customer Logos and Use Cases**

We've got some uh customer logos here are there any particular use cases or verticals that uh you you know either are seeing particular traction in or you're seeing a particularly strong fit in right um. I think that the first adoption was from customers with um needs in deep learning mostly deep learning because of the complexity of spinning gpu instances managing them introducing orchestration scheduling and priority on top of it.

**Breadening into Traditional Machine Learning**

But these days we actually uh see uh this um broadening into more traditional machine learning scenarios where even in machine learning you want to have that flexibility of some on gpu some on cpu things are getting more complicated meaning uh it becomes out it kind of overflows the scope of traditional jupiter notebooks and you have to have more automation and pipelines and we see more of those coming into our platform looking for solutions that are really can encapsulate really the end-to-end from storing the data itself.

**Automation and Pipeline Management**

That's where the automation and pipeline management come in, recently introduced into the system. Uh it's it's usually jupiter notebooks and pipelines built on top of them which are uh transparent in the system so even jupiter notebooks you don't actually have to store the notebook itself by the introducing those two lines of code the jubilee notebook is converted into code.

**Time-Saving for Traditional Machine Learning Companies**

So, this really saves you the time of managing the jupyter notebooks inside your git repository which uh is a huge value for for traditional machine learning companies.

"WEBVTTKind: captionsLanguage: enall right everyone i'm sam cherrington founder of twiml and for this solutions guide spotlight video we're going to dig into the clear ml product with moses goodman co-founder and cto of allegro ai welcome moses welcome sam how are you wonderful looking forward to digging in uh let's jump right in tell us about clear ml so clearml is essentially an open source platform that starts from the development process goes through orchestration scheduling data management deployment and recently even serving as well everything is open first and it really tries to make the machine learning development and production into something like ci cd for software development dig into the open source aspect of the product a bit more is it open core with an enterprise offering built around it or what is the model okay so essentially everything is open source so really side and server side so you can run your own server including everything that said we do have some offerings for enterprise solutions that are not part of the open source they are open core so if an enterprise company as a client they have full access to the to the code itself so they can build on top of it even uh if it's not part of the um open source that is publicly available and are your users primarily operating uh within their own data centers or in the cloud or are the solutions yeah so this is a very good question so actually we have a variety so we have uh users doing hybrid um cloud is probably uh the most popular uh due to cost reductions uh but we do have full cloud and full pre on-prem depending on the type of data usually that they're working on um and based on that that's how they set up their servers and also obviously compute etc and how opinionated is the solution with regards to integrations and frameworks and that kind of thing so actually uh none at all we try to integrate with all the frameworks um that are out there if we missed anything then usually the committee helps us uh integrate and the idea is that the integration itself is as auto magic as possible so the integration is only two lines of code you basically import the python package and you initialize it and it does everything that it can in the background to siphon all the information from the frameworks that you're using so that you do not need to manually connect and integrate your code that means that every piece of code that you have can be very quickly integrated with the system so you have full transparency and visibility into what's going on inside your different steps in the processing pipeline development etc because at the end that information is always accessible from anywhere and then later can be automated awesome so can you show us these two lines of code yeah so let's do that let me just uh see if i can share my um pycharm screen so we can do exactly that so i have here um can you see my screen let me just zoom in yes it might be easier so this is just a straightforward uh tensorflow keras example training on amnest obviously a toy example the only addition that we have here is this line of code obviously we also import the package so that's the second line um and the other is basically your traditional keras model and tensorboard so what's going to happen is i'll be running this uh code what it will be doing in the background is it will connect to the backend itself authenticate so the authentication file is stored locally and it will basically log everything that it's doing so starting from the code itself or the git repository that i'm using including the uncommitted changes because we're all messy while we're developing um hyper parameters that we're using so arc our parts etc automatically siphoned um all the logs that are reported so think tensorboard map that kind of stuff console outputs and also the models that are stored are automatically logged by the system so at least i know what i did so once the system is done i'll get a link to my code that is running let's see if we can open that link so this is probably done so switching back to the ui uh we can see the experiment that we just executed so we can see the full code itself so if this is um without running script without github then we'll get the full copy of the code same goes for jupiter notebooks if we have a git repository you'll get a link to the git internal repository including any uncommitted changes we also have a log of all the packages that we used automatically meaning in the current rich environment that i'm using on my machine um and also i have all the configurations meaning arguments that i'm using dfd finds general configuration files etc and obviously i have my outputs so a few points from the tensorboard and some 3d plots because this is what it was reporting and with that i have everything that i did logged into one place with as minimal as possible in terms of integration and effort on maintenance so everything that i do will automatically appear here in real time so i don't need to worry about it i can still use my tensorboard and my map lib and any other jupiter lab that i'm using the idea is that i'll always be able to trace it back share it with my colleagues and have better visibility into the development process itself and so all those values that you're surfacing there are automatically pulled in from the runtime environment if you want to specify a specific uh values can you go beyond those two lines of code yes um so you can add as many parameters as you like from anywhere in your code so you can add more configuration files for example uh this one is a custom one you can add that one because this is part of the setup that you're using so it'll be logged here um same goes for any other dictionary of parameters so for example if you're using jupyter notebooks and you don't have arguments uh you can add any configuration arguments or objects into the logging system itself and obviously you can log manually any other scalars or plots that you want instead of the automatic interface great great so that's a quick high-level overview of using the products let's maybe dig into the components sure so this is basically the first step so the first step is how do we create um a job in the system so we just demonstrated how any piece of code becomes a job in the system once we have that line here that means that we have everything that we need in order to replicate that job so we have the arguments we have the setup we have connection to the data and we have the python packages and also the even the docker image itself so the next step is how do we duplicate it and how do we make it a part of the orchestration itself and that actually connects to the different engines that we have in the system so we have the orchestration and scheduling system that allows us to take those jobs and put them into a scheduler on top of any hardware um that we like so it really separates between the scheduling process where we want to allocate a resource with priority etc to the resource itself meaning the actual hardware so we can do any hybrid uh cluster in terms of compute so i can have an on-prem machines connected to the cluster as well as a kubernetes cluster as well as bare metal on the cloud without me as the user having to worry about where my code will be executed because it's fully abstracted by the orchestration solution essentially putting it into execution cues and those execution could use control the priority and abstracting the hardware itself then on top of it we have the different applications that use that mechanism basically using the ability to automatically schedule things change them and send them for the next steps and that's uh think of it as the your pipelines and your data ops meaning automating processes that are more data oriented et cetera and we'll get into that maybe later but the idea is um using those abilities allows you to build your own pipelines and workflows on top of it these are just applications that a simple application that we provide but you can always build your own uh so the first look at the product was from the perspective of the data scientist someone who's setting up the jobs what does it look like from the perspective of the ml ops person ml engineer someone who needs to uh to scale this so if the data scientists are creating the jobs themselves then the mls is all about automation creating pipelines etc so what i'll demonstrate now is how a pipeline actually is executed in in real time and then obviously when everything that i do in the ui uh the pipeline does automatically for us so basically programmatically i can do these same things and have logic built into it so i can better control this entire flow so for example what i will be doing now is i'll take my fully executed code here and i will just clone it which means i'm creating another copy of the setup itself so now everything is editable right so i can go here and i can edit even the code itself i can even edit the um python packages if i missed any i can specify um a specific docker image or parameters if i need to in our example what i'll do is i'll change some of the arguments yeah let's uh let's change the number of epochs to six for example and once i'm happy with the new settings i can send it for execution so sending for a job for execution is essentially putting it into an execution queue where we have multiple execution cues they can be on-prem they can be can implicitly say something about the hardware itself for example like the type of gpu or a number of cpus or where that hardware is located specific cloud providers etc but it allows us to be totally agnostic in terms of automation and interface so in for example tomorrow i switch to another cloud provider from the user perspective and the automation it's basically totally transparent i still put it into the same execution queue and then another machine will just pick it up for me so what i'll do now is i just put it into the default uh execution queue so if i switch into the cluster management part i can see that i have a single machine connected to my cluster obviously if i usually we have more machines this is a demo account and i can already see that this machine picked my job it's listening to the default queue and if i go into the queue section i can see all my queues i can see different priorities different jobs waiting in queue and i can even reorder them so i can control the priority per queue or per uh set of queues so now it's installing everything inside a specific docker probably the default docker because we never actually specified any specific docker um so it's setting up the environment it's cloning our code or putting our code into that specific environment installing all the different packages that we needed and it will write back on the task itself the entire setup so we can always replicate it package by package originally we only had the package that we used now we have an entire copy so we have full represent reproducibility of this model because we want to have the ability to trace back every model which code base created it parameters data that was used etc so we can better debug and better understand how to create a better um a model in the in the next time so let me just model uh that's created in the training process get its own docker container that is persisted in the system that is a good question so in order to avoid the docker mess that comes with creating a docker per experiment which means that you end up with thousands very quickly uh what we do is we reuse them we call it base docker image which means think of it as the base one as uh the the one that you downloaded from docker hub from nvidia for example like bass drivers packages etc and then in runtime we only install the difference which means uh just putting the code into it maybe some other packages that we're missing or different versions and then we run it and that way we you still have the ability to have your own customer dockers but you you're we're not spending um storage on just um docker images that were only used once we also introduced a lot of caching so this process of setup is only about 15 seconds to set up a docker with everything because everything is actually cached and does the ml ops owner have the ability to set up multiple of these base uh docker images say per team or project or what have you correct so um per task you can specify the docker image that that specific job will get so for example in this one since i never specified one um it received the default one which in this specific setup is nvidia cuda 10.1 which the emelops or devops engineer can control but if i create another clone of it i can override it with any other docker image in the new version you can also specify some set up bash script that will be executed when the docker is first spent so you can have more control and you really try to avoid another copy of the docker with only a single app package that is missing and this is for example one of the features that we received from the community that actually demanded that because they said hey it's already there just give us that specific ability and that's exactly why we edit it so now i think it's almost done running so we can also see that it fully monitors the machine itself so we have full resource monitoring so gpu cpu everything that's going on updated live so once everything is done we'll we will be able to see um the models that are produced by uh the remote execution so if i click on the model here so i can see that use an input model so i can again trace back uh the initial weights that it was using but i can also use the output model later if i want to go into deployment so this is the output one so in a second it will be done and then we'll be able to actually download the model itself and all is done without the code itself being aware that the models that are stored locally are automatically uploaded so again i don't need to do a lot in order to move from the development stage into the automation into the ml ops stage and then vice versa right looks like the data scientist doesn't have to necessarily be aware of docker or maybe even git if i heard you correctly some of the correct git versioning is done in the background or can be done in the background as well correct so even the uh git versioning itself is done in the background for you so as long as you're working on git it will just know your it will automatically catch your uncommitted changes so if you forget to commit your changes it's fine we'll be able to restore the entire local environment on the remote machine so you can run with your own magic numbers and and whatnot that you changed locally because that worked best up until full production it's all fully stored no one can touch it and so this is as good as almost as good as the version control and get and now we can yeah please i was just going to ask about we've been talking a lot about docker does is there an assumption or requirement or compatibility with kubernetes as the underlying container orchestration system um yes of course there is but it's not a must so actually everything is built um in a very modular way so even dockers are not a must so basically you can start with which environments or con environments if you're not a docker fan or that does not work in your specific setup then you can move into dockers then you move into installing agents on bare metal machines and then you can connect that into a kubernetes cluster whereas the kubernetes is spinning the dockers for you on your cluster and then you can do even more complicated when you have multiple docker multiple kubernetes clusters managed on top of this system basically connects to the system that's a nice kind of path of increasing sophistication complexity exactly so we want to make sure that we cater for everyone beginners to the most advanced users so now what we can see that we have um the final model here ready for um serving with actually everything um in inside that specific model so the network itself is stored and we have configurations and we can add labels if you want and the entire model itself is actually stored in the system but can also be stored on any other storage solution which is important for the system to separate the storage from the management so everything that is done in the background is actually management not essentially storage and that will connect maybe later to the data aspects of the system or data management aspects of the system the system itself will never actually store the data itself it allows our users to use their current um object storage solutions and just have in introduce the missing ingredient which is kind of the control plane and management on top of what we already have so you've got a model url there so the system's created an endpoint is that intended primarily for test rapid prototyping or is that um ready for production right um so this is uh still the intermediate result where you use it internally and then the um other part is what we call the clear mill serving that actually spins a service on top of it listening to those pub models that are generated automatically and then what it does it's actually like a side car next to a serving engine for example like they tried inserted serving engine and what it does it listens in the background to change sets in the project itself so let's assume i publish this model here so once i publish it it will automatically detect that a new model or a new version was published and it will automatically uh reconfigure the trend serving engine fetch it to the trenton serving um docker itself so i don't need to worry about it once i spin the service it will constantly keep my triton server up and running it updated without me having to manually update it with the threatened api and do you have any kind of control over how that's done in terms of you know rolling deployment blue green canary all that kind of stuff correct so um you can control how many versions you store you can control which exactly which exact models you want to spin there and you also have full control over canary abx thing um etc when you're using the kf serving um uh engine instead of the triton bare metal engine so you need another layer uh but this is still built into the clear model serving offering great so we talked a little bit about the data science persona and and we transitioned into uh devops tell us um about the collaboration model right so the idea is that um clear mill is kind of the hub of everything um you can think of it as in a way what uh the version the software version control is for ci cd in traditional software development this becomes the hub of everything when it comes to data science because as opposed to traditional software development where the version control actually stores everything that we need when we think about data science we need to not only connect with the code but also with the data and the parameters of the specific model training inference etc and that's exactly where this kind of experiment management solution as a hub comes into play everything is visible to everyone so the developers actually have an overview of what's being served and used in production the ml engineers have the ability to constantly moderate what's going on in the current development and insert that into the next generation or next versions of the pipelines and the devops they have full visibility on on the resources that are used either in development itself because nowadays development is actually quite resource heavy training models testing them etc but also in production in terms of the resources and utilization that are used in model inference and and usage usage itself with all the different metrics and uh use cases and usage available to everyone with rest api and pythonic api so you can actually build your own dashboards on top of it and create even better visibility for your own organization okay we talked about the some of the broad um you know integrations are you assuming python as kind of a base level requirement um at the end kind of yeah i guess it kind of rules this entire field so there is a full rest api and you can can connect to it so on application level you can definitely build your own applications on top of it with rest api especially when it comes to creating your own dashboards and integrating with the system in terms of automation etc so for example things like triggering a pipeline from your own web application so rest api becomes very easy you basically have a specific job you clone it and you send it for execution and you have the entire pipeline triggered and you can always monitor it but if we think about more of a development um uh integration scenario then currently um it's mostly python basically so yeah got it got it uh so you've got some uh customer logos here are there any particular use cases or verticals that uh you you know either are seeing particular traction in or you're seeing a particularly strong fit in right um so i think that the first adoption uh was from customers with um needs in deep learning mostly deep learning because of the complexity of spinning gpu instances managing them introducing orchestration scheduling and priority on top of it but these days we actually uh see uh this um broadening into more traditional machine learning scenarios where even in machine learning you want to have that flexibility of some on gpu some on cpu things are getting more complicated meaning uh it becomes out it kind of overflows the scope of traditional jupiter notebooks and you have to have more automation and pipelines and and we see more of those coming into our platform looking for solutions that are really can encapsulate really the end-to-end from storing the data itself that we recently introduced into the automation and over there it's it's usually jupiter notebooks and pipelines built on top of them which are uh transparent in the system so even jupiter notebooks you don't actually have to store the notebook itself by the introducing those two lines of code the jubilee notebook is converted into code so it really saves you the time of managing the jupyter notebooks inside your git repository which uh is a huge value for for traditional machine learning companies great uh any final points you want to close us out on um i guess what's important is the the flexibility and the open source of the system so even if you find something that is missing for you you can always build it or talk to the community and ask if someone else built it you'll be surprised to see that how many people build their own things on top of it just because it's fully open source and that they can actually have that flexibility and i guess the other thing is the flexibility in terms of hardware management and the ability to actually integrate the on-prem uh resources with the cloud resources as a huge value for a lot of our users because that means that they can really reuse the office resources as well as kind of move into the cloud without paying a lot of um price or any price in terms of development because the system actually abstracted everything from code through data so it's just essentially just another instance running their code great great well moses thank you so much a pleasure ciaoall right everyone i'm sam cherrington founder of twiml and for this solutions guide spotlight video we're going to dig into the clear ml product with moses goodman co-founder and cto of allegro ai welcome moses welcome sam how are you wonderful looking forward to digging in uh let's jump right in tell us about clear ml so clearml is essentially an open source platform that starts from the development process goes through orchestration scheduling data management deployment and recently even serving as well everything is open first and it really tries to make the machine learning development and production into something like ci cd for software development dig into the open source aspect of the product a bit more is it open core with an enterprise offering built around it or what is the model okay so essentially everything is open source so really side and server side so you can run your own server including everything that said we do have some offerings for enterprise solutions that are not part of the open source they are open core so if an enterprise company as a client they have full access to the to the code itself so they can build on top of it even uh if it's not part of the um open source that is publicly available and are your users primarily operating uh within their own data centers or in the cloud or are the solutions yeah so this is a very good question so actually we have a variety so we have uh users doing hybrid um cloud is probably uh the most popular uh due to cost reductions uh but we do have full cloud and full pre on-prem depending on the type of data usually that they're working on um and based on that that's how they set up their servers and also obviously compute etc and how opinionated is the solution with regards to integrations and frameworks and that kind of thing so actually uh none at all we try to integrate with all the frameworks um that are out there if we missed anything then usually the committee helps us uh integrate and the idea is that the integration itself is as auto magic as possible so the integration is only two lines of code you basically import the python package and you initialize it and it does everything that it can in the background to siphon all the information from the frameworks that you're using so that you do not need to manually connect and integrate your code that means that every piece of code that you have can be very quickly integrated with the system so you have full transparency and visibility into what's going on inside your different steps in the processing pipeline development etc because at the end that information is always accessible from anywhere and then later can be automated awesome so can you show us these two lines of code yeah so let's do that let me just uh see if i can share my um pycharm screen so we can do exactly that so i have here um can you see my screen let me just zoom in yes it might be easier so this is just a straightforward uh tensorflow keras example training on amnest obviously a toy example the only addition that we have here is this line of code obviously we also import the package so that's the second line um and the other is basically your traditional keras model and tensorboard so what's going to happen is i'll be running this uh code what it will be doing in the background is it will connect to the backend itself authenticate so the authentication file is stored locally and it will basically log everything that it's doing so starting from the code itself or the git repository that i'm using including the uncommitted changes because we're all messy while we're developing um hyper parameters that we're using so arc our parts etc automatically siphoned um all the logs that are reported so think tensorboard map that kind of stuff console outputs and also the models that are stored are automatically logged by the system so at least i know what i did so once the system is done i'll get a link to my code that is running let's see if we can open that link so this is probably done so switching back to the ui uh we can see the experiment that we just executed so we can see the full code itself so if this is um without running script without github then we'll get the full copy of the code same goes for jupiter notebooks if we have a git repository you'll get a link to the git internal repository including any uncommitted changes we also have a log of all the packages that we used automatically meaning in the current rich environment that i'm using on my machine um and also i have all the configurations meaning arguments that i'm using dfd finds general configuration files etc and obviously i have my outputs so a few points from the tensorboard and some 3d plots because this is what it was reporting and with that i have everything that i did logged into one place with as minimal as possible in terms of integration and effort on maintenance so everything that i do will automatically appear here in real time so i don't need to worry about it i can still use my tensorboard and my map lib and any other jupiter lab that i'm using the idea is that i'll always be able to trace it back share it with my colleagues and have better visibility into the development process itself and so all those values that you're surfacing there are automatically pulled in from the runtime environment if you want to specify a specific uh values can you go beyond those two lines of code yes um so you can add as many parameters as you like from anywhere in your code so you can add more configuration files for example uh this one is a custom one you can add that one because this is part of the setup that you're using so it'll be logged here um same goes for any other dictionary of parameters so for example if you're using jupyter notebooks and you don't have arguments uh you can add any configuration arguments or objects into the logging system itself and obviously you can log manually any other scalars or plots that you want instead of the automatic interface great great so that's a quick high-level overview of using the products let's maybe dig into the components sure so this is basically the first step so the first step is how do we create um a job in the system so we just demonstrated how any piece of code becomes a job in the system once we have that line here that means that we have everything that we need in order to replicate that job so we have the arguments we have the setup we have connection to the data and we have the python packages and also the even the docker image itself so the next step is how do we duplicate it and how do we make it a part of the orchestration itself and that actually connects to the different engines that we have in the system so we have the orchestration and scheduling system that allows us to take those jobs and put them into a scheduler on top of any hardware um that we like so it really separates between the scheduling process where we want to allocate a resource with priority etc to the resource itself meaning the actual hardware so we can do any hybrid uh cluster in terms of compute so i can have an on-prem machines connected to the cluster as well as a kubernetes cluster as well as bare metal on the cloud without me as the user having to worry about where my code will be executed because it's fully abstracted by the orchestration solution essentially putting it into execution cues and those execution could use control the priority and abstracting the hardware itself then on top of it we have the different applications that use that mechanism basically using the ability to automatically schedule things change them and send them for the next steps and that's uh think of it as the your pipelines and your data ops meaning automating processes that are more data oriented et cetera and we'll get into that maybe later but the idea is um using those abilities allows you to build your own pipelines and workflows on top of it these are just applications that a simple application that we provide but you can always build your own uh so the first look at the product was from the perspective of the data scientist someone who's setting up the jobs what does it look like from the perspective of the ml ops person ml engineer someone who needs to uh to scale this so if the data scientists are creating the jobs themselves then the mls is all about automation creating pipelines etc so what i'll demonstrate now is how a pipeline actually is executed in in real time and then obviously when everything that i do in the ui uh the pipeline does automatically for us so basically programmatically i can do these same things and have logic built into it so i can better control this entire flow so for example what i will be doing now is i'll take my fully executed code here and i will just clone it which means i'm creating another copy of the setup itself so now everything is editable right so i can go here and i can edit even the code itself i can even edit the um python packages if i missed any i can specify um a specific docker image or parameters if i need to in our example what i'll do is i'll change some of the arguments yeah let's uh let's change the number of epochs to six for example and once i'm happy with the new settings i can send it for execution so sending for a job for execution is essentially putting it into an execution queue where we have multiple execution cues they can be on-prem they can be can implicitly say something about the hardware itself for example like the type of gpu or a number of cpus or where that hardware is located specific cloud providers etc but it allows us to be totally agnostic in terms of automation and interface so in for example tomorrow i switch to another cloud provider from the user perspective and the automation it's basically totally transparent i still put it into the same execution queue and then another machine will just pick it up for me so what i'll do now is i just put it into the default uh execution queue so if i switch into the cluster management part i can see that i have a single machine connected to my cluster obviously if i usually we have more machines this is a demo account and i can already see that this machine picked my job it's listening to the default queue and if i go into the queue section i can see all my queues i can see different priorities different jobs waiting in queue and i can even reorder them so i can control the priority per queue or per uh set of queues so now it's installing everything inside a specific docker probably the default docker because we never actually specified any specific docker um so it's setting up the environment it's cloning our code or putting our code into that specific environment installing all the different packages that we needed and it will write back on the task itself the entire setup so we can always replicate it package by package originally we only had the package that we used now we have an entire copy so we have full represent reproducibility of this model because we want to have the ability to trace back every model which code base created it parameters data that was used etc so we can better debug and better understand how to create a better um a model in the in the next time so let me just model uh that's created in the training process get its own docker container that is persisted in the system that is a good question so in order to avoid the docker mess that comes with creating a docker per experiment which means that you end up with thousands very quickly uh what we do is we reuse them we call it base docker image which means think of it as the base one as uh the the one that you downloaded from docker hub from nvidia for example like bass drivers packages etc and then in runtime we only install the difference which means uh just putting the code into it maybe some other packages that we're missing or different versions and then we run it and that way we you still have the ability to have your own customer dockers but you you're we're not spending um storage on just um docker images that were only used once we also introduced a lot of caching so this process of setup is only about 15 seconds to set up a docker with everything because everything is actually cached and does the ml ops owner have the ability to set up multiple of these base uh docker images say per team or project or what have you correct so um per task you can specify the docker image that that specific job will get so for example in this one since i never specified one um it received the default one which in this specific setup is nvidia cuda 10.1 which the emelops or devops engineer can control but if i create another clone of it i can override it with any other docker image in the new version you can also specify some set up bash script that will be executed when the docker is first spent so you can have more control and you really try to avoid another copy of the docker with only a single app package that is missing and this is for example one of the features that we received from the community that actually demanded that because they said hey it's already there just give us that specific ability and that's exactly why we edit it so now i think it's almost done running so we can also see that it fully monitors the machine itself so we have full resource monitoring so gpu cpu everything that's going on updated live so once everything is done we'll we will be able to see um the models that are produced by uh the remote execution so if i click on the model here so i can see that use an input model so i can again trace back uh the initial weights that it was using but i can also use the output model later if i want to go into deployment so this is the output one so in a second it will be done and then we'll be able to actually download the model itself and all is done without the code itself being aware that the models that are stored locally are automatically uploaded so again i don't need to do a lot in order to move from the development stage into the automation into the ml ops stage and then vice versa right looks like the data scientist doesn't have to necessarily be aware of docker or maybe even git if i heard you correctly some of the correct git versioning is done in the background or can be done in the background as well correct so even the uh git versioning itself is done in the background for you so as long as you're working on git it will just know your it will automatically catch your uncommitted changes so if you forget to commit your changes it's fine we'll be able to restore the entire local environment on the remote machine so you can run with your own magic numbers and and whatnot that you changed locally because that worked best up until full production it's all fully stored no one can touch it and so this is as good as almost as good as the version control and get and now we can yeah please i was just going to ask about we've been talking a lot about docker does is there an assumption or requirement or compatibility with kubernetes as the underlying container orchestration system um yes of course there is but it's not a must so actually everything is built um in a very modular way so even dockers are not a must so basically you can start with which environments or con environments if you're not a docker fan or that does not work in your specific setup then you can move into dockers then you move into installing agents on bare metal machines and then you can connect that into a kubernetes cluster whereas the kubernetes is spinning the dockers for you on your cluster and then you can do even more complicated when you have multiple docker multiple kubernetes clusters managed on top of this system basically connects to the system that's a nice kind of path of increasing sophistication complexity exactly so we want to make sure that we cater for everyone beginners to the most advanced users so now what we can see that we have um the final model here ready for um serving with actually everything um in inside that specific model so the network itself is stored and we have configurations and we can add labels if you want and the entire model itself is actually stored in the system but can also be stored on any other storage solution which is important for the system to separate the storage from the management so everything that is done in the background is actually management not essentially storage and that will connect maybe later to the data aspects of the system or data management aspects of the system the system itself will never actually store the data itself it allows our users to use their current um object storage solutions and just have in introduce the missing ingredient which is kind of the control plane and management on top of what we already have so you've got a model url there so the system's created an endpoint is that intended primarily for test rapid prototyping or is that um ready for production right um so this is uh still the intermediate result where you use it internally and then the um other part is what we call the clear mill serving that actually spins a service on top of it listening to those pub models that are generated automatically and then what it does it's actually like a side car next to a serving engine for example like they tried inserted serving engine and what it does it listens in the background to change sets in the project itself so let's assume i publish this model here so once i publish it it will automatically detect that a new model or a new version was published and it will automatically uh reconfigure the trend serving engine fetch it to the trenton serving um docker itself so i don't need to worry about it once i spin the service it will constantly keep my triton server up and running it updated without me having to manually update it with the threatened api and do you have any kind of control over how that's done in terms of you know rolling deployment blue green canary all that kind of stuff correct so um you can control how many versions you store you can control which exactly which exact models you want to spin there and you also have full control over canary abx thing um etc when you're using the kf serving um uh engine instead of the triton bare metal engine so you need another layer uh but this is still built into the clear model serving offering great so we talked a little bit about the data science persona and and we transitioned into uh devops tell us um about the collaboration model right so the idea is that um clear mill is kind of the hub of everything um you can think of it as in a way what uh the version the software version control is for ci cd in traditional software development this becomes the hub of everything when it comes to data science because as opposed to traditional software development where the version control actually stores everything that we need when we think about data science we need to not only connect with the code but also with the data and the parameters of the specific model training inference etc and that's exactly where this kind of experiment management solution as a hub comes into play everything is visible to everyone so the developers actually have an overview of what's being served and used in production the ml engineers have the ability to constantly moderate what's going on in the current development and insert that into the next generation or next versions of the pipelines and the devops they have full visibility on on the resources that are used either in development itself because nowadays development is actually quite resource heavy training models testing them etc but also in production in terms of the resources and utilization that are used in model inference and and usage usage itself with all the different metrics and uh use cases and usage available to everyone with rest api and pythonic api so you can actually build your own dashboards on top of it and create even better visibility for your own organization okay we talked about the some of the broad um you know integrations are you assuming python as kind of a base level requirement um at the end kind of yeah i guess it kind of rules this entire field so there is a full rest api and you can can connect to it so on application level you can definitely build your own applications on top of it with rest api especially when it comes to creating your own dashboards and integrating with the system in terms of automation etc so for example things like triggering a pipeline from your own web application so rest api becomes very easy you basically have a specific job you clone it and you send it for execution and you have the entire pipeline triggered and you can always monitor it but if we think about more of a development um uh integration scenario then currently um it's mostly python basically so yeah got it got it uh so you've got some uh customer logos here are there any particular use cases or verticals that uh you you know either are seeing particular traction in or you're seeing a particularly strong fit in right um so i think that the first adoption uh was from customers with um needs in deep learning mostly deep learning because of the complexity of spinning gpu instances managing them introducing orchestration scheduling and priority on top of it but these days we actually uh see uh this um broadening into more traditional machine learning scenarios where even in machine learning you want to have that flexibility of some on gpu some on cpu things are getting more complicated meaning uh it becomes out it kind of overflows the scope of traditional jupiter notebooks and you have to have more automation and pipelines and and we see more of those coming into our platform looking for solutions that are really can encapsulate really the end-to-end from storing the data itself that we recently introduced into the automation and over there it's it's usually jupiter notebooks and pipelines built on top of them which are uh transparent in the system so even jupiter notebooks you don't actually have to store the notebook itself by the introducing those two lines of code the jubilee notebook is converted into code so it really saves you the time of managing the jupyter notebooks inside your git repository which uh is a huge value for for traditional machine learning companies great uh any final points you want to close us out on um i guess what's important is the the flexibility and the open source of the system so even if you find something that is missing for you you can always build it or talk to the community and ask if someone else built it you'll be surprised to see that how many people build their own things on top of it just because it's fully open source and that they can actually have that flexibility and i guess the other thing is the flexibility in terms of hardware management and the ability to actually integrate the on-prem uh resources with the cloud resources as a huge value for a lot of our users because that means that they can really reuse the office resources as well as kind of move into the cloud without paying a lot of um price or any price in terms of development because the system actually abstracted everything from code through data so it's just essentially just another instance running their code great great well moses thank you so much a pleasure ciao\n"

From Zero to Hero in Two Lines of Code with ClearML

Random Videos