TWiML & AI x Fast.ai Deep Learning v3 2018 Study Group – Session 2 – November 3, 2018

The Importance of Statistics in Deep Learning: A Discussion with Jeremy and Theresa

Statistics play a crucial role in deep learning, particularly when it comes to training models on new datasets. As Jeremy mentioned, using pre-trained models with transfer learning can be beneficial for initial performance, but it's essential to consider the limitations and potential biases of these models. For instance, if you're working with images that are significantly different from those used to train the model, such as satellite images, it may not be possible to leverage pre-trained weights effectively.

One of the key considerations when using pre-trained models is the batch size. Jeremy suggested increasing the batch size to its maximum value to get an accurate statistic, but this approach can lead to overfitting if not done carefully. In contrast, using a smaller batch size allows for more precise calculations and avoids the risk of overfitting. However, as Jeremy pointed out, it's not always necessary to use the smallest batch size possible; a moderate increase in batch size can still provide reliable results.

Another important aspect of statistics is the choice of model architecture. When starting from scratch, it's essential to consider the number of layers and the complexity of the model. Jeremy mentioned that even if you start with the first few layers, they may still capture some level of performance due to their initial training on a large dataset like ImageNet. However, this approach can be computationally intensive and time-consuming.

In contrast, using pre-trained models or transfer learning can significantly reduce the training time and computational resources required. Jeremy mentioned that he was able to train a model from scratch in just an hour, which is impressive considering the complexity of the task. However, as Theresa pointed out, this approach assumes that the new dataset will have similar characteristics to ImageNet, which may not always be the case.

The Impact of Data Quality on Model Performance

One of the key challenges facing deep learning models is the quality and diversity of the data used for training. As Jeremy mentioned, using pre-trained models can help alleviate some of these issues, but it's essential to consider the potential biases and limitations of these models. For instance, ImageNet may have an overrepresentation of white individuals, which could affect the performance of models trained on datasets with more diverse demographics.

This issue is particularly relevant in competitions like the Google Image Search Challenge, where participants are required to train their models on different datasets with varying characteristics. The competition aims to encourage innovation and diversity in model design, but it also highlights the challenges faced by deep learning models when working with significantly different data.

The Role of Transfer Learning in Deep Learning

Transfer learning has become an essential tool in deep learning, particularly for tasks that require large amounts of training data. By leveraging pre-trained models, researchers can accelerate their training process and focus on fine-tuning their models to suit specific tasks or datasets. However, as Jeremy pointed out, this approach assumes that the new dataset will have similar characteristics to ImageNet, which may not always be the case.

In contrast, starting from scratch requires a significant amount of computational resources and time. As Theresa mentioned, training a model from scratch can take weeks, depending on the complexity of the task and the size of the dataset. However, this approach allows for more flexibility and adaptability in model design, particularly when working with significantly different data.

Principle Component Analysis: A Tool for Reducing Dimensions

One potential solution to the issue of dealing with datasets that have many dimensions is principle component analysis (PCA). This technique can help reduce the number of dimensions while retaining the essential features of the data. By applying PCA to a dataset, researchers can identify the most important factors and reduce the dimensionality of the data, making it easier to train models.

While PCA can be an effective tool for reducing dimensions, it's essential to consider its limitations and potential biases. As Jeremy mentioned, PCA may not always capture the subtle relationships between different features in the data, which could affect model performance. However, with careful application and tuning, PCA can provide a valuable framework for understanding and working with high-dimensional datasets.

In conclusion, statistics play a critical role in deep learning, particularly when working with new or significantly different datasets. By considering the limitations and potential biases of pre-trained models, researchers can leverage transfer learning effectively while also adapting to changing requirements and data characteristics. As research continues to evolve, techniques like PCA will become increasingly important for understanding and working with high-dimensional datasets.

"WEBVTTKind: captionsLanguage: enall right so just to give you an introduction build this little criminal gang it up and win with the phosphate course so last week we had met and discussed the lesson one I will give a cap of the previous Vito and today we move to discuss lesson two along with three mini presentations I'll be talking about fast or a camera framework which is something that we form a virtual study group in she a virtual setting up a trip and Vijay will be talking about the learner apiary so to speak which is like fast histories actual done into the API and he'll be talking about all of the things that go on inside there and Christian had shed His image downloaded strip or doing the last Meetup and he'll be giving a mini presentation on it today so today's a second meter and if you aren't a member of the familiarize flag pleased to join using the link it's it's tearing those slides I will just say the link to the slides off the meter and the study groups happens 9:00 a.m. PST every Saturday you welcome to join anyone is welcome to join and will be rotating hosts so today I will be presenting and if you feel there any mistakes in the presentation so I am the person to be blamed today and we're trying this approach that we preferring to talk about some applications over the theory because that's what we observed during a last bit of that or talking about some mini projects for many presentation so if you want to present any idea or any project please reach out to Kai and it could be any any simple project a five-minute presentation would even good so you very welcome to present and this is a rough agenda for the meetup or vivir we meet every Saturday or even the Saturday even one cycle event we don't have the first not a lecture we will be doing a meet-up which is the eighth of the symbol so just to give you Z cap or we had this discussion of the lesson one during a previous meter Sebastian had presented presented by torch and Faso ti epi is overview and if you remember kya had done a wrench versus hammer real time image classification which is pretty cool Christian had shared his tool to scrape images from multiple sources Google Bing I'm sorry and I had briefly spoken about the north hot dog app like I am trying to build along with a few members from the ECR virtual study group so this is the overview that Sebastian has presented and today's agenda is I try to cover the second lecture first oh so that and after that I will move to do further discussions and the many presentations because last time we had we were slightly short on time any questions so far any questions on the previous meet about anything you want to discuss yes I just I do on the slack someone asked for they think to watch the the presentations and we were asked not to give that link out but my confusion just based on what you said is that if someone belongs to this like group that I should be able to do that all or no share the link to watch the classes that's what the guy was asking for but he would he did it black so he was already part of the group yeah like we couldn't boy so I need the code of conduct which is black we know allowed to sale youtube link outside of us total B 3 so that's why we don't suggest sharing the link outside I think the slack group puke anybody can join the slack but it doesn't mean that you are registered because you're in flag does it mean you have permission okay that's what I needed to know thank you is just another quick question so I can see that this is being recorded is this recording made available afterwards so for example I miss the first one yes yes to be honest on the YouTube channel great thank you moving I'll just give a brief overview of the lesson do because most of it was pretty easy paste this time I feel and you will compile any questions if you want so Jeremy had talked about downloading images from Google he achieved this two lines of code that you could dump inside your inspector and you could download images from Google he had shown the teddy bear was grizzly bear versus black built classifier and then he had shown it they move it shows a widget developed by the San Francisco group which was how do you delete the noisy data and it's a pretty cool tool if you if you pause lecture uses get the misclassify or the misclassified labels and you could delete the images if you feel that it's actually a mistake so that's definitely something worth checking out and there was a mention of if I can example of combining the human expert with the mortal who accuracy so the human expert have been asked in the equation that probably right now we better classify better classifying teddy bears versus nipples than the mortal so even in real life for example they were so mention of remember correctly cancer prediction or using image images in the in the lecture and some field expert had converted all the data to images and then built a classifier on top so that's another way we could improve the Mortons accuracy after that it was how do we pick the learning rate effectively if it's a slow learning rate your model will converge very slowly that is your error rate will go down when you go down very slowly if it's a bigger learning rate your validation loss will probably shoot to much higher values and in that case you'll want to reduce it the correct learning rate in my experience can be found through experimentation or by using the learner plot function inside fast Oda so that's something I think that it comes better with experimentation and eventually you get to pick the learn English correctly and I would suggest actually trying some buildings of basic toy models and figuring out how do you do learn English so to speak so so this time and Jeremy recommends actually 0.03 as a good starting point I guess yes even you even without thinking so did the first the first kind a couple of epochs he just said 0.03 which is a little bit different what he was teaching last time but yes I found it interesting yeah he mentioned that during the experimentation of password experimentation they found that 0.03 is a good starting value so start out probably when you be more I guess I guess this is for pre-trained models this is a pond zero three but added added training from scratch adit resonate create CNN with a with pre-trained Falls and I've used a 0.02 for some reason and he worked well as well so I guess this kind this kind of French works well seems to be I think you also said that it's for a lot of cases not that important so if the range is somewhere around this 0.003 then it's gonna work for most of the cases yeah yeah in the one yeah as far as downloading the images from Google I was well I I did what Jeremy suggested in lesson two and and that worked for me I was wondering if anyone compared that to what we talked about last week in this study group is Jeremy's method better or worse than some things other people have done to download images for people who weren't here for the last one would you be able to just quickly tell us what you talked about last week so in terms of scraping images I showed this Python package I think it's called Google Google images download or something you will find that when you google that and it's basically for downloading images from the Google search without without having to open a browser so you can run it on a on a remote machine and Jeremy showed a more I think more manual way where you kind of scrape the links from from the browser with a little JavaScript snippet and then you just load the links inside the person script and download the images so it's a little bit more menu about you the advantage is that you kind of open the browser you see the images call down a little bit and then for example in his teddy bear example you can see a lot of teddy bears and you know okay there the this image set is valid so when you do that completely remote you either have to look into the pictures so scroll through the pictures or you don't know what's in the data set so when you search some very specific thing that Google maybe does not cover that good it's going to probably going to give you a better images I think when you're building the model you should experiment more and that's why the nonstop action video produced which allowed to experiment it actually downloads the image in the single folder which is also something you will more because you would want them in separate folders by base to allow splitting the data set by simply passing perform arguments to dona your you have voices cram is a little bit or if they're only for me or also for the others it's for everyone okay can you maybe yeah like last time i switch off your video so that you have some some more some more bandwidth for your for your voice so it's cram load a little bit so piggy videos for us it is it is it better now yes I can hear now I can't seem to load the prison um I can share it maybe okay can you see you can see it now because it no yeah okay so yeah moving ahead there was mention of starlet and flask a pH to put models into production or when deploying on a web or web framework and there was a demo by if I remember correctly me Simon who I put an app into function using founded I'm gonna know hackathon oh I actually checked out though I be on it's pretty cool I would highly recommend checking it out who's that it was pretty slow paced of it which was basically matrix multiplication refresh and we built a stochastic gradient descent classifier from scratch so I think that's that's my summary for listening to do anything you know it did anyone use this starlett framework so I I know flask and my colleagues are using that for building web APRs but did anyone use this starlett I found it very interesting for especially for data and I guess I shared the link yesterday it is very similar to flask you you just want to check the documentation and it won't take more than 20 minutes so Miss Elizabeth it is very easy and by similar to flask okay just if you're already familiar with flask is there some advantage to starlit from what you've seen I think it's you can do asynchronous stuff so you don't have to stop your program when you are working for a request to come back yeah even anyone know Firefly the packet Firefly because that basically reps and a restful return function over over any piping function I trust came across that I paid at PI data Germany I was wondering if everyone is using that Firefly Firefly Python it's a package name and basically what it gives you it gives you a an endpoint to any functions or you if you have a predict function you just put a decorator on top and it should give you and I get to predict endpoint but yeah I was just wondering seems to be pretty new so so they move ahead with the mini presentations sorry just I've got a question I'm relatively new to deep learning and there was some confusion about you know the error rates for validation sets and training sets and then finally someone's at all and and then Jeremy said that he had got them wrong can someone in like in two minutes kind of explain what what the learning where the error rate needs to be lower and why because I'm sorry I just haven't had the the time to experiment to understand or to to understand which one should be which way around the error rate in the training set needs to be lesser than what it is in the validation set and that's perfectly acceptable that's what I think I have got all of that particular discussion great and and why is that the case because he says a well in Weldon model would be able to do a better error rate on the training set and the validation set the common misconception that exists is that lester trade error rate on the training data and compared to the validation is overfitting which Jeremy said is in the case and the reason that is because your model is more custom to the training data so that's perfectly normal can I can I see my two cents maybe I'm right or wrong but I got this idea from Andrew Hank in three three three years ago the idea is is that the training set and the validation set both should be around the same behavior I mean if you have decreasing in the accuracy of of increasing in the accuracy decreasing in the error of the training set then you should always have the validation set is improving also the idea is that when you have a training set and your model earned on those training said then then how well did it learn on that set in comparison to the validation set so it should be ideal it should be the same both if in case the training set has more errors than the validation set then thus means that you are under fitting under fitting thus mean that your model did not fit well with the training it did not train well on the set although it could do better with the validation set how could be that possible maybe sometimes you are using some dropouts for example some regularization this will will come I think in less than four or five later on you will you will intentionally introduce some noise in the learning in order to generalize better on the training because you don't want to over fit over those images only so you introduce a random noise in the learning process this is a true fault so sometimes you have an under fit where the training error is larger than the validation error at the other case when you have much better training than validation the training error is much less than the validation in this case you have an overfitting why it is all fitting this because you have fit more than what is required on the training it may be you think that this is good but but actually over training or only the training set is is I mean I mean later on when you are deploying the system then you will see new images so you cannot recognize those images as well as the training so ideally should be both the CM going down this year when the when when the training set is getting better the training error is getting better and the validations are not that basically means that the the network is memorizing the trainings and that's it's not generalizing to to Jeremy also said that and he tried it and it's pretty hard to to get this resonant net to overfitting so it's not that prone to overfitting Jeremy specifically mentioned and that if you want to detect overfitting then your error rate which is always on the validation set it gets keeps getting worse after each epoch it doesn't necessarily mean if you're training losses in validation loss it doesn't necessarily mean that you're overfitting but if your error rate is getting worse that means you're overfitting and it's pretty hard to do that on this net correct yeah also on this for me yeah oh so I wish I was just gonna say I was gonna make a quick distinction so you can have cases where your training error is going down and your validation error isn't going down as fast but it's still going down and I mean in that scenario you're still at least partially generalizing into your error is improving and I think the point that Jeremy tried to make at some point or maybe I heard this somewhere else and I'm misrepresenting what he said I'm not sure but the idea is it's when the two start to diverge and move in different directions that's when your overfitting and that's when your error it's gonna start going up it's not necessarily every time that the training there's lower than the validation error that you've got an overfit model and it's bad yeah I think I I have this is the same opinion Jeremy seems to think that well he's kind of Luis trading you can compare training and validation loss as you're moving along and worry if one is larger or one is smaller than the other but if the error that you're getting is tending to decrease as you go along maybe you don't need to worry too much about which is larger you know for training or validation set it's when your error fine ultimately starts increasing you know then if you reach that point then there's a problem that you're no longer generalizing yeah that was my understand as well yeah this time like he only says the error rate is your kind of key measure I agree yeah that's what I've heard so yeah moving ahead I I just quickly talk about a framework and then we'll move on to do other mini presentations so I mentioned this north hot dog app during our previous meter and the facility camera is a free book mm I said rate this isn't my book and I just want you to highlight that now we have I taught working on an Android device so just to give a brief so you will do github repo link is there in this place to give a brief overview we basically exported a toast model to onyx and be running it on the mobile phone which is an Android phone with the capital backing and Facebook actually provides boilerplate I suppose Android app to run this and using that Cedric was able to run the retrain or squeeze net model on his Android phone so if this this is interesting you I'd suggest checking it out there is a video demo özil and just just just to give you an idea why are we doing this is instead of deploying on the cloud why not build an app idea maybe try something on your mobile which one because I suppose some mobile phones are powerful enough to run basic models so the North hotdog app is a proxy we check or benchmark how good is the mobile phone I was wondering is like like what are you experiences with the onyx stuff because I was interested in using that as well never really used it is is that like straight forward or do you have to be careful I'm sorry this isn't my book this because it's okay okay but he did mention it was pretty straight for old like because Facebook provides do I suppose app boilerplate code to just import the model or the cafe to part is pretty dicey from what I could make out because cafe to binaries are the problem okay so I'll stop sharing my screen and I suppose VG has the corner is present with condom yeah so as we switch in there a couple of questions to my chart maybe you could answer answer those there's a question on how to join the idea study group I guess sun-young can hope that I'll provide the link in those facts and thank you yeah so are you guys able to see my desktop yes let me try again able to see now yes great so so basically we all did the lesson one and you know when we did learn is equal to create CNN the data and the model and then we went on to fit we realized that he could create world-class models right so I just wanted to understand what goes on behind those three lines of code that we do and makes it accessible for us to do what we can do right so in the render bore I went ahead and kind of title do a double click on the fast air library so basically what we see is this the learn is equal to the create CNN and that has the parameters of data the architecture which in this case is the rest not sorry for and then error rate metric right and then what it basically does it it creates the classification learner which in itself is an instance of alone the next right and the inputs we only use three here but it can be all of this the the data the architecture the cut and we will see why this is there then whether the model is retained or not and then we have a parameter called linear filters this again is for us to define the inputs and the outputs for the additional layers that we will be adding and we will see what are the other layers that we are adding then we have a parameter for dropout which is kind of fixed right now at 0.5 we can change that if we want and then we going to create a custom head and then a split on basically to split really the model that we create and initialize the weights on the split that we do so have a question question sure back to the previous slide so so the documentation from the first EIS is great it's much better lying worse but like the it lists all the arguments for a certain function but I could not find like the explanation what each argument means like you pierce its dropout but is that is that available somewhere I'm just missing that or is it still something I guess that needs to be improved so Michael I had to really did that to find out some of these things so it was not that straightforward but then you make those connections you start getting a hang of what they are right but they are not described well yet right that's right that's what I thought as well yeah yeah thank you sure so if you go back and this is with regard to the video where Jeremy talked about the this architecture resident 34 yes basically he was saying like this is just an art the architecture not the weights we are downloading I kind of like a flaky unit what does it actually means like what we are saying here we don't have anything like downloaded weights or anything just use resonate 34 but if we say use this resonant 34 architecture is it a different way of predicting or that's where I kind of like a lot got lost any idea sure so I'm not too sure on that either but I'll tell you what I do understand so PI taht supports the transfer learning part of the model rights and it has support for about 10-12 different state of the art architectures including ResNet inception darknet squeeze net and so forth and they are downloaded with their layers and with your weights so when we you know if you if you run the you know the code the first time around and if you don't have the rest of 34 weights downloaded my understanding is they get downloaded into your folder at a certain folder post which is what are you using saying pre-trained is true that's correct yeah so the by default the pre trend is set to true so if you don't set it to false it will get the weights once it gives the model if you don't want the you set it to false and then you can train it from scratch he was going after that the argument just specifies the architecture but not always implies that you actually use the weights I guess and but like that you can override the way it's a retrain from scratch but trust apply this kind of an architecture I think that most be able to complicate it saying like ok here you want to have a resonant 34 layer composition and you can use the pre weights or where you can't up to you so this we have trained for learning then right absolutely if you say weights it's transfer learning because you build up on on the pre trained model yeah there was an interesting question whether you whether whether you cooze though you could use those weights for commercial like product and I think whether was in lecture one or second one and Jeremy I think I was hurt I might be that was on forums that wasn't tested that much yet because basically you're you're using a weights which is some sort of intellectual property or some sort of for now and whatever the question is whether you can use that in a commercial product with a closed code or stuff like that I'm not sure how that works just without the wage you just have a skeleton of an architecture and it took some intelligence to think of that and create that and maybe it was even done partially automatically but really you maybe just have random weights and I wouldn't expect that to do very well if you take the weights as well the fully trained model then you're taking advantage of the week or two or months of training and the millions of images that are in the training set that the people who created the way it's worked on and you can use all of that and maybe tweak tweak the last few layer weights to do transfer learning but if you want to ignore all of the work that the people did to create those weights because you have a very customized data set then just don't just get the architecture that may be a good architecture for you right is there some source though that we can go to don't give us an understanding of who owns of the data or who's the model and can be used for commercial there's anyone I think I think Resnick 34 is on github so it will have license there let me check it's an interesting question because even when you have a custom data set it's it's not it's not enough because you have to have a custom data set that has a lot of data and sometimes it's transfer nice or in a lot of cases transfer learning is the only way to to get a model I mean look at image net there are like 14 million pictures who has a data set like this wasn't that in one of your interviews Sam that some companies are reluctant to actually share their data or have it embedded in trained models because then computer competition will get there kind of custom data and therefore also in some of these conversations are on differential privacy folks we're talking about extract out images from a model set of attacks so there are definitely definitely issues there the licensing issue has come up but I haven't I don't think I've talked to anyone who does have a definitive perspective on that I think it's like Jeremy said it's okay it's untested at this point right you know and we refer to when we refer to the weights and so on what we actually what we do mean though isn't it that these weights are all that I mean this is all based on models that have been trained on imagenet and we're using those weights right I don't think it's I think that's what it's referring to and so so because it's such a wide has such a wide scope in terms of the images there that's why it's it's pretty good for you know any new image thing for for transferring I think that was my understanding yeah bad light example you might want to do some transfer learning because those aren't quite the same as maybe image net it depends on how closely your data set images cope would correspond to image net but if someone nowadays trains a resonate 34 on image net the only thing is basically spending some money on energy costs there is no intellectual property involved anymore so anyone can do that anyone said I'll ever have to to to I mean three lines of code and then you can train it from scratch so I don't know if this is an IP and topic anymore at the moment a big wonderful but not that from a legal standpoint in the United States well I don't know you have to protect your IP so if you allow someone to use it then it becomes public and in lose ownership so you have to protect it so that's why some people who have on the data don't want to share it because if they do they they lose other ownership you know it's like the recipe for coca-cola as long as they keep it within small group of people's and the only people who know it am signed a I'm in agreement not to disclose it then they own it but if that is broken then they lose it yeah I agree but it's not for English net I mean imagenet is publicly available so anyone who's training resonate on image net is really just spending some some money on the energy that you have to use for it so from from the licensing perspective I don't know what said what what sort of license don't images are are but so like if it's if it's if it's certain license you can use staff but you have to keep the same license and license so you kind of make it like closed source and stuff like that so I think so it's a bigger question thank you I agree sure so just going forward yeah okay so this one is basically you know what a learner class is right so as we saw the the line of code saying learn is glue create CNN creates the classification learner which in itself is instance of class so if you see the learner class it is all of these inputs that I have detailed on this side and the methods on the other side of which the one circled in green is what we have used in the lesson one right we have used lr underscore range for giving the slice we use the fitme third we used the unfreeze method the save method and the load method but there are other ones that are there as well and I just want you to have your attention on the one circled in blue which is the input basically it's a callable that we define saying the optimizer function that is by default therefore the limit classes add right if because if we see in the code we are not defining a loss function or an optimizer function so it isn't the code already there that if nothing is defined then it takes the optimizer function as Adam so moving forward there are additional functions that are called as we go ahead in the in the way so one is creation of a body basically this is where we retain the architecture which is in this case the rest of 34 without some layers right and then the head is where we create a custom layer that they stick on top of this body where we have removed the layers and this new layers basically are able to then kind of food the number of classes that we want to predict based on the data set that we have and in the case of lesson 1 that is the the the pets run to various breeds of the dogs and cats then there are other functions to determine the number of features the highest number of features that are there in the rest net 34 architecture and then some other functions to understand where did we split depending upon which model it is in this case it is rest net so there is a function for a rest netsplit and metadata to determine what are the layers to be cut depending upon which architecture it is so again we use here what is called as a rest net meta and we can see what they all mean as we kind of go through the functions right so I just want you to have a look at this particular copies of code on the learner class this is a method call post in it so basically here if you see I just want you to have your attention on the the one that I've circled below saying self lot loss function which is the last function of the learner is basically if there is a loss function defined it will take that or it will take the loss function of the data right so that's what it means so just let's go and see now in the in the workbook how does this kind of an effects I hope you're able to see my notebook here yes yes okay so I'm just going to run the code here which is the creation of a image data bunch right and then it runs show batch and here if you come down this is where if you see there is a loss function defined for the data already called cross-entropy or in other words the negative log likelihood animal and a score loss right now I had a hard time figuring out to where this is kind of done for the data when it is created because I could not find it in any of the documents but in one of the check points of the documents I found that there is a default was function called cross entropy right but it is in the checkpoint or not in the in the original documents I'm I'm kind of still to figure out where the loss function is this kind of set that's I think that's in part right so that's my torch function yeah it is yeah so the next set of things are basically again Python so you have a data which is a data set and that has a train data set and a validation it a set so you can call the the X of it the why of it you can see the kind of you know the number of images in the train data set number of images in the validation data set then you can see the you know the number of images you use in each iteration of the train data loader and the validation later loader as every batch batch passes through so even though we say back size 264 it's not exactly 64 right so it's kind of it the data level does some optimization you say believe to kind of match it to that particular thing so even though we say beer so 64 it's not exactly 64 here then this is the same thing that we do so next is to run the learn and I will not do the Friedmann cycle because that'll take time so let us now go into the code right so here is the code for very quick one could you just go back to the that that point you were making about the batch size because I was a little bit confused by those those numbers that you said so there was the 64 and then I saw 192 and 12 could you just um yeah in that bit again please thank you sure so basically if you see the way PI touch works it has a dataset and it has a data loader right data set is basically your x and y's your dependent on your targets that you have and then you you you know kind of send these data why are batches right so that is what a data loader does and that is spite or just in both kind of functions so there is a data loader for the train and there is a data loader for the validation set and you can find for each kind of batch what is the number of images that goes for training and the number of images that goes for validation right and here if you see it as 92 for the training set whereas you see the batch size that we said it a 64 so basically sorry to interrupt with the 92 here defines the length of the number of batches not the length of the but not the length of a particular batch it is a length of our particular batch so 92 images per batch is going in is what I understand know what I am ditalion according to the plateaus dogs the data or train length or n length so when you apply the length function and you write data to train the underscore DL it will give you the number of batches in the entire data loader not the length of the note event of a match what is the what is the bar size in this case we can calculate it yeah what 64 64 5 7 3 divided by 64 let me tell that it's 5000 so basically you will have 63 full budget is and the last batch would have left number of images because it is not the developer yeah so land train dot length data to train and with Cordiale will give you the length of the entire bet number of batches no particular batch yeah that would mean then the batch size of the validation is twice that because if I defied one five one seven which is the length of the validation data set byte well it gives me hundred 26 per batch so it's it's twice of 64 for that ok so just going ahead okay so let us go into the code itself if so if you see here this is what the the code behind creation of the classification learner so the first part is here so basically we are going to them via the metadata of the architecture that's what a CNN conflict art does right so the code for that is here basically what it does is it says given the metadata of a model get me what is there given for that particular model which is in this case rest net 34 and if that is not available pick up from something called a default meter right since you see here there are dictionaries and in dictionaries you have a method in Python saying that if there is a value available for a key give me that value if it is not available give me a default value right so that that's what it means so if we run this code again here you will find that you're getting the you know the details first but no respective to this rest next 34 which is basically a cut of -2 meaning the last two layers of the bodies to be cut and if you need to split it we need to get it from the function called ResNet split right so again trying to get the emitter cut from that is minus 2 and now we define it as our model and then if I a this model pre-trained is true I get the complete model you know built in for me right so this is basically the complete structure or the layers of resna 34 model right as you can see here it is layer by layer it is given to me you know what is there inside moving forward so what we do here is get the layers of that particular model that's what this particular you know code does gets me the layers of all of that and it says give me everything but the last two layers that's what this means and give it to me as a list right and then what we do here is take that and make a model out of it so n n dot sequential is a pipe code for making the complete model given a set of layers so we have the layers in the list function and using that we create the complete model all over again without the last two layers and we kind of give it into a function called body so you have everything now freshness study form - the last two layers and then we want to find out from this particular you know body what is the maximum number of features so what we do here is flatten that so the the basic basically we're so stacked thing is now flattened out into a into a normal list that you see here and then what we do is reverse the order so if you see here it starts with a 64 or yeah 64 kind three and sixty four kind of a feature list to here which is a final and twelve end of a list right sorry let me go out yeah so what we do is we reverse that so that does the final twelve thing comes up first and then when we asked for the numbers it gives me finer and twelve because then it's an accident number of features that is there in that particular architecture what we do is we then multiply it into two this is something I could not understand why do we do that but we do that into two so my understanding is the previous layer you know just before the the last two layers must have something to do with thousand twenty four which is why we are doing that then what we do here is then create a custom head basically to create two more layers to accommodate the data set that we have here right so we have thirty seven classes so we need to move from thousand twenty-four which is the number of inputs to the first layer to find nth well through next year and thirty seven as the ultimate output that we get so I think and I think doing this multiplication is really just carrying some some more parameters into this socket layer to kind I don't know have a better classifier I really don't know so why this number two is done is something I really had to figure okay so just to give you the perspective what we do is the two layers that we add is this thus the adaptive concat pool 2d layer and then the flatten layer and then we're using a parameter here called linear filters right and why we do that is this basically what we do here is stick up these three things the number of parameters that we got we then multiply that by 12 I'm sorry multiply that by 2 then we retain the number of features and then we retain the number of classes that we predict for this particular dataset and then we kind of make the drop out into two so we initially had something which is 0.5 we make it as a list and then we make it as a subscript able list which has two values 0.25 and 0.5 and then we define two activations the for the activation for the first layer being value and activation for the next one being none because there we are predicting the classes so now we're taking these layers which is the Galactic concat pool 2d and the flatten we take in this linear filters and we separate them into two which is now thousand twenty four by five and twelve and the next one being fine and twelve by thirty seven to the first layer which is thousand twenty-four by 412 V and the drop out of 0.25 and the activation of r lu to the next one which is fine and twelve by 37 we add the drop out of 0.5 and we don't add any activation there so that's what this particular function that you see here does right and then we again make it as as a layer and we call it now the head and what we do is then mix up the head and the body together to create what we want as the learner model we take that and then we make it as a classification learner right because then we derive all the properties of the learner class and then we split using the property of split from the meta basically it says use the rest it split and what the resna split does is separate separates this particular model into - it's called model of one which is basically the last two layers and the model of zero 6.now I do not understand the significance of this model 0 & 6 but the significance of this last two layers is what it says is take this model and apply the weight initiation called climbing Kerr initializer for this right so you need to initialize the weights for these two layers because remember when we even if you say pre-teen is true we are removing the last two layers which means for these two layers we need to have a weights and if you have seen the version 1 of part 2 last year I think Jeremy spend some time explaining creaming her initializer and why it specifically suited for radio activation because we use value for this particularly if you if you remember the thousand 24 by 5 and 12 uses a drop out of 0.25 and rail you activation so we then apply this coming high initializers on it to get us going so that's how we get this particular learner object and then you start using the fit and the other things to go ahead so that's what is the whole mini presentation is all about interesting well I think the shape of this last of this head network is really just I can't extend I don't know if it's explainable from a math standpoint but it's kind of a that's the way how you create a fully connected Network for classification you get out this features from the from the ResNet part and then the shape is kind of like this in the in the networks or you get water in the first layer and then you narrow it down to your number of classes I mean that's look at VDD top network that were like the first networks they are also like this so the after the features from the from the neural net it's it's going into this fully connected layer and then it's getting a little wider and then it's narrowing down to the to the number of classes in the network I don't I don't know if there's a good explanation from a mathematical standpoint but I mean that's what you see everywhere do logically I think it's the best thing to to understand it about yeah no by the way report this type of understanding if we dive in inside the library you should we learn my torch I think I think this will we serve later on right on lesson 67 my I jump into it without knowing too much of Pi touch so I think in part 2 of the of the whole course so deep learning for chorus part 2 he's going to explain how he built this library that's what I understood so that's probably going to dive into pathogen that of course even even in this close in the last lesson of the previous version he did see far on pi PI torch from this catch yeah I agree this was all the bit maybe advanced for for the lesson too as we were because let's think it was more like the basic SGD in PI torch and stuff like that but very interesting the guys here in the the chat group they asking whether you can share the slides for for this presentation because this is very interesting it's worth fair to revise it absolutely I can share the slides and the new book thank you thank you just a question though is there any value in reviewing any of the version 2 stuff of the course III didn't I didn't I didn't look at version 2 I wasn't around for any value in reviewing those the lessons there to help with version 3 I did so it made it more easier but specifically asked me my experiences you don't need version to do what you're doing now and the library is very different I think from last time so may change really a lot it's a complete rebuild of the library so I know if it's going to help you to concept it's good but you can't really draw much from the code because it's very different also the concept I think I think the concepts also changed a little bit now the focuses in different things never think so yeah I think stick to stick with this course unless you have a project that you are just like me I am the middle of finalizing my PhD so I need I need those things that's customizing that at the CN n so I cannot wait until March or April then when I understand version 1 so I have to do two to go with the customized CN n to do it in the old fighters all first night yeah what would you what you could do you could because the notebooks notebooks how they created the that library they are available and like they pretty much explain how they did it and why they did it and what the decisions they've made so if you don't get rap there are no books are there so maybe that's the kind of compromise there are no videos of course but you can just go to notebooks questions on the forums if your step because comparing two versions would be confusing if you want to jump the gun you can always ask on the advanced range I think the devil boots are the ones that will walk you through how how the library was built they might be both and those are going to be I think the foundation of the next course of the part 2 course yeah ok so let me stop sharing and I think Christians think recently but the almond almond site at the end of the meetup should be I can just quickly mention it if I mean I don't really have a full presentation anyway it's just too quickly just try to grab the screen yeah please do okay okay can you see that yes yes yes okay oh what happened I know our life so I don't know Mitel just said I should just mention the first class library thingy that I felt quickly to basically quickly build the dataset for the course I mean it's not much but it's maybe helpful to some and basically I wrote a brief blog post about that and what it is it's like two scripts the first one is trust a bit more involved I crawler so I crawler is a python package way I can crawl Google or Baidu or Bing for for images or other stuff so I took that and I just wrapped around basically a final input with the search terms so basically you can install it from the github I didn't upload it to 2pi PR yet so I might do that if it's a bit cleaner and so basically you install it and then you get two scripts the first one is the F CDs or first-class download and the other one is FCC for fast cars clean and with the first one basically you provided with the csv file of two columns the first column is your search term that you would type into Google for instance for Google images and the second one is any any terms that should be excluded when you build the class name so in this case of my example were both guitar and models so if I want to search for a Gibson Les Paul guitars I need the keytar keyword as well because let's call is a person otherwise I get loads of person images as well but I want to exclude ketta from the class name that it generates okay so that's the second column and the second one is Gibson SG and it'll do that and if you go off and just pull them into individual folders and you can specify now also the number of images I ID fold it like a thousand images but that can take quite some time but you can make it smaller and basically you just issue this command line here right so you can crawl all or just Google your trust by duo P and you write it into Guiterrez folder and you use these search terms and then it goes off and you you get these images which can take some time and you can also specify if it should resize to a certain size already so it took like $2.99 with another command flag and once you have them it's probably good idea to to key in there because I found especially if you put in like higher numbers like the further you go down the really worse the results get so have a look at it and previously I was just using the finder on mega s to go through but I wrote a tiny TK inter agree to do that that's the FCC script and that just gives you like here window and then you can just quickly use keys to either create them or marks and put deletion so what I did for instance was I for myself I classified like four grades of quality so for instance one was a good full kind of keytar picture that I want to use to was just the the body of the guitar still enough to this thing guitarists most often three and four I used for I don't know just a head stock or data from the back or whatever so I just filled that out later on it's just to get that information on the images you can do whatever you want you can leave it out as well then you can also mark images for deletion if they are crap you just press the T key and when you're done you just press X and it will write a report file and also copy the files into a new folder structure with the ones marked for deletion already taken out okay so it's it's it's not nothing big but it's it's pretty fast because you like whenever you press a key it'll also already shows you the next image so it's like the like the Google image download or a short last time on steroids not really sure I didn't really look at the but it's yeah it's kind of similar but it is it's reasonably quick to go through there yeah yes it's interesting because you don't have to install all the chrome stuff that's why I why I went for this this I crawler thingy because I didn't want like the the dependencies are really minimal because tkn there's also included so no I thought about using QT or whatever but chicken does is fine there's some packs with the GUI but it kind of works by coming in it may just from bingen by us all right so why it should pull it so if you specify this - see all it will take all three - see google Scooby - see I do - see Google will say it's if you type in script name - - health it will show you the options okay and are you supporting this tool on slack it's I mean it's pretty rough but it kind of got my deserts done so I mean now that Jeremy mentioned it I I will clean it up a bit more yeah I mean happy to get input especially with with Equis because that was really a kind of a heck but it kind of works and I mean what I use these these classes 1 2 3 4 for most basically I wanted to prepare different data sets right so I said ok maybe I only take the data is it with a very good pictures and then another one with great 1 & 2 and great 1 2 & 3 and see if that changes the results and you get a lot of Pima just like I don't really know the restrictions but there's some download restriction I think it's like 8000 images maximum otherwise they'd lock your requests or something but for instance I mean I I pulled like 27 classes I specially for 26 classes and each got like a thousand 200 images for sure or more that's plenty of stuff you can tweet through too much so I football for the first exercise I had to restrict it to 11 classes because I got tired of putting the buttons and checking this stuff but yeah let me know if it's helpful to improve it and when when I write some tests or make it a bit nicer I will probably put it on pipe here I use your tool last week I found it pretty cool excellent so yeah basically it just you can just get it on the github and for the moment you have to install it which pip install towards the github path plus the egg kind of thingy and there you get the the commands and stuff that reflects I mean this is one more presentation I guess right you have presentation right for you you wanted to present something oh I learned more to the next week because we've already okay sure yep if you want to show something anyone of you you just reach out to any one of us here from the Twinner hostess then we really appreciate that there's a lot of stuff in chat that we didn't have a chance to address but won't try to do that maybe afterwards that's luck and I'll drop the chance friend look I'll drop the chat transcript into after we're done thank you everyone thank you thank you just one question if you have time for few minutes I think this is for something an interesting thread that I found in the forum should I make sharing for the desktop - Oh from where where the shearing oh so here is the the thread it says that if you have a data set that is very different from from imagenet so you have everybody can see the desktop yes alright so you have to use for for for custom that it says that are different from imagenet you have to make a normalization so the people here they are asking how to not normalize the data so in case you have satellite images or some other like x-rays medical imaging those are something very different so you have to use the normalized without anything inside the brackets you know we use image stats image NESTA's you just leave it without anything then an interesting thing mark taught us that actually this will compute the the statistics of your data set but not the whole data set only one for one batch size so it's because it's not very important to make a very accurate statistic you know it's enough to take one batch and and calculate a statistic from that here he suggested that for only for that thing for calculating the statistics you increase the best size to the maximum that you can for for here you can see that says is 2000 just to get here the statistics and then I asked Jeremy that if you use your own statistics the the mean and standard deviation is it still okay to use pre-trained model transfer learning if the images are very different from the emission it all you have to train your model from scratch so here Jeremy said that if you're using a pre trained model you to use the same statistics if it was trained with so you cannot use your custom data if your statistics is significantly different from imagenet with transfer Lenny sure I don't know why but it's an interesting thing that to note maybe in some cases like satellite images we we need to to to not use transfer learning but we kind of go away when you unfreeze I mean you kind of like once you unfreeze wouldn't that kind of adjust well you can adjust the words right once you know freeze I thought I thought that transfer learning is always better than nothing because starting from scratch would run down weights it takes literally weeks it's a bit off I will just use it and then basically with the unfreeze you might probably adjust any any difference yes there's a competition now at goggle and it in in the rules they do not allow you to use pre-trained weights and the reason for that is because they because like so the the imagenet can have biases and all that stuff in it so you don't really know what's in that image net so if you if you decide to use it you use it with all the buyers as they have with everything that's built in it so I guess there's the might be a reason you want to use from scratch and there's a cargo competition now that it's not allowed you're not allowed to use pre train weights so that's why I was trying to start it from scratch and I don't know maybe I did something wrong but basically trained very good performance in like five iBooks and it didn't take long and it was quite a lot of images and it took literally maybe an hour or so to train the image net from scratch so I use the 34 planet planet calm no this is now inclusive images competition so the idea is that they they're going to give you different pictures from different parts of the world in different stages so you train your model own set a and that comes from say whatever America and then they're going to give you another data set that comes from Europe and they want to see how your model is doing on-site the same kind of pictures but from different part of the world so of course image net would kind of blur that a little bit because image that maybes got already all those parts of the world so they want to have that restriction there I I think I think for for that competition there they have some reason regarding the bias image net has some best - for example for for wise people you know I think that most of the people there is white but this is something reasonable but what about if I'm training something it's very different from image net there is no bias like satellite images I don't know III I was I was I was thinking that it's always better to start from something better than just random knowledge I would imagine like like you know the first one or two layers no matter how they are with the colors but if there's any kind of a gradient or whatever even if it's just really the very top of your model that should still be better than than just having some some white noise yeah yeah you need to retrain like the entire thing quite a bit to me Theresa I was wondering the same stuff but that's maybe for another day what happens if you have like totally different data like there you have satellite images with like four or five layers like is there any way you can you can help these kind of models even though they have totally different like they have infrared and mid-infrared information instead of just our TV this is something very interesting I'd search about it in the forum and there are not a lot of discussions about it but probably it's like about this kind of stuff you don't have to draw it out here yeah good interesting maybe one thing we can do is there is something called a principle component analysis which helps you to you know kind of reduce the dimensions so if you have more dimensions than three maybe there is something you can do to bring the dimensions two to three so that you can start over here yeah anyway but we can discard it discuss and stuff like things like I would be interested yeah you okay okay so mmm see you all next time bye bye\n"

TWiML & AI x Fast.ai Deep Learning v3 2018 Study Group – Session 2 – November 3, 2018

Random Videos