TWiML & AI x Fast.ai Deep Learning v3 2018 Study Group – Session 3 – November 10, 2018

The Power of Mel Spectrograms: Unveiling the Secrets of Sound

As we delve into the world of audio processing, it's essential to explore the various tools and techniques used to analyze and understand sound. One such tool is the Mel spectrogram, a graphical representation of the frequency content of an audio signal over time. In this article, we'll embark on a journey to uncover the secrets of Mel spectrograms and discover how they can reveal insights into the human voice.

The Role of Mel Frequencies in Sound Perception

The concept of Mel frequencies is rooted in the way our brains process sound. The Mel scale is a logarithmic scale that categorizes sounds based on their frequency content, with lower frequencies occupying more mel-scale units than higher frequencies. This means that low-frequency sounds require less mel-scale units to represent than high-frequency sounds. By analyzing audio signals using Mel spectrograms, we can visualize the distribution of these mel-scale units over time, allowing us to understand how different frequencies contribute to the overall sound.

The Effect of Lip Movement on Sound Production

When we speak, our lips, tongue, and vocal cords work together to produce a wide range of sounds. The movement of our lips plays a significant role in shaping the sounds we make, with different lip positions and movements affecting the frequency content of our voice. In the context of Mel spectrograms, this means that as we speak, specific frequencies become more or less active, depending on the position and movement of our lips. By analyzing these patterns, researchers can gain insights into how our brains process speech sounds.

The Relationship Between Lip Movement and Activation

One of the most striking features of Mel spectrograms is the way they display activation levels for different frequencies over time. When we speak, specific lip movements trigger changes in these activation levels, creating a complex pattern that reflects the nuances of human communication. In particular, research has shown that as we transition from one sound to another, there are often brief periods of silence or low activation between sounds. This is because our brains need time to process and adapt to new information.

The Transformation into Mel Spectrograms

Mel spectrograms represent a transformation of the original audio signal, encoding it into three dimensions: frequency, time, and amplitude. By plotting these dimensions in three-dimensional space, researchers can visualize the complex patterns of activation that underlie human communication. This allows for a more nuanced understanding of how our brains process speech sounds, revealing insights into topics such as phonology, prosody, and semantics.

The Potential Applications of Mel Spectrograms

Mel spectrograms have numerous potential applications in fields such as linguistics, psychology, and computer science. For instance, they can be used to analyze the acoustic characteristics of speech sounds, shed light on the cognitive processes underlying language processing, or even develop more sophisticated speech recognition systems. Moreover, by applying these techniques to other audio signals, researchers may uncover new insights into music perception, emotion regulation, and social interaction.

The Mfcc Transformation: A Logarithmic Scale

In an effort to improve upon Mel spectrograms, researchers have developed alternative representations using the Mel Frequency Cepstral Coefficients (Mfcc) transformation. This involves taking the log of the Mel spectrogram, which effectively compresses the data into a lower-dimensional space while preserving the essential features of the original signal. While the Mfcc transformation offers advantages over traditional Mel spectrograms, it also introduces new complexities and nuances that require careful consideration.

The Visualization of Silence: Uncovering the Secrets of Audio Files

When analyzing audio files, researchers often encounter periods of silence or inactivity that can be just as revealing as active speech sounds. By employing visualization techniques such as plotting frequencies across multiple dimensions, we can uncover hidden patterns and insights into human communication. This approach not only sheds light on the cognitive processes underlying language processing but also highlights the importance of considering "silence" as a crucial component of audio analysis.

The Relationship Between Frequency Bands and Activation

As we explore the world of Mel spectrograms, it becomes clear that different frequency bands play distinct roles in shaping our perception of sound. By analyzing these patterns across multiple dimensions, researchers can identify specific frequencies that contribute to activation levels over time. This approach not only sheds light on the cognitive processes underlying language processing but also reveals insights into topics such as phonology and prosody.

The Effect of Word Structure on Activation

Word structure plays a significant role in shaping our perception of sound, with different word lengths, syllable counts, and stress patterns affecting the frequency content of our voice. When analyzing Mel spectrograms, researchers can identify specific frequencies that become more or less active depending on the structure of the spoken word. This approach not only sheds light on the cognitive processes underlying language processing but also highlights the importance of considering word structure in audio analysis.

The Zooming In and Out of Audio Signals

When working with audio files, researchers often need to zoom in or out to focus on specific regions of interest. By employing techniques such as trimming or stripping audio signals, we can eliminate periods of inactivity or noise that do not contribute to the overall sound. This approach not only sheds light on the cognitive processes underlying language processing but also highlights the importance of considering "silence" as a crucial component of audio analysis.

The Power of Mel Spectrograms: Unveiling Insights into Human Communication

In conclusion, Mel spectrograms offer a powerful tool for analyzing and understanding human communication. By visualizing the distribution of mel-scale units over time, researchers can uncover insights into topics such as phonology, prosody, semantics, and cognitive processes underlying language processing. As we continue to explore the world of audio analysis, it is essential that we recognize the significance of Mel spectrograms in revealing the secrets of human communication.

"WEBVTTKind: captionsLanguage: enokay yeah welcome to the study group session for the third lesson this time it's a little bit yeah the the time frame was a little bit short so the session is only like I don't know two days ago or two and a half I couldn't even watch the complete session this time so I had a lot to do how is it for you did anyone also not watch the whole session because the time frame was too short or did you all participate live oh okay I just do it afterwards okay through the whole session yes and I'd recommend skipping the first half-hour if they go back and forth okay so let's see what we will do today okay so I left the intro from the last from the last times but I think most of you will not participate the first time oh I will not talk a lot about the study itself and then we can the recap of lesson three as I said I was not able to watch the whole lesson but I have some things I want to talk about from letter three again feel free to jump in unmute yourself and talk about things you want to talk about from from this session just half an hour ago I got two more that show something so we had just one presenter I don't actually don't know the real name of at Havas ET is he is he there here in the session yes I'm here cool awesome so your name is Kyra right yes yes hi okay awesome so he will show something about how to visualize model architectures that's pretty interesting I think kodiak's we will go on with his discussion he started earlier on slack about analyzing spectrogram images that's I found that very interesting because it's kind of transferring the yeah the time based audio domain to the to the image recognition domain for the interesting stuff and if you have time Sonia is going to show something about the doodle challenge you just said that he can show something for that and also maybe if Christians on the on the line he can just very quickly show how he managed to deploy his guitar classifier because he was mentioned by Jeremy in in the course and that's pretty cool so congrats Chris it's by the way is he on the line Christian are you there I mean yeah sorry it was nice from the chat just quickly so there are some people didn't watch the lesson some people did watch it just from the more advanced complex and also like to hear the discussion about deployment so so Christian can show us his famous guitar so I think the the stuff he packed into one listen pretty much tripled this time and we just had two days for this so let's see what we will talk about okay yeah again the study we are doing rotating hosts for the study group today I am hosting it and we will rotate it I think you will know most of these guys from from select again on the how the how the study group is how we do this we have the FSA I select channel I think is there anyone not on the selection i just just post on the in the chat i think pretty much all of you will be in the slack channel and we are doing this weekly sessions like today so if you are first just write in the chat and then Makaha is going to point you to the right direction yeah just like the last house we we want to do this mini presentations fortunately we got free now in the last minute for this session and you can just reach out to me or any other of the posters on slack if you would like to show something or or even if you would like to talk about something from from the course just reach out and then we can bring it up in the schedule for this for this sessions yeah again we're I mean this time it's very hard but we are trying to keep pace with the live course that makes sense and this time and opposite to last time we're going to focus on homework homeworks and projects yeah so the schedule tells me that next time the next session I session is on Tuesday so next time we'll have some more time until Saturday and so this is actually the shortest time frame this time yeah okay so now all right I don't think that we need to do the recap of the last session the video is going to I don't know if it's if it's on YouTube now someone asked me a few days ago but it will be online in the next days when it's not online now okay so let's start talking about session three or lesson three okay so one cool thing maybe I can do that at the first point someone from the live session San Francisco or created this searchable video player which I found pretty cool I think they are the videos that are shown here are the edited ones so lesson 3 is not online yet but which is what is pretty cool is that you can kind of search the the the transcript from the from the from the lesson so when Jeremy says something let's say what was lesson two I think great indecent maybe we found that yeah and then we can click that well also be something it's jumping to the right place then that's pretty cool so if you want to search specific things in the videos don't mention that it's the the Google speech-to-text system and that it's not always that good and that they are maybe doing a speech-to-text model that specifically geared towards the facade videos that won't be pretty cool okay yeah what if gradient descent phrase is mentioned 5 times during Abby you think can does it find the first one and can you skip from one to the next to the minute so let's see what's happening here I think it may get all the yeah yeah it's I think this is the first one and then second and okay okay okay so this one is pretty cool let's go back okay and then the next thing that I found really interesting was the this data block API he was talking about so how this how the data set and data loader things from height words and the classes from PI torch are used to create the data punch in in faster i/o that was pretty interesting I didn't know how that it how that it worked like this so what do you think about this about this IP I yeah I mean I think that the main thing around this was that it just gives you the flexibility that you wouldn't have got using the you know what he's taught us in lesson 1 so this basically means you've got loads of loads of options now yeah there are this pretty cool notes here and they are so what I was trying to do yes that's a little bit from open I can hear it oh sorry ok sorry I have a question maybe you know how to do it so what I did I did train the model I mean I have White's not saved and I have the I have classes that I've saved now I I've deleted the training set so I don't have anymore but if the test I won't now do influence on the test set I couldn't a bunch for which she also need the training set and all the labels to do it so seems like I was not able to do inference on the test set without having data bunch created and for that I need the training said that I was huge dataset which I don't want to download again I was like 500 gigabytes it took several hours to do it I was hoping I would be able to do inference just having the weights for the model the Train model and classes even I have but it didn't work so I was just thinking maybe you find a way or II know how to do it or is it possible to do it so to understand what you mean so you have to you have the test set and you have the weights you can load it into the into the model and you can't do inference on the test set because you you don't accept it without the the training set yeah so the way I understand I need to create I need to learn yeah and for that it needs the data and I cannot create data without at least that's my understanding I cannot create data without trading data and labels for the translator so I'm a bit stuck I was out hope that you can it's like you when you predict the like all the web apps you predict for single image but so I would like to do the same but I just want to predict from the whole bunch of images well you could I think you could probably just pass in like an empty training set I don't know if you've tried that yes I've tried to like put known to instead of putting the folder or the trainee folder I put known but it kind of complained about it I tried to put like random folder there but then complained a training labels and training it doesn't match so I kind of like try couple of things but I thought it would be like a way to do it without fooling the image data but maybe there's a way maybe just allowed to do it yeah I mean the other option is like you could just kind of do outside of fast AI so just load your model just kind of pass the batches to it because if you're just trying to do inference then you don't really like need all the other stuff in the library I mean you might not you might not want to have to do that all yourself but that's another option okay yeah yeah so the model is just like in like a pie chart model and you can just use that but isn't it possible to just load whites into a model and do inference without that's what I was about to ask I mean how is the different of doing like a regular inference like mega I didn't because you need a different because they need to to to load like a different method instead of the single from classes you need something else so yes after I've tried a single from class put the classes and the empty data folder and I did try to run the learn TTA and he doesn't require east-west true anymore they just change it a thing but it showed me an error and the error was forgot was now but he was an error yeah so try a couple different things but fire just like still trying yeah same thing API is yeah okay maybe I've misunderstood you but if we have for example if you're using resna 34 or whatever right don't you just say it's a resna 34 model and then you load the weights and you get your prediction that way yeah I was hoping I can do that but so far wasn't able to find a function on our method in fast they I had to do it maybe it's there I just don't know maybe where to find it but yeah and i posted my three lines of code in the in the chat so i don't know if that if i don't really follow it that's that's different what you need you have the yeah the single from data set and then you have single from class if single from classes yeah model state load stay picked and then load the torch okay yeah I've used slightly different code that I found on the documentation site or firstly I found this okay maybe this changing them okay I'll try that one thank you yeah but so the this data block API and the ability to kind of quickly change things here is that's that's a really cool thing that at least I found that in in this lesson and then again and we had this in the last course 2 which is progressive resizing which is kind of a I mean it's it's not that complex that complex idea and it seems to work really good so did anyone try a few and try that in your projects I have a question about it but I think we kind of figured that it should work so I was wondering I mean with the rest nets we always used like 2 24 $2.99 and I think in the lesson or at least in the chat people were talking about 64 128 256 512 is that like sub optimal or the model doesn't care well I think so when you have when you kind of change your your your net on let's say 224 by 224 and then you give it a picture that it's like two thousand by two thousand and the filters will not match the features so a tire for example when you want to classify a car is much bigger in this much more pixels stretch out on much more pixels and then probably the filter is not not find a car because it's too big yeah yeah I think I get the concept about the progressive stuff I was just wondering if I need a multiple of 229 or two 299 instead of having different steps if their steps met in any way like like I kind of take okay you should have a multiple but I wasn't sure if I if the rest of this is trained on 229 if they're module then it would be the Christian why multiple of 229 so let's all multiple of anything why do you think it should be a multiple or something I thought that the idea that these progressive resizing is always like the next kind of increment I don't know it's just guesswork yeah I think it's it's only it's kind of you have smaller images and then you you can train the net very fast and then you have good initialization weights and then you go on and make it bigger and kind of get more of the details and then when you make it bigger it's getting better also on the computation limits right also if we train bigger than a single GPU sometimes it could one for instance maybe I'm a I'm just tense here but could bond fences go to 29 300 350 350 400 I said like yeah you should be able to do that the only limit I think is gonna be how many times it like cuts your thing in half with like max pooling or stride equals two but okay but I think in terms of like actual multiples I don't think has to be a multiple of 224 or any of the those numbers okay cool but I think the the computation you need is it's not factor two it's it's quadratic right so when you have bigger pictures you need a lot more computational power there's at least it is with like if I have a bigger batch say they may be very mid-sized or ResNet 50 it fails on a single DCP you get out of memory error yeah yeah make sense yeah okay just just one question on this you know I think typically in in ResNet they also have unite like all of the images tend to be a certain size what was there any so how does this play in with that so the fact that we're actually changing the sizes now because it seems to provide a better model because my experience in the past was you'd always use the certain size for resonance but now it looks like we're changing it because it seems to provide a better model this does anyone have any thoughts on that so I think the idea behind the progressive resizing is that you isn't it just the idea that you want to train your model faster you want to get to a good model faster and then you start with 64 by 64 images and I mean it's it's a lot smaller so there are not that Niki right yeah I think you're right okay right thank you yeah so it's not actually about the model can can abstract like features on different sizes in within the image I thought that that you get like the capability that it can detect larger objects or smaller objects as well said a misconception I think we'll but the idea was to experiment faster to get to a decent level okay the bigger images will give every like all the details but you might not always have that much time and computation power so with progressive resizing you can get to it faster and maybe better yeah I wanted to try it out I wanted to train a second batch of models for the app yeah so I think they use this Procrit progressive resizing in this what is dom bench competition they they one and four four very efficiently training on image net and i think that the reason behind that is that you can start very very fast you get very you get pretty good results pretty fast and then can do transfer learning and go on on that and i think someone broke that record a couple of days back to the the dormant rica record yes yes the enemy is on Jeremy's record someone broke that yeah jeremy tweeted about this and they've smashed it like two and a half times that was like nvidia block we were discussing previously whether it they basically said okay how they they actually shaved off seconds and here and there and well they yeah there is a nice post about how they did it or they really just candid they did a couple of interesting things in depth but it's it was not using far say I I think I'm not sure I didn't read that in detail okay no interesting yeah can I can ask a very simple question this is incredibly basically yeah so I've had a bit of a brain freeze I've seen a whole lot of this you know this freeze unfair brain freeze it anyway I've seen a whole lot of the cycle of freeze unfreeze and then fit one cycle and I'm a little bit confused because I thought whenever you do a fit one cycle you're adding like a fully connected head with the relevant relevant number of classes so in lesson three he did several of these and I know it's it's doing some sort of training but if someone could define or just explain what we're doing when we do these multiple freeze on freezes and fit one cycle can you all mute yourself and understand it yourself when you are talking there's some bad feedback noise here okay so what I also saw I think that is applying and you head to the to the network is really you do it all one time at the beginning of your of your training so you take the vanilla ResNet with thousand classes I think and then you cut off the head and apply a new classifier head on that and and then what everything you do after that is all the retraining stuff is done with the same network then it's not changed anymore then so when you when you unfreeze and then retrain you always use the this hat you trained in the very first step of your for your data set so it's only the so it's only when you do the fit one cycle the first time that you're actually adding that that new classifier yes I'm pretty sure because I mean it's the classifier for your for your data set yeah okay yeah so how is it that when you do a couple more of those fit one cycles you're not changing the head anymore because there was like this it's like he looped through this freeze unfreeze and fit one cycle and that got me a little bit confused yeah well in in the very first step you are you take this features that are generated in there in the resin apart so in the in the convolutional part of the network then you get this feature features in your future maps and you use those features to train the classifier for your specific air assets so let's say you just have like 20 classes instead of thousand classes and in the first set you train this classifier to to use the feature whether with a frozen resonate network and then you got a got a get a pretty decent classifier in the head and then in the next steps you you unfreeze the whole system so there are a lot of ways you can retrain and you gradually feed your data through the network to kind of fine tune the the network okay so so I just want clarification on that so okay so I why is it that doing that future what's it called so so I get the fact that the first time you do the fit one cycle right you're you're you're basically based on the the relevant number of classes for your use case with that model why is it that when you do a fit one cycle later down in the code it's not it's not doing exactly the same thing which is fitting your class fitting another thing to the end of your class because I get why you do the first time I don't understand why when you do when you do a fit one cycle the next time around it's not doing exactly the same thing because in the next the next time you unfreeze the the ResNet part of the network so the resonant learned to learn the features that are necessary to classify the imagenet pictures but you can when you when you unfreeze it you can bring it to learn the features that are important for your data set so it's actually the the resna part of the of the network is changing when you when you unfreeze and then train again and maybe you're confusing that it's I think it's actually producing this this networking create CNN and basically they're already it chops off bad I'm pretty sure that it's actually the dropping of the the old head of the network in create CNN because you can specify a custom head there so basically that's the very initial stage of defining your model and then later on it's just like like freezing and unfreezing different parts of the model but you don't change the structure anymore yeah and freezing and unfreezing really just means that you you say okay those weights can change or those weights can't change when you say they can't change then you can basically you retrain the whole Reznor's part to because it yeah it probably learned some features from imagenet that are important also for your data set but when you allow the network to learn to to relearn and new features that are specifically important for your dataset then you will get better results okay great well thanks for that I'll take a look at the chat messages also and if I have a question I'll put on the Select thanks yeah sure okay so the next interesting thing was the unit architecture and so I I watched I think I watched the whole part on the segmentation part but I did not go on in this lesson and so did anyone of you do something with segmentation you just ran the notebook didn't really yeah but I think that's it's a really interesting architecture and it's like in some with so many other things in deep learning it's kind of incredible that it works so good did anyone read the actual paper probably not huh not enough time right I just looked at it this seems to be nice also video explaining how these unit works quite short video from the author's taking a fan came across this window was Craig all sold something competition and pretty much everyone used that one yeah I was I did this salt competition and yeah pretty much everyone was using units but I guess interesting there was people use like different faces so like some people use rez nets like in fast AI but some people use like dense Nets pretty much any CNN you can kind of plug into the architecture so that's that's kind of an interesting part because I think the original unit was actually just vgg so it's kind of evolved over time yeah I'm also I when I prepared for the first yummy a meet-up so not the study group sessions but the meetup where I showed scented something on capsule networks and I had a look into sec caps which is say segmentation architecture based on capsule networks and that's at least what people say it's pretty promising because segmentation should in theory work a lot better with capstat architectures and with convolutional architectures so maybe in the future something caps that can be a predecessor for unit yeah what I also found very interesting was the training with reduced accuracy and particularly one one phrase one sentence Jeremy said was that quite often in deep learning you will generalize better when you get inaccurate that that was pretty interesting I think did anyone of you try this this sort of training I think you you need a very very new and video graphics graphics card for that I think the problem you basically need one of the 28 yes I'll attend I have a 1080 Ti and I think it's not possible to use this ok so it's probably actually for for building your own box that that might be really cool I don't know I didn't really go into it but if it like a 20 70 it should be quite cheap and they are available yeah in the market and then you could basically get away with less memory because you can baby off your memory so 2017 8 gigs ok so like if you do have P 16 can you go on any type of network or because you basically don't you don't using the whole 32 so the work I think with any CUDA code I think because some parts of the process still need to be in 32 and I kind of remember that either the backpropagation or something needs to retain the high resolution okay yeah that's possible I didn't look into that yeah yeah because otherwise you why we would use the 32 for the other version I would say that there is neither it's either for something but maybe maybe not actually maybe not for deep learning I'm not sure so what I I know that the this.f p32 stuff in the in the graphics card I mean you don't need that for for the for graphics processing there the lower precision is is enough but they did that because I think they did that because those graphics cards and the CUDA stuff is very often used in kind of numerical simulations and this accuracy stuff is very important there because you kind of accumulate the error when you have lower precision and I think since yeah people know that for this deep learning stuff FB 16 is is enough I think they kind of changed this architecture in that direction so so basically this allows you to do a whole lot of these matrix calculations without storing as much of the the actual calculations or your figures as as the 32s yeah I think I don't know so Chris you mentioned that not for every part but I think most parts of the networks and and the numbers and the ways you save there and the cake relationship calculations you do are in with that P 16 I think it was discussed in the forum somewhere there's a 2728 e threat and i think it was mentioned that you can't go like a 100% to the lower resolution but you can still potentially save quite a bit but it's still a bit shaky so it's not really a super robust yet and you need I think also in the new scooter could attend I think great thank you yeah okay so then let's go into our mini presentations yeah he Jeremy suggested maybe trying out a regression example during the week before the next lecture so part of the you know challenge there is having a data set a label data set in fact some podcast I listened to recently one the guy maybe it was talking machines or something he's when he goes to a conference he goes around looking at a poster session to see what datasets people have created because sometimes that can be more valuable than than some methods that people have come up with for that particular conference because lots of people can then look at the data set but I'm particular question I have is you know do people have a good way of if I have 25 images how do I find the pixel location of some of some feature on the image that I point out so that I can create a labeled ground truth data set for example I want to find all triangles and I download 100 pictures now I want to figure out the coordinates of those triangles I mean they're probably various tools and custom ways of coding that but does anyone have an idea of how they might do that because that's sort of potentially the homework you would do is a regression like the location of the face example in the at some point somebody had to probably hover their cursor over the face and capture the points of that pixel location and mark that down somewhere and we have to do the same thing if we want to do that regression homework right problem like with the bounding box issue right object detection basically you need like that's also it done with with the regressions you you need the code like then the model needs to come up with coil next and that system the problem there's a couple of tools where you can annotate images but I guess it's a lot of work so that's a lot more work this is probably the minimum amount of work where all you do is cutie paste library which is used to annotate stuff I don't know I can look it up maybe using PI or Koji you also we might be able to achieve something like this I reckon like there's so little time for the till the next session that few people will actually create their own regression data set because I imagine that's quite a lot work well maybe if we you know we started out with this pre-trained object detection model resident 34 perhaps we only need maybe ten or twenty do something working yeah we interesting to know how many how many how many we'll need and if it works was such a low number of images I have a question you know in the lecture he mentioned people doing projects with video capture and they're just using some sort of software to pick out video frames does any buddy know what kind of software that is what RTC I would imagine web web RPC I think that's for for grabbing frames when you when you have a video camera streaming it'll it'll wrap them okay thank you if you're talking like life like like not processing video files but actually like camera input oh how about just video files that's pre recorded I think open CV for that yeah okay thank you and also keep in mind that Jeremy said also said that in the in the lecture that you have to carefully design your training and validation set and that you don't mix just mix them randomly up those images because then it can be very easy for the for the model to find pictures that are almost similar one friend to the next in the training and validation set so always use dividing by time so use images from from the first part of the video and then from the second part from the last twenty percent of the video for for validation okay good point thank you okay so Haider you wanna take over the screen and show the visualization yeah stop sharing yeah we tried to port fast AI models into Android but it wasn't an easy task so during the debugging what what the problem was I I try to search for a visualized visualizer for the model since the the model our model that we generated differs with a few bytes few hundred fights than the Facebook repo for the example of the Android by Tosh to Android 14 so I found this one it's called Natron it's it can support the onyx format and Kiera's coronel cafe but for the Patos didn't work since it's an experimental support for for Kathy and and patrasche so we try to make it so I have this this in the forum I posted the work on was was by by doing this four or five lines of code so you have to install in your system at the onyx onyx this is this is a library for for porting standardized format of the models and it's supported by all major libraries at tensorflow em NXT pie torch Kiera's and you can can install it but unfortunately I didn't get to install the easily I had to I had to I had to compile the the the the onyx as seen in the end example here a Windows machine no no I went to one to eight okay yeah so the this is this is the way to install the the onyx library but this didn't work so I had to to compile it as shown here in the steps here you can also build and install onyx locally on the from the source code so this works fine without problem so uh after that after that we can it can make a dummy input and an output name as a input names and this one line code will export your model this is your model this is your first a I'm gonna learn that modem and this is the dummy input that we put is just for just representation so that in the your graph you see this name this is the actual input one and then you you assign the path of the of the model this will be the file name and and that's it you will find the file and put it in this this application which is which which is available on Mac Linux Windows so everywhere so what I got from for example for for a resonate 38 is this this graph this is the input and going down all the way down back here oh that's pretty nice yeah it's pretty large model so going to down from it without put so this is the this is the best way that I find I found for for visualizing our body if you have something else please sure now that looks really good look super cool yeah just installing no this is the link I will share with you so you can maybe and all the others that are posting stuff on the on the chat here please also post that on smack because the chair is gone when the session ends so just post it on select then it's preserved yeah very nice very nice okay so now we can talk with context maybe he posted something on so he's trying to classify images was from sounds or spectrograms context are you on the line objects are said correctly so he said least in the chat I see that okay so let's let me go through through what he wrote in the chat he is trying to classify spectrograms so if you don't know what a spectrogram is you you definitely have seen a spectrogram it's kind of a transformation from audio data that kind of have this waveform that you know into a time independent domain and you can plot the amplitude of the of each of the frequencies you have in your in your in your data and you can open the link in that chat the reserve there is a picture of this one here oh yeah let me share the screen again yeah can you see my screen yes okay so that's what I spectrum looks like here and so this is this whole picture is for one specific time frame right I mean correct me if I'm wrong I'm not an audio expert here but what is so in this axis you have frequencies I think the color is the amplitude right and the this direction is the time no is it like that climates on the x-axis yeah right frequencies and Y and intensity yeah so this picture shows one specific time frame of a certain sound right and each so when you have different types of sounds you want to classify them each one has a pretty unique for prettier in this amplitude CC and so you can use a image classifier to to classify sound and why can I see the chat when I'm on in presenting mode hmm does anyone know how to get the chat here we can read the chat to you okay okay so his problem I think was that those pictures are not rectangular and the question if this is a problem isn't so is it a problem that this pictures are not rectangular so absence that I actually tried running a spectrogram analysis easy in this CNN last week on faster I and it worked it actually worked great I used the Google challenge that had been maybe 10 11 months old on Kaggle where there were a number one single syllable commands like stop-and-go and then a bunch of other words that were not correct and they basically wanted you to classify build a classifier that could figure out exactly what this means so I use that dataset I turned everything into spectra grams I use basically all the standard things we've learned so far with students although I did take out the transforms and I got excellent classification accuracy so I don't think the rectangle thing is issue so you also did not use rectangular but kind of this shape like this pictures yeah I ran it through I created the spectrograms using the browser I think saves normal looking JPEGs that are PNG files that are not square and there was no issue okay I mean in the image that I see displayed on our shared screen right now differentiation except at the bottom there however all the JPEGs and although spectrograms that I created it was pretty cool activation on different frequency across a wide range and I think I used standard 224 image sizing network 12 yes so you use the Liberals or whatever this library is called right that's correct and they are mell spectrograms over because I remember seeing some of them they looked quite different than this blog post if not more diverse like a lot more action in the image yeah so I actually tried three different version there's there's a non-metal spectrogram she's just a normal spectrogram transport and there was the MELD spectrogram trans form and then there is a variant on the mel spectrogram an FCC and I didn't see I actually found just with both the Mel and a regular spectrogram both produced good results so ship the shape of the of the picture spectrogram that you used it's like this it's a rectangular without problems it should be square or or it's not not required to be square I don't think that it's necessary to be square it didn't even worry about that now maybe I should have maybe this is because of the train time augmentation the TTA if you remember I don't know because I took the version two of Jeremy's course we see that his library is doing a training time augmentation so he crops squares but in different places maybe for example maybe if 10 times 15 times for each picture and if we takes random squares from everywhere and then averaging the predictions or the training there is a test time augmentation and training trommel condition so what Andrew said and I think an Isetta that he didn't use transformations and and augmentation and that kind of makes sense because if it would distort the sounds you you are trying to classify right so when you change the picture and squeeze it or or turn it around then it would be different different sounds it's can it can do without any augmentation you you you you you zero all all types of augmentation only the cropping is is taking and random so I don't think that you would use even use cropping here right I mean you would lose part of the of the so when you and who said that he's classifying worse right Andrew words like go and stop and that's great yeah and when you kind of crop the image then you would lose part of these so you would definitely destroy the meaning of the word of things on that Andrew so if the longer the word the the wider your image right well all the images the same sex yeah well all the images are not exactly the same size but they do seem to be I think Google was trying to make this relatively friendly they're all about one second long roughly and so I do actually think I saw somebody was posting earlier in the chat about a recent post that I also saw about what I see a men's may not be good for audio analysis and I do wonder if there are much longer snippets of audio whether it starts making sense to think more about an RNN model where you have sort of overlapping windows of time in a sequence because I think one large spectrogram for a bigger audio file might might be a little hard to capture all of the the nuance and specificity so I think these really short audio files lend themselves well to something like a CNN yeah I've seen a seem to talk previously when they used the current I did I did notice one might have frozen we can hear you can hear me okay yeah I was I was actually to the point about augmentation I was looking around on archive to see if anybody had thought much about how to do data augmentation for audio and I was initially excited at first I found this one paper from China where kind of make sense I guess the the authors decided to magnify spectrogram frequencies in a way that should mimic like things getting louder or softer as a way of augmentation right that was pretty clever but announced that they had like completely blown away the state of the art from like 50% recognition on someone classification tasks to like ninety-nine point nine and I looked really closely at the paper again this is just an archive and it turns out that they like have completely contaminated their test set with their training site that's why they were getting mmm which good accuracy so I'm not actually sure whether their nominal data augmentation system really works I'd like to see somebody trying to try that out with a little bit more rigor so so and adjust a question on a really really simple question on this on how you interpret this so you know and what's what's being shared that the the screen that's being shared above the word frequency for example right the orange bit at the top means like that that's a little bit that's a higher frequency bit how does that how do you interpret that with all of the you know the dark red which is supposed to be the intensity of the of the some what so I thought the lines represent how high it is right because that's the frequency right so how this is the frequency so when we let's Bob right here then the frequency is higher and whence mob right here the frequency is lower okay and when you have so the bright parts mean that the amplitude of this specific frequency is high for this point in time and then the the whole x-axis here is time so this would the individual yack or like this when when you would record it I I it seems to me like this would have I can believe that for one second burst you can do you can use a CNN on a spectrogram like this maybe just one tenth of the picture that we're looking at one of those little small structures I can see how CNN might be useful but if you try to use with the full sentence or ten seconds I guess you would have the same problem that geoff hinton why he created capsule in that right CNN can identify a face not because of where the eyes are in relation to the nose they could be anywhere and right your mouth could be above your eyes and your nose could be where one of your eyes is supposed to be all those things the network calls in the face even though it wouldn't look like a face to us here similarly the time component I guess would just be lost completely right in a CNN wouldn't really be able to handle that if he had more than a short burst of sound probably basically I mean I'm component is the x-axis right yeah thanks yeah but this is just so you can only show one one image at a time and this is one fixed period of time so it's not like a recurrent net for example so if it's important which structure if the large structure comes before the smaller structure on this plot the CNN wouldn't be able to tell the difference I think right well that's right so my understanding okay can I show you a test time augmentation notes from from the lesson two of last year I can share the screen just few minutes yeah guys I I kinda have to leave on point today because I I need to go now but I will just leave the meeting and you can can go on and go on with the discussion so when you hide out when you want to share the screen yes sure all right so do you see do you see my screen yeah so this is lesson lesson two over the last year it says that our intestine augmentation we have this these pictures which are rectangular as you see so when we do the validation set all of our inputs to our module must be square the reason is kind of minor technical detail but GPU doesn't go very quickly if you have different dimensions for different images it needs to be consistent so that every part of GPU so it should be square so how he Jeremy made the fast AI library is is that to make it square we just pick out the square in the middle as you can see below so this this picture for example this picture he take only the middle square off so I don't understand how we can solve the problem one would with a spectrogram which is equal to angular if you take only the middle square but for for Andrew the question is is did you use fast AI for the cattle competition sorry I just got disconnected momentarily I heard you asked me if I used fast AI for the experiment I was describing is that right yes yes did you use flush AI yes I used faster I so you you didn't worry about this this issue the square making up square for for the particular images yeah my sense is probably what happened is that I had just used yeah I think if I remember correctly when you pass in the image size to setting the parameters on the CNN if you just pass one number then it treats like that as the number to determine to create a square image so for if you just pass in two to four instead of two to four comma and some other number it just makes your image into two to four by two to four so I imagine it just slightly warped the shape of my image to smush it into that square that's probably how it got a square image and how was your your error the validation error it was very good actually actually I think that if I didn't submit to K gold per se I think the best on Kaggle was about ninety one percent accuracy that was it was just a Roth IRA symmetric and and mine actually got 93 but I suspect that the reason for that is that I didn't batch similar to what Jeremy was talking about this week I didn't batch the train and validation sets where each speaker was kept only either in train or validation and I suspect that there might be something about the fact that even if you say stop in the train set if your voice is still saying go and the valid set that that might be some data leakage that inflated my acting my my classification but even so like it still seemed to work pretty well I see would you be able to share the stuff Andrew would you be able to share your if you've if you've got it on any guitar would be interesting to have a take a look yeah sure I've got to clean up my code and I'd be happy to send it over actually one thing I thought I might share I know we're over time but for any of you still curious about the translation from sound to spectrogram and exactly what that means I do have a couple of images on my notebook that might clear that up I don't know if anybody'd be curious for to see that just not yet yes please okay may I may I take the screen share okay yeah can you see my screen yeah okay so this this is the word bed and the very first image is something you've probably seen on any kind of audio analyzer before just the normal sort of frequency activation the one just below it that's the that's the non Mel spectrogram of somebody saying the word bed and you can see there's some the darker colors on the spectrum are silent there are silence or low activation and then the brighter green is activation and so somebody's quiet then probably they put their lips together right around 2.17 and the main action is in the middle and then it's quiet again so a Mel spectrogram this is exactly the same thing it looks like this actually it's a it's a I took the log of it because I read somewhere that that's what you're supposed to do it's not that's not that different I mean the colors are different but the basic idea is the same the M FCC which has a word in it that I don't actually know how to pronounce out loud but you can look it up if you're curious this is just it's just a transformation of the Mel spectrogram and I understand and there's some circumstances it works better although I'm still trying to learn exactly when but I thought maybe the most interesting thing for me that really kind of brought to life exactly what is this spectrogram mean is it's a little weird because we have these three dimensions encoded it's time on the x-axis but each y-axis band is just one band of frequency and so I actually found it helpful to plot it in three dimensions here so you can it's it's exactly the same as what we saw above but instead of just there being a color activation now it's a bit resented activation on the on a z-axis as well and so this is really what it's trying to describe in terms of when and where across the frequency spend bands that you're seeing sound get activated that's all I wanted to show nice are all special grounds mainly in the middle I mean that the change is you see as you show us the spectrogram here the the the water has been spelled out in the middle so maybe all of them like this so I think that that's really just a function of there's two types of middles you can have I'll go back up to here because it's almost easier to show one middle is in the middle if you will of the y-axis which it would be like the middle of the frequency spectrum and for that I would say definitely not depends on the way that your mouth is perturbing the air will determine what frequencies are most active on the x-axis that's also not necessarily always going to be in the middle it just so happens that on the ACE recordings there's a little bit of silence in the beginning and there's silence at the end and then the word in the middle is really where the action happens so you could imagine if there is a lot of noise in the beginning silence in the middle and then noise in the ends then you would see almost the opposite pad in here yes actually for the frequencies okay but it's not in the middle because for our crop that I showed you in the first AI it will take it all because this is the least dimension the length but for for the width as you see if we take only 1/3 of the of the of the time we take only the third the third square in the middle then it will represent all the the words spoken because it is spoken in the middle only so are all the data set is is has been has been captured like that no not not exactly so I see what you're saying and in fact that is one sort of good hygiene practice that I encountered that I didn't actually do on this audio which is you can essentially perform the equivalent of trim or strip on strings to audio files so that it just kind of throws out all those stuff where nothing is really happening so a lot of people do do kind of try and zoom in on where the action is however you could imagine that a two syllable word like zero for example has a little bit of a pause as I transition from Z to row compared to a word like bed which is just this plosive sounds and then it kind of dies off quickly and so then you would see a very you see almost kind of like a bimodal activation in the middle for zero whereas bed would be just be more like this single chunk does that make sense yes yes yes thank you sure thing okay guys that's just what I wanted to share with you but I'm gonna hop off now thanks for your company see you next week okay so do we still have a host around I'm just on the on the jump out of the door so you want to go on so I can I can leave it on but we can also I don't know I mean I don't have to show this stuff I mean we can do the next time exactly all right so next time Christian and also I think Sonia has something to show so we have already two mini presentations for next time yeah cool awesome thank you guys okay bye-byeokay yeah welcome to the study group session for the third lesson this time it's a little bit yeah the the time frame was a little bit short so the session is only like I don't know two days ago or two and a half I couldn't even watch the complete session this time so I had a lot to do how is it for you did anyone also not watch the whole session because the time frame was too short or did you all participate live oh okay I just do it afterwards okay through the whole session yes and I'd recommend skipping the first half-hour if they go back and forth okay so let's see what we will do today okay so I left the intro from the last from the last times but I think most of you will not participate the first time oh I will not talk a lot about the study itself and then we can the recap of lesson three as I said I was not able to watch the whole lesson but I have some things I want to talk about from letter three again feel free to jump in unmute yourself and talk about things you want to talk about from from this session just half an hour ago I got two more that show something so we had just one presenter I don't actually don't know the real name of at Havas ET is he is he there here in the session yes I'm here cool awesome so your name is Kyra right yes yes hi okay awesome so he will show something about how to visualize model architectures that's pretty interesting I think kodiak's we will go on with his discussion he started earlier on slack about analyzing spectrogram images that's I found that very interesting because it's kind of transferring the yeah the time based audio domain to the to the image recognition domain for the interesting stuff and if you have time Sonia is going to show something about the doodle challenge you just said that he can show something for that and also maybe if Christians on the on the line he can just very quickly show how he managed to deploy his guitar classifier because he was mentioned by Jeremy in in the course and that's pretty cool so congrats Chris it's by the way is he on the line Christian are you there I mean yeah sorry it was nice from the chat just quickly so there are some people didn't watch the lesson some people did watch it just from the more advanced complex and also like to hear the discussion about deployment so so Christian can show us his famous guitar so I think the the stuff he packed into one listen pretty much tripled this time and we just had two days for this so let's see what we will talk about okay yeah again the study we are doing rotating hosts for the study group today I am hosting it and we will rotate it I think you will know most of these guys from from select again on the how the how the study group is how we do this we have the FSA I select channel I think is there anyone not on the selection i just just post on the in the chat i think pretty much all of you will be in the slack channel and we are doing this weekly sessions like today so if you are first just write in the chat and then Makaha is going to point you to the right direction yeah just like the last house we we want to do this mini presentations fortunately we got free now in the last minute for this session and you can just reach out to me or any other of the posters on slack if you would like to show something or or even if you would like to talk about something from from the course just reach out and then we can bring it up in the schedule for this for this sessions yeah again we're I mean this time it's very hard but we are trying to keep pace with the live course that makes sense and this time and opposite to last time we're going to focus on homework homeworks and projects yeah so the schedule tells me that next time the next session I session is on Tuesday so next time we'll have some more time until Saturday and so this is actually the shortest time frame this time yeah okay so now all right I don't think that we need to do the recap of the last session the video is going to I don't know if it's if it's on YouTube now someone asked me a few days ago but it will be online in the next days when it's not online now okay so let's start talking about session three or lesson three okay so one cool thing maybe I can do that at the first point someone from the live session San Francisco or created this searchable video player which I found pretty cool I think they are the videos that are shown here are the edited ones so lesson 3 is not online yet but which is what is pretty cool is that you can kind of search the the the transcript from the from the from the lesson so when Jeremy says something let's say what was lesson two I think great indecent maybe we found that yeah and then we can click that well also be something it's jumping to the right place then that's pretty cool so if you want to search specific things in the videos don't mention that it's the the Google speech-to-text system and that it's not always that good and that they are maybe doing a speech-to-text model that specifically geared towards the facade videos that won't be pretty cool okay yeah what if gradient descent phrase is mentioned 5 times during Abby you think can does it find the first one and can you skip from one to the next to the minute so let's see what's happening here I think it may get all the yeah yeah it's I think this is the first one and then second and okay okay okay so this one is pretty cool let's go back okay and then the next thing that I found really interesting was the this data block API he was talking about so how this how the data set and data loader things from height words and the classes from PI torch are used to create the data punch in in faster i/o that was pretty interesting I didn't know how that it how that it worked like this so what do you think about this about this IP I yeah I mean I think that the main thing around this was that it just gives you the flexibility that you wouldn't have got using the you know what he's taught us in lesson 1 so this basically means you've got loads of loads of options now yeah there are this pretty cool notes here and they are so what I was trying to do yes that's a little bit from open I can hear it oh sorry ok sorry I have a question maybe you know how to do it so what I did I did train the model I mean I have White's not saved and I have the I have classes that I've saved now I I've deleted the training set so I don't have anymore but if the test I won't now do influence on the test set I couldn't a bunch for which she also need the training set and all the labels to do it so seems like I was not able to do inference on the test set without having data bunch created and for that I need the training said that I was huge dataset which I don't want to download again I was like 500 gigabytes it took several hours to do it I was hoping I would be able to do inference just having the weights for the model the Train model and classes even I have but it didn't work so I was just thinking maybe you find a way or II know how to do it or is it possible to do it so to understand what you mean so you have to you have the test set and you have the weights you can load it into the into the model and you can't do inference on the test set because you you don't accept it without the the training set yeah so the way I understand I need to create I need to learn yeah and for that it needs the data and I cannot create data without at least that's my understanding I cannot create data without trading data and labels for the translator so I'm a bit stuck I was out hope that you can it's like you when you predict the like all the web apps you predict for single image but so I would like to do the same but I just want to predict from the whole bunch of images well you could I think you could probably just pass in like an empty training set I don't know if you've tried that yes I've tried to like put known to instead of putting the folder or the trainee folder I put known but it kind of complained about it I tried to put like random folder there but then complained a training labels and training it doesn't match so I kind of like try couple of things but I thought it would be like a way to do it without fooling the image data but maybe there's a way maybe just allowed to do it yeah I mean the other option is like you could just kind of do outside of fast AI so just load your model just kind of pass the batches to it because if you're just trying to do inference then you don't really like need all the other stuff in the library I mean you might not you might not want to have to do that all yourself but that's another option okay yeah yeah so the model is just like in like a pie chart model and you can just use that but isn't it possible to just load whites into a model and do inference without that's what I was about to ask I mean how is the different of doing like a regular inference like mega I didn't because you need a different because they need to to to load like a different method instead of the single from classes you need something else so yes after I've tried a single from class put the classes and the empty data folder and I did try to run the learn TTA and he doesn't require east-west true anymore they just change it a thing but it showed me an error and the error was forgot was now but he was an error yeah so try a couple different things but fire just like still trying yeah same thing API is yeah okay maybe I've misunderstood you but if we have for example if you're using resna 34 or whatever right don't you just say it's a resna 34 model and then you load the weights and you get your prediction that way yeah I was hoping I can do that but so far wasn't able to find a function on our method in fast they I had to do it maybe it's there I just don't know maybe where to find it but yeah and i posted my three lines of code in the in the chat so i don't know if that if i don't really follow it that's that's different what you need you have the yeah the single from data set and then you have single from class if single from classes yeah model state load stay picked and then load the torch okay yeah I've used slightly different code that I found on the documentation site or firstly I found this okay maybe this changing them okay I'll try that one thank you yeah but so the this data block API and the ability to kind of quickly change things here is that's that's a really cool thing that at least I found that in in this lesson and then again and we had this in the last course 2 which is progressive resizing which is kind of a I mean it's it's not that complex that complex idea and it seems to work really good so did anyone try a few and try that in your projects I have a question about it but I think we kind of figured that it should work so I was wondering I mean with the rest nets we always used like 2 24 $2.99 and I think in the lesson or at least in the chat people were talking about 64 128 256 512 is that like sub optimal or the model doesn't care well I think so when you have when you kind of change your your your net on let's say 224 by 224 and then you give it a picture that it's like two thousand by two thousand and the filters will not match the features so a tire for example when you want to classify a car is much bigger in this much more pixels stretch out on much more pixels and then probably the filter is not not find a car because it's too big yeah yeah I think I get the concept about the progressive stuff I was just wondering if I need a multiple of 229 or two 299 instead of having different steps if their steps met in any way like like I kind of take okay you should have a multiple but I wasn't sure if I if the rest of this is trained on 229 if they're module then it would be the Christian why multiple of 229 so let's all multiple of anything why do you think it should be a multiple or something I thought that the idea that these progressive resizing is always like the next kind of increment I don't know it's just guesswork yeah I think it's it's only it's kind of you have smaller images and then you you can train the net very fast and then you have good initialization weights and then you go on and make it bigger and kind of get more of the details and then when you make it bigger it's getting better also on the computation limits right also if we train bigger than a single GPU sometimes it could one for instance maybe I'm a I'm just tense here but could bond fences go to 29 300 350 350 400 I said like yeah you should be able to do that the only limit I think is gonna be how many times it like cuts your thing in half with like max pooling or stride equals two but okay but I think in terms of like actual multiples I don't think has to be a multiple of 224 or any of the those numbers okay cool but I think the the computation you need is it's not factor two it's it's quadratic right so when you have bigger pictures you need a lot more computational power there's at least it is with like if I have a bigger batch say they may be very mid-sized or ResNet 50 it fails on a single DCP you get out of memory error yeah yeah make sense yeah okay just just one question on this you know I think typically in in ResNet they also have unite like all of the images tend to be a certain size what was there any so how does this play in with that so the fact that we're actually changing the sizes now because it seems to provide a better model because my experience in the past was you'd always use the certain size for resonance but now it looks like we're changing it because it seems to provide a better model this does anyone have any thoughts on that so I think the idea behind the progressive resizing is that you isn't it just the idea that you want to train your model faster you want to get to a good model faster and then you start with 64 by 64 images and I mean it's it's a lot smaller so there are not that Niki right yeah I think you're right okay right thank you yeah so it's not actually about the model can can abstract like features on different sizes in within the image I thought that that you get like the capability that it can detect larger objects or smaller objects as well said a misconception I think we'll but the idea was to experiment faster to get to a decent level okay the bigger images will give every like all the details but you might not always have that much time and computation power so with progressive resizing you can get to it faster and maybe better yeah I wanted to try it out I wanted to train a second batch of models for the app yeah so I think they use this Procrit progressive resizing in this what is dom bench competition they they one and four four very efficiently training on image net and i think that the reason behind that is that you can start very very fast you get very you get pretty good results pretty fast and then can do transfer learning and go on on that and i think someone broke that record a couple of days back to the the dormant rica record yes yes the enemy is on Jeremy's record someone broke that yeah jeremy tweeted about this and they've smashed it like two and a half times that was like nvidia block we were discussing previously whether it they basically said okay how they they actually shaved off seconds and here and there and well they yeah there is a nice post about how they did it or they really just candid they did a couple of interesting things in depth but it's it was not using far say I I think I'm not sure I didn't read that in detail okay no interesting yeah can I can ask a very simple question this is incredibly basically yeah so I've had a bit of a brain freeze I've seen a whole lot of this you know this freeze unfair brain freeze it anyway I've seen a whole lot of the cycle of freeze unfreeze and then fit one cycle and I'm a little bit confused because I thought whenever you do a fit one cycle you're adding like a fully connected head with the relevant relevant number of classes so in lesson three he did several of these and I know it's it's doing some sort of training but if someone could define or just explain what we're doing when we do these multiple freeze on freezes and fit one cycle can you all mute yourself and understand it yourself when you are talking there's some bad feedback noise here okay so what I also saw I think that is applying and you head to the to the network is really you do it all one time at the beginning of your of your training so you take the vanilla ResNet with thousand classes I think and then you cut off the head and apply a new classifier head on that and and then what everything you do after that is all the retraining stuff is done with the same network then it's not changed anymore then so when you when you unfreeze and then retrain you always use the this hat you trained in the very first step of your for your data set so it's only the so it's only when you do the fit one cycle the first time that you're actually adding that that new classifier yes I'm pretty sure because I mean it's the classifier for your for your data set yeah okay yeah so how is it that when you do a couple more of those fit one cycles you're not changing the head anymore because there was like this it's like he looped through this freeze unfreeze and fit one cycle and that got me a little bit confused yeah well in in the very first step you are you take this features that are generated in there in the resin apart so in the in the convolutional part of the network then you get this feature features in your future maps and you use those features to train the classifier for your specific air assets so let's say you just have like 20 classes instead of thousand classes and in the first set you train this classifier to to use the feature whether with a frozen resonate network and then you got a got a get a pretty decent classifier in the head and then in the next steps you you unfreeze the whole system so there are a lot of ways you can retrain and you gradually feed your data through the network to kind of fine tune the the network okay so so I just want clarification on that so okay so I why is it that doing that future what's it called so so I get the fact that the first time you do the fit one cycle right you're you're you're basically based on the the relevant number of classes for your use case with that model why is it that when you do a fit one cycle later down in the code it's not it's not doing exactly the same thing which is fitting your class fitting another thing to the end of your class because I get why you do the first time I don't understand why when you do when you do a fit one cycle the next time around it's not doing exactly the same thing because in the next the next time you unfreeze the the ResNet part of the network so the resonant learned to learn the features that are necessary to classify the imagenet pictures but you can when you when you unfreeze it you can bring it to learn the features that are important for your data set so it's actually the the resna part of the of the network is changing when you when you unfreeze and then train again and maybe you're confusing that it's I think it's actually producing this this networking create CNN and basically they're already it chops off bad I'm pretty sure that it's actually the dropping of the the old head of the network in create CNN because you can specify a custom head there so basically that's the very initial stage of defining your model and then later on it's just like like freezing and unfreezing different parts of the model but you don't change the structure anymore yeah and freezing and unfreezing really just means that you you say okay those weights can change or those weights can't change when you say they can't change then you can basically you retrain the whole Reznor's part to because it yeah it probably learned some features from imagenet that are important also for your data set but when you allow the network to learn to to relearn and new features that are specifically important for your dataset then you will get better results okay great well thanks for that I'll take a look at the chat messages also and if I have a question I'll put on the Select thanks yeah sure okay so the next interesting thing was the unit architecture and so I I watched I think I watched the whole part on the segmentation part but I did not go on in this lesson and so did anyone of you do something with segmentation you just ran the notebook didn't really yeah but I think that's it's a really interesting architecture and it's like in some with so many other things in deep learning it's kind of incredible that it works so good did anyone read the actual paper probably not huh not enough time right I just looked at it this seems to be nice also video explaining how these unit works quite short video from the author's taking a fan came across this window was Craig all sold something competition and pretty much everyone used that one yeah I was I did this salt competition and yeah pretty much everyone was using units but I guess interesting there was people use like different faces so like some people use rez nets like in fast AI but some people use like dense Nets pretty much any CNN you can kind of plug into the architecture so that's that's kind of an interesting part because I think the original unit was actually just vgg so it's kind of evolved over time yeah I'm also I when I prepared for the first yummy a meet-up so not the study group sessions but the meetup where I showed scented something on capsule networks and I had a look into sec caps which is say segmentation architecture based on capsule networks and that's at least what people say it's pretty promising because segmentation should in theory work a lot better with capstat architectures and with convolutional architectures so maybe in the future something caps that can be a predecessor for unit yeah what I also found very interesting was the training with reduced accuracy and particularly one one phrase one sentence Jeremy said was that quite often in deep learning you will generalize better when you get inaccurate that that was pretty interesting I think did anyone of you try this this sort of training I think you you need a very very new and video graphics graphics card for that I think the problem you basically need one of the 28 yes I'll attend I have a 1080 Ti and I think it's not possible to use this ok so it's probably actually for for building your own box that that might be really cool I don't know I didn't really go into it but if it like a 20 70 it should be quite cheap and they are available yeah in the market and then you could basically get away with less memory because you can baby off your memory so 2017 8 gigs ok so like if you do have P 16 can you go on any type of network or because you basically don't you don't using the whole 32 so the work I think with any CUDA code I think because some parts of the process still need to be in 32 and I kind of remember that either the backpropagation or something needs to retain the high resolution okay yeah that's possible I didn't look into that yeah yeah because otherwise you why we would use the 32 for the other version I would say that there is neither it's either for something but maybe maybe not actually maybe not for deep learning I'm not sure so what I I know that the this.f p32 stuff in the in the graphics card I mean you don't need that for for the for graphics processing there the lower precision is is enough but they did that because I think they did that because those graphics cards and the CUDA stuff is very often used in kind of numerical simulations and this accuracy stuff is very important there because you kind of accumulate the error when you have lower precision and I think since yeah people know that for this deep learning stuff FB 16 is is enough I think they kind of changed this architecture in that direction so so basically this allows you to do a whole lot of these matrix calculations without storing as much of the the actual calculations or your figures as as the 32s yeah I think I don't know so Chris you mentioned that not for every part but I think most parts of the networks and and the numbers and the ways you save there and the cake relationship calculations you do are in with that P 16 I think it was discussed in the forum somewhere there's a 2728 e threat and i think it was mentioned that you can't go like a 100% to the lower resolution but you can still potentially save quite a bit but it's still a bit shaky so it's not really a super robust yet and you need I think also in the new scooter could attend I think great thank you yeah okay so then let's go into our mini presentations yeah he Jeremy suggested maybe trying out a regression example during the week before the next lecture so part of the you know challenge there is having a data set a label data set in fact some podcast I listened to recently one the guy maybe it was talking machines or something he's when he goes to a conference he goes around looking at a poster session to see what datasets people have created because sometimes that can be more valuable than than some methods that people have come up with for that particular conference because lots of people can then look at the data set but I'm particular question I have is you know do people have a good way of if I have 25 images how do I find the pixel location of some of some feature on the image that I point out so that I can create a labeled ground truth data set for example I want to find all triangles and I download 100 pictures now I want to figure out the coordinates of those triangles I mean they're probably various tools and custom ways of coding that but does anyone have an idea of how they might do that because that's sort of potentially the homework you would do is a regression like the location of the face example in the at some point somebody had to probably hover their cursor over the face and capture the points of that pixel location and mark that down somewhere and we have to do the same thing if we want to do that regression homework right problem like with the bounding box issue right object detection basically you need like that's also it done with with the regressions you you need the code like then the model needs to come up with coil next and that system the problem there's a couple of tools where you can annotate images but I guess it's a lot of work so that's a lot more work this is probably the minimum amount of work where all you do is cutie paste library which is used to annotate stuff I don't know I can look it up maybe using PI or Koji you also we might be able to achieve something like this I reckon like there's so little time for the till the next session that few people will actually create their own regression data set because I imagine that's quite a lot work well maybe if we you know we started out with this pre-trained object detection model resident 34 perhaps we only need maybe ten or twenty do something working yeah we interesting to know how many how many how many we'll need and if it works was such a low number of images I have a question you know in the lecture he mentioned people doing projects with video capture and they're just using some sort of software to pick out video frames does any buddy know what kind of software that is what RTC I would imagine web web RPC I think that's for for grabbing frames when you when you have a video camera streaming it'll it'll wrap them okay thank you if you're talking like life like like not processing video files but actually like camera input oh how about just video files that's pre recorded I think open CV for that yeah okay thank you and also keep in mind that Jeremy said also said that in the in the lecture that you have to carefully design your training and validation set and that you don't mix just mix them randomly up those images because then it can be very easy for the for the model to find pictures that are almost similar one friend to the next in the training and validation set so always use dividing by time so use images from from the first part of the video and then from the second part from the last twenty percent of the video for for validation okay good point thank you okay so Haider you wanna take over the screen and show the visualization yeah stop sharing yeah we tried to port fast AI models into Android but it wasn't an easy task so during the debugging what what the problem was I I try to search for a visualized visualizer for the model since the the model our model that we generated differs with a few bytes few hundred fights than the Facebook repo for the example of the Android by Tosh to Android 14 so I found this one it's called Natron it's it can support the onyx format and Kiera's coronel cafe but for the Patos didn't work since it's an experimental support for for Kathy and and patrasche so we try to make it so I have this this in the forum I posted the work on was was by by doing this four or five lines of code so you have to install in your system at the onyx onyx this is this is a library for for porting standardized format of the models and it's supported by all major libraries at tensorflow em NXT pie torch Kiera's and you can can install it but unfortunately I didn't get to install the easily I had to I had to I had to compile the the the the onyx as seen in the end example here a Windows machine no no I went to one to eight okay yeah so the this is this is the way to install the the onyx library but this didn't work so I had to to compile it as shown here in the steps here you can also build and install onyx locally on the from the source code so this works fine without problem so uh after that after that we can it can make a dummy input and an output name as a input names and this one line code will export your model this is your model this is your first a I'm gonna learn that modem and this is the dummy input that we put is just for just representation so that in the your graph you see this name this is the actual input one and then you you assign the path of the of the model this will be the file name and and that's it you will find the file and put it in this this application which is which which is available on Mac Linux Windows so everywhere so what I got from for example for for a resonate 38 is this this graph this is the input and going down all the way down back here oh that's pretty nice yeah it's pretty large model so going to down from it without put so this is the this is the best way that I find I found for for visualizing our body if you have something else please sure now that looks really good look super cool yeah just installing no this is the link I will share with you so you can maybe and all the others that are posting stuff on the on the chat here please also post that on smack because the chair is gone when the session ends so just post it on select then it's preserved yeah very nice very nice okay so now we can talk with context maybe he posted something on so he's trying to classify images was from sounds or spectrograms context are you on the line objects are said correctly so he said least in the chat I see that okay so let's let me go through through what he wrote in the chat he is trying to classify spectrograms so if you don't know what a spectrogram is you you definitely have seen a spectrogram it's kind of a transformation from audio data that kind of have this waveform that you know into a time independent domain and you can plot the amplitude of the of each of the frequencies you have in your in your in your data and you can open the link in that chat the reserve there is a picture of this one here oh yeah let me share the screen again yeah can you see my screen yes okay so that's what I spectrum looks like here and so this is this whole picture is for one specific time frame right I mean correct me if I'm wrong I'm not an audio expert here but what is so in this axis you have frequencies I think the color is the amplitude right and the this direction is the time no is it like that climates on the x-axis yeah right frequencies and Y and intensity yeah so this picture shows one specific time frame of a certain sound right and each so when you have different types of sounds you want to classify them each one has a pretty unique for prettier in this amplitude CC and so you can use a image classifier to to classify sound and why can I see the chat when I'm on in presenting mode hmm does anyone know how to get the chat here we can read the chat to you okay okay so his problem I think was that those pictures are not rectangular and the question if this is a problem isn't so is it a problem that this pictures are not rectangular so absence that I actually tried running a spectrogram analysis easy in this CNN last week on faster I and it worked it actually worked great I used the Google challenge that had been maybe 10 11 months old on Kaggle where there were a number one single syllable commands like stop-and-go and then a bunch of other words that were not correct and they basically wanted you to classify build a classifier that could figure out exactly what this means so I use that dataset I turned everything into spectra grams I use basically all the standard things we've learned so far with students although I did take out the transforms and I got excellent classification accuracy so I don't think the rectangle thing is issue so you also did not use rectangular but kind of this shape like this pictures yeah I ran it through I created the spectrograms using the browser I think saves normal looking JPEGs that are PNG files that are not square and there was no issue okay I mean in the image that I see displayed on our shared screen right now differentiation except at the bottom there however all the JPEGs and although spectrograms that I created it was pretty cool activation on different frequency across a wide range and I think I used standard 224 image sizing network 12 yes so you use the Liberals or whatever this library is called right that's correct and they are mell spectrograms over because I remember seeing some of them they looked quite different than this blog post if not more diverse like a lot more action in the image yeah so I actually tried three different version there's there's a non-metal spectrogram she's just a normal spectrogram transport and there was the MELD spectrogram trans form and then there is a variant on the mel spectrogram an FCC and I didn't see I actually found just with both the Mel and a regular spectrogram both produced good results so ship the shape of the of the picture spectrogram that you used it's like this it's a rectangular without problems it should be square or or it's not not required to be square I don't think that it's necessary to be square it didn't even worry about that now maybe I should have maybe this is because of the train time augmentation the TTA if you remember I don't know because I took the version two of Jeremy's course we see that his library is doing a training time augmentation so he crops squares but in different places maybe for example maybe if 10 times 15 times for each picture and if we takes random squares from everywhere and then averaging the predictions or the training there is a test time augmentation and training trommel condition so what Andrew said and I think an Isetta that he didn't use transformations and and augmentation and that kind of makes sense because if it would distort the sounds you you are trying to classify right so when you change the picture and squeeze it or or turn it around then it would be different different sounds it's can it can do without any augmentation you you you you you zero all all types of augmentation only the cropping is is taking and random so I don't think that you would use even use cropping here right I mean you would lose part of the of the so when you and who said that he's classifying worse right Andrew words like go and stop and that's great yeah and when you kind of crop the image then you would lose part of these so you would definitely destroy the meaning of the word of things on that Andrew so if the longer the word the the wider your image right well all the images the same sex yeah well all the images are not exactly the same size but they do seem to be I think Google was trying to make this relatively friendly they're all about one second long roughly and so I do actually think I saw somebody was posting earlier in the chat about a recent post that I also saw about what I see a men's may not be good for audio analysis and I do wonder if there are much longer snippets of audio whether it starts making sense to think more about an RNN model where you have sort of overlapping windows of time in a sequence because I think one large spectrogram for a bigger audio file might might be a little hard to capture all of the the nuance and specificity so I think these really short audio files lend themselves well to something like a CNN yeah I've seen a seem to talk previously when they used the current I did I did notice one might have frozen we can hear you can hear me okay yeah I was I was actually to the point about augmentation I was looking around on archive to see if anybody had thought much about how to do data augmentation for audio and I was initially excited at first I found this one paper from China where kind of make sense I guess the the authors decided to magnify spectrogram frequencies in a way that should mimic like things getting louder or softer as a way of augmentation right that was pretty clever but announced that they had like completely blown away the state of the art from like 50% recognition on someone classification tasks to like ninety-nine point nine and I looked really closely at the paper again this is just an archive and it turns out that they like have completely contaminated their test set with their training site that's why they were getting mmm which good accuracy so I'm not actually sure whether their nominal data augmentation system really works I'd like to see somebody trying to try that out with a little bit more rigor so so and adjust a question on a really really simple question on this on how you interpret this so you know and what's what's being shared that the the screen that's being shared above the word frequency for example right the orange bit at the top means like that that's a little bit that's a higher frequency bit how does that how do you interpret that with all of the you know the dark red which is supposed to be the intensity of the of the some what so I thought the lines represent how high it is right because that's the frequency right so how this is the frequency so when we let's Bob right here then the frequency is higher and whence mob right here the frequency is lower okay and when you have so the bright parts mean that the amplitude of this specific frequency is high for this point in time and then the the whole x-axis here is time so this would the individual yack or like this when when you would record it I I it seems to me like this would have I can believe that for one second burst you can do you can use a CNN on a spectrogram like this maybe just one tenth of the picture that we're looking at one of those little small structures I can see how CNN might be useful but if you try to use with the full sentence or ten seconds I guess you would have the same problem that geoff hinton why he created capsule in that right CNN can identify a face not because of where the eyes are in relation to the nose they could be anywhere and right your mouth could be above your eyes and your nose could be where one of your eyes is supposed to be all those things the network calls in the face even though it wouldn't look like a face to us here similarly the time component I guess would just be lost completely right in a CNN wouldn't really be able to handle that if he had more than a short burst of sound probably basically I mean I'm component is the x-axis right yeah thanks yeah but this is just so you can only show one one image at a time and this is one fixed period of time so it's not like a recurrent net for example so if it's important which structure if the large structure comes before the smaller structure on this plot the CNN wouldn't be able to tell the difference I think right well that's right so my understanding okay can I show you a test time augmentation notes from from the lesson two of last year I can share the screen just few minutes yeah guys I I kinda have to leave on point today because I I need to go now but I will just leave the meeting and you can can go on and go on with the discussion so when you hide out when you want to share the screen yes sure all right so do you see do you see my screen yeah so this is lesson lesson two over the last year it says that our intestine augmentation we have this these pictures which are rectangular as you see so when we do the validation set all of our inputs to our module must be square the reason is kind of minor technical detail but GPU doesn't go very quickly if you have different dimensions for different images it needs to be consistent so that every part of GPU so it should be square so how he Jeremy made the fast AI library is is that to make it square we just pick out the square in the middle as you can see below so this this picture for example this picture he take only the middle square off so I don't understand how we can solve the problem one would with a spectrogram which is equal to angular if you take only the middle square but for for Andrew the question is is did you use fast AI for the cattle competition sorry I just got disconnected momentarily I heard you asked me if I used fast AI for the experiment I was describing is that right yes yes did you use flush AI yes I used faster I so you you didn't worry about this this issue the square making up square for for the particular images yeah my sense is probably what happened is that I had just used yeah I think if I remember correctly when you pass in the image size to setting the parameters on the CNN if you just pass one number then it treats like that as the number to determine to create a square image so for if you just pass in two to four instead of two to four comma and some other number it just makes your image into two to four by two to four so I imagine it just slightly warped the shape of my image to smush it into that square that's probably how it got a square image and how was your your error the validation error it was very good actually actually I think that if I didn't submit to K gold per se I think the best on Kaggle was about ninety one percent accuracy that was it was just a Roth IRA symmetric and and mine actually got 93 but I suspect that the reason for that is that I didn't batch similar to what Jeremy was talking about this week I didn't batch the train and validation sets where each speaker was kept only either in train or validation and I suspect that there might be something about the fact that even if you say stop in the train set if your voice is still saying go and the valid set that that might be some data leakage that inflated my acting my my classification but even so like it still seemed to work pretty well I see would you be able to share the stuff Andrew would you be able to share your if you've if you've got it on any guitar would be interesting to have a take a look yeah sure I've got to clean up my code and I'd be happy to send it over actually one thing I thought I might share I know we're over time but for any of you still curious about the translation from sound to spectrogram and exactly what that means I do have a couple of images on my notebook that might clear that up I don't know if anybody'd be curious for to see that just not yet yes please okay may I may I take the screen share okay yeah can you see my screen yeah okay so this this is the word bed and the very first image is something you've probably seen on any kind of audio analyzer before just the normal sort of frequency activation the one just below it that's the that's the non Mel spectrogram of somebody saying the word bed and you can see there's some the darker colors on the spectrum are silent there are silence or low activation and then the brighter green is activation and so somebody's quiet then probably they put their lips together right around 2.17 and the main action is in the middle and then it's quiet again so a Mel spectrogram this is exactly the same thing it looks like this actually it's a it's a I took the log of it because I read somewhere that that's what you're supposed to do it's not that's not that different I mean the colors are different but the basic idea is the same the M FCC which has a word in it that I don't actually know how to pronounce out loud but you can look it up if you're curious this is just it's just a transformation of the Mel spectrogram and I understand and there's some circumstances it works better although I'm still trying to learn exactly when but I thought maybe the most interesting thing for me that really kind of brought to life exactly what is this spectrogram mean is it's a little weird because we have these three dimensions encoded it's time on the x-axis but each y-axis band is just one band of frequency and so I actually found it helpful to plot it in three dimensions here so you can it's it's exactly the same as what we saw above but instead of just there being a color activation now it's a bit resented activation on the on a z-axis as well and so this is really what it's trying to describe in terms of when and where across the frequency spend bands that you're seeing sound get activated that's all I wanted to show nice are all special grounds mainly in the middle I mean that the change is you see as you show us the spectrogram here the the the water has been spelled out in the middle so maybe all of them like this so I think that that's really just a function of there's two types of middles you can have I'll go back up to here because it's almost easier to show one middle is in the middle if you will of the y-axis which it would be like the middle of the frequency spectrum and for that I would say definitely not depends on the way that your mouth is perturbing the air will determine what frequencies are most active on the x-axis that's also not necessarily always going to be in the middle it just so happens that on the ACE recordings there's a little bit of silence in the beginning and there's silence at the end and then the word in the middle is really where the action happens so you could imagine if there is a lot of noise in the beginning silence in the middle and then noise in the ends then you would see almost the opposite pad in here yes actually for the frequencies okay but it's not in the middle because for our crop that I showed you in the first AI it will take it all because this is the least dimension the length but for for the width as you see if we take only 1/3 of the of the of the time we take only the third the third square in the middle then it will represent all the the words spoken because it is spoken in the middle only so are all the data set is is has been has been captured like that no not not exactly so I see what you're saying and in fact that is one sort of good hygiene practice that I encountered that I didn't actually do on this audio which is you can essentially perform the equivalent of trim or strip on strings to audio files so that it just kind of throws out all those stuff where nothing is really happening so a lot of people do do kind of try and zoom in on where the action is however you could imagine that a two syllable word like zero for example has a little bit of a pause as I transition from Z to row compared to a word like bed which is just this plosive sounds and then it kind of dies off quickly and so then you would see a very you see almost kind of like a bimodal activation in the middle for zero whereas bed would be just be more like this single chunk does that make sense yes yes yes thank you sure thing okay guys that's just what I wanted to share with you but I'm gonna hop off now thanks for your company see you next week okay so do we still have a host around I'm just on the on the jump out of the door so you want to go on so I can I can leave it on but we can also I don't know I mean I don't have to show this stuff I mean we can do the next time exactly all right so next time Christian and also I think Sonia has something to show so we have already two mini presentations for next time yeah cool awesome thank you guys okay bye-bye\n"