Student Blog Post _ Who is the Songster - Deep Learning Recognizes the singers of Bollywood songs

The Power of Transfer Learning: A Case Study on Audio Classification with VG16

In this case study, we explore the application of transfer learning to audio classification using the VG16 feature extraction module. The researcher utilizes a combination of CNN and XGBoost models to classify songs into different artists, leveraging the power of transfer learning to extract features from small audio clips.

The researcher uses a dataset consisting of song snippets in dot wav format or mp3 files converted to spectrograms. She applies convolutional neural networks (CNN) to these spectrograms to extract features using the VG16 module. The resulting features are then pumped into an XGBoost model, which outputs probabilities that predict the likelihood of each song being sung by a particular artist.

The researcher uses transfer learning to leverage pre-trained features from other datasets, allowing her to classify songs with limited computational resources. She trains the XGBoost model on a small subset of the dataset and then applies the same architecture to the entire dataset, demonstrating the effectiveness of transfer learning in audio classification.

One notable example is the song "Laser Mujhey," where the researcher correctly identifies the original singer as Arman, despite being confused by the predicted singer. The probabilities provided by the model reveal a 60% chance that Sonu Nigam is the correct artist, with Arman having a 20% chance and RJ (Rajesh) having a 20% chance.

This case study highlights the challenges of audio classification using limited resources and the importance of leveraging transfer learning to overcome these limitations. The researcher's approach demonstrates how features extracted from small audio clips can be used to improve model accuracy, making it an accessible solution for applications with constrained computational resources.

The benefits of this approach are evident in the success of Hina, a student who was passionate about Bollywood songs but lacked knowledge of spectrograms and audio classification techniques. With guidance from the researchers, she was able to learn additional concepts and apply them to solve the problem. This case study illustrates the effectiveness of our course materials in providing students with the skills and knowledge necessary to tackle real-world problems.

The use of VG16 feature extraction module is particularly noteworthy, as it has not been extensively discussed in our course material until now. However, this example demonstrates that the techniques can be applied to audio data, even if they are not explicitly covered in the course. The researcher's ability to adapt and apply transfer learning to solve the problem showcases the versatility of these techniques.

In conclusion, this case study presents a compelling demonstration of the power of transfer learning in audio classification using VG16 feature extraction module. By leveraging pre-trained features from other datasets, the researcher is able to classify songs with limited computational resources. The example highlights the importance of leveraging additional techniques and concepts, such as spectrograms and Fourier transforms, to tackle complex problems.

The reference link provided in the original transcript can be accessed for further reading on this topic.

Recommendation:

For readers interested in exploring transfer learning and audio classification techniques, we recommend checking out Hina's blog post, which provides a detailed account of her experience and solutions. Additionally, our course materials cover topics such as Fourier transforms, spectrograms, and transfer learning in depth, providing a comprehensive foundation for tackling complex problems like this case study.

Technical Details:

The researcher uses the following technical details to achieve the classification results:

* VG16 feature extraction module

* CNN to extract features from spectrograms

* XGBoost model to classify songs

* Transfer learning to leverage pre-trained features from other datasets

These techniques are applied using the following software and hardware:

* Computational resources: Limited computational resources were available for this project.

* Software: The researcher uses Python as the primary programming language, with libraries such as TensorFlow and scikit-learn.

The use of these technologies allows for efficient processing of large audio datasets, enabling the development of accurate classification models.

"WEBVTTKind: captionsLanguage: enthis is a very interesting case study done by one of the students named Hina Sharma and she this is a very interesting problem on music and songs especially Bollywood music which is extremely popular amongst Indians so if you look at this the problem is very interesting there is lot of research on speaker recognition there is a lot of research on if you give an audio snippet it will determine who the speaker is now she Hema wanted to extend this to Bollywood songs so what she has done here is she has taken up a bunch of Bollywood songs she has downloaded Bollywood songs from 10 artists that she really likes right she has taken would say 500 solo songs from her 10 favorite artists like our jet atif Arman shreya Sonu Nigam etcetera etcetera etcetera right so she has taken up all she has downloaded this data this audio data and now she is the past that he is trying to solve is as follows right the task that she is trying to solve is imagine if I'm given a song right can I predict she wants to predict the singer she wants to predict the singer from the song so given the audio by song I mean given the audio of the song given the audio of the song she wants to predict who the singer is now this is a very interesting experiment because there is an experiment on audio data or and because you have tens fingers you have ten solo singers in each of your songs it's a multi-class classification right it's a multi-class classification problem but the first question here is how do you convert audio into features because audio if you think about it is time series data right and audio has some very very interesting properties that actually he-man used as part of this case study right so she'll download all the data and then she converts all of the data pre-processing data obtains on the night it took her some time to actually download all these songs and create the data set itself then while working with her we suggest for her to use something called aspect programs right a spectrogram is is like is like a Fourier transform but oh but it has multiple okay let me show you an image a spectrogram again she explains what a spectrogram here is and things like that spectrograms are used extensively in all your processing it's it's it's it's one of the one of the trademark things and she learned she learnt it it's not very hard but it's because we have covered things like Fourier transforms in our course it's not very hard to understand what a spectrogram is and she got some simple code right on how to build spectrograms and this is how a spectrogram looks like right this is uh this is what a spectrogram looks like a spectrogram is basically an image representation in a nutshell it is an image representation of an audio snippet of an audio snippet so the way it works is this your x axis which is time right you have your Y axis which are frequencies right so at any time part at any time point what are all the frequencies that are there and this color coding happens by amplitude if you have more amplitude right you have yellow color if you have less amplitude you have blue color right this is this is this is what is called a spectrogram and given a shot given a let's say our twenty second audio snippet or something you can represent it using an image and this is the corresponding image to the given twenty second audio snippet right and once you have images we know that we can use convolutional neural networks right because CN NS work very well on images it will not be human images this is actually a computer-generated image of a spectrogram but still CN NS will work very very well even on this data so what strategy was given a big audio file right let's break it up into smaller snippets 20-second snippets each right from 20-second snippets each on each of these snippets let's convert this into an image and given this image now she can apply convolution neural network right and and we'll see follow it we'll see how she actually works she actually uses VG 16 on this dataset with some but she doesn't train the whole dataset she only trains she does transfer learning type of strategies it wherein she extract some of the features it's a feature extraction module right so we have discuss how to use transfer learning with VG 16 and stuff like in our course and she actually took the same idea here she used the transfer learning scheme you get to get features from the data and her overall structure right and her overall architecture looks like this given a song in dot wav format if you have mp3 even converted Godfather she takes small small clips or snippets right and on each of these snippets she computes a spectrogram and from each of the spectrogram she gets features using a CNN using BG 16 and she pumps these features into an XG boost model and then that is there is sort of like a majority wound that happens right because now you're taking multiple clips from the same audio song even if one of them is not sure the others will help you work it out but for example she gives an example of how this works she takes this famous song she takes five snippets here and each of the clips she converts them into into spectrograms for each of them she gets the convolutional neural network features and then she applies extra boost on top of them and at the end and then at the end what what this model gives you is it says okay that there is 80% probability that this song is sung by Kishore Kumar and there is 20% that it's sung by Kumar sang right very interesting piece of work right now she again DT goes into details about what are the results how did it work she gives some of the positive results and also some of the negative results it's important to understand the negative results for example there is this song called laser mujhey right and the original singer is actually Arman but the predicted singer is Sonu Nigam but if you look at the probabilities the probability that it gave solo Nigam is 60% and the second singer is Arman with a fourth with the 20% chance and RJ's with other 20% chance so the model is not really crazy it's predicting even the original singer is Arman it got slightly confused and probably Sonu Nigam and not Varma and also got confused that it could be urgent so here she calls out very clearly some of the some of the mistakes that it made for example here in this case the virginal saying there is more health but the predicted singer is Sonu Nigam but the probabilities are like Serena give me 60% more it is 40% and things like that is a very very interesting case study where she looks at both of them both these case studies and she looks at how to improve these models in the future etc she had very limited hardware resources so this was a pretty decent start when you don't have like huge computational resources lots of GPUs and things like that right and this block has garnered 386 likes or collapse on medium very interesting features right so there's a very good example where our students where our students are taking a problem that they're very passionate about in this case I believe Hina likes bollywood songs a lot and she said cannot predict the single cannot predict the single if you give me an audio snippet of course she didn't know anything she didn't know anything about spectrograms or anything like that when she got started we pointed her in the right directions we said typically for audio there's something called a spectrogram which will give you an image and on top of image you can apply CNN and if you don't have a lot of computational resources just to use VGA 16 and get the features and then pump them through an XD boost model all of that stuff right so we helped we gave her the pointers here and she ended up reading that about what spectrograms are learnt about what spectrums are and solved the whole problem on or Ramona this is what we do for all of her students we guide them in the right direction we give them pointers in the right direction to help them solve the problems that really care about so very very interesting they study I really like this case study because he not took a problem that she's really passionate about learn additional techniques we have not discussed remember do not discuss pictograms in detail in our course but we discussed about Fourier transforms and going from Fourier transform spread to spectrograms is not rocket science and she took concepts like transfer learning that we discussed in lot of detail in our course so even an audio data we found ways to leverage techniques that we've learned in the course to solve the problem fairly well so this is a very interesting case study I will provide a reference link to her blog in the description section of this videothis is a very interesting case study done by one of the students named Hina Sharma and she this is a very interesting problem on music and songs especially Bollywood music which is extremely popular amongst Indians so if you look at this the problem is very interesting there is lot of research on speaker recognition there is a lot of research on if you give an audio snippet it will determine who the speaker is now she Hema wanted to extend this to Bollywood songs so what she has done here is she has taken up a bunch of Bollywood songs she has downloaded Bollywood songs from 10 artists that she really likes right she has taken would say 500 solo songs from her 10 favorite artists like our jet atif Arman shreya Sonu Nigam etcetera etcetera etcetera right so she has taken up all she has downloaded this data this audio data and now she is the past that he is trying to solve is as follows right the task that she is trying to solve is imagine if I'm given a song right can I predict she wants to predict the singer she wants to predict the singer from the song so given the audio by song I mean given the audio of the song given the audio of the song she wants to predict who the singer is now this is a very interesting experiment because there is an experiment on audio data or and because you have tens fingers you have ten solo singers in each of your songs it's a multi-class classification right it's a multi-class classification problem but the first question here is how do you convert audio into features because audio if you think about it is time series data right and audio has some very very interesting properties that actually he-man used as part of this case study right so she'll download all the data and then she converts all of the data pre-processing data obtains on the night it took her some time to actually download all these songs and create the data set itself then while working with her we suggest for her to use something called aspect programs right a spectrogram is is like is like a Fourier transform but oh but it has multiple okay let me show you an image a spectrogram again she explains what a spectrogram here is and things like that spectrograms are used extensively in all your processing it's it's it's it's one of the one of the trademark things and she learned she learnt it it's not very hard but it's because we have covered things like Fourier transforms in our course it's not very hard to understand what a spectrogram is and she got some simple code right on how to build spectrograms and this is how a spectrogram looks like right this is uh this is what a spectrogram looks like a spectrogram is basically an image representation in a nutshell it is an image representation of an audio snippet of an audio snippet so the way it works is this your x axis which is time right you have your Y axis which are frequencies right so at any time part at any time point what are all the frequencies that are there and this color coding happens by amplitude if you have more amplitude right you have yellow color if you have less amplitude you have blue color right this is this is this is what is called a spectrogram and given a shot given a let's say our twenty second audio snippet or something you can represent it using an image and this is the corresponding image to the given twenty second audio snippet right and once you have images we know that we can use convolutional neural networks right because CN NS work very well on images it will not be human images this is actually a computer-generated image of a spectrogram but still CN NS will work very very well even on this data so what strategy was given a big audio file right let's break it up into smaller snippets 20-second snippets each right from 20-second snippets each on each of these snippets let's convert this into an image and given this image now she can apply convolution neural network right and and we'll see follow it we'll see how she actually works she actually uses VG 16 on this dataset with some but she doesn't train the whole dataset she only trains she does transfer learning type of strategies it wherein she extract some of the features it's a feature extraction module right so we have discuss how to use transfer learning with VG 16 and stuff like in our course and she actually took the same idea here she used the transfer learning scheme you get to get features from the data and her overall structure right and her overall architecture looks like this given a song in dot wav format if you have mp3 even converted Godfather she takes small small clips or snippets right and on each of these snippets she computes a spectrogram and from each of the spectrogram she gets features using a CNN using BG 16 and she pumps these features into an XG boost model and then that is there is sort of like a majority wound that happens right because now you're taking multiple clips from the same audio song even if one of them is not sure the others will help you work it out but for example she gives an example of how this works she takes this famous song she takes five snippets here and each of the clips she converts them into into spectrograms for each of them she gets the convolutional neural network features and then she applies extra boost on top of them and at the end and then at the end what what this model gives you is it says okay that there is 80% probability that this song is sung by Kishore Kumar and there is 20% that it's sung by Kumar sang right very interesting piece of work right now she again DT goes into details about what are the results how did it work she gives some of the positive results and also some of the negative results it's important to understand the negative results for example there is this song called laser mujhey right and the original singer is actually Arman but the predicted singer is Sonu Nigam but if you look at the probabilities the probability that it gave solo Nigam is 60% and the second singer is Arman with a fourth with the 20% chance and RJ's with other 20% chance so the model is not really crazy it's predicting even the original singer is Arman it got slightly confused and probably Sonu Nigam and not Varma and also got confused that it could be urgent so here she calls out very clearly some of the some of the mistakes that it made for example here in this case the virginal saying there is more health but the predicted singer is Sonu Nigam but the probabilities are like Serena give me 60% more it is 40% and things like that is a very very interesting case study where she looks at both of them both these case studies and she looks at how to improve these models in the future etc she had very limited hardware resources so this was a pretty decent start when you don't have like huge computational resources lots of GPUs and things like that right and this block has garnered 386 likes or collapse on medium very interesting features right so there's a very good example where our students where our students are taking a problem that they're very passionate about in this case I believe Hina likes bollywood songs a lot and she said cannot predict the single cannot predict the single if you give me an audio snippet of course she didn't know anything she didn't know anything about spectrograms or anything like that when she got started we pointed her in the right directions we said typically for audio there's something called a spectrogram which will give you an image and on top of image you can apply CNN and if you don't have a lot of computational resources just to use VGA 16 and get the features and then pump them through an XD boost model all of that stuff right so we helped we gave her the pointers here and she ended up reading that about what spectrograms are learnt about what spectrums are and solved the whole problem on or Ramona this is what we do for all of her students we guide them in the right direction we give them pointers in the right direction to help them solve the problems that really care about so very very interesting they study I really like this case study because he not took a problem that she's really passionate about learn additional techniques we have not discussed remember do not discuss pictograms in detail in our course but we discussed about Fourier transforms and going from Fourier transform spread to spectrograms is not rocket science and she took concepts like transfer learning that we discussed in lot of detail in our course so even an audio data we found ways to leverage techniques that we've learned in the course to solve the problem fairly well so this is a very interesting case study I will provide a reference link to her blog in the description section of this video\n"