Matching Networks for One Shot Learning @ TWiML Online Meetup - EMEA - 2 April 2019 1080p

**Introduction to One-Shot Learning**

The concept of one-shot learning has gained significant attention in recent years due to its potential to revolutionize the way we approach machine learning, particularly in image classification tasks. This phenomenon is characterized by the ability of a model to learn from only a few examples, rather than requiring large datasets. In this article, we will delve into the details of one-shot learning and explore some of its key features, applications, and limitations.

**Parametric Perspective**

From a parametric perspective, one-shot learning can be viewed as a form of metric learning. This involves learning a mapping between different data points in order to reduce the distance between them. In the context of image classification, this means that the model learns to map images to a higher-dimensional space where similar images are closer together. The key insight here is that the model only needs to learn from a few examples, rather than requiring large datasets.

**Non-Parametric Perspective**

From a non-parametric perspective, one-shot learning can be viewed as a form of KNN (K-Nearest Neighbors) classification. This involves training a model on a small set of labeled data points and then using the k-nearest neighbors algorithm to classify new, unseen data points. The key difference here is that the model learns to identify the most similar data points in the support set, rather than simply averaging the features of all data points.

**Training and Support Set**

Another important aspect of one-shot learning is the training and support set. The training set typically consists of a small number of labeled data points, while the support set consists of a larger number of unlabeled data points that are used to evaluate the model's performance. In some cases, the training and support sets may have different distributions, which can affect the accuracy of the model.

**Computational Complexity**

One of the limitations of one-shot learning is computational complexity. As the size of the support set increases, the number of possible combinations of data points grows exponentially, making it computationally expensive to train and evaluate the model. This can be mitigated using techniques such as pruning or sampling, but may require careful tuning of hyperparameters.

**Sampling Strategies**

Several sampling strategies have been proposed to address the computational complexity issue in one-shot learning. These include random sampling, stratified sampling, and importance sampling, among others. Each strategy has its own strengths and weaknesses, and the choice of which one to use depends on the specific problem at hand.

**Attention Mechanisms**

Another important aspect of one-shot learning is attention mechanisms. These involve training a model to focus on certain data points or features when making predictions. This can be particularly useful in cases where the support set contains many irrelevant or noisy data points. In some cases, attention mechanisms have been shown to significantly improve the accuracy of one-shot learning models.

**Biological Plausibility**

The question of whether one-shot learning is biologically plausible has sparked interesting debates among researchers and scholars. Some argue that children learn through a process of one-shot learning, while others propose alternative explanations, such as repeated exposure to training examples from their parents. The author of this article finds the one-shot learning approach intuitive and proposes an experiment to test its effectiveness in real-world scenarios.

**Experiment**

In the context of this article, we propose an experiment to test the effectiveness of a one-shot learning model on a real-world dataset. In this scenario, participants are shown images of animals (e.g., cats, dogs, etc.) along with a brief description of the animal's characteristics (e.g., size, color, etc.). The participant is then asked to identify the animal from a set of distractors. We hypothesize that the one-shot learning model will perform better than traditional machine learning models on this task.

**Conclusion**

In conclusion, one-shot learning has gained significant attention in recent years due to its potential to revolutionize the way we approach machine learning, particularly in image classification tasks. From a parametric perspective, one-shot learning can be viewed as a form of metric learning, while from a non-parametric perspective, it can be seen as a form of KNN classification. The key challenges and limitations of one-shot learning include computational complexity, sampling strategies, attention mechanisms, and biological plausibility. Further research is needed to fully understand the strengths and weaknesses of this approach and to develop more effective models for real-world applications.

**References**

There are many references cited in the article, including blog posts, papers, and code implementations. These include:

* Paper Suite Code

* TensorFlow Implementation

* PyTorch Implementation

* Keras Implementation

These resources provide a wealth of information on one-shot learning and its applications, and can be used to further explore this topic.

**Other Comments**

There are many other comments and questions raised during the discussion, including:

* "I think it's really hard to imagine that with a couple of like really low number of samples that you can actually achieve something"

* "But I think I kind of grasp the main features of that like to me it's always like really hard to understand"

These comments highlight the challenges and complexities of one-shot learning, but also demonstrate its potential for revolutionizing machine learning.

"WEBVTTKind: captionsLanguage: enwelcome to the seventh European and Asian and Afghan online Meetup I'm filling in for CAI and will co-host from now my name is Christian man I'm not sure people from the slack will probably know me and today we will hear about wanted learning really looking forward to that one and Sam will present that paper later so again apologies for the confusion with the starting time apparently we messed up with a European summer time and had a defect in the reminder there will be an update briefly about how we proceed from now on but for the moment I hope most got the note and could join so our schedule for today as with the previous meetups just the same introduction we will do a bit of community discussion and then go to the main topic the presentation and that's gonna be roughly like half an hour followed by a Q&A and then we will wrap it up so sure how Sam how you you want to handle questions should people wait till the end or you you happy to take questions in between yeah I'll be I'll prefer to take question in between okay cool so so we do that and I guess most of you know about the schedule that's a monthly meter basically and in May we already have the next one scheduled I did it that's gonna be about prototype based classifiers in the drift not quite sure what that's about looking forward to that and in general if you're happy to take one for the team and basically also want to present we encourage you to just reach out to us kayo me on the slack channel or by a LinkedIn or any other means and it's really really nice experience for yourself if you're starting out that's fine as well I don't have to be an expert on the topic of the paper just really try to make sense of it and what do you think about it and like I did it I did it as well it's really not super much effort just stick with a paper you like and try to present it to the audience and if not everything is clear that's fine as well we will discuss any open topics so if you're eager of doing that just reach out natural which slots are taking yet but Kai knows and trusts propose something and we'll find at eight I'd say maybe a brief update on how things are going for the study group the first day I study group party two is underway and we had the second session next one is this it's pretty intense I have to say but it's it's well rewarding so for people who missed out on this one you can look forward to it's really much more it's very different to part one and it's much more bottom-up and really into the details but it's pretty good you really enjoy it for handling this this meetup here I will I guess I will mute most of you or all of you and then if you want to ask a question or if you want to speak up just unmute yourself so we keep the noise down basically yeah and if there's any other inquiries or questions or your entrepot stuff just ping us on slack and you can also use the the chat within zoom I'll try to monitor what's going on there and otherwise just unmute yourself and ask our presenter directly okay so for the community it is Ivanov in its open floor whatever up your mind just unmute yourself and ask so basically any questions about ongoing courses what's happening at the moment topics you would like to hear more about so like anything use particular excited about what's happening could be what you're working on anything new that you just spotted and thing is like super cool project papers whatever comes to mind so whatever you discovered in the last couple of weeks you think is not newsworthy so anyone no one really I mean I I can maybe started off with with one topic that I think was really gaining traction was definitely a swift for tensorflow because that's also gonna be the second the the two last lessons in the fast ahead course will be about Swift for tens of flow and there's quite some interests very little documentation yet very little actual stuff out there but it seems to be like a lot of people are super excited and there's also some interest about learning Swift like I guess it's it's pretty new language and a lot of people maybe only know it from iOS development if at all so we also started a new slack topic channel on swift or switch for tensorflow so if you have any want to discuss any stuff like that you can post it in this like channel and there's also one faster I dedicated forum that's open to everyone hairbrained that's specifically about the plans for bringing fast a high to Swift for tender flow so it's very early days but it seems to be like Jeremy's really keen on it and there might be some development happening anyone anything from you guys hi guys can you get me yeah maybe you can speak up a bit okay can you hit me right now mm-hmm okay so thank you so much for their Meetup I've been coming around my name is Steven by the way from Nigeria and I've been coming to the meetup a lot and it's been amazing so on currently I've been working on something and I wanted to bring it to the meetup to be able to get imputes and probably maybe know how I can do it better so I'm working on a project and it's for two more images for pathologists right yeah so I called the cyber oncology it's basically a deep learning driven software that would help pathologists to make quicker and faster and more accurate diagnoses of different types of tumor not just making classification but also staging it and and also trying to recognize the objects in the to Marcus you know tumor images it's not just enough to say that um it's a carcinoma or it's a it's are a transitional cell carcinoma squamous cell carcinoma it's more like can you tell what what what's in the picture what in the chemo image that classifies as a customer or so it more like object detection right but I've been looking around for a beauty proof of concept it's working it can make about 80 to 90 percent missing a character specification boy I've not been able to see a research paper that focused on an application to two more images of different classes and being able to recognize objects and some a question is okay what I'm bringing in it some are there are there like methods to get that that I could use to for this kind of project should be sure that I get not just accuracy on the image and also to be able to organize different type of objects in the image I don't know if I'm making myself clear like what's the thing you you striking with different classification or okay okay okay let me a classical scenario would be that okay cuz I'm in medical school and I'm in final year in medical school so usually it takes about three days to biopsy an HMO image and then properly staged it takes about three days to do that and so like I am trying to reduce that work by being able to not just classify a tumor image I want to be able to stage it too and I want to be able to recognize the object in the tumor image now that's a little bit hard for me because a team or image almost all the time looks the same so I don't know my first question is are there any like research papers that anyone knows that I could use maybe that because I've tried searching I'm not seeing anything on two more images an object detection in a tomb or image like that I have a question what you mean by what do you mean by stage it when you say okay okay staging right okay so it ain't no in pathology when you when you biopsy and when you biopsy an image right you know you want to know whether it's stage 1 or stage 2 or stage 3 cancer all right yeah so staging it's the classification goes like this for instance you can say that is a lobular ductal carcinoma of stage 3 so and jerry' there are features in this biopsy on two more image that makes it possible for you to say that it's stage 3 or stage 1 or stage 2 so but I don't even I've not seen any I don't know okay I don't think I don't know how to go I don't know how to go about that because that's like what I want to solve that is my main angle for the problem just being able to classify it because that's like trivial but being able to but but but you do you have a labeled data set of stages like do you have a label data set of images with along the stage cancer and okay okay that is a that is a um like oh the new shift I'm making now but I am thinking that because chemo images they almost always look alike and I thought maybe if I could see a paper or a particular if anybody has a more proper guidance on how I'd go about this so that I don't like with my time on the wrong track yeah hi Steven this is Philip from Germany I've just heard your story I think it's it's very very interesting as to your question regarding this paper are you aware of the paper of and Rangga he did something similar I think to your project with radiological images I will look up the paper and send it in the chat to you I think that might be in a similar direction as yours ok ok I read up the work on units and like building a unit and your net to do that and taking all the objects in but I don't know it didn't really seem to fit but if I'd love to look at that thank you sir I will google it and send it in the chat to you ok so so it seems like the problem you're doing is Israel you're saying it's not just classification but yeah it's yeah it seems that it is I mean in the sense that you need to classify the type of cancer and the stage so you have to make two classification because that is when it makes more clinical interest in make more clinical sense because classifying a tumor that's fine but what stage is it is it is it at the point where you got to worry so much or not like because you know it's gone now if you just classify a team or image and tell someone that is a squamous cell carcinoma you're not really helping you're not really helping because yeah and and I'm trying to give pathologist a second opinion not just telling them that it is a low blood dr. carcinoma I want to tell them why I feel like it's that so I want to never the machine to be able to or the network to be able to tell the type of cancer and show the object that is guiding the classification on the stage huh I see it's a test that maybe I own slack I know just just join us like and maybe open topic there and then because I think medical applications are quite interesting but not sure how many people are experienced there maybe you find couple of more input so if you just join our selection and and I just go to the to the deep learning the appropriate general generally and trust type in what what you're playing there and we can clip in and then it's much easier and more focused I guess alright oh also in the fat in the fast day I group there are several doctors who are interested in imaging so if you cert you know if you search some of the forums and look for comments you'll find them okay yeah that's great that's great cuz I want to do that before I get out of medical school whoops there should be some people around in the in the forums and maybe even hours like that that did some have some experience with medical data so maybe you find someone okay so how do I channel mmm it's to Emily I got slack calm yeah why are the trim layer a homepage I have a suggestion for him but I don't know keep moving a bit maybe connect in our zoom chat or yeah you know it's like like in the zoom type you can interact right now if you want to point something out but I will have his question how's that come here who just asked the question about medical imaging Steven Steven right that's my name's Steven yeah can I drop my someone is asking if I can drop my contact info here can I do that yeah just just open the chat and and put in and okay all right thank you so much I'll do that Horace okay all right I think I don't know maybe maybe if someone has a shorty we can put it in otherwise I would ask you guys that we moved to the main presentation anyone else something concise and pressing that we want to discuss no okay then then I would say we start with the main presentation I will and my screen sharing and you for the Sam you should be able to share yours okay Sam can you grab the screen and enjoy your snot yeah coming up hey hello everyone and a very good evening yeah so I am so you can call me Sam I'm a PhD student at IIT has there was in in India so my research area is mainly like multilevel supervised classifications so this is I'll be talking about matching networks for one-shot learning this is kind of a very important paper in the area of one-shot learning so this this is this was published in 2016 nips nips by deep mind so first to start with the paper I will just give a little bit about the abstract like what is the paper about so basically they used one-shot learning with attention and memory I'll define what is I mean how that insulation used in how the memory is used and basically they actually this paper has two kind of contribution first one is a model architecture for one-shot learning and the other contribution is a they proposed a new kind of training strategy which goes very well with the ones of learning scenario so like uniform training and testing strategy and because of this like they are kind of trying to utilize both parametric and non-parametric learning approaches so both has some like trade-off so they are kind of trying to take the best of both worlds and so like the architecture can be summarized as such that it's kind of although the original authors didn't coined this term it the term was coined by hundred carpathian his notes about this paper so he call it called it differentiable nearest neighbor so basically differentiable nearest neighbor works like taking the advantage of both parametric and nonparametric models so using this matching network they improve the accuracy of imaginate so basically they didn't exactly run the whole code on the whole image net so they created an three subset of it one is called mini imaginate another is called dog image in it another is called random weights in it and using the this hold that the accuracy increased from 87 percent to 90 percent and there is another data set called Omniglot there also they improved the accuracy okay so basically let's start with supervised learning in traditional supervised learning taste levels are used during training so what do I mean by that dislike the label space is C same for training phase and predict surface so here and there in the training phase you have the labels airplane auto world WorldCat and here so you learn a classifier on these level space and you for a new sample you predict on these levels based on the airplane auto mobile market and gear but in case of one short learning that is not the case in case of one short learning you can in turning space you have the label space like dog frog horse sheep and truck so you learn a classifier on this level space but when predicting you will be predicting on the right image like right side image like airplane automobile Bart carrier if you see that there is these two sets are completely disjoint so what although these two sets are disjoint there is another set on the right image so that with labels with the same level which are which you are one two you want to predict so those are called supports so I will tell you I will discuss with the idea behind the support so so basic idea of one so planning is let's say I show you but basically how humans learn so like here you will be shown a single image of Gibran let's say and you can understand what zebras basically made of and you can see like you can categorize any other zebras how are non zebras so like this idea of just seeing only one in one example and learning from that that is the basic crux of one-shot learning so so basically in one-shot learning points of learning can like this is like if sick the domain of learning strategies so this is basically you could say like in wake a short learning so let's say you have a on the left side you have a training data set and on the right side you have the testing data set here also they're both both of the level space are disjoint so you have a training task which comprises of the training set data set and you have it a testing tasks which comprise of the testing data set now during the testing time you choose the slaves set up labels from the testing tasks it could be equal to the size of the testing staff also here the testing size is 5 but here for the for simplicity you are just taking 3 3 level so and for each level you are taking single examples so this is what is called as Quartet so these are the examples you get to see for each of the categories you have chosen but the query will be from these sub one of these levels set like from this layer in L Prime so this is the basic in this case actually if you see this is 3-way one-shot learning why 3-way because the label space is made up of three levels automobile cat and deer and for each of the levels there is only one sample that is like either one one car one cat and one gear so you'll have to predict for a new sample for two automobile cat of deer so this is called three or a one-shot learning it could be like you can take twenty levels and for each level five samples so then it will be called twenty way five short running so in this paper paper basically we will look into one sort running so the as I said previously the contributions are twofold one is a matching networks another is training strategy so I will start with the matching Network then I will go to the standing strategy so matching network is basically the idea behind matching Network is like like in case of a Wi-Fi dude like parametric models usually take so many samples to learn so for each of the classes parametric models actually learn the properties of each of the classes using lots of samples into their parameters but in case of nonparametric learning not all not all but in few of that non pattern non parametric learning one of them is K nearest neighbor in you don't need any learning as such you don't need any training so you can just project it to some space and find the nearest sample and you can just propagate the levant from the nearest sample to the to your taste sample and you can just give that as a you have predicted level right but the problem with this nonparametric KNN is that when you are projecting it to some space what kind of metric you are choosing to define the similarity how do you define the nearest so that actually matters a lot so here in this matching networks actually try to solve both of the problems so it's kind of a parametric model yeah is there any question okay yeah so it's it's basically a parametric model but the parametric model only works to define the metric so once the metric is defined on the how do one see your nearest neighbors how do you define the nearest neighbor is done then you don't need the parameter parameters anymore so let's see how how it is done so basically as so it works on two fundamental points one is called full context embedding and attention kernel so I will discuss I'll start my discussion with full context embedding so in full context embedding so basically here the truth based matching needs has two inputs one is called supports another is called targets so basically although here I will be calling them as targets in the paper it's not mentioned as targets it's like samples or something so supports are embedded in such so they're in the matching Network you are using a cosines distance here the distance is fixed but you are changing the embedding so that the similar Mary name mei-hua neighbors which are nearest so by D because of their embedding should come to a closure in a in their embedded space so here the idea is take each of the supports and embed them in such a way that it's all the other supports are also in encoded in this so for each support each sample in a support basically you are embedding the are all the other supports present in that support set so this is done by via a bi-directional LSD here if you see the first equation so G is missing X eyes are the supports you want to embed and X is basically element of s yes is the supports it so X each of the Exile is flashed through as G prime comes and so G prime concern is basically any pre-trained embedding you could you could take inception Nate or VG gen8 or any any other thing so you read a taste some embedding after passing through be unit or inception net and then you run it through the lsdm and when running it through the LST M you encode the X I with along with its supports so the new X I which you get is basically a combination of G G prime X I'll and all the other samples present in that particular support set so if you change the support set you will get a new embedding so for even like if you change the ordering of the support set you will get a new embedding so although the ordering part is not mentioned in the paper but like in in general in case of lsdm if you change the orderly will get a new embedding but according to the paper looks like they have like taken it into account that ordering is random so they haven't mentioned anything specific about the ordering we've been a particular support set so this is the way to invade the support set then coming into the this this this is how it looks like so all these let's say there is a data set which have comprised of only pictures of different types of bouts so let's say this all that pictures will be passing through a G prime function so that's like kind of be degenerate or some other network then it will pass through an LS here and this lsdm is connected this LST will connect all the other samples in that particular support set and I will get a embedding which is a kind of a combination of that particular image and all the other images in that particular support set for each of those samples so you are trying to capture the nearest neighbor me in the index so basically you are considering that whichever samples supports are present in the supports yet they are kind of nearest neighbor so you are trying to capture the nearest neighbor into the embedding itself so this is a quick question yeah so what what's the output of G prime here like is that the predicted class label or is that something earlier in the network like the features that it's learned yeah so D prime is basically the BG's in it let's say it's in it so you pass it through the BG unit you get the BDD that is already pre trained you have to have you'll have to imagine it's already pre can you pass it through the big internet and what every embedding you will get that is used as a input to this matching network okay okay make sense Thanks yeah okay so this is another part so this is the second function which is called in the paper it is mentioned as if so this function is basically used to embed the targets targets means for which you want to predict for you so one thing I should reiterate again so for the supports you have the level information you don't read the like it's not hidden but in case of targets the level information is not present so you will have you will embed the targets in such a way that the samples present in the support set that is also an encoded in that target so how do they do it is basically like this so they use by a bidirectional in the system here also with attention so not so here although the in the paper they mostly used by direction and resistant but it's not necessary that you can you should use by dates in an LS time you can use anything so like that's like kind of open in this area so but they define this LST I mean such a way that again if here is P F prime is basically again some visas unit or inception net to take the output of that particular network and pass it through an election where the hidden net hidden network is hidden excuse me yeah he didn't sell ours B is defined in such a way that basically you you define the hidden cell as a combination of just a minute yeah let me just check my notes here yeah so basically here you try to take the relevance of the some support sets samples in the support said and the use data elements as a attention so how do you get the relevance so basically relevant since calculated here in the third equation so here is the attention so attention is defined as softmax of hidden state multi multiplied with our samples in the supports it so when at the unis beginning very beginning this is like this equations and even after ki Terezin so at the very beginning where a hidden state is like h1 so the then the G of X I is not in embedded in it so you take the simple h1 and you take that G of X I multiply them and you will get a scalar and there that scalar we work as a attention value and that attention you multiply with each of the corresponding add attention you multiply with the corresponding supports example and you will get the embedding of some samples of support set with their corresponding relevant that relevancy used along with the hidden state in the next layer so you merge both of them and so from the next okay let me just show you the pictures it will be clearer so so this is how the network looks like so you get the supports from here from the G Prime and you and then pass it through the G and you will get the G of excise and you take the hidden state H in the beginning it will be H of one and you multiply H of one with each of the GI HDX eyes and you get some relevant score and you take the pin live in Spore and I take the weighted sum and you will get the whole relevant vector for the whole support set and you then you use that as a input to the next iteration of the LST m and in this next idea lesson HK basically you like basically in the in the same time it will be H 2 so in H 2 you get an elicit iam embedding where the relevance and the supports are also included included so f is basically a function of two things one is X prime X prime is basically the in test image and s because s is basically the support set so when you are taking this G input so that that is that can be considered as s so HK is basically you could say that like output for the F function so if you and HK minus 1 will be like previous level and initially when it is H 1 there will be no support included in it does that make sense okay yeah I need a bit to wrap my head around but I can't really pinpoint where I'm stuck but maybe maybe yeah it's a complicated topic maybe it gets busier when you progress yeah so okay let me just reiterate it a little bit so the bit complicated part is like the equation is in the paper also is given after Kate's tape so it's not from h1 to h2 like that so in the initial state you don't have any support information embedded in the ilist IAM but but you take the h1 multiplied with the all the supports you take the relevant score and then again use that relevant score as a attention and multiply those relevants who use that at and send on the supports again so basically in this equation where are K minus 1 is defined so this attention of H K minus 1 if you take like the first IDs and it will be H 1 multiplying it G of X excise so all the excise like cardinality of s like for all the supports you multiply them we will get a relevant score for each of the supports and multiply them along with there that that's a particular supports embedding ok and the size of the support set is like 3 in this case right because of three dogs and that's that's yes it's hard to imagine that it's enough to inform the model but maybe I mean it apparently works so yeah so you can actually it's not it certainly that you have to fix to three only you can decide what although it has in the paper it has a limit it's like the minimum support set they have used is 5 and the maximum they have used is 25 so like yeah so like they use to like total 5 types of classes and for each of the classes you I that take one sample each or five sample each okay yeah so then the this attention was used only for the admitting full context embedding on the f functions then there is a another attention which is actually used for all the samples so there is a picture in the below part in a mixed part so it will be clearer so it works like this so you predict based on you take the attention of X hat and X I so here X I comes from the supports and X hat is the target which were or which you want to predict and multiply the attention along with the supports levels representation so although it is not the mentioned in the paper but the labeling represent a semi is actually 100 encoding of the supports okay so in some sense you could think of it like this so a prediction is basically a linear combination of labels in the support sets so for each target you let us say first predicts first support belongs to plow like first class so it embedded 100 meeting will be like 1 0 0 and so they can be supported belongs to class number two so it's 100 meeting will will be like 0 1 0 and third one in our class number 3 so it can be like 0 1 0 0 1 so you take the attention attention here this is basically defined as the softmax function of cosine similarity of their embedding so here f of X hat is basically the embedding of accept and G of X X I is basically the embedding of X I and you take the cosine distance between them and that acts as a attention between the X hat and the excise so that cosine sim distance is basically this value 0.2 or 0.5 and 0.3 and you multiply it the support support set who are not embedding and you will get something like a vector which will be like because of the softmax function it will be like a probability distribution and you you can take the highest probability as the output that makes it much clearer yeah so basically they train this network using a cross entropy loss and one more thing says although the authors use cosine distance so you can use some other distance also not necessary you will have to use cosine distance only so this this is how the network looks like once you have the G and F so basically you get the support set excise and you take the X hat they pass it through F you get a f of X hat and s and you take this you pass them through again another attention this is the like model level attention so this attention will give you the similarity between the X hat which is a test sample along with their with the supports it has in that particular set so you will get a vector for each of the classes and then you multiply the 100 tests of each of the classes and whatever like maximum you can predict as the output so like this is the same architecture with the equations so here you are basically using cosine similarity then pass it through a softmax function that will give you a tension then use that and send function between the x hat and the x i's and when you have the similarity between X hat and all of them x i's then take that and multiplying with corresponding Y I basically you are hurt anybody and you will get a probability distribution over the level space and that will be your final output so this is the basic idea of matching networks so there are like even like you can use CNN and other yeah is there any question okay so so this is the second part of the contribution of the paper so they actually come up with a training strategy so this is kind of like a kind interestings like so in that initial part when I described one-shot learning I described how the testing phase works right prediction phase so they tried to replicate the test prediction phase he exactly same to the training phase and for some reason people did not do that before this paper so this is like kind of this is how it should be done but previously people who are using some other approach so like same as the one-shot learning approach you take it instead of the testing task you take the training task we choose a set of levels and you create the support set from that particular set of levels and you choose another set of backs and that here in the paper they are called as badge so that is basically the same thing which I called as a target so I whatever I mentioned as target is basically the batch in the paper so you from the label space level set L you create two batches one is for support set and other needs for the targets so in the support city you have the level information in that batch or targets you don't have the level information so this is kind of like exactly similar to the one horse one-shot learning prediction phase and like but before this paper nobody did that so they they actually tried to do it like a normal supervised learning approach and then applied one sort learning in the prediction phase only but in this paper they applied in one-shot learning strategy in the training phase also so this is the second contribution in their paper so this is the like datasets they have used and the results they basically tested on three data sets or Omniglot and imaginate so image note has basically three different subsets million state and imagine it Rand and Internet dog I'll describe in the letter what they mean and there is one language modeling of task that is called that is done on that pin treebank data set so um in the Omniglot data set so it is basically a combination of 16 123 different characters characters and each of the characters has 20 different each of the characters was written by 20 different person so each class has 20 samples with them so in this case so they basically used compare their network with the convolutional Siamese network and there is another network I haven't been through the related work section much in detail but like you can go look into the paper for that part so they solved that with for some weird reason actually without the fine tuning their accuracy is better in most of the cases and full context templating didn't help much in this cases and thus Linus Simon's network actually fire with fine taining they improve but in case of matching network fine with fine tearing like they kind of big tree still little rather than improving for some reason it's not actually mentioned why it happened this is for the Omniglot and this is for the image in it so imagine it has the three parts this is called a mini image rate so from the image that they created subset of imaginate with 80 classes for training and 20 classes for testing and they found that their model with fine-tuning and full context embedding is basically working a little better without so the last two rows are basically with full context embedding and the before second-to-last rows are without full context embedding so like it improve the accuracy like two to four percent and fine taining also improved energy so a full context embedding shown little effectiveness in this case but not much then there is a two other data sets one is called random ignite and another is called Dom image net so random unit is basically comprised of random classes 882 class randomly selected classes and for training and remaining classes 118 classes for testing but in case of dogs imagenet all the non dogs classes are selected for the training purpose and only dog classes like some such species of dogs are chosen for the testing purpose so tasting sample has only has browse nothing else but if you see in this case so the matching net is basically doesn't do better in case of one-shot learning in case of in the dog's data set so one of the like result like one of the reason that matching net is not working better than the inception classifier there they mentioned is like because the in the dog's image rate so testing only has dogs and training has except dogs everything so basically the distribution between training and testing is very differ so that is a kind of a challenge for matching it if the training and testing distribution are very different then matching it doesn't work well all the time so okay so this is the pin treebank so they kind of defined a new task for this operation so the task is something like this you have a set of sentences from each of the sentence one word is missing so that particular missing word is basically the level so here you have five sentences and each of the missing words is like this in the first sentence prominent the world prominent is missing in the second sentence the word series is missing in the third sentence the word dollar is missing in the fourth sentence towel and KO gives in and comprehensive and there is one query sentence in there is also one word missing in the query sentence but this missing word will be either of these five words so this is how the we defined the task in the language modeling case so in this case so basically they compared their result with the Oracle lsdm language model so here the Oracle Alice team basically means that it has seen all the words presents so it has everything like not what's one short like Oracle is basically the upper bound of the scenario so like it has seen all the words then predicted and in case of one sort it haven't seen those words and then tried to predict so matching it works pretty well it's like for one shortage like 32.4% accuracy for short sixty 36.1% accuracy and three source 38.2% accuracy so like if you increase the number of samples present the accuracy usually increases and in conclusion so the nonparametric stress structure gives the matching networks like the ability like to learn from single samples only you don't need like thousands of samples formatting networks to work it's basically to summarize the model it's like a trainable into end differentiable nearest-neighbor which metric landing capability so it it works very well in three different data sets and training a model they propose the training or a phase also like a one short way make makes the learning easier before this paper like for some reason researchers didn't use the same training and testing phase so like they use the traditional training procedure and then the testing phase that they use the one-shot learning way but they this paper change that one so this is few of the remarks from my side this is not mentioned in the paper so like this is basically in summary like they introduced the matching networks and so from the you can see these paper in few different perspectives so if you see from the parametric perspective you can say it like it's basically a paper of metric learning if you see from the nonparametric perspective it's basically a KNN where the level predictions are linear combinations of nearest neighbors in the support set and another part of this paper is training and support cyclists level distribution should be closed as we seen in the dog imagenet case if the training set and testing set supports are not very same distribution are very different it might not work well and one other thing which is not mentioned in the paper is that when you are passing the samples through the FC the ordering of the samples are not mentioned to have any significance but in usually it usually have a significance let's say if you in the F function if you give it if you want to embed the X hat but the ordering of the support is like x1 two XK 2 X n and there is another ordering that XK is in Fast and then X 1 and XK are swapped kind of and then the embedding you will get will not be similar to the first one so this part is not like taken into account in the paper so this could be like a study about this who do have been much interesting like how the ordering affects the accuracy effectiveness of the model and one of the limitations of the paper is kind of like sample size during the training and testing are fixed it's like minimum is 5 and maximum is 25 like for all the operations it's kind of fixed they haven't changed it so if the like it's not very suitable for like if the training set is gross like online then it might be a bit problematic to formulate the problem how do you create a batch and how do you create a level straight all these things and if you see into the details like when calculating calculating the attention so if the support set becomes very large it will become computationally expensive ok so this is all my observations about the paper and these are the kind of resources there are many blog post made by different researchers about this paper this is kind of a very well known paper and like and many more sense of the code is also available like you can go to the paper suite code and such for this paper and you will get like 5 to 6 different implementations including fighters and tensorflow and chaos and others and these are the references of the important differences of the wave and there are many other references included in the paper that will be all um thanks very much very talk and then well it's it's pretty heavy I think if you haven't been into the tension network and all then it's daunting but I think I kind of grasp the the main features of that like to me it's always like really hard to imagine that with a couple like really low number of samples that you can actually achieve something but this is that it's like a linear combination of like weighting of the scores that that kind of hit home with me that kind of makes sense to me so I think I'll definitely I haven't had the chance to really read it before but I think I'll definitely do that everything I mean any other comments we were almost of the hour but anyone thank you very much Sam I want a second Christian I also want to take in the paper now I think it was very interesting what your prison did say thank you very much for that I was thinking about the kind of if you think about it it's kind of intuitive the the one-shot learning approach I was thinking if if there is even a biological plausibility to it because if you think about it to how children learn they best basically do they or that is my question that I have do they really learn by one-shot learning or is it basically like repeated exact sort of training examples from their parents really and then they sort of conceptualize the concept of say a banana or something like that so to answer this okay let me give you a scenario okay so let's say I showed you a picture of his a bruh okay and then I told you that real bruh is a zebra with red stripes so you can easily recognize pictures of rivera right now I see yeah right that makes sense yeah yeah so this example is actually mentioned in one of the blog post so I just gave that example yeah any other questions comments no all in all if there's no further questions or comment I mean we can also like like discuss later on on on slack or in the coming days if there's like stuff coming up all for like if there's dedicated questions to one of our meetups we can like make a deeper discussion there as well and people I would say thanks again to Sam for this I guess hard work yeah so I would close it out for today and we will convene again at a fixed time and we'll make sure that it's clear to everyone which one starting of main those things are attending and see you next timewelcome to the seventh European and Asian and Afghan online Meetup I'm filling in for CAI and will co-host from now my name is Christian man I'm not sure people from the slack will probably know me and today we will hear about wanted learning really looking forward to that one and Sam will present that paper later so again apologies for the confusion with the starting time apparently we messed up with a European summer time and had a defect in the reminder there will be an update briefly about how we proceed from now on but for the moment I hope most got the note and could join so our schedule for today as with the previous meetups just the same introduction we will do a bit of community discussion and then go to the main topic the presentation and that's gonna be roughly like half an hour followed by a Q&A and then we will wrap it up so sure how Sam how you you want to handle questions should people wait till the end or you you happy to take questions in between yeah I'll be I'll prefer to take question in between okay cool so so we do that and I guess most of you know about the schedule that's a monthly meter basically and in May we already have the next one scheduled I did it that's gonna be about prototype based classifiers in the drift not quite sure what that's about looking forward to that and in general if you're happy to take one for the team and basically also want to present we encourage you to just reach out to us kayo me on the slack channel or by a LinkedIn or any other means and it's really really nice experience for yourself if you're starting out that's fine as well I don't have to be an expert on the topic of the paper just really try to make sense of it and what do you think about it and like I did it I did it as well it's really not super much effort just stick with a paper you like and try to present it to the audience and if not everything is clear that's fine as well we will discuss any open topics so if you're eager of doing that just reach out natural which slots are taking yet but Kai knows and trusts propose something and we'll find at eight I'd say maybe a brief update on how things are going for the study group the first day I study group party two is underway and we had the second session next one is this it's pretty intense I have to say but it's it's well rewarding so for people who missed out on this one you can look forward to it's really much more it's very different to part one and it's much more bottom-up and really into the details but it's pretty good you really enjoy it for handling this this meetup here I will I guess I will mute most of you or all of you and then if you want to ask a question or if you want to speak up just unmute yourself so we keep the noise down basically yeah and if there's any other inquiries or questions or your entrepot stuff just ping us on slack and you can also use the the chat within zoom I'll try to monitor what's going on there and otherwise just unmute yourself and ask our presenter directly okay so for the community it is Ivanov in its open floor whatever up your mind just unmute yourself and ask so basically any questions about ongoing courses what's happening at the moment topics you would like to hear more about so like anything use particular excited about what's happening could be what you're working on anything new that you just spotted and thing is like super cool project papers whatever comes to mind so whatever you discovered in the last couple of weeks you think is not newsworthy so anyone no one really I mean I I can maybe started off with with one topic that I think was really gaining traction was definitely a swift for tensorflow because that's also gonna be the second the the two last lessons in the fast ahead course will be about Swift for tens of flow and there's quite some interests very little documentation yet very little actual stuff out there but it seems to be like a lot of people are super excited and there's also some interest about learning Swift like I guess it's it's pretty new language and a lot of people maybe only know it from iOS development if at all so we also started a new slack topic channel on swift or switch for tensorflow so if you have any want to discuss any stuff like that you can post it in this like channel and there's also one faster I dedicated forum that's open to everyone hairbrained that's specifically about the plans for bringing fast a high to Swift for tender flow so it's very early days but it seems to be like Jeremy's really keen on it and there might be some development happening anyone anything from you guys hi guys can you get me yeah maybe you can speak up a bit okay can you hit me right now mm-hmm okay so thank you so much for their Meetup I've been coming around my name is Steven by the way from Nigeria and I've been coming to the meetup a lot and it's been amazing so on currently I've been working on something and I wanted to bring it to the meetup to be able to get imputes and probably maybe know how I can do it better so I'm working on a project and it's for two more images for pathologists right yeah so I called the cyber oncology it's basically a deep learning driven software that would help pathologists to make quicker and faster and more accurate diagnoses of different types of tumor not just making classification but also staging it and and also trying to recognize the objects in the to Marcus you know tumor images it's not just enough to say that um it's a carcinoma or it's a it's are a transitional cell carcinoma squamous cell carcinoma it's more like can you tell what what what's in the picture what in the chemo image that classifies as a customer or so it more like object detection right but I've been looking around for a beauty proof of concept it's working it can make about 80 to 90 percent missing a character specification boy I've not been able to see a research paper that focused on an application to two more images of different classes and being able to recognize objects and some a question is okay what I'm bringing in it some are there are there like methods to get that that I could use to for this kind of project should be sure that I get not just accuracy on the image and also to be able to organize different type of objects in the image I don't know if I'm making myself clear like what's the thing you you striking with different classification or okay okay okay let me a classical scenario would be that okay cuz I'm in medical school and I'm in final year in medical school so usually it takes about three days to biopsy an HMO image and then properly staged it takes about three days to do that and so like I am trying to reduce that work by being able to not just classify a tumor image I want to be able to stage it too and I want to be able to recognize the object in the tumor image now that's a little bit hard for me because a team or image almost all the time looks the same so I don't know my first question is are there any like research papers that anyone knows that I could use maybe that because I've tried searching I'm not seeing anything on two more images an object detection in a tomb or image like that I have a question what you mean by what do you mean by stage it when you say okay okay staging right okay so it ain't no in pathology when you when you biopsy and when you biopsy an image right you know you want to know whether it's stage 1 or stage 2 or stage 3 cancer all right yeah so staging it's the classification goes like this for instance you can say that is a lobular ductal carcinoma of stage 3 so and jerry' there are features in this biopsy on two more image that makes it possible for you to say that it's stage 3 or stage 1 or stage 2 so but I don't even I've not seen any I don't know okay I don't think I don't know how to go I don't know how to go about that because that's like what I want to solve that is my main angle for the problem just being able to classify it because that's like trivial but being able to but but but you do you have a labeled data set of stages like do you have a label data set of images with along the stage cancer and okay okay that is a that is a um like oh the new shift I'm making now but I am thinking that because chemo images they almost always look alike and I thought maybe if I could see a paper or a particular if anybody has a more proper guidance on how I'd go about this so that I don't like with my time on the wrong track yeah hi Steven this is Philip from Germany I've just heard your story I think it's it's very very interesting as to your question regarding this paper are you aware of the paper of and Rangga he did something similar I think to your project with radiological images I will look up the paper and send it in the chat to you I think that might be in a similar direction as yours ok ok I read up the work on units and like building a unit and your net to do that and taking all the objects in but I don't know it didn't really seem to fit but if I'd love to look at that thank you sir I will google it and send it in the chat to you ok so so it seems like the problem you're doing is Israel you're saying it's not just classification but yeah it's yeah it seems that it is I mean in the sense that you need to classify the type of cancer and the stage so you have to make two classification because that is when it makes more clinical interest in make more clinical sense because classifying a tumor that's fine but what stage is it is it is it at the point where you got to worry so much or not like because you know it's gone now if you just classify a team or image and tell someone that is a squamous cell carcinoma you're not really helping you're not really helping because yeah and and I'm trying to give pathologist a second opinion not just telling them that it is a low blood dr. carcinoma I want to tell them why I feel like it's that so I want to never the machine to be able to or the network to be able to tell the type of cancer and show the object that is guiding the classification on the stage huh I see it's a test that maybe I own slack I know just just join us like and maybe open topic there and then because I think medical applications are quite interesting but not sure how many people are experienced there maybe you find couple of more input so if you just join our selection and and I just go to the to the deep learning the appropriate general generally and trust type in what what you're playing there and we can clip in and then it's much easier and more focused I guess alright oh also in the fat in the fast day I group there are several doctors who are interested in imaging so if you cert you know if you search some of the forums and look for comments you'll find them okay yeah that's great that's great cuz I want to do that before I get out of medical school whoops there should be some people around in the in the forums and maybe even hours like that that did some have some experience with medical data so maybe you find someone okay so how do I channel mmm it's to Emily I got slack calm yeah why are the trim layer a homepage I have a suggestion for him but I don't know keep moving a bit maybe connect in our zoom chat or yeah you know it's like like in the zoom type you can interact right now if you want to point something out but I will have his question how's that come here who just asked the question about medical imaging Steven Steven right that's my name's Steven yeah can I drop my someone is asking if I can drop my contact info here can I do that yeah just just open the chat and and put in and okay all right thank you so much I'll do that Horace okay all right I think I don't know maybe maybe if someone has a shorty we can put it in otherwise I would ask you guys that we moved to the main presentation anyone else something concise and pressing that we want to discuss no okay then then I would say we start with the main presentation I will and my screen sharing and you for the Sam you should be able to share yours okay Sam can you grab the screen and enjoy your snot yeah coming up hey hello everyone and a very good evening yeah so I am so you can call me Sam I'm a PhD student at IIT has there was in in India so my research area is mainly like multilevel supervised classifications so this is I'll be talking about matching networks for one-shot learning this is kind of a very important paper in the area of one-shot learning so this this is this was published in 2016 nips nips by deep mind so first to start with the paper I will just give a little bit about the abstract like what is the paper about so basically they used one-shot learning with attention and memory I'll define what is I mean how that insulation used in how the memory is used and basically they actually this paper has two kind of contribution first one is a model architecture for one-shot learning and the other contribution is a they proposed a new kind of training strategy which goes very well with the ones of learning scenario so like uniform training and testing strategy and because of this like they are kind of trying to utilize both parametric and non-parametric learning approaches so both has some like trade-off so they are kind of trying to take the best of both worlds and so like the architecture can be summarized as such that it's kind of although the original authors didn't coined this term it the term was coined by hundred carpathian his notes about this paper so he call it called it differentiable nearest neighbor so basically differentiable nearest neighbor works like taking the advantage of both parametric and nonparametric models so using this matching network they improve the accuracy of imaginate so basically they didn't exactly run the whole code on the whole image net so they created an three subset of it one is called mini imaginate another is called dog image in it another is called random weights in it and using the this hold that the accuracy increased from 87 percent to 90 percent and there is another data set called Omniglot there also they improved the accuracy okay so basically let's start with supervised learning in traditional supervised learning taste levels are used during training so what do I mean by that dislike the label space is C same for training phase and predict surface so here and there in the training phase you have the labels airplane auto world WorldCat and here so you learn a classifier on these level space and you for a new sample you predict on these levels based on the airplane auto mobile market and gear but in case of one short learning that is not the case in case of one short learning you can in turning space you have the label space like dog frog horse sheep and truck so you learn a classifier on this level space but when predicting you will be predicting on the right image like right side image like airplane automobile Bart carrier if you see that there is these two sets are completely disjoint so what although these two sets are disjoint there is another set on the right image so that with labels with the same level which are which you are one two you want to predict so those are called supports so I will tell you I will discuss with the idea behind the support so so basic idea of one so planning is let's say I show you but basically how humans learn so like here you will be shown a single image of Gibran let's say and you can understand what zebras basically made of and you can see like you can categorize any other zebras how are non zebras so like this idea of just seeing only one in one example and learning from that that is the basic crux of one-shot learning so so basically in one-shot learning points of learning can like this is like if sick the domain of learning strategies so this is basically you could say like in wake a short learning so let's say you have a on the left side you have a training data set and on the right side you have the testing data set here also they're both both of the level space are disjoint so you have a training task which comprises of the training set data set and you have it a testing tasks which comprise of the testing data set now during the testing time you choose the slaves set up labels from the testing tasks it could be equal to the size of the testing staff also here the testing size is 5 but here for the for simplicity you are just taking 3 3 level so and for each level you are taking single examples so this is what is called as Quartet so these are the examples you get to see for each of the categories you have chosen but the query will be from these sub one of these levels set like from this layer in L Prime so this is the basic in this case actually if you see this is 3-way one-shot learning why 3-way because the label space is made up of three levels automobile cat and deer and for each of the levels there is only one sample that is like either one one car one cat and one gear so you'll have to predict for a new sample for two automobile cat of deer so this is called three or a one-shot learning it could be like you can take twenty levels and for each level five samples so then it will be called twenty way five short running so in this paper paper basically we will look into one sort running so the as I said previously the contributions are twofold one is a matching networks another is training strategy so I will start with the matching Network then I will go to the standing strategy so matching network is basically the idea behind matching Network is like like in case of a Wi-Fi dude like parametric models usually take so many samples to learn so for each of the classes parametric models actually learn the properties of each of the classes using lots of samples into their parameters but in case of nonparametric learning not all not all but in few of that non pattern non parametric learning one of them is K nearest neighbor in you don't need any learning as such you don't need any training so you can just project it to some space and find the nearest sample and you can just propagate the levant from the nearest sample to the to your taste sample and you can just give that as a you have predicted level right but the problem with this nonparametric KNN is that when you are projecting it to some space what kind of metric you are choosing to define the similarity how do you define the nearest so that actually matters a lot so here in this matching networks actually try to solve both of the problems so it's kind of a parametric model yeah is there any question okay yeah so it's it's basically a parametric model but the parametric model only works to define the metric so once the metric is defined on the how do one see your nearest neighbors how do you define the nearest neighbor is done then you don't need the parameter parameters anymore so let's see how how it is done so basically as so it works on two fundamental points one is called full context embedding and attention kernel so I will discuss I'll start my discussion with full context embedding so in full context embedding so basically here the truth based matching needs has two inputs one is called supports another is called targets so basically although here I will be calling them as targets in the paper it's not mentioned as targets it's like samples or something so supports are embedded in such so they're in the matching Network you are using a cosines distance here the distance is fixed but you are changing the embedding so that the similar Mary name mei-hua neighbors which are nearest so by D because of their embedding should come to a closure in a in their embedded space so here the idea is take each of the supports and embed them in such a way that it's all the other supports are also in encoded in this so for each support each sample in a support basically you are embedding the are all the other supports present in that support set so this is done by via a bi-directional LSD here if you see the first equation so G is missing X eyes are the supports you want to embed and X is basically element of s yes is the supports it so X each of the Exile is flashed through as G prime comes and so G prime concern is basically any pre-trained embedding you could you could take inception Nate or VG gen8 or any any other thing so you read a taste some embedding after passing through be unit or inception net and then you run it through the lsdm and when running it through the LST M you encode the X I with along with its supports so the new X I which you get is basically a combination of G G prime X I'll and all the other samples present in that particular support set so if you change the support set you will get a new embedding so for even like if you change the ordering of the support set you will get a new embedding so although the ordering part is not mentioned in the paper but like in in general in case of lsdm if you change the orderly will get a new embedding but according to the paper looks like they have like taken it into account that ordering is random so they haven't mentioned anything specific about the ordering we've been a particular support set so this is the way to invade the support set then coming into the this this this is how it looks like so all these let's say there is a data set which have comprised of only pictures of different types of bouts so let's say this all that pictures will be passing through a G prime function so that's like kind of be degenerate or some other network then it will pass through an LS here and this lsdm is connected this LST will connect all the other samples in that particular support set and I will get a embedding which is a kind of a combination of that particular image and all the other images in that particular support set for each of those samples so you are trying to capture the nearest neighbor me in the index so basically you are considering that whichever samples supports are present in the supports yet they are kind of nearest neighbor so you are trying to capture the nearest neighbor into the embedding itself so this is a quick question yeah so what what's the output of G prime here like is that the predicted class label or is that something earlier in the network like the features that it's learned yeah so D prime is basically the BG's in it let's say it's in it so you pass it through the BG unit you get the BDD that is already pre trained you have to have you'll have to imagine it's already pre can you pass it through the big internet and what every embedding you will get that is used as a input to this matching network okay okay make sense Thanks yeah okay so this is another part so this is the second function which is called in the paper it is mentioned as if so this function is basically used to embed the targets targets means for which you want to predict for you so one thing I should reiterate again so for the supports you have the level information you don't read the like it's not hidden but in case of targets the level information is not present so you will have you will embed the targets in such a way that the samples present in the support set that is also an encoded in that target so how do they do it is basically like this so they use by a bidirectional in the system here also with attention so not so here although the in the paper they mostly used by direction and resistant but it's not necessary that you can you should use by dates in an LS time you can use anything so like that's like kind of open in this area so but they define this LST I mean such a way that again if here is P F prime is basically again some visas unit or inception net to take the output of that particular network and pass it through an election where the hidden net hidden network is hidden excuse me yeah he didn't sell ours B is defined in such a way that basically you you define the hidden cell as a combination of just a minute yeah let me just check my notes here yeah so basically here you try to take the relevance of the some support sets samples in the support said and the use data elements as a attention so how do you get the relevance so basically relevant since calculated here in the third equation so here is the attention so attention is defined as softmax of hidden state multi multiplied with our samples in the supports it so when at the unis beginning very beginning this is like this equations and even after ki Terezin so at the very beginning where a hidden state is like h1 so the then the G of X I is not in embedded in it so you take the simple h1 and you take that G of X I multiply them and you will get a scalar and there that scalar we work as a attention value and that attention you multiply with each of the corresponding add attention you multiply with the corresponding supports example and you will get the embedding of some samples of support set with their corresponding relevant that relevancy used along with the hidden state in the next layer so you merge both of them and so from the next okay let me just show you the pictures it will be clearer so so this is how the network looks like so you get the supports from here from the G Prime and you and then pass it through the G and you will get the G of excise and you take the hidden state H in the beginning it will be H of one and you multiply H of one with each of the GI HDX eyes and you get some relevant score and you take the pin live in Spore and I take the weighted sum and you will get the whole relevant vector for the whole support set and you then you use that as a input to the next iteration of the LST m and in this next idea lesson HK basically you like basically in the in the same time it will be H 2 so in H 2 you get an elicit iam embedding where the relevance and the supports are also included included so f is basically a function of two things one is X prime X prime is basically the in test image and s because s is basically the support set so when you are taking this G input so that that is that can be considered as s so HK is basically you could say that like output for the F function so if you and HK minus 1 will be like previous level and initially when it is H 1 there will be no support included in it does that make sense okay yeah I need a bit to wrap my head around but I can't really pinpoint where I'm stuck but maybe maybe yeah it's a complicated topic maybe it gets busier when you progress yeah so okay let me just reiterate it a little bit so the bit complicated part is like the equation is in the paper also is given after Kate's tape so it's not from h1 to h2 like that so in the initial state you don't have any support information embedded in the ilist IAM but but you take the h1 multiplied with the all the supports you take the relevant score and then again use that relevant score as a attention and multiply those relevants who use that at and send on the supports again so basically in this equation where are K minus 1 is defined so this attention of H K minus 1 if you take like the first IDs and it will be H 1 multiplying it G of X excise so all the excise like cardinality of s like for all the supports you multiply them we will get a relevant score for each of the supports and multiply them along with there that that's a particular supports embedding ok and the size of the support set is like 3 in this case right because of three dogs and that's that's yes it's hard to imagine that it's enough to inform the model but maybe I mean it apparently works so yeah so you can actually it's not it certainly that you have to fix to three only you can decide what although it has in the paper it has a limit it's like the minimum support set they have used is 5 and the maximum they have used is 25 so like yeah so like they use to like total 5 types of classes and for each of the classes you I that take one sample each or five sample each okay yeah so then the this attention was used only for the admitting full context embedding on the f functions then there is a another attention which is actually used for all the samples so there is a picture in the below part in a mixed part so it will be clearer so it works like this so you predict based on you take the attention of X hat and X I so here X I comes from the supports and X hat is the target which were or which you want to predict and multiply the attention along with the supports levels representation so although it is not the mentioned in the paper but the labeling represent a semi is actually 100 encoding of the supports okay so in some sense you could think of it like this so a prediction is basically a linear combination of labels in the support sets so for each target you let us say first predicts first support belongs to plow like first class so it embedded 100 meeting will be like 1 0 0 and so they can be supported belongs to class number two so it's 100 meeting will will be like 0 1 0 and third one in our class number 3 so it can be like 0 1 0 0 1 so you take the attention attention here this is basically defined as the softmax function of cosine similarity of their embedding so here f of X hat is basically the embedding of accept and G of X X I is basically the embedding of X I and you take the cosine distance between them and that acts as a attention between the X hat and the excise so that cosine sim distance is basically this value 0.2 or 0.5 and 0.3 and you multiply it the support support set who are not embedding and you will get something like a vector which will be like because of the softmax function it will be like a probability distribution and you you can take the highest probability as the output that makes it much clearer yeah so basically they train this network using a cross entropy loss and one more thing says although the authors use cosine distance so you can use some other distance also not necessary you will have to use cosine distance only so this this is how the network looks like once you have the G and F so basically you get the support set excise and you take the X hat they pass it through F you get a f of X hat and s and you take this you pass them through again another attention this is the like model level attention so this attention will give you the similarity between the X hat which is a test sample along with their with the supports it has in that particular set so you will get a vector for each of the classes and then you multiply the 100 tests of each of the classes and whatever like maximum you can predict as the output so like this is the same architecture with the equations so here you are basically using cosine similarity then pass it through a softmax function that will give you a tension then use that and send function between the x hat and the x i's and when you have the similarity between X hat and all of them x i's then take that and multiplying with corresponding Y I basically you are hurt anybody and you will get a probability distribution over the level space and that will be your final output so this is the basic idea of matching networks so there are like even like you can use CNN and other yeah is there any question okay so so this is the second part of the contribution of the paper so they actually come up with a training strategy so this is kind of like a kind interestings like so in that initial part when I described one-shot learning I described how the testing phase works right prediction phase so they tried to replicate the test prediction phase he exactly same to the training phase and for some reason people did not do that before this paper so this is like kind of this is how it should be done but previously people who are using some other approach so like same as the one-shot learning approach you take it instead of the testing task you take the training task we choose a set of levels and you create the support set from that particular set of levels and you choose another set of backs and that here in the paper they are called as badge so that is basically the same thing which I called as a target so I whatever I mentioned as target is basically the batch in the paper so you from the label space level set L you create two batches one is for support set and other needs for the targets so in the support city you have the level information in that batch or targets you don't have the level information so this is kind of like exactly similar to the one horse one-shot learning prediction phase and like but before this paper nobody did that so they they actually tried to do it like a normal supervised learning approach and then applied one sort learning in the prediction phase only but in this paper they applied in one-shot learning strategy in the training phase also so this is the second contribution in their paper so this is the like datasets they have used and the results they basically tested on three data sets or Omniglot and imaginate so image note has basically three different subsets million state and imagine it Rand and Internet dog I'll describe in the letter what they mean and there is one language modeling of task that is called that is done on that pin treebank data set so um in the Omniglot data set so it is basically a combination of 16 123 different characters characters and each of the characters has 20 different each of the characters was written by 20 different person so each class has 20 samples with them so in this case so they basically used compare their network with the convolutional Siamese network and there is another network I haven't been through the related work section much in detail but like you can go look into the paper for that part so they solved that with for some weird reason actually without the fine tuning their accuracy is better in most of the cases and full context templating didn't help much in this cases and thus Linus Simon's network actually fire with fine taining they improve but in case of matching network fine with fine tearing like they kind of big tree still little rather than improving for some reason it's not actually mentioned why it happened this is for the Omniglot and this is for the image in it so imagine it has the three parts this is called a mini image rate so from the image that they created subset of imaginate with 80 classes for training and 20 classes for testing and they found that their model with fine-tuning and full context embedding is basically working a little better without so the last two rows are basically with full context embedding and the before second-to-last rows are without full context embedding so like it improve the accuracy like two to four percent and fine taining also improved energy so a full context embedding shown little effectiveness in this case but not much then there is a two other data sets one is called random ignite and another is called Dom image net so random unit is basically comprised of random classes 882 class randomly selected classes and for training and remaining classes 118 classes for testing but in case of dogs imagenet all the non dogs classes are selected for the training purpose and only dog classes like some such species of dogs are chosen for the testing purpose so tasting sample has only has browse nothing else but if you see in this case so the matching net is basically doesn't do better in case of one-shot learning in case of in the dog's data set so one of the like result like one of the reason that matching net is not working better than the inception classifier there they mentioned is like because the in the dog's image rate so testing only has dogs and training has except dogs everything so basically the distribution between training and testing is very differ so that is a kind of a challenge for matching it if the training and testing distribution are very different then matching it doesn't work well all the time so okay so this is the pin treebank so they kind of defined a new task for this operation so the task is something like this you have a set of sentences from each of the sentence one word is missing so that particular missing word is basically the level so here you have five sentences and each of the missing words is like this in the first sentence prominent the world prominent is missing in the second sentence the word series is missing in the third sentence the word dollar is missing in the fourth sentence towel and KO gives in and comprehensive and there is one query sentence in there is also one word missing in the query sentence but this missing word will be either of these five words so this is how the we defined the task in the language modeling case so in this case so basically they compared their result with the Oracle lsdm language model so here the Oracle Alice team basically means that it has seen all the words presents so it has everything like not what's one short like Oracle is basically the upper bound of the scenario so like it has seen all the words then predicted and in case of one sort it haven't seen those words and then tried to predict so matching it works pretty well it's like for one shortage like 32.4% accuracy for short sixty 36.1% accuracy and three source 38.2% accuracy so like if you increase the number of samples present the accuracy usually increases and in conclusion so the nonparametric stress structure gives the matching networks like the ability like to learn from single samples only you don't need like thousands of samples formatting networks to work it's basically to summarize the model it's like a trainable into end differentiable nearest-neighbor which metric landing capability so it it works very well in three different data sets and training a model they propose the training or a phase also like a one short way make makes the learning easier before this paper like for some reason researchers didn't use the same training and testing phase so like they use the traditional training procedure and then the testing phase that they use the one-shot learning way but they this paper change that one so this is few of the remarks from my side this is not mentioned in the paper so like this is basically in summary like they introduced the matching networks and so from the you can see these paper in few different perspectives so if you see from the parametric perspective you can say it like it's basically a paper of metric learning if you see from the nonparametric perspective it's basically a KNN where the level predictions are linear combinations of nearest neighbors in the support set and another part of this paper is training and support cyclists level distribution should be closed as we seen in the dog imagenet case if the training set and testing set supports are not very same distribution are very different it might not work well and one other thing which is not mentioned in the paper is that when you are passing the samples through the FC the ordering of the samples are not mentioned to have any significance but in usually it usually have a significance let's say if you in the F function if you give it if you want to embed the X hat but the ordering of the support is like x1 two XK 2 X n and there is another ordering that XK is in Fast and then X 1 and XK are swapped kind of and then the embedding you will get will not be similar to the first one so this part is not like taken into account in the paper so this could be like a study about this who do have been much interesting like how the ordering affects the accuracy effectiveness of the model and one of the limitations of the paper is kind of like sample size during the training and testing are fixed it's like minimum is 5 and maximum is 25 like for all the operations it's kind of fixed they haven't changed it so if the like it's not very suitable for like if the training set is gross like online then it might be a bit problematic to formulate the problem how do you create a batch and how do you create a level straight all these things and if you see into the details like when calculating calculating the attention so if the support set becomes very large it will become computationally expensive ok so this is all my observations about the paper and these are the kind of resources there are many blog post made by different researchers about this paper this is kind of a very well known paper and like and many more sense of the code is also available like you can go to the paper suite code and such for this paper and you will get like 5 to 6 different implementations including fighters and tensorflow and chaos and others and these are the references of the important differences of the wave and there are many other references included in the paper that will be all um thanks very much very talk and then well it's it's pretty heavy I think if you haven't been into the tension network and all then it's daunting but I think I kind of grasp the the main features of that like to me it's always like really hard to imagine that with a couple like really low number of samples that you can actually achieve something but this is that it's like a linear combination of like weighting of the scores that that kind of hit home with me that kind of makes sense to me so I think I'll definitely I haven't had the chance to really read it before but I think I'll definitely do that everything I mean any other comments we were almost of the hour but anyone thank you very much Sam I want a second Christian I also want to take in the paper now I think it was very interesting what your prison did say thank you very much for that I was thinking about the kind of if you think about it it's kind of intuitive the the one-shot learning approach I was thinking if if there is even a biological plausibility to it because if you think about it to how children learn they best basically do they or that is my question that I have do they really learn by one-shot learning or is it basically like repeated exact sort of training examples from their parents really and then they sort of conceptualize the concept of say a banana or something like that so to answer this okay let me give you a scenario okay so let's say I showed you a picture of his a bruh okay and then I told you that real bruh is a zebra with red stripes so you can easily recognize pictures of rivera right now I see yeah right that makes sense yeah yeah so this example is actually mentioned in one of the blog post so I just gave that example yeah any other questions comments no all in all if there's no further questions or comment I mean we can also like like discuss later on on on slack or in the coming days if there's like stuff coming up all for like if there's dedicated questions to one of our meetups we can like make a deeper discussion there as well and people I would say thanks again to Sam for this I guess hard work yeah so I would close it out for today and we will convene again at a fixed time and we'll make sure that it's clear to everyone which one starting of main those things are attending and see you next time\n"