Machine Learning Methods - Computerphile

**The Challenges and Opportunities of Supervised Learning**

Supervised learning is a popular machine learning approach that has been widely used to solve various problems, including classification and regression tasks. However, there are some challenges associated with supervised learning. One of the major concerns is overfitting, which occurs when a model becomes too complex and starts to fit the noise in the training data rather than the underlying patterns. This can lead to poor performance on new, unseen data.

Another challenge is the need for large amounts of labeled data, which can be time-consuming and expensive to collect. Additionally, supervised learning requires careful preprocessing of the data, including feature selection, normalization, and transformation, to ensure that the model receives high-quality input. If these steps are not done correctly, it can lead to poor performance.

**Practical Applications of Supervised Learning**

Supervised learning has been widely used in various practical applications, including medical diagnosis and image classification. For example, doctors use supervised learning algorithms to diagnose diseases such as cancer, where they have access to a large dataset of patient records with corresponding diagnoses. The algorithm can learn from this data to improve its accuracy over time.

In another example, computer vision techniques are used to classify images into different categories, such as objects or scenes. These techniques often rely on supervised learning algorithms that can be trained on large datasets of labeled images. By applying these algorithms, researchers and developers have been able to build systems that can recognize and classify objects with high accuracy.

**A Practical Problem in Colon Cancer Diagnosis**

In a practical problem involving colon cancer diagnosis, doctors had access to a large dataset of 500 patients, each with corresponding diagnoses. The task was to categorize the patients into different levels of severity, from mild to severe. However, not all patients had labels for their diagnoses, which made it challenging to apply supervised learning algorithms directly.

The researchers decided to use semi-supervised learning, which combines elements of supervised and unsupervised learning. They started by applying an unsupervised algorithm to cluster the data based on its features, such as age, medical history, and genetic values. Then, they used a supervised algorithm to label a subset of the clusters with corresponding diagnoses.

**Semi-Supervised Learning**

Semi-supervised learning is a technique that combines elements of supervised and unsupervised learning. It involves using an initial clustering step to group data into similar categories, followed by a supervised learning step to label the clusters with corresponding labels. This approach can be useful when there are only a few labeled examples available.

In the case of colon cancer diagnosis, semi-supervised learning allowed the researchers to combine the benefits of both supervised and unsupervised learning. They were able to cluster the data based on its features, which helped them identify patterns and relationships between different variables. Then, they used a supervised algorithm to label the clusters with corresponding diagnoses.

**Interactive Learning with Human in the Loop**

To further improve the performance of the semi-supervised learning algorithm, researchers are exploring the use of human-in-the-loop (HITL) techniques. In HITL, humans are involved in the decision-making process to correct or validate the output of the machine learning algorithm.

In the case of colon cancer diagnosis, HITL involves asking a medical expert to review the labeled clusters and provide feedback on their accuracy. This feedback can be used to update the model and improve its performance over time. By incorporating human judgment into the learning process, researchers believe that they can build more accurate and reliable models for complex tasks like disease diagnosis.

**The Future of Semi-Supervised Learning**

As data sets become increasingly large and complex, semi-supervised learning is likely to play an increasingly important role in machine learning research and applications. By combining elements of supervised and unsupervised learning, researchers believe that they can build more accurate and reliable models for a wide range of tasks.

In particular, HITL techniques are seen as a promising approach for improving the performance of semi-supervised learning algorithms. By incorporating human judgment into the decision-making process, researchers believe that they can build more accurate and reliable models for complex tasks like disease diagnosis.

"WEBVTTKind: captionsLanguage: enWell today, I want to talk data miningwhich is what I'm really interested inand I want to explain a little bit about the inner workings of data mininga little bit of the sort of terms that you might have heard when you read - the first lecture or the first bookI want to talk about supervised learning,unsupervised learning, what exactly are these things, and thenI want to get on to something new semi-supervised learning and alsoWhat's the research at the moment in this area?It's called Machine learningThat's the sort of applied artificial machinelearning if you get a data you want to mine the data andBroadly there's kind of two categories of methods how this works, so if could pull up my prop. Yes, I've carefully preparedHere are some items of data that I have brought along the first method may be that I should explain is unsupervised learningBecause it perhaps the easier way, it's called unsupervised learningBecause we don't have any examples that are labeled, so it's an unlabeled learning yeahI guess the idea is a supervisor knows the answer and we don't have anybody who knows the answerSo we get the data to begin with and we don't really know anything about itWe know obviously the attributes. We know the values, but we don't know what categories are they let's say that's a problemSo unsupervised learning very often is just sorting off the dataso unsupervised learning very often is just sorting of the dataSo you get your first date item and you put it somewhereand then comes another data item and you basically go let's do colors is this similar or is this different andNow this is quite different. We put it there and then comes another date item. OhIt's this similar or is it differentit's a little bit similar to the yellow ones so we'll put it a little bit closer to the yellow one andThen comes another data item and noThis is obviously quite similar to the yellow one so we put it closer to here and then so over time you get all theseData items in and they might end up a bit likeSomething maybe a bit like thatSo what have I done? I've done a sorting of the dataand the approach I've done is something based on similarity measures these unsupervised methods they all use the similarity measure in this caseI've done kind of by color the other way these methods usually work is to actually start out by saying but how many groups would?You like your data to be in how many clusters would you like it to be in?So let's say you want them in three clustersWell, then maybe solution might look like this, it's clustered in by the color three clustersIf there would have been four clusters maybe the solution would have looked like thisAnd if there was maybe two clusters it might even looked like this, so you might ask okay?So so what's the data mining about the sorting of the data well?Once we sorted the data in this way. We can of course have a look at allSo what ended up together maybe these things have ended up together?And maybe now we can say oh, this is the light colors. This is the dark colors, and we certainly have two groupsI mean we wouldn't normally sort color cubesYou would sometimes saw patients and are they really illor are they very ill and you know that sort of thing we could sort about this now most of theUnsupervised method spoke exactly like I described to the worker by sorting it the differences that measure the difference between things so isit a statistical similarity is it aAlgebraic similarities that your metric measure you can imagine or so many ways you can measure the difference between thingsUnsupervised learning is sort of quite a simple way of doing itI mean, it's quite quick the algorithms, but it's not as powerful as other methodsWhat's the problem with it? One of the problems with it is actually quite straightforwardLet's say we end up with this solution. Well, is this a good solution, or it's not a good solutionIt's actually really hard. It's really hard to evaluate because we obviously don't know about the dataWe don't know so we're looking at itGoing which looks okay?But maybe not and then very often what happens actually if you look at the data from one wayIt looks like a good solution, but now I do my reveal we sort of turn the data a bitAnd you know suddenly we have another angle on the data and like actually now. It's a messThey're not really sorted variable at them or are they well often what happens?That's often what happens with unsupervised learning you sort them in one way, and they look quite goodBut then we look at the data differently and actually this hasn't quite workedAnd it's not so great the other downside withUnsupervised learning is the algorithms really only work when you tell them how many groups you want to data to be intwo groups, three groups, four groupsFor some problems you might notice maybe you have like I say ill patients and healthy patientsAnd you know there is two groups but very often actually how many groups you have is the whole question so you can't really usethese methods that well, if you want to know some technical terms Kmeans for example, it's a classic unsupervised methodThat's very popular. So if you can look it up, you'll learn a bit more about itnow...Second way of doing learning would be the supervised wayWe said unsupervised there must be that must be a supervised way. Here the differenceis that you have some data which has some answers attached to it already so you can learn from itFrom this data really learn from it and a classic way of doing it is well wellneural networks forms one of the best-known ones. How does that work, okay? Well?So have some date again, and this time let's say we want to do something a bit differentWe want to just sort them in light colors and dark colors for exampleAnd what happens is I get my data in and already somebody has labeled the data for methey said these are light colors, these are dark colors so we already know the answer for this dataWe don't know it for some other data, but we know it for thisThis is our training dataAnd now I'm going to do a new learning neural network the first data item comes in it goes hereThe next occurred item comes in and goes hereAnd I keep doing this and maybe I end up with something that looks like thisAnd now of course I can assess the quality of the solution and go... oh well algorithm, you've doneOkay, but you haven't doneit really well because these two should be over there this one should really be there fix the function a bit and do it againbackAnd we might end up like this. It's likeOkay, that was betterBut he's still got one wrong fix the function again and do it against this called back propagation neural NetworkAnd we'll do this againand of course if you do this long enough eventually the algorithm will learn the perfect function how to sort things andthen the idea is a new data item comes along andIt will go to the same function and because the function is now perfect it you will end upexactly the right place no problemand then ahand then no problemsoIt's supervised because we have labels and because labelswe can assess the quality and in neural networks it's the classic way of doing this and in general supervised learning is very powerful becauseAs long as we have enough data with enough labels, we can always learn the function, and then it should work really wellBut well there wouldn't be research if we're finished with itSo there's obviously a problem with this as well. The problem with this is that it can lead to overfittingWhat does overfitting mean?Means like tight jeans you know. No, not that. It means that you haveToo much emphasis on getting the function right you make it too right.So the function is absolutely perfect in fact it's so perfect, it's brittle it's it's it's just not good anymoreSo what happens is a new data item comes along one that you've not seen beforeI got oneAnd the unsupervised method wouldn't have a problem with this because it just goes by similarity and we'll goIt's kind of a light color you probably end up hereBut a supervised method has never seen this color before and the function goes like what do I do with this and itPfttttbreaks or it puts you just at a random place like maybe here so supervised learning is really goodBut if you overdo it, then you've overfitted and the problem is that you actually make the system worse again. You made it brittleThe other downside of supervised learning is you must actually have enough datawith labels which for some problems you have it's fine but for some problems, you don't really have it, soLet's talk about a practical problem that I was working on so I was working with doctors in a hospitalClinicians who look after colon cancer patients andthey took many years to collect the data of about 500 patients ofclassic medical data so we've got age, critical medical historywe've got genetic values, blood values, and so on and so on and so on andThey get diagnose the different categories of illness some more serious some less serious and the doctors wanted some help with this categorizationthe most serious cases and the least serious casesthey're quite clear, but it's just this whole group in the middleAnd I wanted to make sure can we split them a bit betterAnd so we were working with this with them, and so this is a classic problemAnd in that case there was 500 patients that were already categorized as in whatcategory of illness they were in so actually a supervised approach was really good because we could learn from those500 and build up a picture and as long as we're careful to not overdo it we'll be fineBut then what actually happened and this leads me on to what my research is at the momentWhat happened is.... notfor all the 500 patients did they have all the labels because some of the technology has been changing over the yearsSo there's more modern things now that I didn't have ten years agoso actually for the last 50 patients they had some additional labels that I didn't have for all the others andSo we were talking about what to do with thisAnd there's a method called semi-supervised learning which is kind of what the research is onWhy can we take the best of both worlds and maybe combine it a bit so what if you've got a few labels?It's not enough to learn perfectly, but maybe we can do something so what we've done is a semi-supervised methodAnd it's kind of a mixture of the twoYou get your data and let's just say we want to split them in light and dark colorsIt's basically our more serious patients and our less serious patientsAnd you might end up sorting the data something like that because it's an unsupervised approach first of all we don't know exactly how goodthis isBut then for some of the data items we have a label and we can look upWhat's the number on them and because for some of you have a label nowWe can say okay all the ones have the same label or with a similar label are there in the same group so suddenlyWe can assess the quality of thisSo we don't have a label for all of these, but have a label for some of them are they in the same groupYes, and then the same labels are in the same groupYeahThat looks like a good solution semi-supervised learning is probably the future because as data sets get bigger and bigger and biggerYou don't have labels anymore for everything because nobody has time to label everything and computers can't really label things very wellso you'll have the experts labeling a few things andsemi-supervised learning will be where this is going butThen the next step really would be to have it interactive that would be even betterSo that's kind of what we're working on right now. It's called a man in the loop or human in the loop learningwhereYou maybe have no labels at all or maybe just very few and then you do some sorting of the data and then we askedthe expert has the sorting gone well? Has it not gone well?Well, what about this one item, what would be the label you would give it and it sort of a bit interactiveAnd I think they'll be much better because then you can you know there is in real time and you can actually alsoLatest developments can come in tacit knowledge that you might not even have in the dataSo that's like spot checking? Yeah, exactly it's like spot checking it and but then putting that knowledge back into the algorithmSo the algorithm can learn from it again and it's a sort of reinforce a bitThat's a single-car. That's basically controlling the robot twice 864 processes. Which is more than arobot will usually get. Where are we going now? I'll show you the big machine. That's it.\n"