Fast AI NLP Study Group - Session #6

**Introduction to Regular Expressions**

Regular expressions, commonly referred to as regex, are a powerful tool used in natural language processing and string manipulation. Rachel explains that she makes a function called reg X Fram which is rake X compile this sequence that is either the yellow crown or the green frown. She breaks down what this means, explaining that it's essentially a function that looks for either a yellow frown or a green frown in a given message and then applies a substitution to replace the matched string.

**The Functionality of Reg X**

Rachel further explains that she makes this thing reg X frown sub. This implies that she takes the frown and substitutes it with a smile, effectively replacing the matched string in the message. The function is designed to compile a rank X expression, which means it's applying a set of rules to match and replace specific patterns in the input string. Rachel's approach to regex is straightforward, emphasizing the importance of testing and proofreading to ensure that the function works as intended.

**Common Errors in Reg X**

Rachel highlights two common types of errors that can occur when working with regex: type 1 errors, where a string is matched that wasn't intended to be matched, and type 2 errors, where a string is missed that was meant to be caught. To avoid these errors, Rachel stresses the importance of testing and reviewing regex expressions before applying them in real-world scenarios. She encourages users to test their regex on a variety of inputs to ensure they don't inadvertently match or miss strings.

**Resources and Further Learning**

Rachel mentions a cheat sheet as a resource for further learning about regex. The cheat sheet appears to cover much of the material presented in this lesson, although Rachel notes that there's always more to learn when it comes to advanced features and techniques. She advises users to focus on mastering 90% of the regex functionality before moving on to more complex topics. This approach allows users to build a solid foundation in regex and then expand their knowledge as needed.

**Conclusion**

Rachel concludes by stating that this lesson covers approximately 90% of what you'll ever need to know about regex. She notes that there's always room for further learning, particularly when it comes to more advanced features and techniques. The lesson provides a comprehensive introduction to regex, covering the basics of function creation, testing, and error avoidance. With Rachel's guidance, users can develop a solid understanding of regex and apply it effectively in natural language processing applications.

**Future Lessons**

Rachel hints that future lessons will focus on deep learning for natural language processing. She mentions that this approach is going to get "really interesting" and promises that the next lesson will build upon the foundation established in this one. This suggests that the course will explore more advanced topics, potentially including machine learning algorithms and techniques specifically designed for natural language processing. As the course progresses, users can expect to learn about more sophisticated approaches to text analysis and manipulation.

**Conclusion of the Classical Approach**

Rachel concludes by stating that this brings us to the end of the classical approach to natural language processing. She notes that from now on, the course will shift its focus towards deep learning for natural language processing. Although the classical approach has been covered extensively in this lesson, Rachel emphasizes that it provides a solid foundation for understanding the underlying principles and concepts.

**Upcoming Special Event**

Rachel announces an upcoming special event, specifically mentioning February 8th as the date. She notes that there will be a class scheduled on this day, but she doesn't elaborate further. This suggests that the course may have some exciting developments or announcements planned for the near future.

"WEBVTTKind: captionsLanguage: enI thought we could try to make it our way through the to the two notebooks and the two videos that correspond there's one where let's see it's this one three be more details where Rachel sort of reviews stuff that she's talked about in the previous videos and I kind of um she talks about randomized SVD what we didn't pay much attention to that so I don't think I'll do too much with it here but so I yeah I think it's an interesting topic but it's for us we all we care about for this class is being able to use SVD and we've already done that you know using it to factor a matrix so I'm gonna skip over that section um I wanted to go over this this example that rachel has picked out from this book called the drunkards walk by Leonard Lauda now he's an interesting guy he was a student at Fineman for a while he was a postdoc at Caltech himsa student at Feynman and then I'm not sure what happened he he he he's not a physicist or he is a he's not working as a physicist anymore but he's a writer now so but anyway he wrote this interesting book the drunkards walk and there's a neat example in here which rachel has picked out to help us understand Bayes theorem and um so Milano went and got it was tested for HIV I think in nineteen nineteen eighty nine thing and his doctor told him that that the test was positive and that therefore he had that therefore he has a 1 in 1000 sorry he has a very high chance of having HIV so he thought about it and and realized that that that was not true and and so this is where we're going to sort of understand why that's not true so what um what happens with the HIV test is that it produces a it produces a positive result um we know that it has a false positive probability of one in a thousand that is one in a thousand people who don't have HIV will show up as as having HIV and so the doctor said well that means that there's a 999 in 1011 about that and realized it wasn't true and and this is what this is his reason okay so in very in a very brief sentence um this is how Audino solves the problem he says let's prove the samples faced to include only those who tested positive so that means if you have let's say we have 10000 people 10,000 okay here at limine there's a previous paragraph that sets this up so imagine the sample space that has 10,000 sort of people in the general population which are generally heterosexual and non intravenous drug drug using and also I think they limit the population to males because this was the sample at that time people mostly males are being tested for HIV white males and so in this sample according to the CDC which is the Center for Disease Control about one out of 10,000 people will test positive because they have HIV and the false positive rate is one in 1000 so that means that if you have 10,000 people who are tested you're gonna end up with 11 11 positive results one for the person who has HIV and and 10 for the people who are false alarms and that are that don't have HIV but but we're giving a positive that the test turned out positive so that means that one in 11 of the people who tested positive are really infected and so his doctor told him that the that the probability that the test was wrong was one in a thousand the probability that he was healthy was one one thousand but according to this way of looking at it which is the right way of looking at it only one in 11 people who test positive czar really infected so his chance of having that the chances are one in Y eleven that you have AIDS that he has not AIDS but hiv-positive that is that he has HIV and he's HIV infected so the chances are ten and eleven that he that he's not HIV infected if this makes any sense so the way that they the way that he solves the problem is he looks at the sample space he looks at any any limits the space he proves the space down to only the people who tested positive for HIV and then he realizes that only one of these people really has HIV one in eleven of people who tested positive really has it so this is a simple way to break down the problem and get an answer and notice that it you know it's two paragraphs right well um and this summarizes it okay so 10,000 people total people 11 of them test positive of the people who test positive ten of them are actually not infected with HIV and one of them is so that's that's sort of the table that summarizes the result and then here's the punchline is that there's only a nine percent chance that the test was correct and that and that mallanna was actually um sorry there was a only a one in eleven chance that the test was correct and so mod now was actually HIV negative it turns out so um sorry I'm sorry that's not the way to interpret this sentence there was only a one in eleven chance that the test was correct and I don't know if you can actually I mean he's probably he's still alive and he doesn't have any signs of the disease so he's actually HIV negative so yeah so so that ten and eleven chance that he didn't have that he didn't have that he wasn't infecting with HIV that was that was as a result of a test and now we that's actually the case because he doesn't happen ok so you might say well ok that's it why you know it's two paragraphs we have to struggle a little to understand it but it's not it's not but it's it's pretty transparent after you think about it that that's the right way to think about the problem so now I'm going to go and show you a way another another way to do this problem using Bayes theorem and it's going to take a couple of pages and you're going to wonder why am i why am i doing all this work and look at all those nasty equations and everything to get to the same result and the answer is that Bayes theorem and the Bayesian method is so important that you're going to encounter it again and again and again and it and it helps you to solve problems that are much more complicated than the simple one that we just solved using a different a different method and so it's worth it to learn Bayes a Bayes theorem and how to eat and more importantly how to use Bayes theorem because Bayes theorem is how people infer the truth from evidence and this is going to become more and more important in machine learning causality is a really big topic to establish causes for something and that always involves using Bayes theorem at this year I think 2020 is the year in which I think 2019 it really started but 2020 there's a lot of emphasis gonna be on on on identifying causes in machine learning models or using machine learning models to identify causes okay so we can start with this so here's the here's a review of the facts um but first I wanted to show you Bayes theorem again we saw this last time but it's worth seeing it again so that you'll just sort of the more times you see it the more likely it is that it will be branded in your mind and that's what I'm trying to get at first of all everyone can see my screen right I'm not just talking right no you're good okay thank you okay so consider two events a and B the probability that a and B occur together can be written in two ways one way and this is what this is the mathematical way of saying probability of a and B occurring together it's P of a comma B that means a and B are both occurring together and that can be broken down as the product of the conditional probability the probability that a occurs given that B occurs that's what this bar means given that so the probability of a bar b means the probability that a occurs given that we already know that B has occurred and then multiply that by the probability that B has occurred and you can sort of see in a reason that it that it's that it's that it is logically true that you know other words if I know that a if I know the probability of a given that B occurs even if I multiplied that by the probability that B occurs that should give me the probability that a and B occur so and you can also write it the same way in a sort of in a reverse manner you can you can switch a and B and you can write the probability the product of the probability that B occurs given that a occurs times the probability that a occurs and that'll give you the same result the probability that a and B occurs so this thing here is called a conditional probability if you haven't seen it before you you will see it again if you ever try to read machine learning papers this idea of conditional probability um so yeah so I've sort of broken that down P of a is probability that a occurs P P is a probability that B occurs and these are the conditional probabilities define them this is what I already said in words and so Bayes theorem is just a statement that that this thing here the right-hand side of this of these two equations are equal that is you know this equals this okay and then stated you know simply that's what this means right here so you look at that and you say well what what good is that um why is that so important well let's I want to do something that I should have done before suppose we have hypothesis about the world we could call it H and then we have some data and we can call it D so then we can let's let's let H and D be that take the place of a and B and then we'll see a more interesting application of Bayes theorem so I'm gonna say the probability that H probability of H given D that's the probability of D is the probability of D given H yes did somebody have a question no I was just talking to myself as you were okay that's fine okay so here's what it looks like so this is so this is how you can think about hypothesis that something you're trying to find evidence for or support for and data and a hypothesis could be some kind of a model where you have some parameters like if you're looking at if you're looking at movie reviews the hypothesis could be that the review is a positive reviewer or a negative review it's something you're trying to find out about the world so generally the useful form of Bayes theorem is when you divide both sides by this P of D ya but there's a better way to write that sorry um I'm I'm doing tech here later cage D now it's gonna come out looking like a fraction so there it is so that's the statement of Bayes theorem and the way that in the way that you use it in hypothesis in machine learning and hypothesis testing and so on and generally the probability of H given D is the probability that your hypothesis is true given the data that you have in hand that's always what we want to find out okay we have some data and we want to figure out whether a hypothesis is that we we thought about that we've created a model or something we want to figure out if that's true okay so it can be written as the product of P of D given H that is the probability that one would have gotten this data if this hypothesis were true times the probability that hypothesis itself is true divided by the probability that the date of the probability of getting the data that we got so these are the things that that we work with and they're they're called I should write this this thing here is called the posterior though you know here in these terms used the every one of these terms has a name by the way so that's what I'm trying to I have a question yes what is it what exactly does that mean to have the probability of the data that we will find out in other words we've seen if the probability that we've seen the probability of seeing this kind of data like said suppose if you let's see how about whether for example yeah like say if you if you have a rainy day then you can look in in the whole history of meteorology or meteorological records for your area and figure out what the probability is that you would see a rainy day that's that's what that would be okay thank you well okay let's see how this thing here is often referred to as the likelihood likelihood of the data that can also be computed we'll see in this example P of D is the prior let's see so I should have typed this up beforehand instead of making you wait one time but um I'm almost there it's okay just FYI think you over wrote something in the main equation the prior was P of D and it got over written as yes I see that um yeah thank you um so that was the likelihood the and this should be P of T yes thank you thank you for spotting that okay and then so I got the prior and we've got the denominator wishes I just didn't that again let's see um this is the prior and then sorry go get there in a second and then finally the denominator part is the evidence so these are the pieces see what it looks like I need to just put line breaks there yeah so these are the names for the terms um yeah okay and so we're going to talk about all these terms in the in the context of this problem that were that we're looking at now so let's let's so this is the thing the model or the structure to keep in mind this is what we're going to be applying to our problem here with the HIV test result so let's review what we know this guy was diagnosed as HIV positive in 1989 at which time the false positive rate of the HIV test was one in a thousand and I checked them by the way it's still one in a thousand according to the CDC Center for Disease Control the incidence of HIV infection in the ordinary population is one in 10,000 I think the ordinary population of white males I think is the UM I think I put some reference to that at the bottom let's see oh well we'll get that button over there anyway um so here we go this is the three the three things that we know we want to determine the chance that he was actually infected with HIV that the the probability that any person who's who received a positive HIV test result was actually infected with HIV so we will go by we'll proceed by translating our problem into the language of probabilities and conditional probabilities so the probability that a person in the sample is or is not infected with HIV this is the priors is the probability of HIV is 1 in 10,000 which is point zero zero one and the probability of not having HIV is point nine nine nine nine those are the priors the the things that we already know about our population the false positive rate is the conditional probability that a person who doesn't have HIV would test positive and this is what we've called the likelihood so the probability of a positive test result given that you don't that you're not HIV infected is you can still get a positive test result one in a thousand times so that's the false positive rate you we want to determine what we want to determine is the probability that a person who is given a positive test result was a misdiagnosed as HIV infected so that would be the probability that a person and doesn't have HIV given sorry yeah that the person isn't HIV infected given that the result is positive and here the bar means not okay I think you're familiar with that and this is what we call the posterior and the posterior is always is always the prior probability updated with knowledge that you have so it's it's an updated version of the prior so serving what we've done so far we can you can see that these pieces are right for the application of Bayes theorem right these are two conditional probabilities where the where the two variables are reversed and that's a sign that you can apply Bayes theorem right and so we can write down Bayes theorem for the for this situation um just straight application of the of the the structure that I gave you before and then we solved for the posterior which is what we want to find out and and then this is what base theorem says about this problem and so our job is to compute all three of these pieces to make sure that we can compute all three of these pieces and when we do we can put them together and figure out the answer to our problem so we already know two of the three pieces on on that on the right hand side of the equation namely the likelihood the probability of a positive result given that the person isn't HIV infected that's one point zero zero one and the prior the probability that the person is not HIV infected that is we know that the probability that the person is infected is 1 in 10,000 so the probability of the person is not infected is 0.9999 so 1 minus got 1 over 10,000 so now all we have to do is compute the third piece which is this thing probability of getting a positive the probability of getting a positive result that is the probability of the data and it's the name we gave for it is the evidence so we we reason that if a person tests positive there are two possibilities either the person was infected with HIV or they weren't and so we can calculate this thing as the product of two terms the probability of the person was positive because they had HIV the probability that the person was positive and they didn't have HIV might be and that's these two terms um and here we've hit a snag we have if you if you review what we know we we know this guy here we know this guy here and we know this guy here the only piece of is that we don't know is is this this is a really important number by the way it's a it's a number that quantifies the tests effectiveness and detecting HIV and you know you can imagine that the drug compass is the company this is the number that the drug companies have to make as close to one is positive okay and and they can't make it exactly equal to one but if it's less than one that means that the test sometimes misses detecting the disease and that's bad right so that's called a false negative a missed a missed detection but as I said it stands to reason that the drug companies have worked hard to make this number close to one so let's forge ahead with that assumption and see where it goes we substitute values in the in this equation substitute everything we know and now and now we don't know this but will will say okay we think it's very close to one so let's just put one for it we do that and we end up with this equation and notice that that because of the way these numbers are it doesn't much matter that this number we don't know this number exactly because it gets multiplied by point zero zero zero one so whether it's one or 0.99 or point nine eight it's the other term dominates so this is the this is so it turns out and it turns out that our approximation was justified because um let's see I'm not in at the tech right there's a mistake there I just saw um but it turns out that our our approximation was justified because like I said the second term dominates and actually I went to the internet and looked up what the the specificity of the AIDS is that the sort of HIV test is that is the probability that it then it gets that it gets a detection if the person really has AIDS it really has sorry HIV positive um and that number is like 99% 5.99 so the actual known value is 0.99 we said it was 1 and doesn't much matter because the second term in this equation is much bigger so now we owe and I wanted to mention that if you look carefully at the table in section 2 the table where we did it the easy way they actually made the same assumption implicitly but they didn't talk about it the PERT in other words the person who had the one person in the sample that had HIV actually tested positive right um it it won't always happen that way it only happens 99% of the time that a person who's really HIV positive will get a positive test result so yeah so now we have a yes sir a quick question what is the difference between the left hand side the probability that it is positive and the conditional probability of testing positive giving HIV okay that's a really good question because that that second part testing positive given HIV we've assumed it close to 1 yes how is that different from the probability that it's positive yeah okay that's a really great question ok so this is the probability that the test is positive in the population of people that already have HIV that do have HIV in other words the probability that HIV infected person given the test will come up positive that's what this is right that make sense there's a possibility based on the sample so that's testing based on the sample oh yes HIV that's right the sample of people who have an HIV infected people who are HIV infected which is a small subset of the whole population right this this probability is the probability that a test will come up positive if anybody took it of the entire population of the entire population and and we did specify the population of males white white males who aren't drug users I think that that was the special case well that was the king that was our sample as we said in the beginning so there's a big these these terms look I like that you asked that question because these terms look the same but they're very very different right this number is going to be very large very close to one because the drug companies worked hard to make this HIV test right this number is gonna be close to when this number is going to be much less than one the probability that a person in the general population is going to get a positive result it's gonna be way less than one mhm okay so that's a really good question I'm glad you asked it yeah it's a different writing sample in population yes we have to wait that's right so um so that's it that's that's the so we've gotten sorry um last the last line okay now we have all the numbers now we put all the numbers back together and we compute the posterior the probability that a person who didn't have HIV sorry the person that the probability that a person who received a positive test result really didn't have HIV and that probability turns out to be putting just those numbers together the numbers that we just all the three pieces that we just figured out right turns out to be about point nine or 0.9 one okay so the probability that now was infected with HIV is one minus that probability one minus that um and it's it's point zero 909 so it's like one in a lot it is one in eleven nine percent okay so we've got the same result but here we got it by a much more principled way of thinking that well that's not more principle about more in general more generally applicable way of thinking it's so you might think why do I go to all this trouble to get a simple result that we could get by one paragraph of thinking you know this result here and the answer is that this method is nice applied to a simple problem but it's not so nice if you have a more complicated problem whereas the Bayes method will will be applicable to much more complicated problems and if you ever want to read machine learning papers you'll see this gets used all the time and it increasingly more as the years go by people are using Bayesian reasoning so now that's your introduction to Bayesian reasoning and and you should now be able to to follow discussions that that refer to bayes's theorem and Bayesian reason and that's what I wanted to do with the first part of this lesson so this is something that that Rachel put together it was in the original notebook it's a quick review of the naive Bayes classifier and now that you've seen our discussion of Bayes theorem this will be very easy for you to follow and I've already done I've already done this or let's no look anyway so Rachel has a discussion about numerical stability which is interesting but I think you guys can go through it on your own it's it's really interesting to understand how computers how computers use numbers and how computers are limited in when they can move in their calculations the real limitations that you have in terms of of memory and time and so on um so there's there's a really interesting set of numbers that she points to that everybody should know and it's it's good to get familiar with these numbers so I'll just looking see that um she put Rachel put together naivebayes in an Excel spreadsheet which might be interesting for you to look at but I think given that I did this whole thing in the last notebook the this thing here I gave a pretty pretty detailed discussion of debate naive Bayes classifier so if you've missed that last class you can go back to this notebook the notebook number three where this was a really long notebook oh yeah but I did I did give a pretty complete discussion of the naive Bayes classifier applying to the IMDB movie reviews and there the hypothesis that we're talking about the model of the world was was that we're trying to test was the another question of whether a given whether a given whether a reviewing composed of a given set of words is is positive or negative is likely to be positive or negative so that's that's that so then the main part of today I think I wanted to spend talking about the regex reg X I don't know how I think they call it regex called rate or maybe reg X regular expressions reg X since its regular expression it's not regular expressions I would think it's it's I wouldn't prefer to call it reg X but I heard of called reg X as you said um so reg reg X is really a mini language that is used to to purse strings and as if you if you've seen this you know you're gonna be familiar with everything I say if you haven't if you haven't then this is this will be a chance to get a solid introduction to it I I must say that I just learned this myself just going through Rachael's notebook and and some of the references that she gave I've seen red rag acts many many times over the years but I've never bothered to delve into it because just look like a bunch of gibberish to me but I want to show you that it's not and it's it's really a language and and the first thing you do when you start to learn a language you start to learn some of the vocabulary and and and then you learn how to put together the vocabulary so that's what we're going to do so reg X has a whole bunch of things called meta characters and there's there's always a distinction between meta character and literal so let's get that out of the way first a literal is basically literally a text string a character in a string that you're trying to find or search for like a like a period or an exclamation point or or a digit - or parentheses you know left parenthesis or right parenthesis these are the actual characters in the text string that you're looking that you're searching for that's what a literal is okay so now that we've gotten that out of the way oh what's not a literal is a meta character a meta character is a special character that has a special that that is a part of the vocabulary of this of this mini language of reg X and the the forward the backslash is a special metacharacter that says make the following meta character into a literal so in other words the simple and simple application is if you want so since in in let's just scroll down the list here this is a meta character that means match one instance of any character is a wild card match so it's a period but what if you actually want to try to match a period then you get confused because the period is an actual metacharacter in this language so to get rid of our to overcome this problem we put a backslash in front of that period and that makes it into a literal period right so so there so this thing means so if I want to match a period that is the last character in a string it turns out that this dollar sign is the meta character for matching the last character and string if I want to match a string that ends with a period then then this is a retic expression to do that right the backslash the escape in front of the period means treat that period as a regular literal period as a meta character as a as a vocabulary word in reg X does that make sense yes last period okay there's a bunch of other characters that have meta characters and regular expressions um these are these are this is a group of them that that's interesting of the ones with special characters with back slashes in front of them /d means match one instance of a digit that's the D four-digit slash capital D means match one instance of an on digit and slash W means match one instance of an alphanumeric and that is that is a through Z and 1 0 through 9 that's an alphanumeric and slash W means such capital W means match one instance of a non alphanumeric which means that the character has to be a symbol I think it when I thought at first was non alphanumeric should include spaces like void space I'm not sure if that's the case I I think it I think that it has to be a it doesn't include whitespace but we can check that so I think it means non alphanumeric which means it's like a symbol like a plus sign or a question mark or exclamation point or a pound sign one of those symbols that we have then the s means match one instance of a white space and slash s means mat a slash capital S means match for instance of a non non whitespace character so these are imperfect it important selectors for what you want to match and then as I said dot is a wild card match means match for instance of any character question mark means match zero or one instance of the preceding character so this is like an optional match so if I put question mark one that means I want I want to string that I want to I'm asking whether the string has any ones in it but it but it can return a positive result even if the string has no instances of one in it so it's an optional match plus means match one or more instance of the preceding character so that makes it a mandatory match Oh a star means match zero or more instances of the preceding character so for example if I put if I put two star that means give me that means I'm asking for a string that contains one or more in sorry zero or more instance in zero or more twos in it but if I put a plus in from if I put two plus then I want I want to I'm asking that the string contains one or more twos in it I mean I want to match all that I want to match one or more twos and then these are quantifiers the the curly brackets mean match a string that contains three instances of the preceding exactly three instances of the preceding character so if I put B and then these brackets then I want to match a string that has 3 B's in it and then the other way you can use this is M slash n with M sort n comma n which means match a string that has at least sorry I have this to have this backwards at least M times where M is less than n right it has to occur at least M times and at most n times I've got this backwards um at least M times in at most n nodes right so so you know so and and so there and there are the square brackets means that whatever is in the square brackets match exactly one instance of it so if I have square brackets a through Z that means match exactly one outlet alpha character that is a through Z the the caret is kind of special because it has two interpretations one when it's occurring inside of these square brackets it means logical not so square brackets carrot a through Z means match exactly one instance of a character that is not an alpha an alphabetical character the other instant the other use of carrot refers to that it mean it refers to the character at the beginning of a string so carrot capital a means match a if it is the first era if it is the first character in a string and then it has a matching thing a matching end character that is the dollar sign so X dollars refers to the character X at the end of the string you know if the if it's you know if it's B dollars that means match the character B if it occurs at the end of a string if you want to match a period at the end of a string then you have to use the escape the the slash character to us because period is a meta character and we want to we're interpreting it as a literal here we want to match a literal period at the end of the string so we have to do this sorry yes quick question right there X dollar symbol I thought that was X asterisks for some reason in Ray in rednecks it's been a while for me yeah the asterisks back here means match zero or or more instances of the preceding character okay so it's it's kind of a wild card but it's optional it could have zero you might be yeah it might be some other language that that star has you know star has this interpretation of wild card in in some in some use cases but in this language so in other words start sorry the asterisk could mean anything you know it's a wild card in some languages but in reg X language it's an optional match okay yeah thank you yeah so these are just the pieces of the language and you know you can't memorize them if you just look at them once unless you have a super memory but as you see them you'll as you get from these are basically this is almost a hundred percent this is like there's there's other things in reg X language but this is like the most most of it that you'll ever need right um and then the last thing I wanted to show you was the group table let's see a couple of examples I think we got to this Z matches Z if it's the last character in string /w remember what /w means it means a an alphanumeric so it means Matz sorry um I made a mistake here W it's an alphanumeric that's what we said here and in here I said it was a digit so that's wrong so W is match an alphanumeric character anyhow if any if it's in other words if the string ends with any alphanumeric character then magic yeah and then the round brackets have another meaning they need magic group of characters in the order they're given so remember with with this with the square brackets it refers to one character and it can be any one of these characters that are you specified inside the square brackets but the round brackets have a different meaning they mean match a group of characters in that order so so you know round so parenthesis dog close parenthesis means match a string that contains the characters do G in that order and then this is a summary that I got from this reference this is a really good reference by the way we might want to visit it let's let's visit it on reg x1 it's this brief tutorial and see I want to go to this first lesson it's a very brief tutorial and you can do this and probably one or probably two hours a new self is enough to get through all the lessons and it just gives you all these little examples and it takes you through all of the different vocabulary words for example so here it gives you a problem you want to match if you want a red red X expression that will match all of these things just yeah there's there's a lot of a lot of ways to do this one way is we can match any string that has an A at the beginning of it um let's see wait a minute at the beginning I'm sorry this that's the beginning was carrot Eric right so match any string that has an a at the beginning or that begins maybe I have to quickly get the carrot on the other side carrot hey yeah and that matches all of these things there's enough there's lots of ways to do this you can say these are all alphanumeric characters so I can say W that will match the first a but now I can say um see what I want to do this and let's see any number of times so I can do this I can do something like this I think this will work laughing yeah I couldn't know that that doesn't do that let's see um see suggestions if you want to in the bride can make yeah you could go right right so what I had was / W / W and then + right / that gives you let's see I need a plus sign yeah so like that says match 1 or or more instances of this thing right and this thing is it says any alphanumeric so that could work the other thing that can work is well well dot only matches one instance we didn't but I could go dot plus and that'll match all one or more right but notice that star dot star so remember dot is ain't character dot dot star dot asterisk it says zero or more so that will also work why would star and plus both work because it's but one of them says one or more and the other one says zero or more and these things both remember it's the or more that it's matching or one or more or zero or more right so that's that's lesson one okay so now we've already done less than 1 let's do one more um let's see how do you go to the next oh I did that I have to do gone so I have to do I have to pass the test in order to move on so that when you plus that was our that was our solution so we move on to the next lesson and now let's see it wants us to write a pattern that matches all the digits in the string below okay we want to match all the digits in the string so it has to be again any car any character plus and that'll work this is the same solution but there's other ways to do it I'm sure look let's see yeah D if you try it / D um it will pick out the digits right um it will pick out one digit right if you if you type / D responses it will match all of the it will match it on all the instances of digits but it it doesn't match anything that's not a digit right yeah if you do this if you do capital e it matches everything else I guess it meant I guess it does include spaces anything that's not a digit including us a blank space I think gets matched it matches ah but it but it's one LC ah but it's consecutive instances right so you you have this ABC and then it stops and it doesn't match the XYZ because that's not in a consecutive this means a consecutive series of characters again here it's going to match up to the up to this apostrophe or whatever you call it a quote mark but then it stops and it doesn't catch the last quote mark because that's not in a consecutive series okay same thing here okay so anyway I really I highly recommend doing these lessons it doesn't take much time and you'll you'll really have a good understanding of a good basic understanding of redx when you're done but um back to my lesson what what Rachel started with is as a is a common problem say uh matched phone numbers match phone numbers of giving a text string tell me whether it's a phone number or not and she gave several tries to do that you have to to make this work by the way you have to I don't think you need this I had to but I did have to do this you have to import this string package otherwise it doesn't work you find it doesn't work so define a bunch of phone numbers and she already knows the answer this is not a phone number it's an address in San Francisco but it's not a phone number um so string dot digits is just a way of matching two digits so she first starts with a function that that returns true if the if the if the characters are digits or dashes or spaces or parentheses because those are the common characters that you'll find in a in a in a phone number but you can already see that there's a problem here because phone numbers have to have a special ordering it can't just you can't just put a jumble of characters that are digits and plus signs and sorry I did it's and minus signs and we're at the season's faces that won't work and we see that so she does this assert check which remember assert returns nothing if the statement is true otherwise it will raise an assertion error so she's checking whether whether these things are phone numbers and check that it's okay assert that it's not a phone number so so all this all this works okay but then try this one another possibility just these four digits see if that's a phone number and if you do the astern you'll find it you'll get an assertion error because it doesn't okay it won't it won't catch that that's not a phone number because the simple function that she created here which she knows isn't going to work she's just giving you this as an illustration this little phone number checker doesn't work because it's not paying attention to the ordering not only that it has to have exactly a phone number has to have exactly ten digits in it right and so you know this is gonna this should fail so um but it doesn't a certain way 200 because I think I didn't I was here and those I ran the other way yeah I ran the other definition earlier so this one still doesn't work that in there that in there and we get an assertion here because the function that we defined doesn't check that there are exactly ten digits so now she figures okay let's make a nicer let's make the test nicer we'll make sure that we have exactly ten digits and try that so now what happens it all works but then what about you know this this string what about a string that has ten digits in it but if they're organized in terms they're organized this way is that going to be a phone number um let's see it it says that it's let's see I just wrote that so I didn't see the assertion errors but you can you can do it this way and it throws an assertion error it says that it's not a phone number it has ten exactly ten digits in it and so it should be a phone number so anyway this is so so Rachel's just doing these examples to show you that we need something better than that and that's you know and that's why we need read reg X okay so reg X can can do that so one way to do it is you know the square brackets remember square brackets 0 - 9 means match exactly one instance of a digit between 0 and 9 every square bracket matches exactly 1 and then the dash matches exactly 1 - there's the Freak the the three-digit second group of 3 digits and there's the third group which should have four digits and that that will naturally us phone number but then you know that's kind of you don't have to do it that way you can be smarter about it you can do it this way with / DS and then you can do it even smarter by by you using the exact quantifier that is matched exactly three instances of digit and then match a - and then match exactly three digits and then a - and then exactly four digits and that and that will match a phone number in that format and of course you can go one step further and you could say well I want to match phone numbers that use the you know the parentheses to indicate to set off the area code you can figure out how to make a regular expression that will do that as well you want to be what you ultimately probably want to be able to match a phone number in a valid in a valid form so it could be some people specify phone numbers by putting by using periods as the separators instead of dashes and some people use the parentheses that - yeah around the area code some people just use spaces between so you want to be able to some people might not even use any space any blank space they just might give the string of ten digits so if you wanted to write a real a real phone number classifier you could have to take into account all those special cases which might be fun to do review of these uh exact quantifiers remember the question mark is zero or one instance the star is zero or more instances and the plus is one or more instances so now what what um what Rachel really wants to do and so this is the stuff we just went through the little thing that I put in for the vocabulary of the dictionary of the right reg X meta characters and now we're getting a little more familiar with them so now what Rachel wants to do is revisit tokenizer x' let's see if we can build our own tokenizer and i have to admit that i have not gone through this section yet so let's just go through it together and and see if we can understand it so so we're going to import this rig x the red x package is called re so we first have to import re now re works by come you know you have to make a valid expression by and a valid rex expression and enclose it between these quote marks and then it will it will sort of compile what your your reg X expression so this in what I'm what I've emphasized or what I've underlined here or selected here is the Reg X expression the quote marks tell the red X package that we're giving it a reg X expression and Greg X compile means you know compile this thing so that we know what we're looking for and so let's see so this says reg X punctuation means find one of any of these characters I think that's all it is um see I thought the curvy brackets like that the parenthesis were to match and ordered strings or are we telling it too much string yes yes you're right so we're telling it to match a string that has a Nordstrom that has a bracket a square bracket an escape in front of a quote market means that a literal quote mark an escape in front of an apostrophe means a literal ' well no um see see so quote mark must and then what why do we have to do that and then this and this period so I'm not exactly sure what this thing what this thing does ' reg X ' I think that's what this one means reg X compile with this I'm not sure what it means when you add the are in front I think I think the R means I am not sure what so so sometimes you add the R in front of this in front of these quote marks that sometimes you don't and so I'm not quite sure it looks like when you don't add if you mean a literal string and when sorry when you do add the R it means a literal string because that's what we're doing here we're matching a special sequence of characters so here we're matching an N apostrophe T here we're matching an apostrophe s so these are common occurrences in the English language that we're trying to look for the the way words can end multiple spaces we we're so we have a space with a star meaning multiple space where the star means replace multiple spaces with just one space so so if I can go about the first one basically the parenthesis is so that we can return the result so you can see that in the simple tokens part it returns what it finds the show's required to find so the square bracket then says that the word there is not shown in court integers as it find all all puncayshun in that space so it's just kind of an or any of these that's right that's right that's right then and the slash will just mean that you know just fine literally like yes okay all these things so so this is great thank you for that um she she dressed okay yeah Gary's okay thank you thank you for that so I'm wondering why we didn't put us a slash in front of the period and another slash in front of these parentheses because they're meta characters yeah anyway so okay so we can kind of see what we're doing here the our seems to mean literal characters or replace maybe our means replace oh one of a couple of people in the chat said that the r is for it tells it that it's of raw strength a rock string okay well so that's that's what i meant by literal so yeah seems little so that's that's good so that means look for this literal string this means we're dealing with the regex language when you don't use an R okay great so R for raw thank you for that multiple spaces so we want to find multiple spaces but this is just so it finds the Fleurus space and then says the second space is optional so it's kind of two or more sort of starting so thank you that's right so find a series of spaces in a series of consecutive spaces right one or more consecutive space or zero or more sorry one or more consecutive spaces because you you have that you have the first space but the second space that says zero zero or more I think it might be better to say just a space with a plus after it one space with a plus after it means one or more spaces the same thing so I think I would replace this thing but it says it's a literal like it's a raw string and they want to replace raw multiple spaces depending on what you know the input is I mean oh okay so R for R means replace them you're saying no I'm saying that's the point in this is to replace multiple spaces so they want to search maybe for multiple spaces that's why they put the extra spaces there what do you have the unit sorry to speak over wouldn't the double space searching multiple times for meaning it's always looking for yeah that's a spaces well yeah but I think someone pointed out the second space is qualified by this star which means zero or more instances so that would mean search for one space followed by zero or more spaces yes yeah and so what I'm saying and but but now I'm a confused because if R means raw string then how come they're using the Reg array gets a meta character inside of that that's a little confusing to me and so but but if they're going to say that that is is replaced multiple spaces with one and why can't they just Oh Joseph Gras means just do not use backspace as an escape character so back space double quote would mean back space double quote that's at least what I understood from raw doesn't mean just you string it means do not use backs literally if it is a back space and then the back okay okay so it excludes that when you have something that's preceded by art it means that whatever is in there is to be taken as well yeah is it well raw characters but including meta characters right that's right okay okay so anyway what I was thinking was why shouldn't this be one space followed by a plus wouldn't that accomplish the same thing because a plus says match one or more instances of of a space so I think that these two are equivalent but what was he then she has a function to define simple simple tokens I guess of a string that you send it you were so cent equals are a punk that's this thing that's this compiled expression substitute that dotsub means substitute so you start with this expression you substitute this thing for whatever for this expression um in in the string that you were given so this argument so this is a compiled reg X expression this is a method on that reg X expression which is called substitute and and then this is a string that you want to substitute a for for this expression and this is the input string the sent string okay so that's what that's how I read this right is it does everybody seem that does everybody think that that might be right that's how I'm understanding say okay sent sent as a sentence I think what she means um and so she wants to so the first thing she does is she substitutes any punctuation that's what we agreed that this does substitute any punctuation with a space of one I'm not sure with the flesh one soso slash one is the devil Singh which was found in the first okay yeah that's right very good thank you I mean you're you you obviously those you should be teaching us so so yeah this is the group this is a so whenever you tell it to find a group you can refer to that to that group by slash one if you ask me for two groups then / - would be the second group that it found that you specified so you can have expressions that have multiple sets of around the brackets in them and each set of round brackets returns an expression and this Flags this this says the first return the first group that you found that I asked you to find in this compiled expression and so here it is we we we have a group that has any punctuation and so what this does is it surrounds any punctuation with sorry it surrounds any function it replaces the any punctuation with or it surrounds any punctuation with aspect with blank characters so it makes them into dinner so any punctuation mark gets made into a token gets made into its own token um let's see then the next thing is look at this apostrophe so now this thing says find an expression like didn't or don't or whatever about that kind of an expression and if you find that that's the that's the compiled expression that we have substitute for it substitute in other words surround it by by the blank spaces so that's that and then anything that ends with an apostrophe s separate off that apostrophe s and make it into its own token so that's what this is doing it's saying in don't it's gonna take the word don't and make it into two tokens do and an apostrophe T it's going to split off the ending and apostrophe T and make that into its own token this is this thing splits off the apostrophe s and makes that into its own token and how does it do that it surrounds it by spaces because that's how we're going to end up interpreting a token is we're gonna once we pre process this list of this list of texts into its token groups then we're going to go pick off all the token groups and then what does this do this takes all the multiple spaces inside and and collapses them into one oh wait this was doesn't this one does yeah so this one subsidy exactly what you said it takes all groups of multiple spaces and just collapses them into one that's what this does and then it returns it it lowercases the whole string and splits it into these and splits it into tokens so that's what this that's what this little function does it give one other comment don't know whether it's particularly relevant but just to emphasize you know these things are so subtle and tricky with respect to the star and the plus that you had up above the difference that I think is in the star case it'll capture also single empty spaces and in your plus case it will only capture double or more empty spaces well actually see this but what I did was I this has two spaces and the first space is it's something that has to be there and then the second part is something that optionally can be there's right so one or more spaces for spaces right so you did capture the case when it's a star following when there is no space after the I am dspace correct so it's gonna capture the case where there's a single empty space got it but and this one does the same thing right because I only put one space here but it has to have at least one yes following yeah the only capture starting - oh I think I see what is that second space the star says zero or more so the second space might be a zero I think that's what that's exactly what I mean yeah plus your second space is required oh that's correct oh wait wait wait no because whatever whatever precedes this plus is the thing you're looking for now right maybe and it says one or more yeah so that means one or more space yeah it'll still capture one so these two are equivalent I claim I claim these two are equivalent yeah but because see the thing about these special matter characters is some of them are our qualifiers they caught they need something before then you need something that occurs before them the star of the + right question mark and other ones need something to occur after them like I I gave yeah yeah yeah you may be qualified sleep is this space in front of it okay yes yes yes so yeah there's a little bit of subtlety here so but but anyway so we're on our way to building a tokenizer this little function seems to do everything that well it seems to do what we need to tokenize you to do you give it a sentence it lowercases the sentence because we're not concerned with capitals you know that's what this lower does it lowercase is everything that's what so the sentence dot lower means lowercase that cycle converts the whole sentence of the lowercase wherever there's a capital letter switch it for it's lowercase version and then and then it splits but what we've done before we get to this point is we we've we've already separated out all the things that we want to be tokens by playing white space around them that's what these little things that's what these are that's what these things are doing so yeah so what about yeah so this is our sort of our tokenizer so give it a text string let's see what it does now we're going to join two deaths so now we're going to join to the to us an empty string the result of our simple tokenizer applied to the text and so then we see what it gives us and it does exactly what we wanted to do it separate it separates out the NT it separates out the apostrophe s from the word splits that off it splits each punctuation mark into its own token as remember this is this was originally occurring in a row of two dashes in a row but it's it picked the dashes out and and surrounded them by whitespace that's exactly what we wanted ' it picks out the apostrophe and surrounded by whitespace it picked out the period and notice also that it D capitalized it D capitalized all the capital letters and made them into lowercase and then the question mark it it separated out this apostrophe is separated out the question mark and it separated out the quote mark so this is what we expect our tokenizer to do and it seems to be doing it right another example another let's see now it's gonna send this this it's going to send it the same text which is this and just do the just do this part surround every punctuation group by by spaces so it does does that it does that but it doesn't yeah but it didn't do the our tokenizer did though remember it did the lower casing and it doesn't do the lower casing okay um and let's see it didn't do the oh it did the apostrophe but it didn't catch this thing right okay this this function alone does it doesn't catch the apostrophe so now we have so now we show that if we put text through text to through our apostrophe catcher then it would I forgot to run thank ya and now let's see it didn't seem to catch the first apostrophe but those are double quotes maybe they're not like defined and oh oh okay so the double quote should also be caught somewhere um cuz yeah it should be here it is you should be caught here punctuation so it weird okay anyway a bill close the double quotes in front is not part of the string it's just the way it's displayed oh okay thank you thank you for that yeah so the double quote is not part of the string that we entered let's go back to check that yeah the double quote merely tells you that it's a string so this is the actual string we're looking at so there we go then so this is illustrating the the whole took the action of the whole tokenizer and showing you that it does exactly what you wanted to do and this is just showing you what the parts that are due so that's fine so now we can give it all a bunch of sentences and and let's see list so make a list make this make a tokens make tokens which is a list that maps simple tokens using the maps sentences by applying the function simple tokes that's what this map that's what map does it it applies the function so map gives two arguments it takes a function and then it'll run operation and then the second is the string is a an input and map applies that function this function to this input so that's exactly what we want it to do and then we can look at our result and we see that we got a list of an each let's see I wanna make sure is that the tokenizer is returning a a text string this thing is returning a list okay so simple toks is a function sentences is a list of sentences and so simple toks is applied this function is applied to each element in this list of sentences and so you get a list of Lists so here's this sentence we can just check that it's doing the right thing all this happened more or less it separates out the period he capitalizes a and separates out all the words so it's doing what we wanted to do and the war parts anyway are pretty much true the war parts anyway are pretty much true it separated out all the tokens so it's doing what it's supposed to do then okay so now remember the other thing that our token the next thing that our organizer has to do is it has to convert your words all of your tokens to integers and then we have to have a way of converting between the words and the integer IDs indexes so she uses this collections thing again we've seen it before um we have a padding of zeros an SOS I'm sure what SOS stands for um anyway it's a 1 we have a function that takes that we have a function that's called tokens to IDs it takes a string of sentences which would which would a string of sentence sorry a list of sentences as an input let's see vocabulary count is collections don't counter we we've seen that and then it goes through each element in this sentences for sent in sentences 40 in cents so T is a a token in the sentence so it's returning the list of tokens for for all the sentences sent okay well this is just a function and whatever it's given whatever it's given as input it it breaks it up and returns it into tokens vocab is now it's sorts it sorts them all it sorts the result of vote count key get reverse equals two it sorts of in reverse not sure why it's doing that it inserts it okay whenever it sees a zero it inserts this and whenever it sees SOS it inserts this si-hu Aidid this is a dictionary the w2 ID is converting words two words to ID so it's going through and numerating the elements in vocab and index okay so so it's basically sorting the vocabulary words in reverse in reverse order then you can enumerate enumerate the remember what enumerate does is it goes through a list and gives you the zeroeth element and then that that word the second the first element and then that word and so iw Court is enumerated this list out and what we're interested in is these indexes so because the index is the numerical is the numerical mapping that we're going to make between the token and the numbers so all it's doing is it's taking this imw that come out from enumerate and then making a dictionary that has the key as the word and the index as the value so this is how you can map the token to its value and then remember we have to do the reverse we have to have a way of mapping in index to a word and so here we have word to i D of token gives you the index for t incent um so this is going through each each sentence in your list of sentences and and mapping them to their numerical IDs so it it does this so it's producing our art our mapping of the tokens into a numerical space and then it returns all of the it returns that numerical mapping and returns the vocabulary and it returns these functions or to index and vocabulary count my guess about the divorce pirate is that since it's sorting by world console it won't still sort the bias frequent I'm almost sticking the world's first so that's why it's the worst so it's kind of from the largest to the smallest yes thank you for that because I I forgot to mention that it's looking because it's looking at accounts so so it's sorted yeah the key is the count and yeah so it makes sense to sort on the reverse because we want the most important words first and that's what we ended up seeing in when we did this before we ended up seeing though that the list of in the vocabulary starts with the most frequently used characters and thank you for that that's that's the reasoning behind that so now here's how you apply this on this function you get you just get the all the outputs out of it equals tokes two IDs of tokens and so we've done this let's go back and there's our IDs so it's converted this list of of sentences each each list item corresponds to a sentence and now it's mapped all of the tokens into their equivalent numerical IDs so this is what we say when we have an embedding them are kind of embedding these words into a and well the embedding comes later it's it's when we convert this convert this sentence into a a vector that has the dimension equal to the total number of vocabulary words and uh let's see indexes yeah so these are the the indexes of the words that got used and then we've collected all of our words together in this list all the words that occurred in any of these sentences are collected into our vocabulary list and you can imagine that the more the more movie reviews you start with the more the greater your vocabulary will get on and it's kind of like it's kind of like a it'll be an increasing curve in other words the number of unique words versus the number of reviews that you've collected but eventually at a level off right I mean if you get if you collect many many many reviews you're not going to find any more new words in your dictionary so it kind of levels off and we saw that in our in our example of the subsample of the the IMDB sample we have something like what was it six thousand 16 or something six thousand and sixteen words so in this this is a much smaller set of unique words it actually can be displayed on my most want one screen so there and there so we examine all of these things these are the top what the tokens look like so we have a padding token we have an SOS token I'm not sure what that is start or end or something we have startup sent it's very good thank you that makes sense um then we have a period we have a comma we have i these all make sense as the most commonly used things i would have thought that eggs just a would would occur more often but maybe not in this set of we use we can see pretty much that these are all we're oh there's a it's just not as common as some of the other words um and we see the the individual kind of tokens that we that we encounter in sentences and they're they're they are so that's the vocabulary question what can be another name of the vocabulary variable above not sure I don't know what else nobody if you here's our function that maps our dictionary that map's a token to an ID and you can see that it's doing its job its mapping the first token to 0 the second token to 1 the third token to 2 and so on refresh your memory there's the first second and third the last token is his names and it got mapped to the largest number 51 so we have 51 words in our dictionary here um so that's pretty much the the regex and half reg X and how it gets used then she meant reminds us that string has its own uses that are it has methods called fine substitute start and so you can do stuff you can put qualifier you can put it up you can add a method to a string string dot fine gives you a way to find so you you give it a string and you tell you wanted to find that string so for example say write text equals then you could say aah fine and see what it does let's call this something and see what we got oops like that fine takes at least one argument and none given oh sorry I meant to the argument I need to give it the text sorry um I'll just go back to what I had I need the text I need to write this with the right syntax text dot find of what I want to find goes in here so and it tells us that I guess that it returns the number of times it found that or that maybe the index of the last let's see so there's three instances of a H and it's it says - I think that - refers to the index of the third instance I'm not sure it's a florist and oh the first oh the index that's right the index of the first instance of a h00 went to exactly thank you and all I think lends all the instances and underscore all think is find in this school thank you maybe something else here you guys know well we can look again what we get what what we have to work with here we have substitute method we have a start method we have all right Oh fine Oh maybe what we need to do is do this I'll tell us everything we can do with strings so let's see it's fine dog know one disk or Oh find all the artists work so I'm wondering oh um okay let's do that find all-knowing squad and then see what we got stronger has no attribute find all huh well let's let's try something else um we can figure out what praising what we have to work with um by asking what the properties of this string are and see what we can do to it we can ask for these are all the methods that we can apply to it we can split we can strip and we can find it has a method find oh well it has the right client if you want to look from the right like our find our fine that's great and then l vine okay let's try that let's try our find and L find that that and we think we know what what that'll do we think that our find is going to find this one is going to tell us the index where this starts so let's figure out what we expect it's going to be 0 1 2 3 4 5 6 7 8 9 10 we think we're going to get 10 back when we do our find right so we got 10 and then we can't do L find and then we think we should get back the same - right it doesn't have a middle fine so fine that's just fine oh oh oh that's right that's right find already whe doesn't have a method l fine because fine finds the left-hand side okay great thank you for that so we've got the two back um anyway there's other things we can do we saw that we used a method strip I think we use a split method split and splits on some on spaces so it splits on whatever you want so you could do something like a blah let's say you had a comma separated list of things look so let's say this would be a more interesting application suppose you have a comma separated list of numbers 12 14 22 17 31 and then you give it that list and then you can say you could say text dot split and you have to tell it what you want to split on so you could split on on comp you want to split on commas so let's see what happens if we tell it to split on comments then it did what we expected to do this is how you process a comma separated list which you sometimes have to deal with you could split on you could split on spaces for example if I did it that way you if if I split on spaces I would get a different result it included see I told it to split on spaces so that means I forced it to include the the Commons in there anyway so then this is just ways of dealing with dealing with text so you know python has ways of dealing with text has its own string methods but reg X is much more powerful string methods are easier to understand they express the intent more clearly reg X has a much broader use case it's much more powerful reg X is used not only in Python but in other languages and this is something I didn't know that reg X can be faster at scale in other words if you have lots and lots of text to the process reg access is very quick I guess because it's compiled you you come you actually compile your reg X expression as you saw above so that's our oh and this is interesting you can um so let's let's try to understand let's break down this example so she there's Unicode characters um that I'm gonna figure out what cuz oh um I can figure out what these are by putting them in a saree so I'm gonna put those characters in there but now I'm going to say that it's a markup but then I'll be able to see what part go oh I thought Oh see I can't do this oh they just go through they just map directly and I was I thought I was gonna see the code for it um anyway there there are actual Unicode sequences that give you these funny characters and I don't know what they are but what we've done here is we've created a message composed of these special Unicode symbols and this this is kind of interesting he makes a function called regex front-rank X Fram which is rake X compile this sequence that is either the yellow crown or the green frown that's what this means and then she says she makes this thing reg X frown sub so you take the frown and then you you substitute for it a a smile and then you apply it to the message remember the message was this sequence of characters so this is come this is compiling a rank X function that looks for either a yellow frown or a green frown and then this function says look for the yellow or green frown and then substitute it for a smile character in this given message you have to give it an implicit message so then you do that and it comes up the two frowns are replaced by yellow smiles which is what you told you to do okay and then Rachel goes through this advice for when you use reg X there are two kinds of errors you can make one is where you match a string that you didn't intend to match and another error is when you miss matching a string that you try that you wanted that you were supposed to match so you have to check your reg X expressions you have to test them when you when you come up with a reg X expression that you think is going to do the job you should test it on all kinds of instances all kinds of test cases to make sure that you don't get any of these type 1 or type 2 errors okay so a lot of times you'll come up with you'll you'll brainstorm a reg X expression you'll say oh this I think that's going to work and then you have to try it out and and and prove to yourself that it really does work by looking at different cases make sure that it doesn't match fancy you don't want match and it doesn't miss things that you that you want to that you that you want to catch all right so and these are some resources that she pointed to a cheat sheet I'll see what that has this yeah this is kind of convenient it goes it goes over much of the things that we that we did there's more there's more to rig X in what we gave but what we gave in this lesson is is like 90% of what you'll ever need so there's always you know when you learn a new language or something the best thing to do is to learn the most important 90% of it the stuff that's going to handle 90% of the cases and then you can learn the rest of it and that's what we did today we learned you know we learned the stuff from rig X that we'll need that were useful 90% of the time and then sometimes you'll have to learn more you'll have to go to something like this a more comprehensive spreadsheet to find more subtle features of the language um because it has a lot more a lot more things you can do with it you could look for a hexadecimal digits or octal digits blah blah blah there's a lot of fancy stuff you can do with it you can even do if I guess you can make if-then-else statements to which we haven't gone into but anyway this is a kind of introduction to reg except I wish I had seen many many years ago so I wouldn't so I would have so I wouldn't have just avoided it all this time thinking that it was just a bunch of gibberish I didn't want to learn about I'll speak for myself I'm glad we did this together yeah doing this alone would be and even reviewing it would be hard well especially since we had G drous to help us because he he seems to really know know this very very well so that was great and thank you for that alright well I actually got done I didn't think I was going to be able to finish all this stuff but but I did does anybody want to talk about anything raise any issues or questions that we that they want to talk about he's pretty straightforward great yeah so now last next we go yes is there some special day that it's next week some special day and I should know about her it's February 8th right February 8th nothing special we'll have a class was I gonna say okay this brings us to the end of I think it brings us to the end of the classical approach to natural language processing and I think from now on we're gonna do maybe there's one more lesson but but pretty soon we're gonna start using deep learning for natural length seeing how deep learning is used for natural language processing and that it's gonna get really interesting oh so long yeah so if there are no issues that people want to talk about or things that you want to raise then I'll see you guys guys are you people next next week thank you fantastic thank you I'm not angry thank you all\n"

Fast AI NLP Study Group - Session #6

Random Videos