18 - how do I search for a pattern in text regex mini project (Python tutorial for beginners 2019)

Creating an Email Address Text Scraper with Regular Expressions in Python

Hey there, guys! It's Aaron from Clever Programmer here again, and today I'm going to show you how to create a simple email address text scraper using regular expressions in Python. This is just a small project that I thought would be fun to do, but it can actually come in handy if you need to parse emails from a large dataset.

To get started, we're going to need a library called R.E., which stands for regular expression. In Python, this library is simply imported by typing "import re". Now, let's take a quick look at what regular expressions are and how they work. Regular expressions are a way to specify different patterns in strings and text so that you can match those certain patterns and then pick things out.

An email address follows a very specific pattern, right? There are some letters, numbers, and whatnot, followed by an ampersand, then some more letters and numbers, and finally a period, and then some more letters and numbers. That's what makes an email address recognizable to humans. So, if we can pick apart those little rules, like the structure or pattern of something, like an email address, then we can specify and describe that pattern with regular expressions.

Once we do that, we can run the regular expression over the entire text file and basically just pick out everything in there that matches the regular expression. In this case, that would be email addresses. So, let's get started with something simple. I'm going to need some text, so let's just type out a random string.

Literally, a random string. Okay, and next we're going to create a regular expression pattern. We'll call this variable "pattern" equals... Well, actually, the library is called "re", not R.E., but I'm using the same name for simplicity's sake. So, our first step is to create a regular expression object by calling the "compile" method on our string that describes our pattern.

The compile function will take a string in that describes our pattern and then it will create an object from it and put it in pattern. Now, I'm going to do this with something literal just like my actual regular expression. The reason I'm doing this is because I just want to show you how regular expressions work. If I type in something like this, if this is my actual regular expression, what it's going to search for in the text is exactly this text.

So, since I typed these exactly the same thing, this is actually going to match this entire thing. When I run this pattern over this text variable, it's going to spit out a match because this text satisfies what this pattern says. Let's do this real quick. I've got my pattern and I think I need to call "result" equals pattern dot search. Yeah, yeah, search.

What's happening here is I have my text here, so I want to search my text using this pattern. So this, using this pattern I'm going to search for the text and then any results we get back, I'm just going to stick in this result variable. Let's run this. I need one more print statement actually. Bare with me, I'm doing this on the fly.

Alright, print result. It will print out the results we get. It should print out this... let's see what happens. I could be completely wrong. Sad truth of life but hey look, so we get this object is actually called a match object. If you don't know what an object is, that's okay, if you do though, it's usually just an object and one of its attributes is called match.

And the match here is actually a string called a random string because this is what it found in the text string. This is the match it found over here. If you don't know what an object is, just forget about it. Basically all you need to look for is if it says "match" equals this. So, anything that is right here, that is in this, like right next to match, means it found it in the text, so that's pretty cool.

Now, let's try something else because that's kinda boring. This is just a static kind of regular expression. Now, what I want to do is... you can also use these things called character classes. Character classes are used to specify sets of characters. For example, if we wanted to match any single letter that can be either A, B or C, upper case, it's case sensitive.

So, this is going to look for a single letter that is either A, B or C, and the very first letter here is a capital A.

"WEBVTTKind: captionsLanguage: en- Hey what's up guys,Aaron from Clever Programmer here again.And today, I feel like makingan email address text scraper.Just feel like doingsomething small and simple.So yeah, let's get right to it.(EDM Music)So if I want to be able to searcha bunch of texts and pickout some emails from it.I'm gonna need somethingcalled regular expressions.In Python this library is just R.E.So I'm just gonna typeimport R.E just like that.And a quick explanationof regular expressions.It's a way to specify different patternsin string and text, in strings and textsso that you can matchthose certain patternsand then pick things out.So, an email address followsa very specific pattern right?There's some lettersand numbers and what notand then there's anampersand, and then some moreletters and numbers andwhat not and then a periodand then some more letters and numbers.And that's what makes an email address,that's what a human wouldrecognize as an email address.So, things like these,if you can pick apartthose little rules, like the structure,the pattern of something, some entityin this case an email address.Then we could specifyand describe that patternwith regular expressions.Then once we do that, then we can runthe regular expressionover the entire text fileand basically just pick out everythingin there that matchesthe regular expression.In this case, that wouldbe email addresses.So that was kind of a round about wayof explaining it but I hope you got it.I'm gonna go kinda fastbut I'm gonna explainlike a little bit here and there.Just smoothing my way upto the full complicatedregular expression that will actuallypick out email addresses.So, let's just startwith something simple.First, I'm gonna need some text.So, let's just type out a random string.(keyboard clicks)Literally, a random string.(Scoffs)Okay, and next we are going to createa regular expression pattern.So, we'll call this variable patternequals-- call the regularexpression libraryand there's this functionor method called compile.And what compile will do, is it will takea string in that describes our pattern,our regular expression pattern.And then it will create an object in it--an object from it and put it in pattern.So, the first one I'mgonna do is literallyjust this, a random string.The reason I'm doing this is becauseI just want to show you howregular expressions work.So, if I type in something like this,if this is my actual regular expression,what it's gonna search for in the textis this text exactly.So, since I typed these exactly the samethis is actually going tomatch this entire thing.And when I run this pattern,over this text variable,it's going to spit outthis match because thisthis matches this.Or actually I should say,this portion of the textsatisfies what this pattern says.So, let's do this real quick.I got the pattern and then I believeI need to call--call it result equals--pattern--dot I believe it's search, yeahyeah search and then text.So, what's happening here is I havemy text here, so I wanna search my textusing this pattern.So this, using thispattern I'm gonna searchfor the text and then any resultswe get back, I'm just gonna stickin this result variable.So, let's run this.I need one more print statement actually.Bare with me I'm doing this on the fly.Alright, print result.So, it will print the results we get.It should print out this,let's see what happens.I could be completely wrong.(Sighs)Sad truth of life but hey look,so we get this object is actuallycalled a match object.If you don't know what an object is,that's okay, if you do thoughit's usually just an objectand one of it's attributesis called match.And the match here is actually a stringcalled a random stringbecause this is what it foundin the text string.This is the match it found over here.If you don't know what an object isjust forget about it.Basically all you need to look foris it says match equals this.So, anything that is right here,that is in this, like right next to match.That means it found it in the text,so that's pretty cool.Now, let's try something elsebecause that's kinda boring.This is just a static kindof regular expression.Now, let's change it up.So actually what I wanna do is--you can also use special charactersin regular expression so if I putsome brackets here and let's say I putA-- or actually A, B or C.This is better for explaining it.If I put it like this, this will mean thisentire highlighted portion hereis looking for a singleletter that can be eitherA, B or C, upper case,it's case sensitive.So, if it can be A, Bor C and nothing elsethen that's what it takes to satisfy it.So, this is going tolook for a single letterthat is either A, B or Cand the very first letterhere is a capital A and \"A\" is A, B or C,so, this is actuallygoing to be the only matchwhen we run it.Let's just try this, itshould only have \"A\" hereinstead of an entire, a random string.Let's run it and hey, look, it worked.The only match is \"A\".Now, you can actuallyput as much as you wantin between these brackets.I think you can go as long as you want,I've never actually tried but I mean,if it gets too long itgets kinda messy anywaysand it just--yeah.It's nicer to try to keep things concise.I could put multiple letters in herelike I could also put a lower case \"R\".A lower case \"R\" or \"D\" or \"M\".So, that's kind of likethree letters at random.Let's try run this and see what pop's out.It should just be \"R\" and I am correct.The reason it's just \"R\" isbecause the search functionit only searches for the first match,the very first match and then it stops.So, it terminates once itfinds it's first match.So, the first match here it can be eitheran \"R\" either a \"D\" or an \"M\" and this isthe first one that matches, an \"R\".Even though \"D\" and \"M\" also match,it found \"R\" first.And so \"R\" was considered the match.So, this pattern object got--I mean sorry this resultobject got createdand its match attribute was set to \"R\"because that was the first it found.So, as you can assume, I can reallyput anything I want in here.This would still return \"R\" because--Actually would it not--Yeah there's no capital \"A\" in herethe only thing that's before thisis a capital \"A\".So, the only time it would changeis if I put a capital \"A\" hereand then since capital \"A\"matches here before \"R\"does then capital \"A\"would actually be the match, and yupthat's exactly what I thought.Let's get rid of all this gibberish--Another cool thing youcan do is like I saidA, B, C--You can go like that A, B, C, D, E, F, G.You can also just--Instead of A, B, C.You can also just put ina range like \"A\" to \"C\".Which is the same as A, B, C.Or \"A\" to \"Z\" which isthe entire alphabet.In lower case.Also what you can do is actually chainthese ranges, so I'm gonna run thisand since \"R\" is thefirst lower case letter\"R\" should be the only match againand yup, just as expected.Alright, we can also add upper casejust by going like this.Now it will match all lowercase and upper case letters.So, since capital \"A\" comes firstand capital \"A\" is within--in this one single letter criteriathat we specified here.Then \"A\" will actuallybe printed out and yup,right again.Alright, so I've kindof drilled this idea in,I might be droning on, boring you guys.But if you're still listeningnow it gets a little more interestingbecause if we add a plus sign after this.What the plus sign does, basically sayswhatever comes before it, I can have oneor more of those.So, instead of onlymatching a single characterlike we have now, I can actually matchmultiple ones.So, I can actually detect an entire word.This time its only gonna be \"A\" againbecause it's going tofind the first entirestring of one or more lettersof lower case, upper caseand in that case its just \"A\".Actually, let's delete \"A\".So, now its just random string.So if this is going tosearch for one or morelower or upper caseletters as long as I wantand then it stops at aspace because this doesn'tcount as a upper or lowercase letter, this space.So, it's just going to getthese six lower case lettersand that will be the very first matchand then it will stop.So, let's run this and see what happens.And yeah, random pop's out.So, pretty cool, right?I can also put 0,1,2,3,4,5,6,7,8,9 etc.But I'm just gonna put zero through nine.So now, now what thiswill do is it will findanything that has lower case, upper caseor numbers in it.One or more of themuntil it's not satisfied.So, I could actuallyput some capital lettersin here now, some randomones, some numbersand this will stillmatch because everythingin this entire string iseither a lower case letter,is either an upper case letter or a digitfrom zero to nine as we can see here.It's just when we reach this space,it is not satisfied anymore.So this one or more plus sign thingy,that we applied to thiswhole thing in the bracketsdoesn't work anymore.So, actually this entirething will be matchedand now what do yaknow, pretty cool right?So, now that you understanda little bit of those things,I'm going to move on.Once you have these pieces,let's think about whatan email address actually looks like.So an email address is going to have--alright, let's just--Random string--let's just make a random email addressso that we can test it out.So, let's just say my name onetwo three @ website dot com,looks like an email address to me.So, let's just do thatand then just add somemore random text, cause Iam lacking creativity today.But now, we have this entire string here.All this text and we haveone single email hereand we wanna pick out this entire thing.So, if we ran this, it wouldjust return random againbecause that's thefirst match that it getsbut we don't want that,we actually want topick out this email addressinstead of this random string.So, how could we do that.Well, another thing you cando with regular expressionsis actually just use characters.So, if I just put an ampersand here.Just an ampersand,actually it's going to lookfor the first match of thisand since there's one here,it's actually going topick out this ampersandwhen I run this.Let's run it and see what happens,yup the match is an ampersand.Let's--it's at website,if I put at website it's going to searchfor that exactly and thenit will say at websitecause that's the firsttime it found at website.But since an email addresshas an ampersand in it.This is very easy to, tobuild this pattern structure.So, just like before wehad all lower case letters,all uppercase letters, all digits,one or moreand then after that, that satisfies likethis portion here \"my name one two three\"and then after as many lettersor numbers that we like upper caseor lower case in any order that we please.It's followed by an ampersandlike that's what an email has.Okay?And then after the ampersandwhat happens again?Yet again we have as manylower case, upper caseor numbers we want.So, we can literallyjust copy and paste thissame thing, this chunkand just paste it again.So, you see at this parthere and this part here,are exactly the same.See, its auto highlighting for me,cause they're exactly the same.That's really cool.Okay, then what followsthis website portion?A period.So, we could just puta period but actuallythere's this little weird special thingcalled you need to escape certain,punctuation and symbolswhen you are coding.If you don't understand, it'sa little bit hard to explainI'm not gonna explain it nowbut just take my word for it.To actually detect the pyramid--(Laughs)A pyramid?A period.We have to put a backslash before itand this is called \"escaping the period\".So, this is actually considered a unitand this will actually beinterpreted as just a periodbecause this backslash saystreat this period as an actual period,instead of doing something else.We would also have to escape this,if we actually wanted to finda plus sign in our string.We'd actually have to--I guess I am explainingescape sequences a little bit.But if you put a backslash here,this would actually search for a plus signliterally in the string,instead of applyingthis plus sign to these brackets.And the same thing with these bracketsif you actually wanted tosearch for actual bracketsinstead of having this weird--This thing here, where you can havelike all the stuff inside.You would actually have to put backslashesbefore each of these bracketsto actually check for these brackets.So that's what thatdoes, that's the reasonwe need that for the period.So, just take my word for itthis will be considered a period.This here is equal to this period hereand then again after theperiod what do we have?I don't actually thinkyou can have numbers,I'm not exactly sure of all the possibleemail address that you can havebut I think its only lower case lettersand uppercase letters, alright?One or more of course, don't forget that.So, I think that's theentire regular expression.So, we have as manylower case and upper caseletters and digits aswe want, one or more,followed by an ampersand, again, oops.Again, same thing and then a periodand then just some letters after it.So, that is what I believean email address should look like.Let's try to run this andsee if this gets printed outmyname123@website.com, hopefully it does.And run and hey what do youknow it actually worked.Stuff rarely works the first timeso things are going flawlessly right now.Relatively flawlessly.But, yeah it seems like it's workingit picked out this one email addressbut one thing I wanna try iswhat if we have multiple email addressesin the same string?So, let's say--I'm gonna slide this over soI have a little bit more room.What if we have--Okay.Your name eight eight eightat companydot net, something like that.That doesn't make sense cause a companyis supposed to have dotcom but yeah whatever.So, let's try to run thisand see what happens.It should pick out both of these right?It should say match equals thisand match equals this.Let's hit run.Hmm, but match only hasthe first email address.But what happened to thesecond email address?Well let's try to deletethis first email addressand see what happens.Delete, delete, delete,delete, delete, delete.Feel's good.Destruction.(Scoffs)And now lets run itand as you can see it actually picked outthis email address\"yourname888@company.net\".Well, what gives?It's seems like the regularexpression that we wrotehere is picking out thiscorrectly but its not--its not picking up both of them.Well the thing is, the search functionactually only searchesfor the first occurrence.Like I said earlier, itonly searches for the firstoccurrence of your match.If you actually wanna find all of themyou can actually do thatby just calling find all.I really wanna drill this home to youbecause if you wanna find everythingthen you need to rememberto use this functioninstead of the other one.It's a mistake I've made before.So, if we do this now, it should actuallypick out both email addresses.Let's give it a shot,run and as you can seeit returns a nice cleanPython list of both matches.Both email addresses.That's cool right?So, it looks like inliterally five lines of codewe actually have a emailaddress text scraper.Pretty cool man.Some other things wecould actually add to thisis I believe in email addresses you canactually have periods andunderscores and what not.So, that's literally assimple as adding a periodor a dash or an underscore in here.And I don't think you have to escape theseif it's within thebrackets, if it's withinthe brackets it's already considered,it is what it is.I may be wrong but we'lltest it if it works.I'm pretty sure that's how it works.Same thing here, I don'tknow if you can havethat here but let's just try thisand let's maybe add like--(keyboard clicking)your name is at underscore or actuallyyour dot--or your name dot eightdash eight dash eight.Now let's see what matches this time.Let's click run and oh,something went wrong.(Keyboard clicking)(Mouse clicks)Maybe we do have to escape the dashlet's try that, yeah that's what it was.I think we do need toescape everything in here.Yup, okay that fixed it.So, you actually do needto escape everythingso now it's actually picking outthis email address just fine.The entire thing.Let's change it back to what we had beforejust to make sure itsstill working as expected.Your name dot eight dash eight dash eight.Run and it looks like it'spicking out the entire thing.Pretty cool guys, pretty cool.So, yeah you do need toescape these special symbolseven if it's within brackets.You need to put thesebackslashes before itto actually treat it as a period,actually treat this as a dashand then actually treatthis as an underscore.You gotta escape them.For that matter you might evenwanna escape the ampersand.Yeah, I guess so.Maybe that too if ampersandis a special charactersometimes it's little bit confusing,whether you need to escape or not.But if you're ever getting weird error'swhere you're dealing with stringsand you're trying toparse strings like this.Try escaping the special charactersand sometimes that'll fix the problem.Or un-escape them, it's a very weird,layered, convoluted kinda thingbut yep it's just a part of learning thiscrazy Aztec language.Yup, I just taught youguys how to speak Aztec.(Scoffs)But yeah guys, seems like we havea fully functioning email scraper now,with this added functionality.So, pretty simple right?Five lines of code, tooka little while to get herebut yeah that's pretty much it.Just these simple five linesand that just kind of shows thepower of regular expression.You are able to detect alldifferent kinds of thingswith one single line ofweird symbols, pretty much.So, that kind ofdemonstrates the value of it.I'm sure Python's very happyto have it within it's libraryif it was sentient.But yeah, very useful tool.I'll probably be touchingon this more in the futurethis was only like ageneral introduction to it.But there's a lot of thingsyou can do more with thisthere's, just a lot.Perhaps I'll go into itin the future but for nowthat is all you guys, thanksfor watching this video.Still new to this, getting better, slowly.But yeah, hope you guys liked watching itstay tuned for more and yeah,I will see you guys next time.Good--Bye.(EDM music)\n"