GPT-4o Low Latency Screen to Voice Tutorial - SUPER IMPRESSIVE OCR!

I want to be able to control when the main function is triggered I pressing control space if I press control space we run the true Loop once then the system goes back to being dormant waiting for user to press trigger Keys again can you implement this uh okay so we got a code here I just copy that and let's paste it in here and let's save that and yeah let's test it out okay so now we can see press control space bar to capture an alyis screen okay control space so hopefully this triggers it and we can kind of control it right so that's good sure the function seems to be incomplete here's a possible completion python def count num if num 2 print yes else print no yeah that works so let's head over to let's go to Reddit here and yeah there's on Reddit post let's press control space bar and let's see if we get any response to this screenshot sure the post discusses a Google deep mine paper on Advanced AI models and their safety implications okay that was pretty quick let's do another one okay so here we have an actual question so control space uh and so I'm going to count it so let's just press control space 1 2 3 4 5 it's a common concern but AI making its own decisions autonomously is more science fiction than reality current AI systems are Advanced but still require human oversight and programming okay so that was just 5 seconds from after I pressed control space that was pretty quick right okay so now I think we are kind of done with this so that worked out pretty good I'm pretty happy with this seems to be doing exactly what I want so let's just do a few things test it out a bit more and yeah I think I'm happy so let's try this what is the color of the largest box here the color of the largest box is yellow yeah I think the latency is quite good to be honest so I press now the color of the largest box is blue yeah that's pretty quick if you ask me so let's do a few language tests okay so let's try this are you a sensient AI answer in English I'm not sentient but I'm here to help how can I assist you today okay so I got that so let's change up so let's do answer in Japanese I can't confirm let's do Spanish okay that works let's do German okay and let's try Portuguese that's quite close to Spanish right no like I can't confirm but it seems good uh so yeah that's pretty interesting so we can do all languages we kind of knew that but uh I just wanted to try it out uh but the OCR is pretty good here that's not so good handwriting right I know a lot of people ask as asked about OCR so I would always recommend G 404 OCR it is a bit more expensive than open source models but it is working pretty good if you ask me so let's try one more thing so I'm going to change up the prompt to explain the concept of a system let's go back to our initial system uh visuals and let's try to ask if it can explain it so I'm just going to press control space and let's hear it for example in the image you provided there's a system that takes a screenshot processes it and converts it to voice each step like resizing the image or analyzing it is a part of the system working towards the final goal creating a voice comment on the screen content okay I guess that was fine we didn't go into any details but yeah I'm pretty happy with this and like I said I don't have any great use cases for it now but it's going to be really interesting when we get the new voice mode we can alternate it hopefully we'll come straight to the API so we can kind of customize this make it do a lot of cool stuff so I'm pretty excited so this latency worked out pretty good good it's about 4 seconds maybe let's say we get it down to maybe 1 second that's going to be really cool right so yeah hope you enjoy this and I'm going to of course upload the code to the community GitHub so if you want to join in on that become a member of the channel you will get access you can download these codes and every code we do we're going to focus more on live streams going forward so that's been pretty interesting you can follow along while we build stuff so that's pretty cool and yeah thank you for tuning in have a great day and hopefully I'll see you on Wednesday uh I'm going to a conference this week check out Norwick Summit AI conference so yeah other than that have a great day and we speak soon

"WEBVTTKind: captionsLanguage: ensure hello to everyone on YouTube people assume that because a super intelligent AI might make decisions faster and better than humans leading to autonomy but it's a complex issue with many opinions what you just saw there was what we built on the live stream on Friday so today I thought we could just do a recap of it and go through how we kind of created this so basically what you see here is kind of how imagine this could work so we just want to take a screenshot we want to resize it to 512 by 512 as someone noted on the live stream open AI is probably doing this automatic but I wanted to try to do it ourself to see if we can save some latency uh so we're going to send this resized screenshot to GPT 40 for analysis uh and of course we have a prompt we're going to use the low detail low resolution uh so they have an option they have high and low detail so we're going to focus on a maximum low latency so we're going to set this to low so we're going to get an analysis of our screenshot uh we're going to turn that into text of course we're going to feed it to a TTS with a custom prompt uh we want to try to comment what is happening on the screen so we're going to do some questions we're going to look at Reddit we're going to try to answer in different languages uh and I want to add a new feature that I can kind of control it more just by pressing a key or something so let's just get into it so let's just start by looking at kind of how I set this up and yeah let's do some testing and some use cases to write a code for a script I wanted to try to use chat gp40 that was kind of part of the live stream we were going to do this together right so what I started with uh was just going through my prompt how I described the project in just pure text uh I've uploaded some documentation from open AI of a script I want to build so we're going to go fetch some documentation to here is the project description what we want to try to do is to get the lowest possible latency when analyzing images using the GPT 40 model we want the settings to be low as they can be like low rest low detail I want to use the pill image that's a library grab to screenshot the screen set it interal to like 3 seconds as a JPG resize the screenshot to 512x 512 as a JPG and save the temp image to our folder name images can you write the first iteration of the script okay so that's good uh now I wanted to kind of collect some documentation from open Ai and I want to start out with uh the wish part right so let's just go to models here let's click on GT40 let's go to the guide here uh explore GPD for o with image inputs right so what I wanted to focus on was kind of what I have here that's called low or high fidelity image understanding so if we choose the low mode this will a a low rest mode the model receive a low rest 512 x 51 pixel version of the image and represent the image with a budget of 85 tokens this allows the API to return faster responses consume fewer input tokens for use cases that don't require High detail so that is what I wanted to test out so I'm just going to grab this documentation oops grab this grab this documentation and yeah let's just uh add this managing images too so let's just copy that and collect that over here right so we kind of have that going I also wanted to include some functions I've been using in a different project so this is actually a function to actually analyze the images using the gp4 or model so I think we're just going to grab this two so it's always nice to use functions and stuff for your previous projects to just to save some time so let's also add that to our documentation right okay so now we have kind of save that uh I think we're just going to grab our prompt here right uh let's head back to shbt and let's upload our documentation uh in a text file right and let's paste in our prompt and yeah let's just run this okay so that was done so let's just copy this code I had a quick look at it there are some things missing but let's just paste it in here and I saw we were missing import uh open AI we need that right so the next part is I created a EnV file and I'm just going to paste in my open AI key here but we need to configure that in our code right so let's copy this code again let's go back to chat gbt 40 and let's just do my code now right paste in our code and then I just follow up it I've stored my open a API key in EnV file in a V name open AI API key can you implement EnV to fetch our open AI key please okay so let's just do that hopefully now we will get the code included this EnV yeah that looks good so you can see here uh we imported EnV load it and here's our variable that has our key so that's good okay so we can just copy that and let's just double check if that looks good again we are missing import open AI we kind of need that though uh so that means I think we are ready to try this now uh so remember this is without the voice so we just going to test it out on uh uh the text based version right and the final thing I want to do before we test it is I have kind of my custom prompt here so if you see a question on the screen answer it keep it short and conversational so that is kind of uh the instruction we give to the GPD 40 model when it's looking at the images so let's set the temperature to like .5 or something yeah we can increase our tokens here let's say th000 or something thing doesn't really matter too much and now let's try it out on just some question on uh a screen right okay so I went over to paint I type this out in handwriting this is not good writing can you complete the function def count num if num is two print yes else and then I just left it open so I want to see if it can understand my handwriting here and kind of complete the function so I guess it's going to be print no right or something uh so let's just fire this up now I had to add uh import base 64 that's the only thing I had to change so let's fire this up now so python box. Pi so remember we don't have the voice now uh but we should get in text yeah so you can see here is the completed function So Def count num yeah that's good if num is less than two print yes else print no perfect so you can see it's just going to spam this so it's pretty quick the latency is quite low and that means we are ready to move move on to kind of the voice part right so yeah pretty impressive OCR here if you ask me so what I want to do next is just go back to our code let's just copy that let's head back to chat GPT right and let's just do my code let p in our code now and let's go grab some open AI documentation about the voice part right so let's just go back to the documentation here let's find speech to text to speech uh we can grab this part right just some introduction here so let's just put that into chat gbd here let's go back next I just want to grab this quick start part here because we want an example of um a code right so let's just grab this also include this down here and let's paste in this so just want to feed a lot of documentation here uh supported formats I think we kind of need to instead of using MP3 we want to use v file so let's just grab this support output formats into our documentation let's go to the API reference and see if we can find a example here so audio create speech so this is the response format so we want this in V Not MP3 so we can also grab some documentation from this just paste all of this into chat GPT right okay so now we kind of need our instruction uh how we want to do this so I'm just going to go now let's add a feature to uh feature TTS to read out the responses I want the variable response to be played so you remember our code uh we want to play out the response variable here that is the answer from GT4 over right uh to be played yeah set a temporary audio file in a ferame audio play it at the end of the loop using simple audio that is what I've been using before to play the file we need the audio file in VAB Not MP3 please fix this let's just try and start here and see if we can get it to work okay so yeah we have the audio folder we have the temp audio path that's good we have the generate speech function and we have the response format in V yeah that looks good we just going to copy this code here and let's just update it and see if it's going to run now okay so we have an error here so let's just grab this error let's copy it let's go back here and let's do just error paste in our error message and see if it can fix it okay so there's something wrong with the TTS API is being not being handled correctly and it's going to try to recorrect that Ure API call generate speech Returns the binary con it uh yeah okay so it's so nice to work with this G4 so quick so let's try just this simple fix here save that let's go back to our paint and let's clear this and let's run it again sure here's the completed function python def count num if num two print yes else print no this function checks if num is less than two and prints yes if it is otherwise it prints no perfect so actually now we have it working so that is kind of what I wanted to do so this is just going to go in a through Loop and kind of look for questions things to answer on the screen and it's just going to keep continue doing that so what I thought we can actually try to add a button so we can kind of try to control this when we want it to look at the screen and answer so let's see if we can Implement that okay so I just copied our code I went back to chat GPT I pasted in my code here right and I just follow up with the instruction uh this code is working great but I want to be able to control when the main function is triggered I pressing control space if I press control space we run the true Loop once then the system goes back to being dormant waiting for user to press trigger Keys again can you implement this uh okay so we got a code here I just copy that and let's paste it in here and let's save that and yeah let's test it out okay so now we can see press control space bar to capture an alyis screen okay control space so hopefully this triggers it and we can kind of control it right so that's good sure the function seems to be incomplete here's a possible completion python def count num if num 2 print yes else print no yeah that works so let's head over to let's go to Reddit here and yeah there's on Reddit post let's press control space bar and let's see if we get any response to this screenshot sure the post discusses a Google deep mine paper on Advanced AI models and their safety implications okay that was pretty quick let's do another one okay so here we have an actual question so control space uh and so I'm going to count it so let's just press control space 1 2 3 4 5 it's a common concern but AI making its own decisions autonomously is more science fiction than reality current AI systems are Advanced but still require human oversight and programming okay so that was just 5 seconds from after I pressed control space that was pretty quick right okay so now I think we are kind of done with this so that worked out pretty good I'm pretty happy with this seems to be doing exactly what I want so let's just do a few things test it out a bit more and yeah I think I'm happy so let's try this what is the color of the largest box here the color of the largest box is yellow yeah I think the latency is quite good to be honest so I press now the color of the largest box is blue yeah that's pretty quick if you ask me so that's no editing or nothing let's do a few language tests okay so let's try this are you a sensient AI answer in English I'm not sentient but I'm here to help how can I assist you today okay so I got that so let's change up so let's do answer in Japanese I can't confirm let's do Spanish okay that works let's do German okay and let's try Portuguese that's quite close to Spanish right no like I can't confirm but it seems good uh so yeah that's pretty interesting so we can do all languages we kind of knew that but uh I just wanted to try it out uh but the OCR is pretty good here that's not so good handwriting right I know a lot of people ask as asked about OCR so I would always recommend G 404 OCR it is a bit more expensive than open source models but it is working pretty good if you ask me so let's try one more thing so I'm going to change up the prompt to explain the concept of a system let's go back to our initial system uh visuals and let's try to ask if it can explain it so I'm just going to press control space and let's hear it for example in the image you provided there's a system that takes a screenshot processes it and converts it to voice each step like resizing the image or analyzing it is a part of the system working towards the final goal creating a voice comment on the screen content okay I guess that was fine we didn't go into any details but yeah I'm pretty happy with this and like I said I don't have any great use cases for it now but it's going to be really interesting when we get the new voice mode we can alternate it hopefully we'll come straight to the API so we can kind of customize this make it do a lot of cool stuff so I'm pretty excited so this latency worked out pretty good good it's about 4 seconds maybe let's say we get it down to maybe 1 second that's going to be really cool right so yeah hope you enjoy this and I'm going to of course upload the code to the community GitHub so if you want to join in on that become a member of the channel you will get access you can download these codes and every code we do we're going to focus more on live streams going forward so that's been pretty interesting you can follow along while we build stuff so that's pretty cool and yeah thank you for tuning in have a great day and hopefully I'll see you on Wednesday uh I'm going to a conference this week check out Norwick Summit AI conference so yeah other than that have a great day and we speak soon\n"