100% Local Tiny AI Vision Language Model (1.6B) - Very Impressive!!

The Open Voice Model, also available as a local model for offline use, utilizes the latest advancements in AI technology to provide a more personal and intimate voice interaction experience. This model is entirely open source, allowing developers to access and contribute to its codebase.

To test this model, simply navigate to my library and explore the provided video demonstrating its capabilities. The model's functionality is similar to what we are currently utilizing, with the primary difference being its offline accessibility.

The first image presented showcases an example of how the Open Voice Model can be used to identify objects and locations within a given image. Upon running this test, the image will appear on screen, allowing users to ask questions about the identified elements.

For instance, when asked about the arrows in the image, the model correctly identified their destination as Central Park and Kennedy Airport. This demonstrates the model's ability to accurately process visual data and provide relevant information.

The next image presented involves identifying the mood of a dog within an image and predicting the outcome of a hypothetical confrontation between the dog and a cat. In this case, the model suggested that the dog was likely to win the fight due to its size and strength advantages over the cat.

Moving on to the final test, which involves analyzing a specific text about Google's new video generation AI model called Lumire, the model successfully summarized the most important points of the text in one sentence. This showcases its capacity for processing and extracting key information from written content.

The test concluded by inviting viewers to explore the Open Voice Model further and access additional resources through my link in the description below.

"WEBVTTKind: captionsLanguage: enwho is this Taylor Swift where in the world is this this is New York City specifically in the financial district what kind of field is this man an expert in this man is an expert in the field of computer vision and machine learning what two destinations does this arrow point to who is winning this fight what animal is inside the Red Square this was just a few of the things we are going to test out today so let's get started this video was sponsored by brilliant okay so this is just a very simplified sh of how this works we are using the moondream tin Vision 1.6 billion parameters model here so we can choose between kind of speech input through faster Whisperer we can just do regular text input like prompts uh we can also do video input where we can slice up to frames I'm going to show you that and we can choose what kind of description format we want we can get it back in just plain text or we can get it back in speech uh if you want to know more about how I do the speech to speech function you can just go back to one of my previous videos and you can can learn all about that but now let's take a quick look at the tiny Vision model before we do some more testing this is the moondream GitHub page go check it out I will leave a link in the description below you can see this is a 1.6 billion parameter model they have some benchmarks here pretty good results for being so small it's very quick that's what I liked about it so you can see here we have some examples you can go through how you can install it so definitely go check them out and learn something and speaking of learning have you ever wondered how the magic of large language model works or maybe you're keen on diving into the world of programming but you don't know where to start well brilliant.org the sponsor of today's video is your place to understand and master these skills brilliant makes learning engaging accessible especially in fields like computer science math and data science what sets brilliant apart is their interactive approach where you're not just passively consuming information you're actively engaging and solving real world problems like the course how llms work which takes you on an exciting journey to unravel the mysteries of llms you get a chance to kind of peek under the hood and really grasp how they function and for those who's itching to get their hands dirty with coding the thinking in code course is a perfect start it's designed to build a robust foundation in computational problem solving starting with the basics and moving on to more complex challenges with brilliant learning is a breeze you get interactive lessons helpful hints and the lessons stick with you long after you're done to start your journey head over to brilliant.org allabout AI or just click the link in the description for a free 30-day trial and the first 200 of you will get 20% off the annual plan a big thanks to brilliant for sponsoring this video okay so the first test I thought we could do was just playing text so we have a mistal 7B model here that is running on llm Studio on this local server so we can uh interact with the mistol 7 B2 and we have a process image function that is actually using the moondream one model so we are loading images we also have some function here that can kind of Pop the image up so we can see see the image we are talking about we have an image path and here is kind of our input so ask a question about image or type uh exit to quit this is just a true Loop until we break it and here is kind of the mistal 7B part and we you can see we join all the answers in kind of into a list so mistal B can get the full reference when we feed kind of all the answers here uh into the prompt for mistol 7B you will see this now when we test it so yeah let's just do that okay so let's run this in kind of real time so you can see how fast this is so here we're kind of loading the wish model I think we should get the image pop up soon yeah that's good so this is Taylor Swift um who is this this is our input prompt right and you can see pretty fast and here we get kind of summary so far of the image descriptions we can follow up with what is she uh wearing this is a input prompt again so we get kind of a an appended description here so she's wearing a green dress and you can see this is implemented into our summary so we can follow up what color is her her eyes uh her eyes are blue yeah working quite good does she have earrings okay so I put just earring so yes she has ear a earring in her left ear so here we can kind of see the summary so I got this image description for you it's Taylor Swift is seeing everyone on pictures wearing a stunning green dress brings out her eyes her eyes are blue yeah so I think we kind of are creating like a big great summary of the image here now as we append this so pretty cool I think we just going to call it now for the text version and move on to kind of let's take a look at the video part okay so for the video version we need to use ffmpeg we need to use movie pie uh I got a video on that if you just want to go back to some my previous one and check that out we have the mral B 7 7B function still because we want to yeah do some more work on the results we have the same process image function as before Moon dream one uh we want to convert uh MP3 MP4 to MP3 because some videos uh we want to take the audio to and process that I'm going to show you but uh for the first video we are just going to extract the frames we're going to do a frame interval of um one in frame per second so if the video is in 60 frames then we get one per second right and we yeah we set our video part and stuff and yeah should be pretty straightforward and here you can kind of see the explanation that we are going to get from Mr 7B so we just feed in the video description from all the frames so basically what we do is we send every frame through the wish model we collect all the descriptions and put them into one big description and then we use mistal 7B to kind of from the video description above write a full list of names because the video we are going to take a look at uh is just a bunch of celebrities and we're going to try to identify them all so let me show you the video then so this is the video you can see it should be pretty easy because if we start over here you can see we have celan Murphy here Merill stre and we got some other celebrities I don't know all about the all these celebrities uh so this is Margot Robbie so the model has their names right Julia Garner Taylor Swift and yeah we can just yeah so there's a bunch of celebrities so if we go to our image folder here now let's Pop That up here so when we are running this now you can see all the frames is going to be put into this so let's exit this and uh yeah fire up the wish uh video version okay so I'm running this now so kind of the first thing is going to happen is we going to get some frames here so you can kind of see all the frames popping up into our folder here and we kind of have pointed the vision model to this folder here and then going to go through every single frame here you can see we captured 38 frames and this is of course going to take a bit more time because we need to go to 38 images so we just going to let this run and let's take a look at the final result wow that was pretty quick I wasn't expecting that to be so quick so yeah okay so let's take a look at the result there I thought we had to go out uh so we have meril stre Leo deio Kate cap stepen Spielberg Gloria compano I don't think I saw her kasanova Margot Robie Julia Garner Taylor Swift so yeah pretty good job I think uh I think he captured everything but it made some up Gloria compano I I guess we saw her give me a second here yeah but it didn't catch Bradley Cooper that was my bad maybe maybe that's kasanova I don't know uh okay so I think we have to give this a pass now let's try a video with kind of some audio so now I have kind of added in audio so now we kind of activated our faster Whisperer so we're going to pick the medium model this is running on cud course we changed up our prompt a bit we kind of added in audio description now we're going to from the video and audio description above write a detailed summary okay so this is the video we're going to pass through these five tanks are literally about to rain missiles upon this $500,000 and any money that doesn't get destroyed I'm giving to Blake okay so it's just a Mr Beast clip let's just run it okay so let's run this so the first thing is going to be the transcription yeah you can see we're using movie pie turning into MP3 file here we are using faster risper that was quite quick so here is yeah looks pretty good so this is our transcription and we saw all the frames pop up here I put this down to 100 and 20 now so we got like every other second and yeah now let's just wait for the the description okay so let's check it out uh let's do it a bit so we can take a read a bit here so in this video there are five tanks positioning fire missiles at $500,000 worth of money yeah that's good the speaker plans to give the money uh not destroyed to Blake yeah nice the recipents given 24 hours before the tanks to fire upon their money very good what can I say that is exactly what the clip is about so yeah very impressed to be honest with this small wish model here but of course I don't think it would have been so good if we didn't add the sound clip but yeah that's just another feature okay so the final test we are going to do is actually using speech to speech so I can talk into my mic I can get a response back from the open Voice model uh also for this I have a video so you can just go back in my library if you want to know how to set this up it it's completely open source it's fully local so you can be offline when you do this this is of course the same with what we are doing now so we're just going to skip the python code here and we're just going to go straight into the testing and the first image is this one uh so you can see this so I put a couple of arrows here so I'm going to ask what kind of parts are these arrow pointing to and see if we can identify that so yeah let's just run it okay so let's run this now it should be pretty much the same we should get the image popping up here anytime and let's ask about the arrows what destinations is the arrow pointing to the arrow is pointing to Central Park in Kennedy Airport yeah that's good so that was the only thing I wanted to know we can of course continue asking more questions about the image but I just want to move on to the next part so yeah good start let's ask a bit about this image I wanted to ask kind about the mood of the dog and who is likely to win the fight let's try this what is the mood of the dog in the image the dog in the image appears to be angry or aggressive as it is growling and showing its teeth while facing the cat who is most likely to win this fight the dog is most likely to win this fight as it is a larger and stronger animal dogs typically have a higher bite force and are more agile which gives them an advantage in physical confrontations cats on the other hand or smaller and have a more agile body structure but their bite force is not as strong as dogs in this image the dog is showing a more aggressive and dominant Behavior while the cat is trying okay so that was a pretty good response but uh I kind of think the cat would win but yeah who am I to say uh so yeah good good response there I think okay so next up we have this image so you can kind of see went into paint I took this red square and put it around the swing here so let's see if we can identify this so yeah let's just run this now okay so we should have pretty much the basic setup the image is going to pop up and we can ask a question describe all the objects inside the Red Square in the image the Red Square in the image represents a portion of the lawn likely a playground or a play area for children there is a swing set in the area with two swings visible additionally there are two cars parked in the vicinity possibly belonging to the residence of the houses nearby the houses are situated on the hillside overlooking the playground yeah that was pretty good IM mention the swing it mentioned the yard it mentioned like the playground it mention the cars here too that was a bit off scope but yeah I accept that very good okay so the last test is going to be text I found something on the verge here so this is about Google's new video generation AI so we're going to ask if it can summarize this text uh my expectations are quite low here but hey let's give it a shot can you summarize the most important parts of the text in the image the image shows a summary of a YouTube video which is about a new video generation AI model called lumire the video explains how lumire uses a new Fusion process to create a base frame from a given prompt the AI model then uses this base frame to generate a series of frames which are then combined to create a seamless motion video the video also includes a comparison of the number of frames generated by lumire to those generated by another AI model called SpaceTime u n in one sentence can you describe Lumiere an AI that uses a new diffusion model called SpaceTime conet or stuff net that figures out where things are in video and creates a Basse frame from The Prompt okay so the final answer was pretty good the rest was a bit H so I'm not 100% sure about text uh it's clearly understanding something but I have to test it more to see if yeah to find out kind of how good it is actually so yeah that was basically what I wanted to go through today if you are interested in testing this out for yourself just click on my link in the description below and become a member of the channel I will invite you to the community GitHub you will get access to the community Discord and of course all my previous scripts and stuff I have been doing so don't forget to check out moondream don't forget to check out brilliant .org and yeah have a great day and I'll see you on weddingdaywho is this Taylor Swift where in the world is this this is New York City specifically in the financial district what kind of field is this man an expert in this man is an expert in the field of computer vision and machine learning what two destinations does this arrow point to who is winning this fight what animal is inside the Red Square this was just a few of the things we are going to test out today so let's get started this video was sponsored by brilliant okay so this is just a very simplified sh of how this works we are using the moondream tin Vision 1.6 billion parameters model here so we can choose between kind of speech input through faster Whisperer we can just do regular text input like prompts uh we can also do video input where we can slice up to frames I'm going to show you that and we can choose what kind of description format we want we can get it back in just plain text or we can get it back in speech uh if you want to know more about how I do the speech to speech function you can just go back to one of my previous videos and you can can learn all about that but now let's take a quick look at the tiny Vision model before we do some more testing this is the moondream GitHub page go check it out I will leave a link in the description below you can see this is a 1.6 billion parameter model they have some benchmarks here pretty good results for being so small it's very quick that's what I liked about it so you can see here we have some examples you can go through how you can install it so definitely go check them out and learn something and speaking of learning have you ever wondered how the magic of large language model works or maybe you're keen on diving into the world of programming but you don't know where to start well brilliant.org the sponsor of today's video is your place to understand and master these skills brilliant makes learning engaging accessible especially in fields like computer science math and data science what sets brilliant apart is their interactive approach where you're not just passively consuming information you're actively engaging and solving real world problems like the course how llms work which takes you on an exciting journey to unravel the mysteries of llms you get a chance to kind of peek under the hood and really grasp how they function and for those who's itching to get their hands dirty with coding the thinking in code course is a perfect start it's designed to build a robust foundation in computational problem solving starting with the basics and moving on to more complex challenges with brilliant learning is a breeze you get interactive lessons helpful hints and the lessons stick with you long after you're done to start your journey head over to brilliant.org allabout AI or just click the link in the description for a free 30-day trial and the first 200 of you will get 20% off the annual plan a big thanks to brilliant for sponsoring this video okay so the first test I thought we could do was just playing text so we have a mistal 7B model here that is running on llm Studio on this local server so we can uh interact with the mistol 7 B2 and we have a process image function that is actually using the moondream one model so we are loading images we also have some function here that can kind of Pop the image up so we can see see the image we are talking about we have an image path and here is kind of our input so ask a question about image or type uh exit to quit this is just a true Loop until we break it and here is kind of the mistal 7B part and we you can see we join all the answers in kind of into a list so mistal B can get the full reference when we feed kind of all the answers here uh into the prompt for mistol 7B you will see this now when we test it so yeah let's just do that okay so let's run this in kind of real time so you can see how fast this is so here we're kind of loading the wish model I think we should get the image pop up soon yeah that's good so this is Taylor Swift um who is this this is our input prompt right and you can see pretty fast and here we get kind of summary so far of the image descriptions we can follow up with what is she uh wearing this is a input prompt again so we get kind of a an appended description here so she's wearing a green dress and you can see this is implemented into our summary so we can follow up what color is her her eyes uh her eyes are blue yeah working quite good does she have earrings okay so I put just earring so yes she has ear a earring in her left ear so here we can kind of see the summary so I got this image description for you it's Taylor Swift is seeing everyone on pictures wearing a stunning green dress brings out her eyes her eyes are blue yeah so I think we kind of are creating like a big great summary of the image here now as we append this so pretty cool I think we just going to call it now for the text version and move on to kind of let's take a look at the video part okay so for the video version we need to use ffmpeg we need to use movie pie uh I got a video on that if you just want to go back to some my previous one and check that out we have the mral B 7 7B function still because we want to yeah do some more work on the results we have the same process image function as before Moon dream one uh we want to convert uh MP3 MP4 to MP3 because some videos uh we want to take the audio to and process that I'm going to show you but uh for the first video we are just going to extract the frames we're going to do a frame interval of um one in frame per second so if the video is in 60 frames then we get one per second right and we yeah we set our video part and stuff and yeah should be pretty straightforward and here you can kind of see the explanation that we are going to get from Mr 7B so we just feed in the video description from all the frames so basically what we do is we send every frame through the wish model we collect all the descriptions and put them into one big description and then we use mistal 7B to kind of from the video description above write a full list of names because the video we are going to take a look at uh is just a bunch of celebrities and we're going to try to identify them all so let me show you the video then so this is the video you can see it should be pretty easy because if we start over here you can see we have celan Murphy here Merill stre and we got some other celebrities I don't know all about the all these celebrities uh so this is Margot Robbie so the model has their names right Julia Garner Taylor Swift and yeah we can just yeah so there's a bunch of celebrities so if we go to our image folder here now let's Pop That up here so when we are running this now you can see all the frames is going to be put into this so let's exit this and uh yeah fire up the wish uh video version okay so I'm running this now so kind of the first thing is going to happen is we going to get some frames here so you can kind of see all the frames popping up into our folder here and we kind of have pointed the vision model to this folder here and then going to go through every single frame here you can see we captured 38 frames and this is of course going to take a bit more time because we need to go to 38 images so we just going to let this run and let's take a look at the final result wow that was pretty quick I wasn't expecting that to be so quick so yeah okay so let's take a look at the result there I thought we had to go out uh so we have meril stre Leo deio Kate cap stepen Spielberg Gloria compano I don't think I saw her kasanova Margot Robie Julia Garner Taylor Swift so yeah pretty good job I think uh I think he captured everything but it made some up Gloria compano I I guess we saw her give me a second here yeah but it didn't catch Bradley Cooper that was my bad maybe maybe that's kasanova I don't know uh okay so I think we have to give this a pass now let's try a video with kind of some audio so now I have kind of added in audio so now we kind of activated our faster Whisperer so we're going to pick the medium model this is running on cud course we changed up our prompt a bit we kind of added in audio description now we're going to from the video and audio description above write a detailed summary okay so this is the video we're going to pass through these five tanks are literally about to rain missiles upon this $500,000 and any money that doesn't get destroyed I'm giving to Blake okay so it's just a Mr Beast clip let's just run it okay so let's run this so the first thing is going to be the transcription yeah you can see we're using movie pie turning into MP3 file here we are using faster risper that was quite quick so here is yeah looks pretty good so this is our transcription and we saw all the frames pop up here I put this down to 100 and 20 now so we got like every other second and yeah now let's just wait for the the description okay so let's check it out uh let's do it a bit so we can take a read a bit here so in this video there are five tanks positioning fire missiles at $500,000 worth of money yeah that's good the speaker plans to give the money uh not destroyed to Blake yeah nice the recipents given 24 hours before the tanks to fire upon their money very good what can I say that is exactly what the clip is about so yeah very impressed to be honest with this small wish model here but of course I don't think it would have been so good if we didn't add the sound clip but yeah that's just another feature okay so the final test we are going to do is actually using speech to speech so I can talk into my mic I can get a response back from the open Voice model uh also for this I have a video so you can just go back in my library if you want to know how to set this up it it's completely open source it's fully local so you can be offline when you do this this is of course the same with what we are doing now so we're just going to skip the python code here and we're just going to go straight into the testing and the first image is this one uh so you can see this so I put a couple of arrows here so I'm going to ask what kind of parts are these arrow pointing to and see if we can identify that so yeah let's just run it okay so let's run this now it should be pretty much the same we should get the image popping up here anytime and let's ask about the arrows what destinations is the arrow pointing to the arrow is pointing to Central Park in Kennedy Airport yeah that's good so that was the only thing I wanted to know we can of course continue asking more questions about the image but I just want to move on to the next part so yeah good start let's ask a bit about this image I wanted to ask kind about the mood of the dog and who is likely to win the fight let's try this what is the mood of the dog in the image the dog in the image appears to be angry or aggressive as it is growling and showing its teeth while facing the cat who is most likely to win this fight the dog is most likely to win this fight as it is a larger and stronger animal dogs typically have a higher bite force and are more agile which gives them an advantage in physical confrontations cats on the other hand or smaller and have a more agile body structure but their bite force is not as strong as dogs in this image the dog is showing a more aggressive and dominant Behavior while the cat is trying okay so that was a pretty good response but uh I kind of think the cat would win but yeah who am I to say uh so yeah good good response there I think okay so next up we have this image so you can kind of see went into paint I took this red square and put it around the swing here so let's see if we can identify this so yeah let's just run this now okay so we should have pretty much the basic setup the image is going to pop up and we can ask a question describe all the objects inside the Red Square in the image the Red Square in the image represents a portion of the lawn likely a playground or a play area for children there is a swing set in the area with two swings visible additionally there are two cars parked in the vicinity possibly belonging to the residence of the houses nearby the houses are situated on the hillside overlooking the playground yeah that was pretty good IM mention the swing it mentioned the yard it mentioned like the playground it mention the cars here too that was a bit off scope but yeah I accept that very good okay so the last test is going to be text I found something on the verge here so this is about Google's new video generation AI so we're going to ask if it can summarize this text uh my expectations are quite low here but hey let's give it a shot can you summarize the most important parts of the text in the image the image shows a summary of a YouTube video which is about a new video generation AI model called lumire the video explains how lumire uses a new Fusion process to create a base frame from a given prompt the AI model then uses this base frame to generate a series of frames which are then combined to create a seamless motion video the video also includes a comparison of the number of frames generated by lumire to those generated by another AI model called SpaceTime u n in one sentence can you describe Lumiere an AI that uses a new diffusion model called SpaceTime conet or stuff net that figures out where things are in video and creates a Basse frame from The Prompt okay so the final answer was pretty good the rest was a bit H so I'm not 100% sure about text uh it's clearly understanding something but I have to test it more to see if yeah to find out kind of how good it is actually so yeah that was basically what I wanted to go through today if you are interested in testing this out for yourself just click on my link in the description below and become a member of the channel I will invite you to the community GitHub you will get access to the community Discord and of course all my previous scripts and stuff I have been doing so don't forget to check out moondream don't forget to check out brilliant .org and yeah have a great day and I'll see you on weddingday\n"