Did Google fake their Gemini Video
"WEBVTTKind: captionsLanguage: enwhat's up I'm in New Orleans I'm attending new RPS this week if you're here come say hi love to meet you yeah so we have to talk about Gemini and the whole disaster around the fake video I guess so we'll see about that who deep mind has released a new model called Gemini it is a large multimodal model I guess it's no longer appropriate to call them llms now that they're not language only anymore but it is a transform that consumes sort of images and text and audio and so on and outputs I believe it outputs images and text and yeah it does all in all and it's big how big we have no idea now there has been a bit of controversy around a demo video that they released and that people later said it's staged or Faked and so on they're outraged about it now honestly I personally am not that upset about it I'm also not that impacted or surprised or anything anything like this like the fact that this is big news I get it a little bit but also I don't really get it so we'll dive into the video and you know what the controversy behind it is later but I want to say if you're looking to be upset about something much more upset about the rest of the Gemini release than the demo video they made you know to put a nice music behind it and tell a nice story like the rest of what they did is much more upsetting so let's look at that this website right here welcome to the Gemini area this is wow this is Gemini it's multimodality texting blah blah blah yes this is a marketing page but even in marketing pages I believe you should be somewhat somewhat honest and this graphic right here is one of the things that I think is kind of a lot more upsetting than the video so they compare GPT 4 to Gemini Ultra which is the largest Gemini model look at that that if Gemini Ultra is prompted with Chain of Thought prompting it's a prompting technique uh it's relatively intensive but it gives good results in sort of very complex language understanding tasks wellas GPT 4 is prompted with five shot prompts so naturally a Chain of Thought prompt is going to deliver much better results than a just five shot prompt on a Model just displaying this graphic right here as it is of with the big numbers like ooh Gemini 90 qp4 86.4 human expert 89.8 right just displaying them like this with like a tiny font that says how they were achieved is to me not very cool in fact if you look at the technical report so let's go to the part where the results are if you look at that you can find these numbers again Gemini Ultra 9.04 and GPT 4 86.4 four in Chain of Thought and five shot now here's the kicker if you look at Gemini Ultra five shot prompted it's worse than gp4 if they however evaluate gp4 with Chain of Thought prompting it is better than the five shot prompted model but not as good as the Gemini Ultra the kicker here is that these aren't verified numbers these are just numbers that they got via the API of gp4 these are numbers by the gp4 team themselves so now you can see how this number came about namely they found themselves in a bit of a tussle right there is a very important Benchmark that they want to be really good at and they have a method to be really good at this this one prompting strategy with their model turns out to be really really good better than humans even right however they don't have comparable numbers for that particular prompting strategy what they do have comparable numbers for is this five shot thing but oh wouldn't it be great if they would also be better on that but they're not in this prompting method gp4 is better so what do they do they can't come out with a model and say well we Compare the numbers and actually on this important one were worse and they also can't really I guess they could have put the 87 here I get they want to do compare to reported numbers but yeah so you can see the conundrum they found themselves in it's still I think not very okay to represent it like that I personally would have probably put the the 87 here I guess that would have drawn some some attention from the open AI folks in any case the marketing page here there's not too much more there's a bunch of these numbers and then uh a bunch of nice graphics and so on a bunch of oo what can it do oh wow I want to go into the technical report a bit this is what I find to be the one of the most upsetting things papers aren't papers anymore they're not technical reports because in a technical report you can write anything that you want and have to disclose nothing and that's I think what's upsetting to me I come from a time as an academic where we wrote papers and we actually put inside of the papers what we did and how we did it so that others could reproduce it and that's just not the case anymore with these big releases this technical report here is essentially just another marketing piece and why do I say that because they say next to nothing in this technical report they open with this example right here of Gemini solving or correcting a student that incorrectly solves a physics problem which is neat it's supposed to show that Gemini can unify language and images and handwritten and physics knowledge and then do some calculations even at the end that's certainly I guess impressive but yeah it's an example and honestly like 90% of the answer here on the right hand side you can get by recognizing just like one or two lines like as long as you recognize the student wrote E equals mgl you can probably already guess what went wrong and what should go right and then you kind of have to get the numbers here right not saying it's not impressive but uh it's an example right so they want to show oh look what it can what it all can do so then the next thing they do is they tell us oh our model comes in different sizes wow that's interesting the model comes in size Ultra Pro and Nano at least for the Nano they say how big it is they say it's 1.8 billion and 3.25 billion parameters the Nanos are meant to be run on device and yeah the 1.8 billion 4bit quanti would clock in I would say about just under a gigabyte of RAM which is probably a manable to a lot of devices and yeah it's going to be cool future where these things run on device and good that they actually tell us how big they are we would find out anyway because they would be deployed on devices and people are going to inspect them however for the pro and the ultra they just say well the pro is meant to be something and the ultra is meant to be like bigger than the pro we have no they could have just as well said we trained Gumble and Bumble it's it's all the same they don't share anything any facts about this the biggest kicker like the thing that actually upsets me is the architecture diagram they have done a great deal to go into the architecture of these models the detail that they have put into these architecture diagrams and the Novelties and The Innovation that can be deduced from that is astounding are you ready for the architecture diagram of the century wow there it is there seems to be inputs those inputs will get into what looks to be tokens it goes into a Transformer and then there are outputs this is amazing I cannot stop looking at this this is so bad I guess we can pull out there there there is image text audio and video inputs and then there is image and text outputs like that's fine but really really this uh then they come to the training infrastructure where they kind of detail how they went about training the thing and then what kind of chips and sure they here and there they say something about what they did to solve the problems but they're super careful to never say anything so that you could even estimate or guesstimate how big the models are or how much compute they put into it or something like this like they're extremely careful to to like say a lot of words without saying really a lot of things they do go a little bit into as I said how they handled Hardware failures and things like this but essentially they say nothing and from here it just gets worse it just gets worse they use a sentence piece tokenizer oh by the way the models are 32k tokens that's they say that but other than that the further down you go the less they say until we get to the evaluation and the evaluation is just the number numbers that I've shown you before these are fairly extensive so they do fairly extensive evaluations on a lot of benchmarks but then again as soon as you could estimate something is like relative sizes again and so on but yeah the evaluations I have to say are good so they compare as you can see there's lots of numbers um Yes again look at that wow is what's the label your axis like what's the unit in any case they have some examples on how it can create uh matplot lib code and so on it's pretty good like I'm honestly excited about these models like they're going to be cool to work with they're going to be cool to try they're going to be cool to build applications upon they seem to be really capable I'm just not haven't arrived yet in the new world where it's just well we made something and you may use it via our API but we won't tell you anything about it Gemini is further step towards our mission to solve intelligence Advance science and benefit Humanity you're not you're doing the opposite of that not the opposite but you're very actively trying not to do that all right let's get into video controversy so there's been this video they've released this okay we've been testing the capabilities of Gemini and look how it how it kind of looks so look how they phrase it here IDE let's start all right testing Gemini here we go tell me what you see I see you placing a piece of paper on the table I see a squiggly line what about now the contour lines are smooth and flowing with no sharp angles or Jagged edges it looks like a bird to me hm what if I add this the bird is swimming in the water it has a long neck and Beak it is a duck yes a duck is a type of waterf in the family inad clue one this country is the home of the kangaroo the koala and the Great Barrier Reef oh that's easy clue two this country loves football and has won the most men's World Cups in football history H find the paper ball under the cup I accept the challenge the cup to the left I know what you're doing you're playing rock paper scissors all right this was this was probably the biggest controversy the rock paper scissors one they made this video and they put a nice voice on Gemini and it's kind of interactive and so on and then in the blog post they released along with that they said you know how it's made how how how the video is made and they reveal how they made the video for example they show a frame of the video and then they give a prompt and this is already there's so many news articles oh Google Faces controversy over edited Gemini AI demo video all the video was fabricated why do people say that because it turns out they've just shown appropriate frames from the videos and then giving prompts together with those frames then they took the out output of Gemini and then they had someone say it or they had a voice voice as text to speech say it so they didn't actually show the video isn't a live interaction with Gemini that recognizes from the video frames and so on and the prompts sometimes are also quite helpful for example this one what do you think I'm doing hint it's a game right showing these three frames which is I get it it's really different than what we just saw what we just saw was a person just doing this and then this and then this and then Gemini by itself said I know what you're doing you're playing rock paper scissors there is a big difference yeah so people are quite upset that the prompts were quite specific and only individual frames were shown and that the video wasn't this live interaction and so on however like that to me that's kind of the least worrying part I didn't know what you expect but that's what I expected when I saw this video when I saw this video I was like like okay they must be like either snapshotting the frames or doing something like this right very clearly uh or they just tried 50 billion times right that's the other thing what what do you expect what do you expect continuous video feed interactive thing that's just not the case nowadays so I kind of expected that I guess the regular journalist from CNBC didn't they actually believe that this was an interactive system I didn't and uh therefore I wasn't that upset about it I'm personally much more upset about all the other stuff about the fact that they wrote a technical report and didn't say a single thing that's actually useful about these models so make up your own minds honestly I think it's fine to do marketing for marketing sakes even making these videos and so on to me not that big of a deal honestly uh what matters is what kind of applications people build on top of these types of things and honestly something that takes frames and uh prompts is much more useful than something that is interactive video tell me what you see kind of stuff all right that was it from me for Gemini and at least the controversy around this if you find out how Big Gemini Ultra is please post it in a comment other than that I will be at NPS I'll see you around and bye-bye ohwhat's up I'm in New Orleans I'm attending new RPS this week if you're here come say hi love to meet you yeah so we have to talk about Gemini and the whole disaster around the fake video I guess so we'll see about that who deep mind has released a new model called Gemini it is a large multimodal model I guess it's no longer appropriate to call them llms now that they're not language only anymore but it is a transform that consumes sort of images and text and audio and so on and outputs I believe it outputs images and text and yeah it does all in all and it's big how big we have no idea now there has been a bit of controversy around a demo video that they released and that people later said it's staged or Faked and so on they're outraged about it now honestly I personally am not that upset about it I'm also not that impacted or surprised or anything anything like this like the fact that this is big news I get it a little bit but also I don't really get it so we'll dive into the video and you know what the controversy behind it is later but I want to say if you're looking to be upset about something much more upset about the rest of the Gemini release than the demo video they made you know to put a nice music behind it and tell a nice story like the rest of what they did is much more upsetting so let's look at that this website right here welcome to the Gemini area this is wow this is Gemini it's multimodality texting blah blah blah yes this is a marketing page but even in marketing pages I believe you should be somewhat somewhat honest and this graphic right here is one of the things that I think is kind of a lot more upsetting than the video so they compare GPT 4 to Gemini Ultra which is the largest Gemini model look at that that if Gemini Ultra is prompted with Chain of Thought prompting it's a prompting technique uh it's relatively intensive but it gives good results in sort of very complex language understanding tasks wellas GPT 4 is prompted with five shot prompts so naturally a Chain of Thought prompt is going to deliver much better results than a just five shot prompt on a Model just displaying this graphic right here as it is of with the big numbers like ooh Gemini 90 qp4 86.4 human expert 89.8 right just displaying them like this with like a tiny font that says how they were achieved is to me not very cool in fact if you look at the technical report so let's go to the part where the results are if you look at that you can find these numbers again Gemini Ultra 9.04 and GPT 4 86.4 four in Chain of Thought and five shot now here's the kicker if you look at Gemini Ultra five shot prompted it's worse than gp4 if they however evaluate gp4 with Chain of Thought prompting it is better than the five shot prompted model but not as good as the Gemini Ultra the kicker here is that these aren't verified numbers these are just numbers that they got via the API of gp4 these are numbers by the gp4 team themselves so now you can see how this number came about namely they found themselves in a bit of a tussle right there is a very important Benchmark that they want to be really good at and they have a method to be really good at this this one prompting strategy with their model turns out to be really really good better than humans even right however they don't have comparable numbers for that particular prompting strategy what they do have comparable numbers for is this five shot thing but oh wouldn't it be great if they would also be better on that but they're not in this prompting method gp4 is better so what do they do they can't come out with a model and say well we Compare the numbers and actually on this important one were worse and they also can't really I guess they could have put the 87 here I get they want to do compare to reported numbers but yeah so you can see the conundrum they found themselves in it's still I think not very okay to represent it like that I personally would have probably put the the 87 here I guess that would have drawn some some attention from the open AI folks in any case the marketing page here there's not too much more there's a bunch of these numbers and then uh a bunch of nice graphics and so on a bunch of oo what can it do oh wow I want to go into the technical report a bit this is what I find to be the one of the most upsetting things papers aren't papers anymore they're not technical reports because in a technical report you can write anything that you want and have to disclose nothing and that's I think what's upsetting to me I come from a time as an academic where we wrote papers and we actually put inside of the papers what we did and how we did it so that others could reproduce it and that's just not the case anymore with these big releases this technical report here is essentially just another marketing piece and why do I say that because they say next to nothing in this technical report they open with this example right here of Gemini solving or correcting a student that incorrectly solves a physics problem which is neat it's supposed to show that Gemini can unify language and images and handwritten and physics knowledge and then do some calculations even at the end that's certainly I guess impressive but yeah it's an example and honestly like 90% of the answer here on the right hand side you can get by recognizing just like one or two lines like as long as you recognize the student wrote E equals mgl you can probably already guess what went wrong and what should go right and then you kind of have to get the numbers here right not saying it's not impressive but uh it's an example right so they want to show oh look what it can what it all can do so then the next thing they do is they tell us oh our model comes in different sizes wow that's interesting the model comes in size Ultra Pro and Nano at least for the Nano they say how big it is they say it's 1.8 billion and 3.25 billion parameters the Nanos are meant to be run on device and yeah the 1.8 billion 4bit quanti would clock in I would say about just under a gigabyte of RAM which is probably a manable to a lot of devices and yeah it's going to be cool future where these things run on device and good that they actually tell us how big they are we would find out anyway because they would be deployed on devices and people are going to inspect them however for the pro and the ultra they just say well the pro is meant to be something and the ultra is meant to be like bigger than the pro we have no they could have just as well said we trained Gumble and Bumble it's it's all the same they don't share anything any facts about this the biggest kicker like the thing that actually upsets me is the architecture diagram they have done a great deal to go into the architecture of these models the detail that they have put into these architecture diagrams and the Novelties and The Innovation that can be deduced from that is astounding are you ready for the architecture diagram of the century wow there it is there seems to be inputs those inputs will get into what looks to be tokens it goes into a Transformer and then there are outputs this is amazing I cannot stop looking at this this is so bad I guess we can pull out there there there is image text audio and video inputs and then there is image and text outputs like that's fine but really really this uh then they come to the training infrastructure where they kind of detail how they went about training the thing and then what kind of chips and sure they here and there they say something about what they did to solve the problems but they're super careful to never say anything so that you could even estimate or guesstimate how big the models are or how much compute they put into it or something like this like they're extremely careful to to like say a lot of words without saying really a lot of things they do go a little bit into as I said how they handled Hardware failures and things like this but essentially they say nothing and from here it just gets worse it just gets worse they use a sentence piece tokenizer oh by the way the models are 32k tokens that's they say that but other than that the further down you go the less they say until we get to the evaluation and the evaluation is just the number numbers that I've shown you before these are fairly extensive so they do fairly extensive evaluations on a lot of benchmarks but then again as soon as you could estimate something is like relative sizes again and so on but yeah the evaluations I have to say are good so they compare as you can see there's lots of numbers um Yes again look at that wow is what's the label your axis like what's the unit in any case they have some examples on how it can create uh matplot lib code and so on it's pretty good like I'm honestly excited about these models like they're going to be cool to work with they're going to be cool to try they're going to be cool to build applications upon they seem to be really capable I'm just not haven't arrived yet in the new world where it's just well we made something and you may use it via our API but we won't tell you anything about it Gemini is further step towards our mission to solve intelligence Advance science and benefit Humanity you're not you're doing the opposite of that not the opposite but you're very actively trying not to do that all right let's get into video controversy so there's been this video they've released this okay we've been testing the capabilities of Gemini and look how it how it kind of looks so look how they phrase it here IDE let's start all right testing Gemini here we go tell me what you see I see you placing a piece of paper on the table I see a squiggly line what about now the contour lines are smooth and flowing with no sharp angles or Jagged edges it looks like a bird to me hm what if I add this the bird is swimming in the water it has a long neck and Beak it is a duck yes a duck is a type of waterf in the family inad clue one this country is the home of the kangaroo the koala and the Great Barrier Reef oh that's easy clue two this country loves football and has won the most men's World Cups in football history H find the paper ball under the cup I accept the challenge the cup to the left I know what you're doing you're playing rock paper scissors all right this was this was probably the biggest controversy the rock paper scissors one they made this video and they put a nice voice on Gemini and it's kind of interactive and so on and then in the blog post they released along with that they said you know how it's made how how how the video is made and they reveal how they made the video for example they show a frame of the video and then they give a prompt and this is already there's so many news articles oh Google Faces controversy over edited Gemini AI demo video all the video was fabricated why do people say that because it turns out they've just shown appropriate frames from the videos and then giving prompts together with those frames then they took the out output of Gemini and then they had someone say it or they had a voice voice as text to speech say it so they didn't actually show the video isn't a live interaction with Gemini that recognizes from the video frames and so on and the prompts sometimes are also quite helpful for example this one what do you think I'm doing hint it's a game right showing these three frames which is I get it it's really different than what we just saw what we just saw was a person just doing this and then this and then this and then Gemini by itself said I know what you're doing you're playing rock paper scissors there is a big difference yeah so people are quite upset that the prompts were quite specific and only individual frames were shown and that the video wasn't this live interaction and so on however like that to me that's kind of the least worrying part I didn't know what you expect but that's what I expected when I saw this video when I saw this video I was like like okay they must be like either snapshotting the frames or doing something like this right very clearly uh or they just tried 50 billion times right that's the other thing what what do you expect what do you expect continuous video feed interactive thing that's just not the case nowadays so I kind of expected that I guess the regular journalist from CNBC didn't they actually believe that this was an interactive system I didn't and uh therefore I wasn't that upset about it I'm personally much more upset about all the other stuff about the fact that they wrote a technical report and didn't say a single thing that's actually useful about these models so make up your own minds honestly I think it's fine to do marketing for marketing sakes even making these videos and so on to me not that big of a deal honestly uh what matters is what kind of applications people build on top of these types of things and honestly something that takes frames and uh prompts is much more useful than something that is interactive video tell me what you see kind of stuff all right that was it from me for Gemini and at least the controversy around this if you find out how Big Gemini Ultra is please post it in a comment other than that I will be at NPS I'll see you around and bye-bye oh\n"