Google’s video showing off its new model Gemini’s capabilities was nothing short of amazing. Unfortunately, the truth about how good Gemini is and what it can do falls short of the marketing hype.
When we first watched the demo video showing Gemini interacting in real-time with the presenter we were blown away. We were so excited that we missed some key disclaimers in the beginning and accepted the video at face value.
The text in the first few seconds of the video says “We’ve been capturing footage to test it on a wide range of challenges, showing it a series of images, and asking it to reason about what it sees.”
What really happened behind the scenes is the cause of the criticism Google got and the ethical questions it raises.
Gemini was not watching a live video of the presenter drawing a duck or moving cups around. And neither was Gemini responding to the voice prompts you heard. The video was a stylized marketing presentation of a simpler truth.
In reality, Gemini was presented with still images and text prompts that were more detailed than the questions you hear the presenter asking.
A Google spokesperson confirmed that the words you hear spoken in the video are “real excerpts from the actual prompts used to produce the Gemini output that follows.”
So, detailed text prompts, still images, and text responses. What Google actually demonstrated was functionality that GPT-4 has had for months.
Google’s blog post shows the still images and text prompts that were actually used.
In the example of the car the presenter asks, “Based on their design, which of these would go faster?”
The actual prompt that was used was, “Which of these cars is more aerodynamic? The one on the left or the right? Explain why, using specific visual details.”
And when you recreate the experiment on Bard, which Gemini now powers, it doesn’t always get it right.
I really wanted to believe that Gemini could follow the ball as the three cups were moved around but sadly that’s not true either.
Google’s blog post shows that a lot of prompting and explanation was required for the cup shuffling demo.
It’s still impressive that an AI model can do this, but it’s not what we were sold in the video.
Is that it, Google?
We’re just speculating here but the demo was most likely showing results Google got using Gemini Ultra, which still hasn’t been released.
So when Gemini Ultra is eventually released it looks like it will be capable of what GPT-4 has been doing for months. The implications aren’t great.
Are we hitting a ceiling as far as AI capabilities are concerned? Because, if the best AI minds are working at Google then surely they’d be driving cutting-edge innovation.
Or, was Google not only slow in entering the race but struggling to keep up with the rest? The benchmark numbers Google proudly displayed show its yet-to-be-released model marginally beating GPT-4 in some tests. How will it fare against GPT-5?
Or maybe Google’s marketing department made a judgment error with their video but Gemini Ultra will still be better than we think. Google says Gemini is truly multi-modal and that it understands video, which truly will be a first for LLMs.
We’ve not seen an LLM demonstrate video comprehension yet, but when we do it will be worth getting excited about. Will it be Gemini Ultra or GPT-5 that shows us first?