Gemini 3.0 Pro – early tests

(twitter.com)

136 points | by ukuina 4 hours ago ago

79 comments

  • simonw 3 hours ago

    I've seen a bunch of tweets like this recently, as far as I can tell they're all from people using https://aistudio.google.com/ who got served an A/B test.

    A few more in this genre:

    https://x.com/cannn064/status/1973818263168852146 - "Make a SVG of a PlayStation 4 controller"

    https://x.com/cannn064/status/1973415142302830878 "Create a single, self-contained HTML5 file that mimics a macOS Sonoma-style desktop: translucent menu bar with live clock, magnifying dock, draggable/resizable windows, and a dynamic wallpaper. No external assets; use inline SVG for icons."

    https://x.com/synthwavedd/status/1973405539708056022 "Write full HTML, CSS and Javascript for a very realistic page on Apple's website for the new iPhone 18"

    I've not seen it myself so I'm not sure how confident they are that it's Gemini 3.0.

    • ajcp 3 hours ago

      At this point until I see one run through the Pelican Benchmark I can't really take a new model seriously.

      • diggan 3 hours ago

        Unfortunately, as every public benchmark, once it ends up in the training sets and/or the developers aware of it, it stops being effective, and I think we've started to reach that point.

        The only thing I've found to give me some sort of quantitative idea of how good a new model is, is my own private benchmarks. It doesn't cover everything I want to use LLMs for, and only has 20-30 tests per "category", but at least I'm 99% sure it isn't in the training datasets.

        • simonw 3 hours ago

          I have a few "SVG of an X riding a Y" tests that I don't publish online which I run occasionally to see if a model is suspiciously better at drawing a pelican riding a bicycle than some other creature on some other form of transport.

          I would be so entertained if I found out an AI lab had wasted their time cheating on my dumb benchmark!

          • ajcp 3 hours ago

            -> I would be so entertained if I found out an AI lab had wasted their time cheating on my dumb benchmark!

            Que intro: "The gang wastes their time cheating on a dumb benchmark"

            • mcny 2 hours ago

              A shower thought I just had: there must be some AI training company somewhere that has injested all It is always sunny in Philadelphia, not just the text but all the video from all episodes somehow...

          • diggan 2 hours ago

            > I would be so entertained if I found out an AI lab had wasted their time cheating on my dumb benchmark!

            I don't think it's necessarily "cheating", it just happens as they're discovering and ingesting large ranges of content. A problem of public content, it's bound to be included sooner or later, directly or indirectly.

            Nice to hear you're doing some sort of contingency though, and looking forward to the inevitable blog post announcing the change to a different bird and vehicle :)

            • svachalek an hour ago

              The thing is most of the discussion about it is embarrassingly bad SVGs so training on them would actually hurt their performance.

          • reissbaker 2 hours ago

            I doubt they'd cheat that obviously... But "SVG of X" has become common enough that I suspect most frontier labs train on it, especially since the models are multimodal now anyway.

            Not that I mind; I want models to be good at generating SVG! Makes icons much simpler.

          • Imustaskforhelp 2 hours ago

            Please do let us know through your blog post if you ever find AI labs to cheat on your benchmark.

            But now I am worried that since you have shared that you do SVG of an X riding a Y thing, maybe these models will try to cheat on the whole SVG of X riding Y thing instead of hyper focusing the pelican.

            So now I suppose you might need to come up with an entirely new thing though :)

            • throwup238 2 hours ago

              There are so many X and Y combinations that I find it hard to believe they could realistically train for a even a small fraction of them. Someone has to generate the graphics output for the training.

              A duck billed platypus riding a unicycle? A man o' war riding a pyrosome? A chicken riding a Quetzalcoatlus? A tardigrade riding a surf board?

              • gnatolf 2 hours ago

                You're assuming that given the collection of simonw's publicly available blog posts, the creativity of those combinations can't be narrowed down. Simply reverse engineer his brain this way and you'll get your Xs and Ys ;)

                • throwup238 an hour ago

                  I feel like that would over fit on various snakes like pythons.

              • fragmede 41 minutes ago

                If we accept ChatGPT telling me that there are approximately 200k common nouns in English, and then we square that, we get 40 billion combinations. At one second per, that's ~1200 years, but then if we parallelize it on a supercomputer that can do 100,000 per second that would only take 3 days. Given that ChatGPT was trained on all of the Internet and every book written, I'm not sure that still seems infeasible.

                • throwup238 21 minutes ago

                  It still can't satisfactorily draw a pelican on a bicycle because that's either not in the training data or the signal is too weak, so why would it be able to satisfactorily draw every random noun-riding-noun combination just because you threw a for loop at it?

                  The point is that in order to cheat on @simonw's benchmark across any arbitrary combination, they'd have to come up with an absurd number of human crafted input-output training pairs with human produced drawings. You can't just ask ChatGPT to generate every combination because all it'll produce is garbage that gets a lot worse the further from a pelican riding a bicycle.

                  It might work at first for the pelican and a few other animals/transport combination but what does it even mean for a man o' war riding a pyrosome? I asked every model I have access to generate an SVG for a "man o' war riding a pyrosome" and not a single one managed to draw anything resembling a pyrosome. Most couldn't even produce something resembling a man o' war except as a generic ellipsoid-shaped jellyfish with a few tenticles.

                  Expand that to every weird noun-noun combination and it's just not practical to train even a tiny fraction of them.

          • fragmede 18 minutes ago

            But how would you know it's from what you would consider cheating as opposed to pelicans on bicycles existing in the latest training data? Obviously your blog gets fed into the training set for GPT-6, as well as everyone else talking about your test, so how would the comparison to a secret X riding a Y tell you if an AI lab is cheating as opposed to merely there being more examples in the training data?

            • simonw a minute ago

              Mainly because if they train on the pelican on bicycle SVGs from my blog they are going to get some very weird looking pelicans riding some terrible looking bicycles.

        • Workaccount2 10 minutes ago

          I honestly think people really blow out of proportion the effect of "being in the training set". The internet is ridden with examples of problem/solution posts that many models definitely trained on, but still get wrong.

          More important would be post training, where the labs specifically train on the exact question. But it doesn't seem like this is happening for most amateur benchmarks at least. All the models that are good at pelican bike have been good at whatever else you throw at them to SVG.

        • latemedium 2 hours ago

          We need to know if big AI labs are explicitly training models to generate SVGs of pelicans on bicycles. I wouldn't put it past them. But it would be pretty wild in they did!

        • londons_explore 2 hours ago

          As soon as you use your private tests, all the AI companies vacuum up the input to use to train the next model.

          Obviously they're only getting the question and not a perfect answer, but with today's process of generating hundreds of potential answers and getting another model to choose the best/correct one for training, I don't think that matters.

        • ajcp 3 hours ago

          That's the move right there.

    • ceejayoz 3 hours ago

      > a very realistic page on Apple's website…

      Is this supposed to be a good example?

      It looks like something I'd put together, and you don't want me doing design work.

  • strongpigeon 3 hours ago

    Google's biggest problem in my opinion (and I'm saying that as an ex-googler) is that Google doesn't have a product culture. Google had the tech for something like ChatGPT for a long time, but couldn't come up with that product. Instead it had to rely on another company showing it the way and then copy them and try to out-engineer them...

    I still think ultimately (and somewhat sadly) Google will win the AI race due to its engineering talent and the sheer amount of data it has (and Android integration potential).

    • thewebguyd 3 hours ago

      > is that Google doesn't have a product culture.

      This is evident in Android and the pixel lineup, which could be my favorite phone if not for some of the most baffling and frustrating decisions that lead to a very weirdly disjointed app experience (comparing to something like iOS's first party tools).

      Like removing location based reminders from google tasks, for some reason? Still no apple shortcuts-like automation built-in, keep can still do location based reminders but it's a notes app so which am I supposed to use? Google tasks or keep? Well, gemini adds reminders to google tasks and not keep if I wanted to use keep primarily.

      If they just spent some time polishing and integrating these tools, and add some of their ML magic to it they'd blow Apple out of the park.

      All of Google's tech is cool and interesting, from a tech standpoint but it's not well integrated for a full consumer experience.

      • xooooogler 2 hours ago

        Google recently let go ALL -- EVERY SINGLE -- L3/L4/L5 UX Researcher

        https://www.thevoiceofuser.com/google-clouds-cuts-and-the-bi...

        Could it be argued that perhaps UX Research was not working at all? Or that their recommendations were not being incorporated? Or that things will get even worse now without them?

        • seemaze 2 hours ago

          Maybe Apple should follow suit.. I jest, but I’m still processing the liquid glass debacle.

          • thewebguyd an hour ago

            At least it's uniform. Unlike Material 3 expressive which might look different depending on the app, or not be implemented at all, or only half implemented in some of Google's own apps even, much like with every other Android redesign.

            I get Google can't force it on all the OEMs with their custom skins, but they can at least control their own PixelOS and their own apps.

            • layer8 8 minutes ago

              It’s not uniform at all. Some parts of the interface and of their apps get it, others don’t. Some parts look more glassy, some more frosty. It’s all over the place in terms of consistency. It’s also quite different between Apple’s various OSs, although allegedly the purpose was to unify their look.

    • byefruit 3 hours ago

      And even when it does copy other products, it seems to be doing a terrible job of them.

      Google's AI offering is a complete nightmare to use. Three different APIs, at least two different subscriptions, documentation that uses them interchangeably.

      For Gemini's API it's often much simpler to actually pay OpenRouter the 5% surchargeto BYOK than deal with it all.

      I still can't use my Google AI Pro account with gemini-cli..

      • specproc 2 hours ago

        I had great fun this week with the batch API. A good morning lost trying to work out how to do a not particularly complex batch request via JSONL.

        The python library is not well documented, and has some pretty basic issues that need looking at. Terrible, unhelpful errors, and "oh, so this works if I put it in camel-case" sort of stuff.

      • gardnr 2 hours ago

        Then there's the billing dashboards...

        It's amazing how they can show useless data while completely obfuscating what matters.

        • ur-whale an hour ago

          Yeah, the whole billing death march is what ended up making me pick OpenAI as my main worhorse instead of GOOG.

          Not enough brain cycles to figure out a way to give Google money, whereas the OpenAI subscription was basically a no-brainer.

      • cshores 2 hours ago

        As of this week you can use gemini-cli with Google AI Pro

    • sho_hn 3 hours ago

      To be fair, according to OpenAI they started ChatGPT as a demo/experiment and were taken by surprise when it went viral.

      It may well be that they also didn't have a product culture as an organization, but were willing to experiment or let small teams do so.

      It's still a lesson, but maybe a different one.

      With organizational scale it becomes harder and harder to launch experiments under the brand. Red tape increases, outside scrutiny increases. Retaining the ability to do that is difficult.

      Google does experiment a fair bit (including in AI, e.g. NotebookLLM and its podcast feature are I think a standout example of trying to see what sticks) but they also tend to try to hide their experiments in developer portals nowadays, which makes it difficult to get a signal from a general consumer audience.

      • ajcp 3 hours ago

        -> With organizational scale it becomes harder and harder to launch experiments under the brand

        I feel like Google tried to solve for this with their `withgoogle.com` domain and it just ends up being confusing or worse still, frustrating when you see something awesome and then nothing ever comes of it.

      • strongpigeon 3 hours ago

        Google is definitely good at experimenting (and yeah NotebookLLM is really cool), which is a product of the bottom-up culture. The lack of a consistent story with regard to AI products however is a testament to the lack of product vision from the top.

        • ajcp 3 hours ago

          NotebookLM came out of Google Labs though, and in collaboration with outside stakeholders. I'm not sure I would call it a success of "bottom-up" culture, but a well realized idea from a dedicated incubator. That doesn't necessarily mean the rest of the company is so empowered or product oriented.

    • xnx 3 hours ago

      > Google doesn't have a product culture

      Fair criticism that it took someone else to make something of the tech that Google initially invented, but Google is furiously experimenting with all their active products since Sundar's "code red" memo.

    • wmf 2 hours ago

      Didn't Google have Bard internally around the same time as ChatGPT?

      • blueg3 12 minutes ago

        Bard came out shortly after ChatGPT as a prototype of what would become Gemini-the-chatbot.

        There were other, less-available prototypes prior to that.

      • eternal_braid 2 hours ago

        Search for Meena from Google.

    • renewiltord 3 hours ago

      Well, they had an internal ethics team that told them that their technology was garbage. That can't help. The other guys' ethics teams are all like "Our stuff is too awesome for people to use. No one should have this kind of unbridled power. We must muzzle the beast before a tourist rides him" and Google's ethics team was like "our shit sucks lol this is just a Markov chain parrot doesn't do shit it's garbage".

      • Filligree 2 hours ago

        Which, to be fair—we're talking about the pre-GPT-3.5 era—it kind of was?

        • charcircuit 8 minutes ago

          Don't you remember all of the scaremongering around how unethical it would be to release a GPT3 model publicly.

          Google personally reached out to someone trying to reproduce GPT3 and convinced him to abandon his plan of releasing it to the public.

        • renewiltord 2 hours ago

          The unfortunate truth when you're on the cusp of a new technology: it isn't good yet. Keeping a team of guys around whose sole job it is to tell you your stuff sucks is probably not aligned with producing good stuff.

    • adventured 3 hours ago

      Along with its engineering talent and resource scale, I think their in-house chips are one of their core advantages. They can scale in a way that their peers are going to struggle to match, and at much lower cost. Nvidia's extreme margins are Google's opportunity.

    • killerstorm 2 hours ago

      ChatGPT-3.5 was more of a novelty than a product.

      It would be weird to release that as a serious company. They tried making a deliberately-wacky chatbot but it was not fun.

      Letting OpenAI to release it first was a right move.

      • Imustaskforhelp 2 hours ago

        To me, I want openai to release the Chatgpt 3 and chatgpt 3.5 as the phenomenal leap of intelligence and even I appreciated the Chatgpt 3 a lot, more so than even now like It had its quirks but it was such a good model man.

        I remember forming a really simple dead simple sveltekit website during Chatgpt 3. It was good, it was mind blowing and I was proud of it.

        The only interactivity was a button which would go from one color to other and it would then lead to a pdf.

        If I am going to be honest, the UI was genuinely good. It was great tho and still gives me more nostalgia and good vibes than current models. Em-dashes weren't that common in Chatgpt 3 iirc but I have genuinely forgotten what it was like to talk to it

    • londons_explore 2 hours ago

      > Android integration potential

      Nearly all the people that matter use iPhone... Yet Apple really hasn't had much success in the AI world, despite being in a position to win if their product is even only vaguely passable.

  • robots0only 2 hours ago

    In all of these posts there is someone claiming Claude is the best, then somebody else claiming they have tried a bunch of times and for them Gemini is the best while others find GPT-5 is supreme. Obviously, all of these are subjective narrow experiences. My conclusion is that all frontier models are both good and bad with no clear winner and making good evals is really hard.

    • SkyPuncher 2 hours ago

      I'll be that person:

      * Gemini has the highest ceiling out of all of the models, but has consistently struggled with token-level accuracy. In other words, it's conceptual thinking it well beyond other models, but it sometimes makes stupid errors when talking. This makes it hard to reliably use for tool calling or structured output. Gemini is also very hard to steer, so when it's wrong, it's really hard to correct.

      * Claude is extremely consistent and reliable. It's very, very good at the details - but will start to forget things if things get too complex. The good news is Claude is very steerable and will remember those details if you remind it.

      * GPT-5 seems to be completely random for me. It's so inconsistent that it's extremely hard to use.

      I tend to use Claude because I'm the most familiar with it and I'm confident that I can get good results out of it.

      • Workaccount2 5 minutes ago

        Gemini is also the best for staying on the ball (when it does) over long contexts.

        It's really the only model that can do large(er) codebase work.

      • artdigital 42 minutes ago

        I’d say GPT-5 is the best in following and remembering instructions. After an initial plan it can easily continue with said plan for the next 30-60 minutes without human intervention, and come back with a complete working finished feature/product.

        It’s honestly crazy how good it is, coming from Claude. I never thought I could already pass something a design doc and have it one-shot the entire thing with such level of accuracy. Even with Opus, I always need to either steer it, or fix the stuff it forgot by hand / have another phase afterwards to get it from 90% to 100%.

        Yes the Codex TUI sucks but the model with high reasoning is an absolute beast, and convinced me to switch from Claude Max to ChatGPT Pro

      • bcrosby95 an hour ago

        GPT-5 seems best at analyzing the codebase for me. It can pick up nuances and infer strategies Claude and Gemini seem to fail at.

      • Alex-Programs an hour ago

        Personally I prefer Gemini because I still use AI via chat windows, and it can do a good ~90k tokens before it starts getting stupid. I'm yet to find an agent that's actually useful, and doesn't constantly fuck up everywhere while burning money.

    • Keyframe an hour ago

      Answer is a classic programming one - it depends? There are definitely differences in strength and weaknesses among them.

      I run claude CLI as a primary and just ask it nicely to consult gemini cli (but not let it do any coding). It works surprisingly well. OpenAI just fell out of my view. Even cancelled ChatGPT subscription. Gemini is leaping forward and _feels like_ ChatGPT-5 is a regression.. I can't put my finger on it tbh.

    • smoe an hour ago

      Capability wise, they seem close enough that I don’t bother re-evaluating them against each other all the time.

      One advantage Gemini had (or still has, I’m not sure about the other providers) was its large context window combined with the ability to use PDF documents. It probably saved me weeks of work on an integration with a government system uploading hundreds of pages of documentation and immediately start asking questions, generating rules, and troubleshooting payloads that were leading to generic, computer-says-no errors.

      No need to go trough RAG shenanigans and all of it within the free token allowance.

    • Robdel12 2 hours ago

      Yeah, my take is it’s sort of up to the person using the LLM and maybe how they match to that LLM. That’s my hunch as to why we hear wildly different takes on these LLMs working for people. Gemini can be the most productive model for some while others find it entirely unworkable.

      • jiggawatts 2 hours ago

        Not just personalities and preferences, but the purpose for which the AI is being used also affects the results. I primarily use AIs for complex troubleshooting along the lines of: "Here's a megabyte of logs, an IaC template, and a gibberish error code. What's the reason?" Right now, only Gemini Pro 2.5 has any chance of providing a useful output given those inputs, because its long-context attention is better than any other model's.

    • qaq an hour ago

      In my experience gemini is good at writing specs it's hit or miss in reviewing code and it's not really usable for iterating on code. Codex is slow but can crack issues that Claude Code struggles with. So my workflow has being to use all three to iterate on specs. Have claude code work on implementation and have Codex review claude code's work (sometimes have gemini double check it).

    • binary132 2 hours ago

      The fact that there is so much astroturf out there also makes it difficult to evaluate these claims

  • whywhywhywhy 23 minutes ago

    These influencer tests are so pointless and don't represent the reality of model use at all when things are constantly being downgraded when people actually use the thing.

    Not to mention every team will have the bouncing balls in the polygon in their dataset now.

  • vunderba 3 hours ago

    Outside of the aesthetic, the very first example on that twitter post is "balls bouncing around a constrained rotating rigid physics environment" which has been trivially one-shottable since Claude Code was first announced.

    It was one of the first things I tried when Claude Code went GA:

    https://gondolaprime.pw/hex-balls

    • Synaesthesia 2 hours ago

      They have differing degrees of fidelity to the simulation, this one looks pretty good and it's got parameters, but yes the LLM's are really advanced now in what they can do. I was actually blown away during the Gemini 2.5 announcement with some of the demos people came up with.

  • maerch 3 hours ago

    I still have a bad taste in my mouth after all those GPT-5 hype articles that claimed the model was just one step away from AGI.

    • gardnr an hour ago

      TBF, they all believed that scaling reinforcement learning would achieve the next level. They had planned to "war-dial" reasoning "solutions" to generate synthetic datasets which achieved "success" on complex reasoning tasks. This only really produced incremental improvements at the cost of test-time compute.

      Now Grok is publicly boasting PhD level reasoning while Surge AI and Scale AI are focusing on high quality datasets curated by actual PhD humans.

      Surge AI is boasting $1B in revenue, and I am wondering how much of that was paid in X.ai stock: https://podcasts.apple.com/us/podcast/the-startup-powering-t...

      In my opinion the major advancements of 2025 have been more efficient models. They have made smaller models much, much better (including MoE models) but have failed to meaningfully push the SoTA on huge models; at least when looking at the USA companies.

      • svachalek an hour ago

        Same, qwen3 omni blows my mind for what a 30b-A3b model can do. I had a video chat with it and it correctly identified plant species I showed it.

  • ACCount37 3 hours ago

    I hope this is the one that unfucks the multi-turn instruction following.

    One of the biggest issues holding Gemini back, IMO, compared to the competitors.

    Many LLMs are still plagued by "it's easier to reset the conversation than to unfuck the conversation", but Gemini 2.5 is among the worst.

    • solarkraft 2 hours ago

      Gemini‘s loops are a real problem. Within a few minutes of using it in the CLI it happened to me me („I can verify that I fulfilled the user’s request, I can verify that I fulfilled the user’s request …“). It’s telling that the CLI has a detection for this.

      The other day I asked 2.5 Pro for suggestions. It would provide one, which I rejected with some reasoning. It would provide another, which I also rejected. Asked for more it would then loop between the two, repeating the previous suggestions verbatim. It went on for 3-4 times, even after being told to reflect on it and it being able to recite the rejection reasons.

  • nharada 2 hours ago

    Gemini has always been the leader in multimodal work like images and video, I expect this won't be any different but am interested to see how it is

  • renewiltord 3 hours ago

    Every three months there's some mind blowing hype around a Google product, lots of people talk about it, and then when I use it it's not nearly as good.

  • Oras 3 hours ago

    These tests mean nothing; I yet to see a model that is better than Sonnet 4 for coding. I tried many, all of them are sub-par, even with a small code base.

    • nnevatie 3 hours ago

      Well, Codex with GPT5 High wins Claude Sonnet 4.5 - this is anecdotal, but I've used both extensively.

      • solarkraft 2 hours ago

        At what speed? At some point you’ll have to compare to Opus.

    • Bolwin 2 hours ago

      Well yeah no surprise. You should try glm 4.6

  • esafak 3 hours ago

    We can't see the code and the challenge is pedestrian. Nothing to see here.