Gemini 3

(blog.google)

1735 points | by preek 4 months ago ago

402 comments

  • lairv 4 months ago

    Out of curiosity, I gave it the latest project euler problem published on 11/16/2025, very likely out of the training data

    Gemini thought for 5m10s before giving me a python snippet that produced the correct answer. The leaderboard says that the 3 fastest human to solve this problem took 14min, 20min and 1h14min respectively

    Even thought I expect this sort of problem to very much be in the distribution of what the model has been RL-tuned to do, it's wild that frontier model can now solve in minutes what would take me days

    • thomasahle 4 months ago

      I also used Gemini 3 Pro Preview. It finished it 271s = 4m31s.

      Sadly, the answer was wrong.

      It also returned 8 "sources", like stackexchange.com, youtube.com, mpmath.org, ncert.nic.in, and kangaroo.org.pk, even though I specifically told it not to use websearch.

      Still a useful tool though. It definitely gets the majority of the insights.

      Prompt: https://aistudio.google.com/app/prompts?state=%7B%22ids%22:%...

    • qsort 4 months ago

      To be fair a lot of the impressive Elo scores models get are simply due to the fact that they're faster: many serious competitive coders could get the same or better results given enough time.

      But seeing these results I'd be surprised if by the end of the decade we don't have something that is to these puzzles what Stockfish is to chess. Effectively ground truth and often coming up with solutions that would be absolutely ridiculous for a human to find within a reasonable time limit.

    • rbjorklin 4 months ago

      Your post made me curious to try a problem I have been coming back to ever since ChatGPT was first released: https://open.kattis.com/problems/low

      I have had no success using LLM's to solve this particular problem until trying Gemini 3 just now despite solutions to it existing in the training data. This has been my personal litmus test for testing out LLM programming capabilities and a model finally passed.

    • sedatk 4 months ago

      Just to clarify the context for future readers: the latest problem at the moment is #970: https://projecteuler.net/problem=970

    • thomasahle 4 months ago

      I tried it with gpt-5.1 thinking, and it just searched and found a solution online :p

    • irthomasthomas 4 months ago

      Are you sure it did not retrieve the answer using websearch?

    • id 4 months ago

      gpt-5.1 gave me the correct answer after 2m 17s. That includes retrieving the Euler website. I didn't even have to run the Python script, it also did that.

    • j2kun 4 months ago

      Did it search the web?

    • jamilton 4 months ago

      Yeah, LLMs used to not be up to par for new Project Euler problems, but GPT-5 was able to do a few of the recent ones which I tried a few weeks ago.

    • bgwalter 4 months ago

      Does it matter if it is out of the training data? The models integrate web search quite well.

      What if they have an internal corpus of new and curated knowledge that is constantly updated by humans and accessed in a similar manner? It could be active even if web search is turned off.

      They would surely add the latest Euler problems with solutions in order to show off in benchmarks.

    • bumling 4 months ago

      I asked Grok to write a Python script to solve this and it did it in slightly under ten minutes, after one false start where I'd asked it using a mode that doesn't think deeply enough. Impressive.

    • blubber 4 months ago

      Is this a problem for which the (human) solution is well documented an known and was learned during the training phase? Or is it a novel problem?

      I personally think anthropomorphizing LLMs is a bad idea.

    • NaomiLehman 4 months ago

      definitely uses a lot of tooling. From "thinking":

      > I'm now writing a Python script to automate the summation computation. I'm implementing a prime sieve and focusing on functions for Rm and Km calculation [...]

    • mistercheph 4 months ago

      If using through the chat interface are these models not doing some RAG?

    • ivape 4 months ago

      So when does the developer admit defeat? Do we have a benchmark for that yet?

    • motbus3 4 months ago

      We need to wait and see. According to Google they have solved AI 10 years ago with Google Duo but somehow they keep smashing records despite being the worst coding tool until Gemini 2.5. Google internal benchmarks are irrelevant

    • panarky 4 months ago

      [flagged]

    • orly01 4 months ago

      Wow. Sounds pretty impressive.

    • lofaszvanitt 4 months ago

      The problem is these models are optimized to solve the benchmarks, not real world problems.

  • davidpolberger 4 months ago

    This is wild. I gave it some legacy XML describing a formula-driven calculator app, and it produced a working web app in under a minute:

    https://aistudio.google.com/app/prompts?state=%7B%22ids%22:%...

    I spent years building a compiler that takes our custom XML format and generates an app for Android or Java Swing. Gemini pulled off the same feat in under a minute, with no explanation of the format. The XML is fairly self-explanatory, but still.

    I tried doing the same with Lovable, but the resulting app wouldn't work properly, and I burned through my credits fast while trying to nudge it into a usable state. This was on another level.

    • zarzavat 4 months ago

      This is exactly the kind of task that LLMs are good at.

      They are good at transforming one format to another. They are good at boilerplate.

      They are bad at deciding requirements by themselves. They are bad at original research, for example developing a new algorithm.

  • dwringer 4 months ago

    Well, I tried a variation of a prompt I was messing with in Flash 2.5 the other day in a thread about AI-coded analog clock faces. Gemini Pro 3 Preview gave me a result far beyond what I saw with Flash 2.5, and got it right in a single shot.[0] I can't say I'm not impressed, even though it's a pretty constrained example.

    > Please generate an analog clock widget, synchronized to actual system time, with hands that update in real time and a second hand that ticks at least once per second. Make sure all the hour markings are visible and put some effort into making a modern, stylish clock face. Please pay attention to the correct alignment of the numbers, hour markings, and hands on the face.

    [0] https://aistudio.google.com/app/prompts?state=%7B%22ids%22:%...

    • kjgkjhfkjf 4 months ago

      This is quite likely to be in the training data, since it's one of the projects in Wes Bos's free 30 days of Javascript course[0].

      [0] https://javascript30.com/

    • stalfie 4 months ago

      The subtle "wiggle" animation that the second hand makes after moving doesn't fire when it hits 12. Literally unwatchable.

    • kldg 4 months ago

      in defense of 2.5 (Pro, at least), it was able to generate for me a metric UNIX clock as a webpage which I was amused by. it uses kiloseconds/megaseconds/etc. there are 86.4ks/day. The "seconds" hand goes around 1000 seconds, which ticks over the "hour" hand. Instead of saying 4am, you'd say it's 14.

      as a calendar or "date" system, we start at UNIX time's creation, so it's currently 1.76 gigaseconds AUNIX. You might use megaseconds as the "week" and gigaseconds more like an era, e.g. Queen Elizabeth III's reign, persisting through the entire fourth gigasecond and into the fifth. The clock also displays teraseconds, though this is just a little purple speck atm. of course, this can work off-Earth where you would simply use 88.775ks as the "day"; the "dates" a Martian and Earthling share with each other would be interchangeable.

      I can't seem to get anyone interested in this very serious venture, though... I guess I'll have to wait until the 50th or so iteration of Figure, whenever it becomes useful, to be able to build a 20-foot-tall physical metric UNIX clock in my front yard.

    • xnx 4 months ago

      This is cool. Gemini 2.5 Pro was also capable of this. Gemini was able to recreate famous piece of clock artwork in July: https://gemini.google.com/app/93087f373bd07ca2

      "Against the Run": https://www.youtube.com/watch?v=7xfvPqTDOXo

    • farazbabar 4 months ago

      https://ai.studio/apps/drive/1yAxMpwtD66vD5PdnOyISiTS2qFAyq1... <- this is very nice, I was able to make seconds smooth with three iterations (it used svg initially which was jittery, but eventually this).

    • pmarreck 4 months ago

      https://ai.studio/apps/drive/1oGzK7yIEEHvfPqxBGbsue-wLQEhfTP...

      I made a few improvements... which all worked on the first try... except the ticking sound, which worked on the second try (the first try was too much like a "blip")

    • thegrim33 4 months ago

      "Allow access to Google Drive to load this Prompt."

      .... why? For what possible reason? No, I'm not going to give access to my privately stored file share in order to view a prompt someone has shared. Come on, Google.

    • malfist 4 months ago

      That is not the same prompt as the other person was using. In particular this doesn't provide the time to set the clock to, which makes the challenge a lot simpler. This also includes javascript.

      The prompt the other person was using is:

      ``` Create HTML/CSS of an analog clock showing ${time}. Include numbers (or numerals) if you wish, and have a CSS animated second hand. Make it responsive and use a white background. Return ONLY the HTML/CSS code with no markdown formatting. ```

      Which is much more difficult.

      For what it's worth, I supplied the same prompt as the OG clock challenge and it utterly failed, not only generating a terrible clock, but doing so with a fair bit of typescript: https://ai.studio/apps/drive/1c_7C5J5ZBg7VyMWpa175c_3i7NO7ry...

    • dyauspitr 4 months ago

      Having seen the page the other day this is pretty incredible. Does this have the same 2000 token limit as the other page?

    • skybrian 4 months ago

      It looks quite nice, though to nitpick, it has “quartz” and “design & engineering” for no reason.

    • pmarreck 4 months ago

      holy shit! This is actually a VERY NICE clock!

  • SXX 4 months ago

    Static Pelican is boring. First attempt:

    Generate SVG animation of following:

    1 - There is High fantasy mage tower with a top window a dome

    2 - Green goblin come in front of tower with a torch

    3 - Grumpy old mage with beard appear in a tower window in high purple hat

    4 - Mage sends fireball that burns goblin and all screen is covered in fire.

    Camera view must be from behind of goblin back so we basically look at tower in front of us:

    https://codepen.io/Runway/pen/WbwOXRO

  • prodigycorp 4 months ago

    I'm sure this is a very impressive model, but gemini-3-pro-preview is failing spectacularly at my fairly basic python benchmark. In fact, gemini-2.5-pro gets a lot closer (but is still wrong).

    For reference: gpt-5.1-thinking passes, gpt-5.1-instant fails, gpt-5-thinking fails, gpt-5-instant fails, sonnet-4.5 passes, opus-4.1 passes (lesser claude models fail).

    This is a reminder that benchmarks are meaningless – you should always curate your own out-of-sample benchmarks. A lot of people are going to say "wow, look how much they jumped in x, y, and z benchmark" and start to make some extrapolation about society, and what this means for others. Meanwhile.. I'm still wondering how they're still getting this problem wrong.

    edit: I've a lot of good feedback here. I think there are ways I can improve my benchmark.

    • WhitneyLand 4 months ago

      >>benchmarks are meaningless

      No they’re not. Maybe you mean to say they don’t tell the whole story or have their limitations, which has always been the case.

      >>my fairly basic python benchmark

      I suspect your definition of “basic” may not be consensus. Gpt-5 thinking is a strong model for basic coding and it’d be interesting to see a simple python task it reliably fails at.

    • dekhn 4 months ago

      Using a single custom benchmark as a metric seems pretty unreliable to me.

      Even at the risk of teaching future AI the answer to your benchmark, I think you should share it here so we can evaluate it. It's entirely possible you are coming to a wrong conclusion.

    • thefourthchime 4 months ago

      I like to ask "Make a pacman game in a single html page". No model has ever gotten a decent game in one shot. My attempt with Gemini3 was no better than 2.5.

    • sosodev 4 months ago

      How can you be sure that your benchmark is meaningful and well designed?

      Is the only thing that prevents a benchmark from being meaningful publicity?

    • benterix 4 months ago

      > This is a reminder that benchmarks are meaningless – you should always curate your own out-of-sample benchmarks.

      Yeah I have my own set of tests and the results are a bit unsettling in the sense that sometimes older models outperform newer ones. Moreover, they change even if officially the model doesn't change. This is especially true of Gemini 2.5 pro that was performing much better on the same tests several months ago vs. now.

    • ddalex 4 months ago

      I moved to using the model from python coding to golang coding and got incredible speedups in writing the correct version of the code

    • mring33621 4 months ago

      I agree that benchmarks are noise. I guess, if you're selling an LLM wrapper, you'd care, but as a happy chat end-user, I just like to ask a new model about random stuff that I'm working on. That helps me decide if I like it or not.

      I just chatted with gemini-3-pro-preview about an idea I had and I'm glad that I did. I will definitely come back to it.

      IMHO, the current batch of free, free-ish models are all perfectly adequate for my uses, which are mostly coding, troubleshooting and learning/research.

      This is an amazing time to be alive and the AI bubble doomers that are costing me some gains RN can F-Off!

    • t0mas88 4 months ago

      Google reports a lower score for Gemini 3 Pro on SWEBench than Claude Sonnet 4.5, which is comparing a top tier model with a smaller one. Very curious to see whether there will be an Opus 4.5 that does even better.

    • testartr 4 months ago

      and models are still pretty bad at playing tic-tac-toe, they can do it, but think way too much

      it's easy to focus on what they can't do

    • luckydata 4 months ago

      I'm dying to know what you're giving to it that's choking on. It's actually really impressive if that's the case.

    • mupuff1234 4 months ago

      Could also just be rollout issues.

    • Rover222 4 months ago

      curious if you tried grok 4.1 too

    • Filligree 4 months ago

      What's the benchmark?

    • m00dy 4 months ago

      that's why everyone using AI for code should code in rust only.

  • simonw 4 months ago

    Here are my notes and pelican benchmark, including a new, harder benchmark because the old one was getting too easy: https://simonwillison.net/2025/Nov/18/gemini-3/

    • torginus 4 months ago

      Considering how important this benchmark has become to the judgement of state of the art AI models, I imagine each AI lab has a dedicated 'pelican guy', a a highly accomplished and academically credentialed person, who's working around the clock on training the model to make better and better SVG pelicans on bikes.

    • skylurk 4 months ago

      They've been training for months to draw that pelican, just for you to move the goalposts.

    • mtrovo 4 months ago

      It's interesting that you mentioned on a recent post that saturation on the pelican benchmark isn't a problem because it's easy to test for generalization. But now looking at your updated benchmark results, I'm not sure I agree. Have the main labs been climbing the Pelican on a bike hill in secret this whole time?

    • Thrymr 4 months ago

      Considering how many other "pelican riding a bicycle" comments there are in this thread, it would be surprising if this was not already incorporated in the training data. If not now, soon.

    • libraryofbabel 4 months ago

      I was interested (and slightly disappointed) to read that the knowledge cutoff for Gemini 3 is the same as for Gemini 2.5: January 2025. I wonder why they didn't train it on more recent data.

      Is it possible they use the same base pre-trained model and just fine-tuned and RL-ed it better (which, of course, is where all the secret sauce training magic is these days anyhow)? That would be odd, especially for a major version bump, but it's sort of what having the same training cutoff points to?

    • tkgally 4 months ago

      I updated my benchmark of 30 pelican-bicycle alternatives that I posted here a couple of weeks ago:

      https://gally.net/temp/20251107pelican-alternatives/index.ht...

      There seem to be one or two parsing errors. I'll fix those later.

  • ttul 4 months ago

    My favorite benchmark is to analyze a very long audio file recording of a management meeting and produce very good notes along with a transcript labeling all the speakers. 2.5 was decently good at generating the summary, but it was terrible at labeling speakers. 3.0 has so far absolutely nailed speaker labeling.

    • rfw300 4 months ago

      My audio experiment was much less successful — I uploaded a 90-minute podcast episode and asked it to produce a labeled transcript. Gemini 3:

      - Hallucinated at least three quotes (that I checked) resembling nothing said by any of the hosts

      - Produced timestamps that were almost entirely wrong. Language quoted from the end of the episode, for instance, was timestamped 35 minutes into the episode, rather than 85 minutes.

      - Almost all of what is transcribed is heavily paraphrased and abridged, in most cases without any indication.

      Understandable that Gemini can't cope with such a long audio recording yet, but I would've hoped for a more graceful/less hallucinatory failure mode. And unfortunately, aligns with my impression of past Gemini models that they are impressively smart but fail in the most catastrophic ways.

    • satvikpendem 4 months ago

      I'd do the transcript and the summary parts separately. Dedicated audio models from vendors like ElevenLabs or Soniox use speaker detection models to produce an accurate speaker based transcript while I'm not necessarily sure that Google's models do so, maybe they just hallucinate the speakers instead.

    • iagooar 4 months ago

      What prompt do you use for that?

    • renegade-otter 4 months ago

      It's not even THAT hard. I am working on a side project that gets a podcast episode and then labels the speakers. It works.

    • valtism 4 months ago

      Parakeet TDT v3 would be really good at that

  • Workaccount2 4 months ago

    It still failed my image identification test ([a photoshopped picture of a dog with 5 legs]...please count the legs) that so far every other model has failed agonizingly, even failing when I tell them they are failing, and they tend to fight back at me.

    Gemini 3 however, while still failing, at least recognized the 5th leg, but thought the dog was...well endowed. The 5th leg however is clearly a leg, despite being where you would expect the dogs member to be. I'll give it half credit for at least recognizing that there was something there.

    Still though, there is a lot of work that needs to be done on getting these models to properly "see" images.

    • GuB-42 4 months ago

      > Gemini 3 however, while still failing, at least recognized the 5th leg, but thought the dog was...well endowed.

      I see that AI is reaching the level of a middle school boy...

    • recitedropper 4 months ago

      Perception seems to be one of the main constraints on LLMs that not much progress has been made on. Perhaps not surprising, given perception is something evolution has worked on since the inception of life itself. Likely much, much more expensive computationally than it receives credit for.

    • column 4 months ago

      "[a photoshopped picture of a dog with 5 legs]...please count the legs"

      Meanwhile you could benchmark for something actually useful. If you're about to say "But that means it won't work for my use case of identifying a person on a live feed" or whatever, then why don't you test that? I really don't understand the kick people get of successfully tricking LLMs on non productive task with no real world application. Just like the "how many r in strawberry?", "uh uh uh it says two urh urh".. ok but so what? What good is a benchmark that is so far from a real use case?

    • lukebechtel 4 months ago

      ah interesting. I wonder if this is a "safety guardrails blindspot" due to the placement.

  • ponyous 4 months ago

    Just generated a bunch of 3D CAD models using Gemini 3.0 to see how it compares in spatial understanding and it's heaps better than anything currently out there - not only intelligence but also speed.

    Will run extended benchmarks later, let me know if you want to see actual data.

    • lfx 4 months ago

      Just hand sketched what 5 year old would do on the paper - the house, trees, sun. And asked to generate 3d model with tree.js.

      Results are amazing! 2.5 and 3 seems way way head.

    • mindlessg 4 months ago

      I'm interested in seeing the data.

    • layer8 4 months ago

      Is observed speed meaningful for a model preview? Isn’t it likely to go down once usage goes up?

    • giancarlostoro 4 months ago

      I'm not familiar enough with CAD what type of format is it?

  • falcor84 4 months ago

    I love it that there's a "Read AI-generated summary" button on their post about their new AI.

    I can only expect that the next step is something like "Have your AI read our AI's auto-generated summary", and so forth until we are all the way at Douglas Adams's Electric Monk:

    > The Electric Monk was a labour-saving device, like a dishwasher or a video recorder. Dishwashers washed tedious dishes for you, thus saving you the bother of washing them yourself; video recorders watched tedious television for you, thus saving you the bother of looking at it yourself. Electric Monks believed things for you, thus saving you what was becoming an increasingly onerous task, that of believing all the things the world expected you to believe.

    - from "Dirk Gently's Holistic Detective Agency"

    • davedigerati 4 months ago

      Excellent reference Tried to name an AI project at work Electric Monk but too 'controversial'

      Had to change to Electric Mentor....

    • mikepurvis 4 months ago

      SMBC had a pretty great take on this: https://www.smbc-comics.com/comic/summary

    • egeozcan 4 months ago

      I'm afraid they will finish "The Salmon of Doubt" with AI and sell it to the future generations with a very small disclaimer, stating it's inspired by Douglas Adams.

      The possibility was already a topic in the series "Mozart in the jungle" where they made a robot which supposedly finished the Requiem piece by Mozart.

    • wartywhoa23 4 months ago

      > I can only expect that the next step is something like "Have your AI read our AI's auto-generated summary"

      That's basicaly "The Washing Machine Tragedy" by Stanislav Lem in a nutshell.

    • xeonmc 4 months ago

      Now let’s hope that it will also save labour on resolving cloud infrastructure downtimes too.

    • tonyhart7 4 months ago

      after outsource developer job, we can outsource all of manager job and leaving CEO with AI agentic code as its servant

  • tylervigen 4 months ago

    I am personally impressed by the continued improvement in ARC-AGI-2, where Gemini 3 got 31.1% (vs ChatGPT 5.1's 17.6%). To me this is the kind of problem that does not lend itself well to LLMs - many of the puzzles test the kind of thing that humans intuit because of millions of years of evolution, but these concepts do not necessarily appear in written form (or when they do, it's not clear how they connect to specific ARC puzzles).

    The fact that these models can keep getting better at this task given the setup of training is mind-boggling to me.

    The ARC puzzles in question: https://arcprize.org/arc-agi/2/

    • stephc_int13 4 months ago

      What I would do if I was in the position of a large company in this space is to arrange an internal team to create an ARC replica, covering very similar puzzles and use that as part of the training.

      Ultimately, most benchmarks can be gamed and their real utility is thus short-lived.

      But I think this is also fair to use any means to beat it.

    • grantpitt 4 months ago

      Agreed, it also leads performance on arc-agi-1. Here's the leaderboard where you can toggle between arc-agi-1 and 2: https://arcprize.org/leaderboard

    • tylervigen 4 months ago

      This comment was moved from another thread. The original thread included a benchmark chart with ARC performance: https://blog.google/products/gemini/gemini-3/#gemini-3

    • HarHarVeryFunny 4 months ago

      There's a good chance Gemini 3 was trained on ARG-AGI problems, unless they state otherwise.

    • m3kw9 4 months ago

      that looks great, but we all care how it translate to real world problems like programming where it isn't really excelling by 2x.

  • syspec 4 months ago

    I have "unlimited" access to both Gemini 2.5 Pro and Claude 4.5 Sonnet through work.

    From my experience, both are capable and can solve nearly all the same complex programming requests, but time and time again Gemini spits out reams and reams of code so over engineered, that totally works, but I would never want to have to interact with.

    When looking at the code, you can't tell why it looks "gross", but then you ask Claude to do the same task in the same repo (I use Cline, it's just a dropdown change) and the code also works, but there's a lot less of it and it has a more "elegant" feeling to it.

    I know that isn't easy to capture in benchmarks, but I hope Gemini 3.0 has improved in this regard

    • plaidfuji 4 months ago

      I have the same experience with Gemini, that it’s incredibly accurate but puts in defensive code and error handling to a fault. It’s pretty easy to just tell it “go easy on the defensive code” / “give me the punchy version” and it cleans it up

    • poyu 4 months ago

          but I would never want to have to interact with
      
      That is its job security ;)
    • jmkni 4 months ago

      I can relate to this, it's doing exactly what I want, but it ain't pretty.

      It's fine though if you take the time to learn what it's doing and write a nicer version of it yourself

    • eitally 4 months ago

      I have had a similar experience vibe coding with Copilot (ChatGPT) in VSCode, against the Gemini API. I wanted to create a dad joke generator and then have it also create a comic styled 4 cel interpretation of the joke. Simple, right? I was able to easily get it to create the joke, but it repeatedly failed on the API call for the image generation. What started as perhaps 100 lines of total code in two files ended up being about 1500 LOC with an enormous built-in self-testing mechanism ... and it still didn't work.

  • coffeecoders 4 months ago

    Feels like the same consolidation cycle we saw with mobile apps and browsers are playing out here. The winners aren’t necessarily those with the best models, but those who already control the surface where people live their digital lives.

    Google injects AI Overviews directly into search, X pushes Grok into the feed, Apple wraps "intelligence" into Maps and on-device workflows, and Microsoft is quietly doing the same with Copilot across Windows and Office.

    Open models and startups can innovate, but the platforms can immediately put their AI in front of billions of users without asking anyone to change behavior (not even typing a new URL).

    • Workaccount2 4 months ago

      AI overviews has arguable done more harm than good for them, because people assume it's Gemini, but really it's some ultra light weight model made for handling millions of queries a minute, and has no shortage of stupid mistakes/hallucinations.

    • bitpush 4 months ago

      > Google injects AI Overviews directly into search, X pushes Grok into the feed, Apple wraps "intelligence" into Maps and on-device workflows, and Microsoft is quietly doing the same with Copilot across Windows and Office.

      One of them isnt the same as others (hint: It is Apple). The only thing Apple is doing with Maps is, is adding ads https://www.macrumors.com/2025/10/26/apple-moving-ahead-with...

    • acoustics 4 months ago

      Microsoft hasn't been very quiet about it, at least in my experience. Every time I boot up Windows I get some kind of blurb about an AI feature.

    • int_19h 4 months ago

      Gemini genuinely has an edge over the others in its super-long context size, though. There are some tasks where this is the deal breaker, and others where you can get by with a smaller size, but the results just aren't as good.

    • ehsankia 4 months ago

      > The winners aren’t necessarily those with the best models

      Is there evidence that's true? That the other models are significantly better than the ones you named?

  • stevesimmons 4 months ago

    A nice Easter egg in the Gemini 3 docs [1]:

        If you are transferring a conversation trace from another model, ... to bypass strict validation in these specific scenarios, populate the field with this specific dummy string:
    
        "thoughtSignature": "context_engineering_is_the_way_to_go"
    
    [1] https://ai.google.dev/gemini-api/docs/gemini-3?thinking=high...
    • bijant 4 months ago

      It's an artifact of the problem that they don't show you the reasoning output but need it for further messages so they save each api conversation on their side and give you a reference number. It sucks from a GDPR compliance perspective as well as in terms of transparent pricing as you have no way to control reasoning trace length (which is billed at the much higher output rate) other than switching between low/high but if the model decides to think longer "low" could result in more tokens used than "high" for a prompt where the model decides not to think that much. "thinking budgets" are now "legacy" and thus while you can constrain output length you cannot constrain cost. Obviously you also cannot optimize your prompts if some red herring makes the LLM get hung up on something irrelevant only to realize this in later thinking steps. This will happen with EVERY SINGLE prompt if it's caused by something in your system prompt. Finding what makes the model go astray can be rather difficult with 15k token system prompts or a multitude of MCP tools, you're basically blinded while trying to optimize a black box. Obviously you can try different variations of different parts of your system prompt or tool descriptions but just because they result in less thinking tokens does not mean they are better if those reasoning steps where actually beneficial (if only in edge cases) this would be immediately apparent upon inspection but hard/impossible to find out without access to the full Chain of Thought. For the uninitiated, the reasons OpenAI started replacing the CoT with summaries, were A. to prevent rapid distillation as they suspected deepSeek to have used for R1 and B. to prevent embarrassment if App users see the CoT and find parts of it objectionable/irrelevant/absurd (reasoning steps that make sense for an LLM do not necessarily look like human reasoning). That's a tradeoff that is great with end-users but terrible for developers. As Open Weights LLMs necessarily output their full reasoning traces the potential to optimize prompts for specific tasks is much greater and will for certain applications certainly outweigh the performance delta to Google/OpenAI.

  • CMay 4 months ago

    I was sorting out the right way to handle a medical thing and Gemini 2.5 Pro was part of the way there, but it lacked some necessary information. Got the Gemini 3.0 release notification a few hours after I was looking into that, so I tried the same exact prompt and it nailed it. Great, useful, actionable information that surfaced actual issues to look out for and resolved some confusion. Helped work through the logic, norms, studies, standards, federal approvals and practices.

    Very good. Nice work! These things will definitely change lives.

    • JohnKemeny 4 months ago

      This ad was brought to you by DeepMind™. Changing people's lives.

  • __jl__ 4 months ago

    API pricing is up to $2/M for input and $12/M for output

    For comparison: Gemini 2.5 Pro was $1.25/M for input and $10/M for output Gemini 1.5 Pro was $1.25/M for input and $5/M for output

    • raincole 4 months ago

      Still cheaper than Sonnet 4.5: $3/M for input and $15/M for output.

    • jhack 4 months ago

      With this kind of pricing I wonder if it'll be available in Gemini CLI for free or if it'll stay at 2.5.

    • dktp 4 months ago

      It's interesting that grounding with search cost changed from

      * 1,500 RPD (free), then $35 / 1,000 grounded prompts

      to

      * 1,500 RPD (free), then (Coming soon) $14 / 1,000 search queries

      It looks like the pricing changed from per-prompt (previous models) to per-search (Gemini 3)

    • fosterfriends 4 months ago

      Thrilled to see the cost is competitive with Anthropic.

    • hirako2000 4 months ago

      [flagged]

  • siva7 4 months ago

    I have my own private benchmarks for reasoning capabilities on complex problems and i test them against SOTA models regularly (professional cases from law and medicine). Anthropic (Sonnet 4.5 Extended Thinking) and OpenAI (Pro Models) get halfway decent results on many cases while Gemini Pro 2.5 struggled (it was overconfident in its initial assumptions). So i ran these benchmarks against Gemini 3 Pro and i'm not impressed. The reasoning is way more nuanced than their older model but it still makes mistakes which the other two SOTA competitor models don't make. Like it forgets in a law benchmark that those principles don't apply in the country from the provided case. It seems very US centric in its thinking whereas Anthropic and OpenAI pro models seem to be more aware around the context of assumed culture from the case. All in - i don't think this new model is ahead of the other two main competitors - but it has a new nuanced touch and is certainly way better than Gemini 2.5 pro (which is more telling how bad actually that one was for complex problems).

    • MaxL93 4 months ago

      > It seems very US centric in its thinking

      I'm not surprised. I'm French and one thing I've consistently seen with Gemini is that it loves to use Title Case (Everything is Capitalized Except the Prepositions) even in French or other languages where there is no such thing. A 100% american thing getting applied to other languages by the sheer power of statistical correlation (and probably being overtrained on USA-centric data). At the very least it makes it easy to tell when someone is just copypasting LLM output into some other website.

  • meetpateltech 4 months ago
  • crawshaw 4 months ago

    Has anyone who is a regular Opus / GPT5-Codex-High / GPT5 Pro user given this model a workout? Each Google release is accompanied by a lot of devrel marketing that sounds impressive but whenever I put the hours into eval myself it comes up lacking. Would love to hear that it replaces another frontier model for someone who is not already bought into the Gemini ecosystem.

    • film42 4 months ago

      At this point I'm only using google models via Vertex AI for my apps. They have a weird QoS rate limit but in general Gemini has been consistently top tier for everything I've thrown at it.

      Anecdotal, but I've also not experienced any regression in Gemini quality where Claude/OpenAI might push iterative updates (or quantized variants for performance) that cause my test bench to fail more often.

    • mmaunder 4 months ago

      Yes. I am. It is spectacular in raw cognitive horsepower. Smarter than gpt5-codex-high but Gemini CLI is still buggy as hell. But yes, 3 has been a game changer for me today on hardcore Rust, CUDA and Math projects. Unbelievable what they’ve accomplished.

    • Szpadel 4 months ago

      I gave it a spin with instructions that worked great with gpt-5-codex (5.1 regressed a lot so I do not even compare to it).

      Code quality was fine for my very limited tests but I was disappointed with instruction following.

      I tried few tricks but I wasn't able to convince it to first present plan before starting implementation.

      I have instructions describing that it should first do exploration (where it tried to discover what I want) then plan implementation and then code, but it always jumps directly to code.

      this is bug issue for me especially because gemini-cli lacks plan mode like Claude code.

      for codex those instructions make plan mode redundant.

    • Narciss 4 months ago

      I've been working with it, and so far it's been very impressive. Better than Opus in my feels, but I have to test more, it's super early days

  • bnchrch 4 months ago

    I've been so happy to see Google wake up.

    Many can point to a long history of killed products and soured opinions but you can't deny theyve been the great balancing force (often for good) in the industry.

    - Gmail vs Outlook

    - Drive vs Word

    - Android vs iOS

    - Worklife balance and high pay vs the low salary grind of before.

    Theyve done heaps for the industry. Im glad to see signs of life. Particularly in their P/E which was unjustly low for awhile.

    • digbybk 4 months ago

      Ironically, OpenAI was conceived as a way to balance Google's dominance in AI.

    • ThrowawayR2 4 months ago

      They've poisoned the internet with their monopoly on advertising, the air pollution of the online world, which is an transgression that far outweighs any good they might have done. Much of the negative social effects of being online come from the need to drive more screen time, more engagement, more clicks, and more ad impressions firehosed into the faces of users for sweet, sweet, advertiser money. When Google finally defeats ad-blocking, yt-dlp, etc., remember this.

    • epolanski 4 months ago

      Outlook is much better than Gmail and so is the office suite.

      It's good there's competition in the space though.

    • redbell 4 months ago

      > Drive vs Word

      You mean Drive vs OneDrive or, maybe Docs vs Word?

    • rvz 4 months ago

      Google always has been there, its just that many didn't realize that DeepMind even existed and I said that they needed to be put to commercial use years ago. [0] and Google AI != DeepMind.

      You are now seeing their valuation finally adjusting to that fact all thanks to DeepMind finally being put to use.

      [0] https://news.ycombinator.com/item?id=34713073

    • 63stack 4 months ago

      - Making money vs general computing

    • drewda 4 months ago

      For what it's worth, most of those examples are acquisitions. That's not a hit against Google in particular. That's the way all big tech co's grow. But it's not necessarily representative of "innovation."

    • IlikeKitties 4 months ago

      Something about bringing balance to the force not destroying it.

    • storus 4 months ago

      If you consider surveillance capitalism and dark pattern nudges a good thing, then sure. Gemini has the potential to obliterate their current business model completely so I wouldn't consider that "waking up".

    • qweiopqweiop 4 months ago

      Forgot to mention absolutely milking every ounce of their users attention with Youtube, plus forcing Shorts!

    • kevstev 4 months ago

      All those examples date back to the 2000s. Android has seen some significant improvements, but everything else has stagnated if not enshittified- remember when google told us not to ever worry about deleting anything?- and then started backing up my photos without me asking and are now constantly nagging me to pay them a monthly fee?

      They have done a lot, but most of it was in the "don't be evil" days and they are a fading memory.

    • samdoesnothing 4 months ago

      Seriously? Google is an incredibly evil company whose net contribution to society is probably only barely positive thanks to their original product (search). Since completely de-googling I've felt a lot better about myself.

    • stephc_int13 4 months ago

      Google is using the typical monopoly playbook as most other large orgs, and the world would be a "better place" if they are kept in check.

      But at least this company is not run by a narcissistic sociopath.

  • tim333 4 months ago

    Hassabis interview on Gemini 3, with Hard Fork (nyt podcast), also Josh Woodward https://youtu.be/rq-2i1blAlU?t=428 Some points -

    Good at vibe coding 10:30 - step change where it's actually useful

    AGI still 5-10 years. Needs reasoning, memory, world models.

    Is it a bubble? - Partly 22:00

    What's fun to do with Gemini to show the relatives? Suggested taking a selfie with the app and having it edit. 24:00 (I tried and said make me younger. Worked pretty well.)

    Also interesting - apparently they are doing an agent to go through your email inbox and propose replies automatically 4:00. I could see that getting some use.

    • erikpukinskis 4 months ago

      > Needs reasoning, memory, world models.

      Is that all? So they just need to invent:

      1. Thought

      2. A mechanism for efficiently encoding and decoding arbitrary percepts

      3. A formal model of the world

      And then the existing large language models can handle the rest.

      Yep, 5 years and a hundred billion dollars or so should do the trick.

  • aliljet 4 months ago

    Understanding precisely why Gemini 3 isn't front of the pack on SWE Bench is really what I was hoping to understand here. Especially for a blog post targeted at software developers...

    • Workaccount2 4 months ago

      It doesn't matter, the real benchmark is taking the community temperature on the model after a few weeks of usage.

    • svantana 4 months ago

      SWEBench-Verified is probably benchmaxxed at this stage. Claude isn't even the top performer, that honor goes to Doubao [1].

      Also, the confidence interval for a such a small dataset is about 3 percent points, so these differences could just be up to chance.

      [1] https://www.swebench.com/

    • cube2222 4 months ago

      Yeah, they mention a benchmark I'm seeing the first time (Terminal-Bench 2.0) and are supposedly leading in, while for some reason SWE Bench is down from Sonnet 4.5.

      Curious to see some third-party testing of this model. Currently it seems to primarily improve of "general non-coding and visual reasoning" primarily, based on the benchmarks.

    • pawelduda 4 months ago

      Why is this particular benchmark important?

    • spookie 4 months ago

      Does anyone trust benchmarks at this point? Genuine question. Isn't the scientific consensus that they are broken and poor evaluation tools?

    • ezekiel68 4 months ago

      I mean... it achieved 76.2% vs the leader (Claude Sonnet) at 77.2%.

      That's a "loss" I can deal with.

  • zone411 4 months ago

    Sets a new record on the Extended NYT Connections benchmark: 96.8 (https://github.com/lechmazur/nyt-connections/).

    Grok 4 is at 92.1, GPT-5 Pro at 83.9, Claude Opus 4.1 Thinking 16K at 58.8.

    Gemini 2.5 Pro scored 57.6, so this is a huge improvement.

  • mparis 4 months ago

    I've been playing with the Gemini CLI w/ the gemini-pro-3 preview. First impressions are that its still not really ready for prime time within existing complex code bases. It does not follow instructions.

    The pattern I keep seeing is that I ask it to iterate on a design document. It will, but then it will immediately jump into changing source files despite explicit asks to only update the plan. It may be a gemini CLI problem more than a model problem.

    Also, whoever at these labs is deciding to put ASCII boxes around their inputs needs to try using their own tool for a day.

    People copy and paste text in terminals. Someone at Gemini clearly thought about this as they have an annoying `ctrl-s` hotkey that you need to use for some unnecessary reason.. But they then also provide the stellar experience of copying "a line of text where you then get | random pipes | in the middle of your content".

    Codex figured this out. Claude took a while but eventually figured it out. Google, you should also figure it out.

    Despite model supremacy, the products still matter.

  • mccoyb 4 months ago

    I truly do not understand what plan to use so I can use this model for longer than ~2 minutes.

    Using Anthropic or OpenAI's models are incredibly straightforward -- pay us per month, here's the button you press, great.

    Where do I go for this for these Google models?

    • dktp 4 months ago

      Google actually changed it somewhat recently (3 months ago, give or take) and you can use Gemini CLI with the "regular" Google AI Pro subscription (~22eur/month). Before that, it required a separate subscription

      I can't find the announcement anymore, but you can see it under benefits here https://support.google.com/googleone/answer/14534406?hl=en

      The initial separate subscriptions were confusing at best. Current situation is pretty much same as Anthropic/OpenAI - straightforward

      Edit: changed ~1 month ago (https://old.reddit.com/r/Bard/comments/1npiv2o/google_ai_pro...)

    • mantenpanther 4 months ago

      I am paying for AI ultra - no idea how to use it in the CLI. It says i dont‘t have access. The google admin/payment backend is pure evil. What a mess.

    • fschuett 4 months ago

      Update VSCode to the latest version and click the small "Chat" button at the top bar. GitHub gives you like $20 for free per month and I think they have a deal with the larger vendors because their pricing is insanely cheap. One week of vibe-coding costs me like $15, only downside to Copilot is that you can't work on multiple projects at the same time because of rate-limiting.

    • closewith 4 months ago

      Yeah, it truly is an outstandingly bad UX. To use Gemini CLI as a business user like I would Codex or Claude Code, how much and how do I pay?

    • ur-whale 4 months ago

      > I truly do not understand what plan to use so I can use this model for longer than ~2 minutes.

      I had the exact same experience and walked away to chatgpt.

      What a mess.

    • kachapopopow 4 months ago

      ai studio, you get a bunch of usage free if you want more you buy credits (google one subscriptions also give you some additional usage)

    • dboreham 4 months ago

      Also Google discontinues everything in short order, so personally I'm waiting until they haven't discontinued this for, say 6 months, before wasting time evaluating it.

  • golfer 4 months ago
    • tweakimp 4 months ago

      Every time I see a table like this numbers go up. Can someone explain what this actually means? Is there just an improvement that some tests are solved in a better way or is this a breakthrough and this model can do something that all others can not?

    • HardCodedBias 4 months ago

      If you believe another thread the benchmarks are comparing Gemini-3 (probably thinking) to GPT-5.1 without thinking.

      The person also claims that with thinking on the gap narrows considerably.

      We'll probably have 3rd party benchmarks in a couple of days.

  • bityard 4 months ago

    > Whether you’re an experienced developer or a vibe coder

    I absolutely LOVE that Google themselves drew a sharp distinction here.

    • rafaquintanilha 4 months ago

      You realize this is copy to attract more people to the product, right?

  • svantana 4 months ago

    Grok got to hold the top spot of LMArena-text for all of ~24 hours, good for them [1]. With stylecontrol enabled, that is. Without stylecontrol, gemini held the fort.

    [1] https://lmarena.ai/leaderboard/text

    • inkysigma 4 months ago

      Is it just me or is that link broken because of the cloudflare outage?

      Edit: nvm it looks to be up for me again

    • dyauspitr 4 months ago

      Grok is heavily censored though

  • ogig 4 months ago

    I just gave it a short description of a small game I had an idea for. It was 7 sentences. It pretty much nailed a working prototype, using React, clean css, Typescript and state management. It event implemented a Gemini query using the API for strategic analysis given a game state. I'm more than impressed, I'm terrified. Seriously thinking of a career change.

    • wraptile 4 months ago

      I find it funny to find this almost exact same post in every new model release thread. Yet here we are - spending the same amount of time, if not more, finishing the rest of the owl.

    • WhyOhWhyQ 4 months ago

      I just spent 12 hours a day vibe coding for a month and a half with Claude (which has equal swe benchmarks at gemini 3). I started out terrified but eventually I realized that these are just remarkably far away from actually replacing a real software engineer. For prototypes they're amazing, but when you're just straight vibe coding you get stuck in a hell where you don't want to or can't efficiently really check what's going on under the hood but it's not really doing the thing you want.

      Basically these tools can you you to a 100k LOC project without much effort, but it's not going to be a serious product. A serious product requires understanding still.

    • osn9363739 4 months ago

      Can you share the code?

    • brcmthrowaway 4 months ago

      To what?

  • yomismoaqui 4 months ago

    From an initial testing of my personal benchmark it works better than Gemini 2.5 pro.

    My use case is using Gemini to help me test a card game I'm developing. The model simulates the board state and when the player has to do something it asks me what card to play, discard... etc. The game is similar to something like Magic the Gathering or Slay the Spire with card play inspired by Marvel Champions (you discard cards from your hand to pay the cost of a card and play it)

    The test is just feeding the model the game rules document (markdown) with a prompt asking it to simulate the game delegating the player decisions to me, nothing special here.

    It seems like it forgets rules less than Gemini 2.5 Pro using thinking budget to max. It's not perfect but it helps a lot to test little changes to the game, rewind to a previous turn changing a card on the fly, etc...

  • mpeg 4 months ago

    Well, it just found a bug in one shot that Gemini 2.5 and GPT5 failed to find in relatively long sessions. Claude 4.5 had found it but not one shot.

    Very subjective benchmark, but it feels like the new SOTA for hard tasks (at least for the next 5 minutes until someone else releases a new model)

  • sd9 4 months ago

    How long does it typically take after this to become available on https://gemini.google.com/app ?

    I would like to try the model, wondering if it's worth setting up billing or waiting. At the moment trying to use it in AI Studio (on the Free tier) just gives me "Failed to generate content, quota exceeded: you have reached the limit of requests today for this model. Please try again tomorrow."

    • mpeg 4 months ago

      Allegedly it's already available in stealth mode if you choose the "canvas" tool and 2.5. I don't know how true that is, but it is indeed pumping out some really impressive one shot code

      Edit: Now that I have access to Gemini 3 preview, I've compared the results of the same one shot prompts on the gemini app's 2.5 canvas vs 3 AI studio and they're very similar. I think the rumor of a stealth launch might be true.

    • netdur 4 months ago

      On gemini.google.com, I see options labeled 'Fast' and 'Thinking.' The 'Thinking' option uses Gemini 3 Pro

    • magicalhippo 4 months ago

      > https://gemini.google.com/app

      How come I can't even see prices without logging in... they doing regional pricing?

    • csomar 4 months ago

      It's already available. I asked it "how smart are you really?" and it gave me the same ai garbage template that's now very common on blog posts: https://gist.githubusercontent.com/omarabid/a7e564f09401a64e...

    • Squarex 4 months ago

      Today I guess. They were not releasing the preview models this time and it seems the want to synchronize the release.

    • Romario77 4 months ago

      It's available in cursor. Should be there pretty soon as well.

  • santhoshr 4 months ago

    Pelican riding a bicycle: https://pasteboard.co/CjJ7Xxftljzp.png

    • xnx 4 months ago

      2D SVG is old news. Next frontier is animated 3D. One shot shows there's still progress to be made: https://aistudio.google.com/apps/drive/1XA4HdqQK5ixqi1jD9uMg...

    • mohsen1 4 months ago

      Some time I think I should spend $50 on Upwork to get a real human artist to do it first to know what is that we're going for. What a good pelican riding a bicycle SVG is actually looking like?

    • robterrell 4 months ago

      At this point I'm surprised they haven't been training on thousands of professionally-created SVGs of pelicans on bicycles.

    • arresin 4 months ago

      It’s a good pelican. Not great but good.

  • markdog12 4 months ago

    I asked it to analyze my tennis serve. It was just dead wrong. For example, it said my elbow was bent. I had to show it a still image of full extension on contact, then it admitted, after reviewing again, it was wrong. Several more issues like this. It blamed it on video being difficult. Not very useful, despite the advertisements: https://x.com/sundarpichai/status/1990865172152660047

    • BoorishBears 4 months ago

      The default FPS it's analyzing video at is 1, and I'm not sure the max is anywhere near enough to catch a full speed tennis serve.

    • strange_quark 4 months ago

      I’ve never seen such a huge delta between advertised capabilities and real world experience. I’ve had a lot of very similar experiences to yours with these models where I will literally try verbatim something shown in an ad and get absolutely garbage results. Do these execs not use their own products? I don’t understand how they are even releasing this stuff.

  • bilekas 4 months ago

    > The Gemini app surpasses 650 million users per month, more than 70% of our Cloud customers use our AI, 13 million developers have built with our generative models, and that is just a snippet of the impact we’re seeing

    Not to be a negative nelly, but these numbers are definitely inflated due to Google literally pushing their AI into everything they can, much like M$. Can't even search google without getting an AI response. Surely you can't claim those numbers are legit.

    • lalitmaganti 4 months ago

      > Gemini app surpasses 650 million users per month

      Unless these numbers are just lies, I'm not sure how this is "pushing their AI into everything they can". Especially on iOS where every user is someone who went to App Store and downloaded it. Admittedly on Android, Gemini is preinstalled these days but it's still a choice that users are making to go there rather than being an existing product they happen to user otherwise.

      Now OTOH "AI overviews now have two billion users" can definitely be criticised in the way you suggest.

    • Yizahi 4 months ago

      This is benefit of bundling, I've been forecasting this for a long time - the only companies who would win the LLM race would be the megacorps bundling their offerings, and at most maybe OAI due to the sheer marketing dominance.

      For example I don't pay for ChatGPT or Claude, even if they are better at certain tasks or in general. But I have Google One cloud storage sub for my photos and it comes with a Gemini Pro apparently (thanks to someone on HN for pointing it out). And so Gemini is my go to LLM app/service. I suspect the same goes for many others.

    • joaogui1 4 months ago

      It says Gemini App, not AI Overviews, AI Mode, etc

    • alecco 4 months ago

      Yeah my business account was forced to pay for an AI. And I only used it for a couple of weeks when Gemini 2.5 was launched, until it got nerfed. So they are definitely counting me there even though I haven't used it in like 7 months. Well, I try it once every other month to see if it's still crap, and it always is.

      I hope Gemini 3 is not the same and it gives an affordable plan compared to OpenAI/Anthropic.

    • blinding-streak 4 months ago

      Gemini app != Google search.

      You're implying they're lying?

  • wohoef 4 months ago

    Curious to see it in action. Gemini 2.5 has already been very impressive as a study buddy for courses like set theory, information theory, and automata. Although I’m always a bit skeptical of these benchmarks. Seems quite unlikely that all of the questions remain out of their training data.

  • DanMcInerney 4 months ago

    A 50% increase over ChatGPT 5.1 on ARC-AGI2 is astonishing. If that's true and representative (a big if), it lends credence to this being the first of the very consistent agentically-inclined models because it's able to follow a deep tree of reasoning to solve problems accurately. I've been building agents for a while and thus far have had to add many many explicit instructions and hardcoded functions to help guide the agents in how to complete simple tasks to achieve 85-90% consistency.

    • machiaweliczny 4 months ago

      I think it's due to improvements in vision basically, the arc agi 2 is very visual

    • puttycat 4 months ago

      Where is this figure taken from?

  • srameshc 4 months ago

    I think I am in this AI fatigue phase. I am past all hype with models, tools and agents and back to problem and solution approach, sometimes code gen with AI , sometimes think and ask for a piece of code. But not offloading to AI and buying all the bs, waiting it to do magic with my codebase.

    • amelius 4 months ago

      Yeah, at this point I want to see the failure modes. Show me at least as many cases where it breaks. Otherwise, I'll assume it's an advertisement and I'll skip to the next headline. I'm not going to waste my time on it anymore.

    • jstummbillig 4 months ago

      I think it's fun to see what is not even considered magic anymore today.

    • Kiro 4 months ago

      I agree but if Gemini 3 is as good as people on HN said about the preview, then this is the wrong announcement to sleep on.

    • SchemaLoad 4 months ago

      My test for the state of AI is "Does Microsoft Teams still suck?", if it does still suck, then clearly the AIs were not capable of just fixing the bugs and we must not be there yet.

    • m3kw9 4 months ago

      it's not AI fatigue, its that you just need to shift mode to not pay attention too much to the latest and greatest as they all leap frog each other each month. Just stick to one and ride it thru ups and downs.

    • strangescript 4 months ago

      And by this time next year, this comment is going to look very silly

  • mil22 4 months ago

    It's available to be selected, but the quota does not seem to have been enabled just yet.

    "Failed to generate content, quota exceeded: you have reached the limit of requests today for this model. Please try again tomorrow."

    "You've reached your rate limit. Please try again later."

    Update: as of 3:33 PM UTC, Tuesday, November 18, 2025, it seems to be enabled.

    • sarreph 4 months ago

      Looks to be available in Vertex.

      I reckon it's an API key thing... you can more explicitly select a "paid API key" in AI Studio now.

    • CjHuber 4 months ago

      For me it’s up and running. I was doing some work with AI Studio when it was released and reran a few prompts already. Interesting also that you can now set thinking level low or high. I hope it does something, in 2.5 increasing maximum thought tokens never made it think more

    • lousken 4 months ago

      I hope some users will switch from cerebras to free up those resources

    • r0fl 4 months ago

      Works for me.

    • misiti3780 4 months ago

      seeing the same issue.

  • nickandbro 4 months ago

    What we have all been waiting for:

    "Create me a SVG of a pelican riding on a bicycle"

    https://www.svgviewer.dev/s/FfhmhTK1

    • Thev00d00 4 months ago

      That is pretty impressive.

      So impressive it makes you wonder if someone has noticed it being used a benchmark prompt.

    • bitshiftfaced 4 months ago

      It hadn't occurred to me until now that the pelican could overcome the short legs issue by not sitting on the seat and instead put its legs inside the frame of the bike. That's probably closer to how a real pelican would ride a bike, even if it wasn't deliberate.

  • dudeinhawaii 4 months ago

    Gemini has been so far behind agentically it's comical. I'll be giving it a shot but it has a herculean task ahead of itself. It has to not only be "good enough" but a "quantum leap forward".

    That said, OpenAI was in the same place earlier in the year and very quickly became the top agentic platform with GPT-5-Codex.

    The AI crowd is surprisingly not sticky. Coders quickly move to whatever the best model is.

    Excited to see Gemini making a leap here.

    • catigula 4 months ago

      Claude is still a better agent for software professionals though it is less capable, so there isn't nothing to having the incumbent advantage.

    • ryandrake 4 months ago

      I don't even know what the fuck "agentic" is or why the hell I would want it all over my software. So tired of everything in the computing world today.

  • senfiaj 4 months ago

    Haven't used Gemini much, but when I used, it often refused to do certain things that ChatGPT did happily. Probably because it has many things heavily censored. Obviously, a huge company like Google is under much heavier regulations than ChatGPT. Unfortunately this greatly reduces its usefulness in many situations despite that Google has more resources and computational power than OpenAI.

  • nighwatch 4 months ago

    I just tested the Gemini 3 preview as well, and its capabilities are honestly surprising. As an experiment I asked it to recreate a small slice of Zelda , nothing fancy, just a mock interface and a very rough combat scene. It managed to put together a pretty convincing UI using only SVG, and even wired up some simple interactions.

    It’s obviously nowhere near a real game, but the fact that it can structure and render something that coherent from a single prompt is kind of wild. Curious to see how far this generation can actually go once the tooling matures.

  • qustrolabe 4 months ago

    Out of all other companies Google provide the most generous free access so far. I bet this gives them plenty of data to train even better models

  • primaprashant 4 months ago

    Created a summary of comments from this thread about 15 hours after it had been posted and had 814 comments with gemini-3-pro and gpt-5.1 using this script [1]:

    - gemini-3-pro summary: https://gist.github.com/primaprashant/948c5b0f89f1d5bc919f90...

    - gpt-5.1 summary: https://gist.github.com/primaprashant/3786f3833043d8dcccae4b...

    Summary from GPT 5.1 is significantly longer and more verbose compared to Gemini 3 Pro (13,129 output tokens vs 3,776). Gemini 3 summary seems more readable, however, GPT 5.1 one has interesting insights missed by Gemini.

    Last time I did this comparison at the time of GPT 5 release [2], the summary from Gemini 2.5 Pro was way better and readable than the GPT 5 one. This time the readability of Gemini 3 summary still seems great while GPT 5.1 feels a bit more improved but not there quite yet.

    [1]: https://gist.github.com/primaprashant/f181ed685ae563fd06c49d...

    [2]: https://news.ycombinator.com/item?id=44835029

  • icyfox 4 months ago

    Pretty happy the under 200k token pricing is staying in the same ballpark as Gemini 2.5 Pro:

    Input: $1.25 -> $2.00 (1M tokens)

    Output: $10.00 -> $12.00

    Squeezes a bit more margin out of app layer companies, certainly, but there's a good chance that for tasks that really require a sota model it can be more than justified.

    • rudedogg 4 months ago

      Every recent release has bumped the pricing significantly. If I was building a product and my margins weren’t incredible I’d be concerned. The input price almost doubled with this one.

  • zone411 4 months ago

    Sets a new record on the Extended NYT Connections: 96.8. Gemini 2.5 Pro scored only 57.6. https://github.com/lechmazur/nyt-connections/

  • King-Aaron 4 months ago

    > it’s been incredible to see how much people love it. AI Overviews now have 2 billion users every month

    "Incredible"! When they insert it into literally every google request without an option to disable it. How incredibly shocking so many people use it.

  • icapybara 4 months ago

    Anyone know how Gemini CLI with this model compares to Codex and Claude Code?

  • dr_dshiv 4 months ago

    Make a pelican riding a bicycle in 3d: https://gemini.google.com/share/def18e3daa39

    Amazing and hilarious

  • recitedropper 4 months ago

    Who wants to bet they benchmaxxed ARC-AGI-2? Nothing in their release implies they found some sort of "secret sauce" that justifies the jump.

    Maybe they are keeping that itself secret, but more likely they probably just have had humans generate an enormous number of examples, and then synthetically build on that.

    No benchmark is safe, when this much money is on the line.

    • sosodev 4 months ago

      Here's some insight from Jeff Dean and Noam Shazeer's interview with Dwarkesh Patel https://youtu.be/v0gjI__RyCY&t=7390

      > When you think about divulging this information that has been helpful to your competitors, in retrospect is it like, "Yeah, we'd still do it," or would you be like, "Ah, we didn't realize how big a deal transformer was. We should have kept it indoors." How do you think about that?

      > Some things we think are super critical we might not publish. Some things we think are really interesting but important for improving our products; We'll get them out into our products and then make a decision.

    • HarHarVeryFunny 4 months ago

      I'd also be curious what kind of tools they are providing to get the jump from Pro to Deep Think (with tools) performance. ARC-AGI specialized tools?

    • horhay 4 months ago

      They ran the tests themselves only on semi-private evals. Basically the same caveat as when o3 supposedly beat ARC1

  • CephalopodMD 4 months ago

    What I'm getting from this thread is that people have their own private benchmarks. It's almost a cottage industry. Maybe someone should crowd source those benchmarks, keep them completely secret, and create a new public benchmark of people's private AGI tests. All they should release for a given model is the final average score.

  • bespokedevelopr 4 months ago

    Wow so the polymarket insider bet was true then..

    https://old.reddit.com/r/wallstreetbets/comments/1oz6gjp/new...

    • giarc 4 months ago

      These prediction markets are so ripe for abuse it's unbelievable. People need to realize there are real people on the other side of these bets. Brian Armstong, CEO of Coinbase intentionally altered the outcome of a bet by randomly stating "Bitcoin, Ethereum, blockchain, staking, Web3" at the end of an earnings call. These types of bets shouldn't be allowed.

    • fresh_broccoli 4 months ago

      In hindsight, one possible reason to bet on November 18 was the deprecation date of older models: https://www.reddit.com/r/singularity/comments/1oom1lq/google...

  • creddit 4 months ago

    Gemini 3 is crushing my personal evals for research purposes.

    I would cancel my ChatGPT sub immediately if Gemini had a desktop app and may still do so if it continues to impress my as much as it has so far and I will live without the desktop app.

    It's really, really, really good so far. Wow.

    Note that I haven't tried it for coding yet!

    • energy123 4 months ago

      I would personally settle for a web app that isn't slow. The difference in speed (latency, lag) between ChatGPT's fast web app and Gemini's slow web app is significant. AI Studio is slightly better than Gemini, but try pasting in 80k tokens and then typing some additional text and see what happens.

    • ethmarks 4 months ago

      Genuinely curious here: why is the desktop app so important?

      I completely understand the appeal of having local and offline applications, but the ChatGPT desktop app doesn't work without an internet connection anyways. Is it just the convenience? Why is a dedicated desktop app so much better than just opening a browser tab or even using a PWA?

      Also, have you looked into open-webui or Msty or other provider-agnostic LLM desktop apps? I personally use Msty with Gemini 2.5 Pro for complex tasks and Cerebras GLM 4.6 for fast tasks.

  • JacobiX 4 months ago

    Tested it on a bug that Claude and ChatGPT Pro struggled with, it nailed it, but only solved it partially (it was about matching data using a bipartite graph). Another task was optimizing a complex SQL script: the deep-thinking mode provided a genuinely nuanced approach using indexes and rewriting parts of the query. ChatGPT Pro had identified more or less the same issues. For frontend development, I think it’s obvious that it’s more powerful than Claude Code, at least in my tests, the UIs it produces are just better. For backend development, it’s good, but I noticed that in Java specifically, it often outputs code that doesn’t compile on the first try, unlike Claude.

    • skrebbel 4 months ago

      > it nailed it, but only solved it partially

      Hey either it nailed it or it didn't.

  • jpkw 4 months ago

    Hoping someone here may know the answer to this, but do any of the benchmarks that exist currently account for false answers in any meaningful way, other than it would in a typical test (ie, if I give any answer at all it is better than saying "I don't know" as the answer I give at least has a chance of being correct(which in the real world is bad))? I want an LLM that tells me when it doesn't know something. If it gives me an accurate response 90% of the time and an inaccurate one 10% of the time, it is less useful than one that gives me an accurate answer 10% of the time and tells me "I don't know" the other 90%.

    • terandle 4 months ago
    • rocqua 4 months ago

      Those numbers are too good to expect. If 90% right 10% wrong is the baseline would you take as an improvement:

      - 80% right 18% I don't know 2% wrong - 50%/48%/2% - 10%/90%/0% - 80%/15%/5%

      The general point being that to reduce wrong answers you will need to accept some reduction in right answers if you want the change to only be made through trade-offs. Otherwise you just say "I'd like a better system" and that is rather obvious.

      Personally I'd take like 70/27/3. Presuming the 70% of right answers aren't all the trivial questions.

    • energy123 4 months ago

      OpenAI uses SimpleQA to assess hallucinations

  • eknkc 4 months ago

    Looks like it is already available on VSCode Copilot. Just tried a prompt that was not returning anything good on Sonnet 4.5. (Did not spend much time though, but the prompth was already there on the chat screen so I switched the model and sent it again)

    Gemini 3 worked much better and I actually committed the changes that it created. I don't mean its revolutionary or anything but it provided a nice summary of my request and created a decent simple solution. Sonnet had created a bunch of overarching changes that I would not even bother reviewing. Seems nice. Will probably use it for 2 weeks until someone else releases a 1.0001x better model.

    • flyinglizard 4 months ago

      You were probably stuck at some local model minima avoidable by simply changing the model to something else.

  • ponyous 4 months ago

    Can’t wait to test it out. Been running a tons of benchmarks (1000+ generations) for my AI to CAD model project and noticed:

    - GPT-5 medium is the best

    - GPT-5.1 falls right between Gemini 2.5 Pro and GPT-5 but it’s quite a bit faster

    Really wonder how well Gemini 3 will perform

  • mrinterweb 4 months ago

    Hit the Gemini 3 quota on the second prompt in antigravity even though I'm a pro user. I highly doubt I hit a context window based on my prompt. Hopefully, it is just first day of near general availability jitters.

  • GodelNumbering 4 months ago

    And of course they hiked the API prices

    Standard Context(≤ 200K tokens)

    Input $2.00 vs $1.25 (Gemini 3 pro input is 60% more expensive vs 2.5)

    Output $12.00 vs $10.00 (Gemini 3 pro output is 20% more expensive vs 2.5)

    Long Context(> 200K tokens)

    Input $4.00 vs $2.50 (same +60%)

    Output $18.00 vs $15.00 (same +20%)

    • panarky 4 months ago

      Claude Opus is $15 input, $75 output.

    • xnx 4 months ago

      If the model solves your needs in fewer prompts, it costs less.

    • CjHuber 4 months ago

      Is it the first time long context has separate pricing? I hadn’t encountered that yet

  • gertrunde 4 months ago

    "AI Overviews now have 2 billion users every month."

    "Users"? Or people that get presented with it and ignore it?

    • mNovak 4 months ago

      Maybe you ignore it, but Google has stated in the past that click-through rates with AI overviews are way down. To me, that implies the 'user' read the summary and got what they needed, such that they didn't feel the need to dig into a further site (ignoring whether that's a good thing or not).

      I'd be comfortable calling a 'user' anyone who clicked to expand the little summary. Not sure what else you'd call them.

    • singhrac 4 months ago

      They're a bit less bad than they used to be. I'm not exactly happy about what this means to incentives (and rewards) for doing research and writing good content, but sometimes I ask a dumb question out of curiosity and Google overview will give it to me (e.g. "what's in flower food?"). I don't need GPT 5.1 Thinking for that.

    • recitedropper 4 months ago

      "Since then, it’s been incredible to see how much people love it. AI Overviews now have 2 billion users every month."

      Cringe. To get to 2 billion a month they must be counting anyone who sees an AI overview as a user. They should just go ahead and claim the "most quickly adopted product in history" as well.

  • syedshahmir7214 4 months ago

    I think from last few releases of these models from all companies, I have not observed much improvements in the response of these models. Their claims and launches are a little over hyped.

  • jordanpg 4 months ago

    What is Gemini 3 under the hood? Is it still just a basic LLM based on transformers? Or are there all kinds of other ML technologies bolted on now? I feel like I've lost the plot.

    • meowface 4 months ago

      I am very ignorant in this field but I am pretty sure under the hood they are all still fundamentally built on the transformer architecture, or at least innovations on the original transformer architecture.

    • anilgulecha 4 months ago

      It's a mixture-of-experts model. Basically N smaller model pieces put together, and when inference occurs, only 1 is active at a time. Each model piece would be tuned/good in one area.

    • becquerel 4 months ago

      The industry is still seeing how far they can take transformers. We've yet to reach a dollar value where it stops being worth pumping money into them.

  • bluecalm 4 months ago

    I've asked it (thinking 3) about the difference between Plus and Pro plans. First it thought I am asking for comparison between Gemini and ChatGPT as it claimed there is no "Plus" plan on Gemini. After I insisted I am on this very plan right now it apologized and told me it in fact exists. Then it told me the difference is that I got access to newer models with the Pro subscription. That is despite Google's own plan comparison page showing I get access to the Gemini 3 on both plans.

    It also told me that on Plus I am most likely using "Flash" model. There is no "Flash" model in the dropdown to choose from. There is only "Fast" and "Thinking". It then told me "Fast" is just renamed Flash and it likely uses Gemini 2.5. On the product comparison page there is nothing about 2.5, it only mentions version 3 for both Plus and Pro plans. Of course on the dropdown menu it's impossible to see which model it is really using.

    How can a normal person understand their products when their own super advanced thinking/reasoning model that took months to train on world's most advanced hardware can't?

    It's amazing to me they don't see it as an epic failure in communication and marketing.

  • aerhardt 4 months ago

    Combining structured outputs with search is the API feature I was looking for. Honestly crazy that it wasn’t there to start with - I have a project that is mostly Gemini API but I’ve had to mix in GPT-5 just for this feature.

    I still use ChatGPT and Codex as a user but in the API project I’ve been working on Gemini 2.5 Pro absolutely crushed GPT-5 in the accuracy benchmarks I ran.

    As it stands Gemini is my de facto standard for API work and I’ll be following very closely the performance of 3.0 in coming weeks.

  • rubymamis 4 months ago

    I gave it the task to recreate StackView.qml to be feel more native on iOS and it failed - like all other models...

    Prompt:

    Instead of the current StackView, I want you to implement a new StackView that will have a similar api with the differences that:

    1. It automatically handles swiping to the previous page/item. If not mirrored, it should detect swiping from the left edge, if mirrored it should detect from the right edge. It's important that swiping will be responsive - that is, that the previous item will be seen under the current item when swiping - the same way it's being handled on iOS applications. You should also add to the api the option for the swipe to be detected not just from the edge, but from anywhere on the item, with the same behavior. If swiping is released from x% of current item not in view anymore than we should animate and move to the previous item. If it's a small percentage we should animate the current page to get back to its place as nothing happened. 2. The current page transitions are horrible and look nothing like native iOS transitions. Please make the transitions feel the same.

  • BugsJustFindMe 4 months ago

    The Gemini AI Studio app builder (https://aistudio.google.com/apps) refuses to generate python files. I asked it for a website, frontend and python back end, and it only gave a front end. I asked again for a python backend and it just gives repeated server errors trying to write the python files. Pretty shit experience.

  • abixb 4 months ago

    Okay, Gemini 3.0 Pro has officially surpassed Claude 4.5 (and GPT-5.1) as the top ranked model based on my private evals (multimodal reasoning w/ images/audio files and solving complex Caesar/transposition ciphers, etc.).

    Claude 4.5 solved it as well (the Caesar/transposition ciphers), but Gemini 3.0 Pro's method and approach was a lot more elegant. Just my $0.02.

  • WhyOhWhyQ 4 months ago

    Why doesn't this spell the death of OpenAI? Maybe someone with a better business sense can explain, but here's what I'm seeing:

    OpenAI is going for the consumer-grade AI market, as opposed to a company like Anthropic making a specialized developer tool. Google can inject their AI tool in front of everybody in the world, and already have with Google AI search. All of these models are just going to reach parity eventually, but Google is burning cash compared to OpenAI burning debt. It seems like for consumer-grade purposes, AI use will just be free sooner or later (DeepSeek is free, Google AI search is free, students can get Gemini Pro for free for a year already). So all I'm seeing that OpenAI has is Sora, which seems like a business loser though I don't really understand it, and also ChatGPT seems to own the market of people roleplaying with chat bots as companions (which doesn't really seem like a multi-trillion dollar business but I could be wrong).

    • nextworddev 4 months ago

      Yep. Except OpenAI is mainly burning LP money (saudis, softbank, pension funds)

  • davide_benato 4 months ago

    I would love to see how Gemini 3 can solve this particular problem. https://lig-membres.imag.fr/benyelloul/uherbert/index.html

    It used to be an algorithmic game for a Microsoft student competition that ran in the mid/late 2000. The game invents a new, very simple, recursive language to move the robot (herbert) on a board, and catch all the dots while avoiding obstacles. Amazingly this clone's executable still works today on Windows machines.

    The interesting thing is that there is virtually no training data for this problem, and the rules of the game and the language are pretty clear and fit into a prompt. The levels can be downloaded from that website and they are text based.

    What I noticed last time I tried is that none of the publicly available models could solve even the most simple problem. A reasonably decent programmer would solve the easiest problems in a very short amount of time.

  • iib 4 months ago

    As soon as I found out that this model launched, I tried giving it a problem that I have been trying to code in Lean4 (showing that quicksort preserves multiplicity). All the other frontier models I tried failed.

    I used the pro version and it started out well (as they all did), but it couldn't prove it. The interesting part is that it typoed the name of a tactic, spelling it "abjel" instead of "abel", even though it correctly named the concept. I didn't expect the model to make this kind of error, because they all seems so good at programming lately, and none of the other models did, although they did some other naming errors.

    I am sure I can get it to solve the problem with good context engineering, but it's interesting to see how they struggle with lesser represented programming languages by themselves.

  • energy123 4 months ago

    With the $20/m subscription, do we get it on "Low" or "High" thinking level?

  • aliljet 4 months ago

    When will this be available in the cli?

  • alach11 4 months ago

    This is a really impressive release. It's probably the biggest lead we've seen from a model since the release of GPT-4. Seems likely that OpenAI rushed out GPT-5.1 to beat the Gemini 3 release, knowing that their model would underperform it.

  • misja111 4 months ago

    I asked Gemini to solve today's Countle puzzle (https://www.countle.org/). It got stuck while iterating randomly trying to find a solution. While I'm writing this it has been trying already for 5 minutes and the web page has become unresponsive.

    I also asked it for the best play when in backgammon opponent rolls 6-1 (plays 13/7 8/7) and you roll 5-1. It starts alright with mentioning a good move (13/8 6/5) but continues to hallucinate with several alternative but illegal moves. I'm not too impressed.

  • deanc 4 months ago

    The AntiGravity seems to be a bit overwhelmed. Unable to set up an account at the moment.

  • kanodiaayush 4 months ago

    I don't really understand the amount of ongoing negativity in the comments. This is not the first time a product has been near copied, and the experience for me is far superior to code in a terminal. It comes with improvements even though imperfect, and I'm excited for those! I've long wanted the ability to comment on code diffs instead of just writing things back down in chat. And I'm excited for the quality of gemini 3.0 pro; although I'm running into rate limits. I can already tell its something I'm going to try out a lot!

    • rvnx 4 months ago

      It's not really good for real-life programming though, it invents lot of imaginary things, cannot respect its own instructions, forgets basic things (variable is called "bananaDance", then claims it is "bananadance", then later on "bananaDance" again).

      It is good at writing something from scratch (like spitting out its training set).

      Claude is still superior for programming and debugging. Gemini is better at daily life questions and creative writing.

  • catigula 4 months ago

    The problem with experiencing LLM releases nowadays is that it is no longer trivial to understand the differences in their vast intelligences so it takes awhile to really get a handle on what's even going on.

  • briga 4 months ago

    Every big new model release we see benchmarks like ARC and Humanity's Last Exam climbing higher and higher. My question is, how do we know that these benchmarks are not a part of the training set used for these models? It could easily have been trained to memorize the answers. Even if the datasets haven't been copy pasted directly, I'm sure it has leaked onto the internet to some extent.

    But I am looking forward to trying it out. I find Gemini to be great as handling large-context tasks, and Google's inference costs seem to be among the cheapest.

  • mark_l_watson 4 months ago

    I had a fantastic ‘first result’ with Gemini 3 but a few people on social media I respect didn’t. Key takeaway is to do your own testing with your use cases. I feel like I am now officially biased re: LLM infrastructure: I am retired, doing personal research and writing, and I decided months ago to drop OpenAI and Anthropic infrastructure and just use Google to get stuff done - except I still budget about two hours a week to experiment with local models and Chinese models’ APIs.

  • realty_geek 4 months ago

    I would like to try controlling my browser with this model. Any ideas how to do this. Ideally I would like something like openAI's atlas or perplexity's comet but powered by gemini 3.

  • Der_Einzige 4 months ago

    When will they allow us to use modern LLM samplers like min_p, or even better samplers like top N sigma, or P-less decoding? They are provably SOTA and in some cases enable infinite temperature.

    Temperature continues to be gated to maximum of 0.2, and there's still the hidden top_k of 64 that you can't turn off.

    I love the google AI studio, but I hate it too for not enabling a whole host of advanced features. So many mixed feelings, so many unanswered questions, so many frustrating UI decisions on a tool that is ostensibly aimed at prosumers...

  • espeed 4 months ago

    I paid for Gemini Pro. Am I getting Gemini 3 Pro (https://gemini.google.com)? "To be precise: You are currently interacting with Gemini 1.5 Pro." https://x.com/espeed/status/1991333475098718601

  • clusterhacks 4 months ago

    I wish I could just pay for the model and self-host on local/rented hardware. I'm incredibly suspicious of companies totally trying to capture us with these tools.

  • Retr0id 4 months ago

    > it’s been incredible to see how much people love it. AI Overviews now have 2 billion users every month

    Do regular users know how to disable AI Overviews, if they don't love them?

    • jeron 4 months ago

      it's as low tech as using adblock - select element and block

  • petesergeant 4 months ago

    Still insists the G7 photo[0] is doctored, and comes up with wilder and wilder "evidence" to support that claim, before getting increasingly aggressive.

    0: https://en.wikipedia.org/wiki/51st_G7_summit#/media/File:Pri...

  • DeathArrow 4 months ago

    It generated a quite cool pelican on a bike: https://imgur.com/a/yzXpEEh

    • rixed 4 months ago

      2025: solve the biking pelican problem

      2026: cure cancer

  • pclark 4 months ago

    I just want Gemini to access ALL my Google Calendars, not just the primary one. If they supported this I would be all in on Gemini. Does no one else want this?

  • mikeortman 4 months ago

    Its available for me now in gemini.google.com.... but its failing so bad at accurate audio transcription.

    Its transcribing the meeting but hallucinates badly... both in fast and thinking mode. Fast mode only transcribed about a fifth of the meeting before saying its done. Thinking mode completely changed the topic and made up ENTIRE conversations. Gemini 2.5 actually transcribed it decently, just occasional missteps when people talked over each other.

    I'm concerned.

  • 4 months ago
    [deleted]
  • zurfer 4 months ago

    It also tops LMSYS leaderboard across all categories. However knowledge cutoff is Jan 2025. I do wonder how long they have been pre-training this thing :D.

    • mudkipdev 4 months ago

      Isn't it the same cutoff as 2.5?

  • taf2 4 months ago

    I just wish gemini could write well formatted code. I do like the solutions it comes up to and I know I can use a linter/formatter tool - but it would just be nice if when I openned gemini (cli) up and asked it to write a feature it didn't mix up the indenting so badly... somehow codex and claude both get this without any trouble...

  • lofaszvanitt 4 months ago

    A tad bit better, still has the same issues regarding unpacking and understanding complex prompts. I have a test of mine and now it performs a bit better, but still, it has zero understanding what is happening and for why. Gemini is the best of the best model out there, but with complex problems it just goes down the drain :(.

  • hubraumhugo 4 months ago

    No gemini-3-flash yet, right? Any ETA on that mentioned? 2.5-flash has been amazing in terms of cost/value ratio.

    • 8note 4 months ago

      ive found gemini 2.5-flash works better (for.agentic coding) than pro, too

  • nilsingwersen 4 months ago

    Feeling great to see something confidential

  • RobinL 4 months ago

    - Anyone have any idea why it says 'confidential'?

    - Anyone actually able to use it? I get 'You've reached your rate limit. Please try again later'. (That said, I don't have a paid plan, but I've always had pretty much unlimited access to 2.5 pro)

    [Edit: working for me now in ai studio]

  • oezi 4 months ago

    Probably invested a couple of billion into this release (it is great as far as I can tell), but can't bring proper UI to AI Studio for long prompts and responses (e.g. it animates new text being generated even though you just return to the tab which was finished generating).

  • decide1000 4 months ago

    We hire a developer to build parsers for a complicated file format. It takes a week per parser. Gemini 3 is the first LLM that is able to create a parser from scratch, and it does it very well. Within a minute, 1-shot-right. I am blown away.

  • visioninmyblood 4 months ago

    Really exciting results on paper. But truly interesting to see what data this has been trained on. There is a thin line between accuracy improvements and the data used from users. Hope the data used to train was obtained with consent from the creators

  • auggierose 4 months ago

    > Gemini 3 is the best vibe coding and agentic coding model we’ve ever built

    Google goes full Apple...

  • 4 months ago
    [deleted]
  • jacky2wong 4 months ago

    What I loved about this release was that it was hyped up by a polymarket leak with insider trading - NOT with nonsensical feel the AGI hype. Great model that's pushed the frontier of spatial reasoning by a long shot.

  • guluarte 4 months ago

    it is live in the api

    > gemini-3-pro-preview-ais-applets

    > gemini-3-pro-preview

    • spudlyo 4 months ago

      Can confirm. I was able to access it using GPTel in Emacs using 'gemini-3-pro-preview' as the model name.

  • BoorishBears 4 months ago

    So they won't release multimodal or Flash at launch, but I'm guessing people who blew smoke up the right person's backside on X are already building with it

    Glad to see Google still can't get out of its own way.

    • BoorishBears 4 months ago

      I don't want to be one of those assholes who only calls out when they were right: I was very wrong

  • taf2 4 months ago

    I had asked earlier in the day for gpt 5.1 high to refactor my apex visualforce page into a lightning component and it really didn’t do much here - Gemini 3 pro crushed this task… very promising

  • lofaszvanitt 4 months ago

    Oh that corpulent fella with glasses who talks in the video. Look how good mannered he is, he can't hurt anyone. But Google still takes away all your data and you will be forced out of your job.

  • 4 months ago
    [deleted]
  • scrollop 4 months ago

    Here it makes a text based video editor that works:

    https://youtu.be/MPjOQIQO8eQ?si=wcrCSLYx3LjeYDfi&t=797

  • thingsilearned 4 months ago

    I love that the recipe example is still being used as one of the main promising use cases for computers and now AGI. One day hopefully computers will solve that pressing problem...

  • thedelanyo 4 months ago

    Reading the introductory passage - all I can say now is, Ai is here to stay.

  • I_am_tiberius 4 months ago

    I still need a google account to use it and it always asks me for a phone verification, which I don't want to give to google. That prevents me from using Gemini. I would even pay for it.

    • gpm 4 months ago

      > I would even pay for it.

      Is it just me or is it generally the case that to pay for anything on the internet you have to enter credit card information including a phone number.

  • sunaookami 4 months ago

    Gemini CLI crashes due to this bug: https://github.com/google-gemini/gemini-cli/issues/13050 and when applying the fix in the settings file I can't login with my Google account due to "The authentication did not complete successfully. The following products are not yet authorized to access your account" with useless links to completely different products (Code Assist).

    Antigravity uses Open-VSX and can't be configured differently even though it says it right there (setting is missing). Gemini website still only lists 2.5 Pro. Guess I will just stick to Claude.

  • energy123 4 months ago

    Impressive. Although the Deep Think benchmark results are suspicious given they're comparing apples (tools on) with oranges (tools off) in their chart to visually show an improvement.

  • vlmrun-admin 4 months ago

    https://www.youtube.com/watch?v=cUbGVH1r_1U

    side by side comparison of gemini with other models

  • CjHuber 4 months ago

    Interesting that they added an option to select your own API key right in AI studio‘s input field. I sincerely hope the times of generous free AIstudio usage are not over

  • nprateem 4 months ago

    OMG they've obviously had a major breakthrough because now it can reply to questions with actual answers instead of shit blog posts.

  • agentifysh 4 months ago

    my only complaint is i wish the SWE and agentic coding would have been better to justify the 1~2x premium

    gpt-5.1 honestly looking very comfortable given available usage limits and pricing

    although gpt-5.1 used from chatgpt website seems to be better for some reason

    Sonnet 4.5 agentic coding still holding up well and confirms my own experiences

    i guess my reaction to gemini 3 is a bit mixed as coding is the primary reason many of us pay $200/month for

  • oceanplexian 4 months ago

    Suspicious that none of the benchmarks include Chinese models even they scored higher on the benchmarks than the models they are comparing to?

  • jdthedisciple 4 months ago

    What I'd prefer over benchmarks is the answer to a simple question:

    What useful thing can it demonstrably do that its predecessors couldn't?

    • Ridius 4 months ago

      Keep the bubble expanding for a few months longer.

  • qingcharles 4 months ago

    Somebody "two-shotted" Mario Bros NES in HTML:

    https://www.reddit.com/r/Bard/comments/1p0fene/gemini_3_the_...

  • hamasho 4 months ago

    I just googled latest LLM models and this page appears at the top. It looks like Gemini Pro 3 can score 102% in high school math tests.

  • sylware 4 months ago

    Trained models should be able to use formal tools (for instance a logical solver, a computer?).

    Good. That said, I wonder if those models are still LLMs.

  • AstroBen 4 months ago

    First impression is I'm having a distinctly harder time getting this to stick to instructions as compared to Gemini 2.5

  • maczwei 4 months ago

    entity.ts is in types/entity.ts .it cant grasp that it should import it like "../types/entity" and instead it always writes "../types" i am using the https://aistudio.google.com/apps

  • ilaksh 4 months ago

    okay since Gemini 3 is AI mode now, I switched from the free perplexity back to google as being my search default.

  • zen_boy 4 months ago

    Is the "thinking" dropdown option on gemini.google.com what the blog post refers to as Deep Think?

  • bilsbie 4 months ago

    Is there a way to use this without being in the whole google ecosystem? Just make a new account or something?

    • mtremsal 4 months ago

      If you mean the "consumer ecosystem", then Gemini 3 should be available as an API through Google's AI Vertex platform. If you don't even want a Google Cloud account, then I think the answer is no unless they announce a partnership with an inference cloud like cerebras.

    • tim333 4 months ago

      You could probably do a new account. I have the odd junk google account.

  • slackerIII 4 months ago

    What's the easiest way to set up automatic code review for PRs for my team on GitHub using this model?

  • pflenker 4 months ago

    > Since then, it’s been incredible to see how much people love it. AI Overviews now have 2 billion users every month.

    Come on, you can’t be serious.

    • muzani 4 months ago

      This is so disingenuous that it hurts the credibility of the whole thing.

  • raffkede 4 months ago

    Seems to be the first model that one-shots my secret benchmark about nested SQLite and it did it in 30s,

    • osn9363739 4 months ago

      Out of interest. Does it one shot it every time?

  • gigatexal 4 months ago

    How does it do in coding tasks? I’ve been absolutely spoiled by Claude sonnet 4.5 thinking.

  • taikahessu 4 months ago

    Boring. Tried to explore sexuality related topics, but Alphabet is stuck in some Christianity Dark Ages.

    Edit: Okay, I admit I'm used to dealing with OpenAI models and it seems you have to be extra careful with wording with Gemini. Once you have right wording like "explore my own sexuality" and avoid certain words, you can get it going pretty interestingly.

  • serjester 4 months ago

    It's disappointing there's no flash / lite version - this is where Google has excelled up to this point.

    • aoeusnth1 4 months ago

      Maybe they're slow rolling the announcements to be in the news more

  • AbstractH24 4 months ago

    Can someone ELI5 what the difference between AI Studio, Antigravity, and Colab is?

    • simlevesque 4 months ago

      Ai studio is a web chat.

      Antigravity is an IDE you install.

      Colab is a place to run notebooks in the cloud.

  • NullCascade 4 months ago

    I'm not a mathematician but I think we underestimate how useful pure mathematics can be to tell whether we are approaching AGI.

    Can the mathematicians here try ask it to invent new novel math related to [Insert your field of specialization] and see if it comes up with something new and useful?

    Try lowering the temperature, use SymPy etc.

    • ducttapecrown 4 months ago

      Terry Tao is writing about this on his blog.

  • thrownaway561 4 months ago

    yea great.... when will I be able to have it dial a number on my google pixel? Seriously... Gemini absolutely sucks on pixel since it can't interact with the phone itself so it can't dial numbers.

  • elcapithanos 4 months ago

    > AI overviews now have 2 billion users every month

    More like 2 billion hostages

  • m3kw9 4 months ago

    If it ain't quantum leap, new models are just "OS updates".

  • fosterfriends 4 months ago

    Gemini 3 and 3 pro are good bit cheaper than Sonnet 4.5 as well. Big fan

  • Frannky 4 months ago

    I tried it on a landing page. Very, very impressive.

  • Dquiroga 4 months ago

    I asked Gemini to write "a comment response to this thread. I want to start an intense discussion".

    Gemini 3:

    The cognitive dissonance in this thread is staggering.

    We are sitting here cheering for a model that effectively closes the loop on Google’s total information dominance, while simultaneously training our own replacements.

    Two things in this thread should be terrifying, yet are being glossed over in favor of "look at the pelican SVG":

    1. *The Model Card Leak:* As user `rvz` pointed out, the leaked model card implies training on "user data... in accordance with Google's relevant terms of service." Given the recent terms updates, this essentially confirms that Gemini 3 is likely refined on the corpus of Gmail and Drive data that they’ve been hoarding for decades. We are cheering for a product sold back to us at a 60% markup (input costs up to $2.00/M) that was built on our own private correspondence.

    2. *The "Agentic" Trap:* Everyone praising how `Gemini 3` solves complex bugs one-shot or handles entire PRs is missing the macro view. We aren’t looking at a productivity tool; we are looking at the devaluation of the software engineer. When the entry barrier drops to "can you prompt a seminal agent," the economic value of what we do on this forum evaporates.

    Google has successfully gamified us into feeding the very beast that will make the "14-minute human solve time" (referenced by `lairv`) irrelevant. We are optimizing for our own obsolescence while paying a monopoly rent to do it.

    Why is the sentiment here "Wow, cool clock widget" instead of "We just handed the keys to the kingdom to the biggest ad-tech surveillance machine in history"?

  • DrNosferatu 4 months ago

    Anyone has any idea if/when it’s coming to paid Perplexity?

  • 4 months ago
    [deleted]
  • eterm 4 months ago

    > It seems there's a date conflict. The prompt claims it's 2025, but my internal clock says otherwise.

    > I'm now zeroing in on the temporal aspect. Examining the search snippets reveals dates like "2025-10-27," suggesting a future context relative to 2024. My initial suspicion was that the system time was simply misaligned, but the consistent appearance of future dates strengthens the argument that the prompt's implied "present" is indeed 2025. I am now treating the provided timestamps as accurate for a simulated 2025. It is probable, however, that the user meant 2024.

    Um, huh? It's found search results for October 2025, but this has led it to believe it's in a simulated future, not a real one?

  • samuelknight 4 months ago

    "Gemini 3 Pro Preview" is in Vertex

  • pgroves 4 months ago

    I was hoping Bash would go away or get replaced at some point. It's starting to look like it's going to be another 20 years of Bash but with AI doodads.

  • pk-protect-ai 4 months ago

    It is pointless to ask an LLM to draw an ASCII unicorn these days. Gemini 3 draws one of these (depending on the prompt):

    https://www.ascii-art.de/ascii/uvw/unicorn.txt

    However, it is amazing how far spatial comprehension has improved in multimodal models.

    I'm not sure the below would be properly displayed on HN; you'll probably need to cut and paste it into a text editor.

    Prompt: Draw me an ASCII world map with tags or markings for the areas and special places.

    Temperature: 1.85

    Top-P 0.98

    Answer: Edit (replaced with URL) https://justpaste.it/kpow3

  • vivzkestrel 4 months ago

    has anyone managed to use any of the AI models to build a complete 3D fps game using web GL or open GL?

  • beezlewax 4 months ago

    Can't wait til Gemini 4 is out!

  • keepamovin 4 months ago

    I don't wan't to shit on the much anticipated G3 model, but I have been using it for a complex single page task and find it underwhelming. Pro 2.5 level, beneath GPT 5.1. Maybe it's launch jitters. It struggles to produce more than 700 lines of code in a single file (aistudio). It struggles to follow instructions. Revisions omit previous gains. I feel cheated! 2.5 Pro has been clearly smarter than everything else for a long time, but now 3 seems not even as good as that, in comparison to the latest releases (5.1 etc). What is going on?

  • hekkle 4 months ago

    GOOGLE: "We have a new product".

    REALITY: It's just 3 existing products rolled into one. One of which isn't even a Google product.

    - Microsoft Code

    - Gemeni

    - Chrome Browser

  • kmeisthax 4 months ago

    The most devastating news out of this announcement is that Vending-Bench 2 came out and it has significantly less clanker[0] meltdowns than the first one. I mean, seriously? Not even one run where the model tried to stock goods that hadn't arrived yet, only for it to eventually try and fail to shut down the business, and then e-mail the FBI about the $2 daily fee being deducted from the bot?

    [0] Fake racial slur for a robot, LLM chatbot, or other automated system

  • smarx007 4 months ago

    Is it coming to Google Jules?

  • iamA_Austin 4 months ago

    it started with OpenAI and Google took the competition damn seriously.

  • dankobgd 4 months ago

    every day, new game changer

  • VladimiOrlovsky 4 months ago

    import decimal

    def solve_kangaroo_limit(): # Set precision to handle the "digits different from six" requirement decimal.getcontext().prec = 50

        # For U(0,1), H(x) approaches 2x + 2/3 very rapidly (exponential decay of error)
        # At x = 10^6, the value is indistinguishable from the asymptote
        x = 10**6
        limit_value = decimal.Decimal(2) * x + decimal.Decimal(2) / decimal.Decimal(3)
        
        print(f"H({x}) ≈ {limit_value}")
        # Output: 2000000.66666666666666666666...
    
    if __name__ == "__main__": solve_kangaroo_limit() ....p.s. for airheads=idiots: """decimal.Decimal(2) / decimal.Decimal(3)""" == 0.6666666666666666666666666666666666666666666666666666666666666666666666666 ... This is your Fukingly 'smart' computer???
  • t_minus_40 4 months ago

    is there even a puzzle or math problem gemini 3 cant solve?

  • chiragsrvstv 4 months ago

    Waiting for google to nuke this as well just like 2.5pro

  • testfrequency 4 months ago

    I continue to not use Gemini as I can’t have my data not trained but also have chat history at the same time.

    Yes, I know the Workspaces workaround, but that’s silly.

  • skerit 4 months ago

    Not the preview crap again. Haven't they tested it enough? When will it be available in Gemini-CLI?

  • XCSme 4 months ago

    How's the pelican?

  • cognitive-gl 4 months ago

    Wild

  • casey2 4 months ago

    The first paragraph is pure delusion. Why do investors like delusional CEOs so much? I would take it as a major red flag.

  • jennyholzer 4 months ago

    boooooooooooooo

  • 4 months ago
    [deleted]
  • gabriel666smith 4 months ago

    Interestingly, Gemini 3 Pro Thinking clocks in at 2nd place - behind Horizon Alpha, but ahead of any actually-released model checkpoints - on the TvP Benchmark, with a score of 71.8%.

    This blows Sonnet 4.5 (64.8%), GPT5.1 (58.7%) and Opus 4.1 (66.3%) away.

    An example output:

    '''

    The cabin was cold. It smelled of pine needles and old dust. Bryan sat on the porch with a glass of scotch balanced on his knee. It was cheap scotch. He missed the good stuff, but he wasn't in Paris anymore, and he certainly wasn't in Beverly Hills.

    He checked his watch. It was a bulky thing, black rubber. He checked the Glock on the table next to him. That was black, too.

    "I told you," Bryan said. He wasn't talking to anyone. "I told you not to go into the woods."

    He was thinking about Kim. He was always thinking about Kim. She was safe, probably. At a mall in Santa Monica. Or maybe a library. But it felt like she was gone. It always felt like she was gone.

    The trees at the edge of the clearing shifted. There wasn't any wind. The air was heavy and humid, unusual for October. A bird stopped singing. Then the crickets stopped. It got very quiet.

    Bryan took a sip of the scotch. It burned his throat. He liked the burn. It reminded him he was still there.

    Three red dots appeared on his chest. They made a small triangle.

    Bryan didn't look at them. He set the glass down on the railing. He picked up the phone. It was an old flip phone. He pressed a button.

    "I don't know who you are," Bryan said into the phone. "I don't know what you want. If you are looking for ransom, I can tell you I don't have money."

    The leaves rustled. A shape began to form out of the empty air. It looked like heat rising off the highway in August. It was tall. Much taller than a man.

    "But what I do have," Bryan said, "are a very particular set of skills."

    The shape solidified. It was ugly. It had skin like a lizard and dreadlocks that looked like heavy black cables. It wore a metal mask. It made a clicking sound. Click-click-click. Like a Geiger counter.

    "Skills I have acquired over a very long career," Bryan said. He stood up. His knees popped. He felt the ache in his lower back. He was getting too old for this.

    The thing roared. It wasn't a human sound. It was a wet, guttural noise.

    "I will look for you," Bryan said. He put the phone in his pocket. "I will find you."

    The creature raised a metal gauntlet. Two jagged blades slid out. They caught the moonlight.

    "And I will kill you," Bryan said.

    He raised the Glock and fired twice. Pop-pop.

    The creature didn't fall. It took a step forward. Green blood, thick and glowing, leaked from its shoulder. It looked like antifreeze. The creature touched the wound, then looked at its hand. It seemed surprised.

    Bryan didn't wait. He moved off the porch. He moved fast for a man his age. He went left, toward the woodpile. The creature fired a blast of blue light from its shoulder. It hit the porch swing. The swing exploded. Wood splinters rained down like confetti.

    Bryan rolled behind the logs. He breathed in. He breathed out. He could hear the heavy footsteps crunching the dry leaves. Crunch. Crunch.

    "You're big," Bryan whispered. "That makes you slow."

    He grabbed a splitting maul from the stump. It was heavy. The handle was smooth hickory.

    The creature came around the woodpile. It was scanning the trees. It didn't see Bryan down low. Bryan swung the maul. He put his hips into it. He swung it like he was chopping a cord of oak.

    The blade hit the creature in the knee. There was a snap. A loud, wet snap.

    The thing howled. It fell onto one knee.

    Bryan dropped the maul. He stepped inside the creature’s reach. He knew exactly where to hit. The throat. The armpit. The soft spot under the ribs. He hit the creature three times, hard chops with the side of his hand. It was like hitting a brick wall, but he felt something give.

    The creature swung its arm back. It hit Bryan in the chest.

    Bryan flew backward. He hit the dirt. The wind went out of him. He lay there for a second, staring up at the stars. They looked very far away. He wondered if Lenore was looking at the same stars. Probably not. She was probably sleeping.

    He sat up. His ribs hurt. Maybe broken.

    The creature was trying to stand. It was clicking again. It tapped something on its wrist. A series of red symbols started flashing. They counted down.

    Bryan knew a bomb when he saw one.

    "No," Bryan said.

    He tackled the thing. He didn't think about it. He just did it. He grabbed the creature’s arm. He twisted the wrist mechanism. He’d seen something like it in Baghdad once. Or maybe Istanbul. The memories ran together now.

    He ripped the gauntlet loose. Wires sparked. He threw it as hard as he could into the darkness of the woods.

    Three seconds later, there was a flash. A boom. A shockwave that shook the pine needles from the trees.

    Silence came back.

    The creature lay on the ground. It was breathing shallowly. The green blood was pooling under it. It took off its mask.

    The face was hideous. Mandibles. Beady eyes. It looked at Bryan. It said something, a garbled copy of Bryan's own voice.

    "...good luck..."

    Then it died. It just stopped.

    Bryan stood up. He dusted off his pants. He walked back to the porch. The swing was gone. The railing was scorched.

    His glass of scotch was still sitting there, untouched. The ice hadn't even melted.

    He picked it up. He took a drink. It still tasted cheap.

    He took his phone out and looked at it. No service.

    "Well," he said.

    He went inside the cabin and locked the door. He sat on the couch and waited for the sun to come up. He hoped Kim would call. He really hoped she would call.

    '''

  • otikik 4 months ago

    … agentic …

    Meh, not interested already

  • John-Tony12 4 months ago

    [flagged]

  • recitedropper 4 months ago

    [flagged]

  • John-Tony 4 months ago

    [flagged]

  • jennyholzer 4 months ago

    "AI" benchmarks are and have consistently been lies and misinformation. Gemini is dead in the water.

  • denysvitali 4 months ago

    Finally!

  • mihau 4 months ago

    @simonw wen pelican

  • nextworddev 4 months ago

    It’s over for Anthropic. That’s why Google’s cool with Claude being on Azure.

    Also probably over for OpenAI

  • alksdjf89243 4 months ago

    Pretty obvious how contaminated this site is with goog employees upvoting nonsense like this.

  • poemxo 4 months ago

    It's amazing to see Google take the lead while OpenAI worsens their product every release.

  • WXLCKNO 4 months ago

    Valve could learn from Google here

  • 4 months ago
    [deleted]
  • informal007 4 months ago

    It seem that Google doesn't prepare well to release Gemini 3 but leak many contents, include the model card early today and gemini 3 on aistudio.google.com

  • kachapopopow 4 months ago

    It's joeover for openai and antrophic. I have been using it for 3 hours now for real work and gpt-5.1 and sonnet 4.5 (thinking) does not come close.

    the token efficiency and context is also mindblowing...

    it feels like I am talking to someone who can think instead of a **rider that just agrees with everything you say and then fails doing basic changes, gpt-5.1 feels particulary slow and weak in real world applications that are larger than a few dozen files.

    gemini 2.5 felt really weak considering the amount of data and their proprietary TPU hardware in theory allowing them way more flexibility, but gemini 3 just works and it truly understands which is something I didn't think I'd be saying for a couple more years.

  • vlmrun-admin 4 months ago

    https://www.youtube.com/watch?v=cUbGVH1r_1U

    Everyone is talking about the release of Gemini 3. The benchmark scores are incredible. But as we know in the AI world, paper stats don't always translate to production performance on all tasks.

    We decided to put Gemini 3 through its paces on some standard Vision Language Model (VLM) tasks – specifically simple image detection and processing.

    The result? It struggled where I didn't expect it to.

    Surprisingly, VLM Run's Orion (https://chat.vlm.run/) significantly outperformed Gemini 3 on these specific visual tasks. While the industry chases the "biggest" model, it’s a good reminder that specialized agents like Orion are often punching way above their weight class in practical applications.

    Has anyone else noticed a gap between Gemini 3's benchmarks and its VLM capabilities?

  • irthomasthomas 4 months ago

    I asked it to summarize an article about the Zizians which mentions Yudkowsky SEVEN times. Gemini-3 did not mention him once. Tried it ten times and got zero mention of Yudkowsky, despite him being a central figure in the story. https://xcancel.com/xundecidability/status/19908286970881311...

    Also, can you guess which pelican SVG was gemini 3 vs 2.5? https://xcancel.com/xundecidability/status/19908113191723213...

  • rvz 4 months ago

    I expect almost no-one to read the Gemini 3 model card. But here is a damning excerpt from the early leaked model card from [0]:

    > The training dataset also includes: publicly available datasets that are readily downloadable; data obtained by crawlers; licensed data obtained via commercial licensing agreements; user data (i.e., data collected from users of Google products and services to train AI models, along with user interactions with the model) in accordance with Google’s relevant terms of service, privacy policy, service-specific policies, and pursuant to user controls, where appropriate; other datasets that Google acquires or generates in the course of its business operations, or directly from its workforce; and AI-generated synthetic data.

    So your Gmails are being read by Gemini and is being put on the training set for future models. Oh dear and Google is being sued over using Gemini for analyzing user's data which potentially includes Gmails by default.

    Where is the outrage?

    [0] https://web.archive.org/web/20251118111103/https://storage.g...

    [1] https://www.yahoo.com/news/articles/google-sued-over-gemini-...