This reminds me of Antirez's "Don't fall into the anti-AI hype" [0]
In a sentence: These foundation models are really good at optimizing these extremely high level, extremely well defined problem spaces (ie multiply matrices faster). In Antirez's case, it's "make Redis faster".
There have been two reactions: "Oh it would never work for me" and "I have seen months of my life accomplished in an hour", and I think they're both right. I think we should be excited for Antirez, (who has since been popping off [1]), and I think the rest of us should rest easy knowing that LLM's can't (and maybe were never meant to) tackle the tacit-knowledge-filled, human-system-centric, ambiguously-defined-problem-space jobs most mortals work.
How many times we have to hear again about Erdös problems? :) It sounds like a great achievement for humanity at first, but after a while they keep coming back!
> Do we have other examples of AI being used to improve the LLMs
Yes, last year when they revealed AlphaEvolve they used a previous gemini model to improve kernels that were used in training this gen models, netting them a 1% faster training run. Not much, but still.
This is the thing to look for in 2027, imho. All the big AI labs have big projects working on research agents, also specifically into improving AI (duh) and I expect a lot of that to get out of the experimental phases this year.
Next year they actually get to do a lot of work and I think we will see the first big effective architectural change co-invented by AI.
I for one can't tell the difference between Claude and Gemini for coding. And the internal agent tooling is many times faster than Claude Code in my experience.
> He says the problem is that they can't use Claude Code because it's the enemy, and Gemini has never been good enough to capture people's workflows like Claude has, so basically agentic coding just never really took off inside Google. They're all just plodding along, completely oblivious to what's happening out there right now.
This is a bunch of gabagoo. Wrong on so many layers, it's not even worth reading further.
a) goog has agentic coding in both antigravity & cli forms. While it is not at the level of cc + opus, it's still decent.
b) goog has their own versions of models trained on internal code
c) goog has claude in vertex, and most definitely can set it up in secure zones (like they can for their clients) so they'd be able to use claude (at cost) within their own projects.
I’m not so sure. From talking to some of my own friends at google they feel that antigravity/gemini models are handicapping them and would much rather be using claude code (which only deepmind gets to use)
Note that coding is not the only use of Gemini or any of these models. It's also not what this article is talking about. Gemini can be not the best coding agent, but very good at other things.
I would be interested to see how exactly the agent helped. How was it used, where did it lead to the given improvement and in how far would it have taken a human to come to the same solution.
What I'm most curious about is how this translates to messy, real-world codebases without well-defined metrics. Most production software isn't chip design or kernel optimization - it's business logic with unclear success criteria. The infrastructure story is impressive, but I'd love to see how they handle domains where the evaluation function itself is ambiguous.
This reminds me of Antirez's "Don't fall into the anti-AI hype" [0]
In a sentence: These foundation models are really good at optimizing these extremely high level, extremely well defined problem spaces (ie multiply matrices faster). In Antirez's case, it's "make Redis faster".
There have been two reactions: "Oh it would never work for me" and "I have seen months of my life accomplished in an hour", and I think they're both right. I think we should be excited for Antirez, (who has since been popping off [1]), and I think the rest of us should rest easy knowing that LLM's can't (and maybe were never meant to) tackle the tacit-knowledge-filled, human-system-centric, ambiguously-defined-problem-space jobs most mortals work.
[0] https://antirez.com/news/158 [1] https://antirez.com/news/164
How many times we have to hear again about Erdös problems? :) It sounds like a great achievement for humanity at first, but after a while they keep coming back!
Ugh, one of the examples of novel approaches is a ripoff of something I was working on. Local llms are where it's at.
AI improving itself (or at least the architecture it runs on), the singularity is near as they say.
Do we have other examples of AI being used to improve the LLMs, apart for the creation of synthetic data and the testing of the models?
I feel like the most viral lately is https://github.com/karpathy/autoresearch
> Do we have other examples of AI being used to improve the LLMs
Yes, last year when they revealed AlphaEvolve they used a previous gemini model to improve kernels that were used in training this gen models, netting them a 1% faster training run. Not much, but still.
> AI improving itself
This is the thing to look for in 2027, imho. All the big AI labs have big projects working on research agents, also specifically into improving AI (duh) and I expect a lot of that to get out of the experimental phases this year.
Next year they actually get to do a lot of work and I think we will see the first big effective architectural change co-invented by AI.
And then on 2028 we will be selling ice cream at the beach.
Shameless plug: https://huggingface.co/spaces/smolagents/ml-intern
It’s a simple harness around Opus, but with tight integration to Hugging Face infra, so the agent can read papers, test code and launch experiments
Are Googlers themselves happy using Gemini coding agent instead of Claude Code or Codex? (no snark, I'm really asking)
I for one can't tell the difference between Claude and Gemini for coding. And the internal agent tooling is many times faster than Claude Code in my experience.
Last month, Steve Yegge suggested that they are not: https://xcancel.com/Steve_Yegge/status/2043747998740689171
> He says the problem is that they can't use Claude Code because it's the enemy, and Gemini has never been good enough to capture people's workflows like Claude has, so basically agentic coding just never really took off inside Google. They're all just plodding along, completely oblivious to what's happening out there right now.
This is a bunch of gabagoo. Wrong on so many layers, it's not even worth reading further.
a) goog has agentic coding in both antigravity & cli forms. While it is not at the level of cc + opus, it's still decent.
b) goog has their own versions of models trained on internal code
c) goog has claude in vertex, and most definitely can set it up in secure zones (like they can for their clients) so they'd be able to use claude (at cost) within their own projects.
Demis Hassabis chimed in on that thread and called it what it is: clickbait.
I’m not so sure. From talking to some of my own friends at google they feel that antigravity/gemini models are handicapping them and would much rather be using claude code (which only deepmind gets to use)
This couldn't be further from the truth
The point of dogfooding is exactly that: if we're unhappy, we're the ones to improve.
Note that coding is not the only use of Gemini or any of these models. It's also not what this article is talking about. Gemini can be not the best coding agent, but very good at other things.
Codex?
I would be interested to see how exactly the agent helped. How was it used, where did it lead to the given improvement and in how far would it have taken a human to come to the same solution.
The blog post has many links to papers and preprints discussing this exact question.
RSI is here on the hardware level and on software level. Sprinkle with a couple algorithmic breakthroughs and results are nigh unimaginable.
What I'm most curious about is how this translates to messy, real-world codebases without well-defined metrics. Most production software isn't chip design or kernel optimization - it's business logic with unclear success criteria. The infrastructure story is impressive, but I'd love to see how they handle domains where the evaluation function itself is ambiguous.