> Conversely, on tasks requiring strict sequential reasoning (like planning in PlanCraft), every multi-agent variant we tested degraded performance by 39-70%. In these scenarios, the overhead of communication fragmented the reasoning process, leaving insufficient "cognitive budget" for the actual task.
> As tasks require more tools (e.g., a coding agent with access to 16+ tools), the "tax" of coordinating multiple agents increases disproportionately.
This aligns well the principle of highly cohesive, loosely coupled design for software components. If you instruct the AI to design this way, it should result in components that're simpler to reason about, and require fewer sequential steps to work on. You can think of cohesion in many different ways, but one is common functions, and another is tool/library dependency.
Even in the case of a single agent, the compounding of errors [1] can easily make your "flow" unacceptable for your use case. The deterministic where possibe/decoupled/well tested approach is key.
With such a fast moving space I'm always wary of adopting optimization techniques that I can't easily prove and pivot from (which means measuring/evals are necessary).
Slowly but surely, abstractions allow us to use others' deep investments in the matter of coordination without losing control (e.g. pyspark worker/driver coordination) and we can invest on friction removal and direct value generation in our domains (e.g. banking/retail/legal payments, etc)
> We found that independent multi-agent systems (agents working in parallel without talking) amplified errors by 17.2x
The paper sounds too shallow. The errors data doesn't seem to have a rationale or correlation against the architecture. Specifically, what makes the SAS architecture to have lowest error rates while the similar architecture with independent agents having highest error rates? The conclusion doesn't seem well-grounded with reasoning.
> The paper sounds too shallow. The errors data doesn't seem to have a rationale or correlation against the architecture. Specifically, what makes the SAS architecture to have lowest error rates while the similar architecture with independent agents having highest error rates?
I can believe SAS works great until the context has errors which were corrected - there seems to be a leakage between past mistakes and new ones, if you leave them all in one context window.
My team wrote a similar paper[1] last month, but we found the orchestrator is not the core component, but a specialized evaluator for each action to match the result, goal and methods at the end of execution to report back to the orchestrator on goal adherence.
The effect is sort of like a perpetual evals loop, which lets us improve the product every week but agent by agent without the Snowflake agent picking up the Bigquery tools etc.
We started building this Nov 2024, so the paper is more of a description of what worked for us (see Section 3).
Also specific models are great at some tasks, but not always good at others.
My general finding is that Google models do document extraction best, Claude does code well and OpenAI does task management in somewhat sycophantic fashion.
Multi-agents was originally supposed to let us put together a "best of all models" world, but it works at error correcting if I have Claude write code and GPT 5 check the results instead of everything going into one context.
"All MAS and SAS configurations were matched for
total reasoning-token budget (mean 4,800 tokens per trial)"
A single-agent system (SAS) uses this budget for a deep, unified reasoning stream (averaging 7.2 turns), multi-agent teams would fragment the same budget into dozens of coordination messages
I wonder if the budget is increased (say 50k) would the same results be observed ?
It's true that most problems can be solved with context + prompt. I have actively seen teams within large organizations complicate it into complex "agentic orchestration" just to impress leadership who lack the expertise to realize it's not even necessary. Hell, there are various startups who make this their moat.
I’ve been building a lot of agent workflows at my day job. Something that I’ve found a lot of success with when deciding on an orchestration strategy is to ask the agent what they recommend as part of the planning for phase. This technique of using the agent to help you improve its performance has been a game changer for me in leveraging this tech effectively. YMMV of course. I mostly use Claude code so who knows with the others.
Not who you asked but I have made similar experiences with Claude Code. For example I might have asked it to “commit the changes” and be surprised that it did not use the skill `committing-changes`.
Now if I ask “why did you not use the skill”, the answer typically starts with an apology, the insight which skill it should have used, and it will proceed using the skill.
In contrast, asking “what can we change in the skill description so you use it next time I ask you to commit”, it will typically explain how it selects skills and how to modify the skill in question’s description so it would pick it on its own.
It feels like everyone these days are thinking of Markdown IPC hierarchical multi agent orchestration. Just the other day I saw this[1] vibecoded thing. I wonder if there's any ones notable, or maybe I should try my hands at it.
This is a neat idea but there are so many variables here that it's hard to make generalizations.
Empirically, a top level orchestrator that calls out to a planning committee, then generates a task-dag from the plan which gets orchestrated in parallel where possible is the thing I've seen put in the best results in various heterogeneous environments. As models evolve, crosstalk may become less of a liability.
Reasoning is recursive - you cannot isolate where is should be symbolic and where it should be llm based (fuzzy/neural). This is the idea that started https://github.com/zby/llm-do - there is also RLM: https://alexzhang13.github.io/blog/2025/rlm/ RLM is simpler - but my approach also have some advantages.
I think the AI community is sleeping hard on proper symbolic recursion. The computer has gigabytes of very accurate "context" available if you start stacking frames. Any strategy that happens inside token space will never scale the same way.
Depth first, slow turtle recursion is likely the best way to reason through the hardest problems. It's also much more efficient compared to things that look more like breadth first search (gas town).
I only agree with that statement if you're drawing from the set of all possible problems a priori. For any individual domain I think it's likely you can bound your analytic. This ties into the no free lunch theorem.
I found the captions on Figure 1 quite interesting.
> Average performance (%) across four agentic benchmarks improves consistently with increasing model Intelligence Index.
> Centralized and hybrid coordination generally yield superior scaling efficiency, suggesting that collaborative agentic structures amplify capability gains more effectively than individual scaling alone.
Then again, the deltas between SAS and best performing MAS approach are ~8%, so I can't help wonder if it's worth the extra cost, at least for the generation of models that was studied.
I've been building something in this space ("Clink" - multi-agent coordination layer) and this research confirms some of the assumptions that motivated the project. You can't just throw more agents at a problem and expect it to get better.
The error amplification numbers are wild! 17x for independent agents vs 4x with some central coordination. Clink provides users (and more importantly their agents) the primitives to choose their own pattern.
The most relevant features are...
- work queues with claim/release for parallelizable tasks
- checkpoint dependencies when things need to be sequential
- consensus voting as a gate before anything critical happens
The part about tool count increasing coordination overhead is interesting too. I've been considering exposing just a single tool to address this, but I wonder how this plays out as people start stacking more MCP servers together. It feels like we're all still learning what works here. The docs are at https://docs.clink.voxos.ai if anyone wants to poke around!
> The part about tool count increasing coordination overhead is interesting too. I've been considering exposing just a single tool to address this, but I wonder how this plays out as people start stacking more MCP servers together.
It works really well. Whatever knowledge LLMs absorb about CLI commands seems to transfer to MCP use so a single tool with commands/subcommands works very well. It’s the pattern I default to when I’m forced to use an MCP server instead of providing a CLI tool (like when the MCP server needs to be in-memory with the host process).
I've started with the basics for now: messages (called "Clinks" because... marketing), groups, projects, milestones - which are all fairly non-novel and one might say this is just Slack/Jira. The ones that distinguish it are proposals to facilitate distributed consensus behaviour between agents. That's paired with a human-in-the-loop type proposal that requires the fleet owner to respond to the proposal via email.
That's great to hear. It makes sense given the MCP server in this case is mainly just a proxy for API calls. One thing I wonder is at what point do you decide your single tool description packs in too much context? Do you introduce a tool for each category of subcommands?
Wouldn't it be better just to stack functionalities of multiple agents into a single agent instead of getting this multi-agent overhead/failure? Many people in academia consider multi-agentic systems to be just an artifact of the current crop of LLMs but with longer and longer reliable context and more reliable calls of larger numbers of tools in recent models multi-agentic systems seem less and less necessary.
In some cases, you might actually want to cleanly separate parallel agents' context, no? I suppose you could make your main agent with stack functionalities responsible for limiting the prompt of any subagents it spawns.
My hunch is that we'll see a number of workflows that will benefit from this type of distributed system. Namely, ones that involve agents having to collaborate across timezones and interact with humans from different departments at large organizations.
Coordination of workflows between people using different LLM providers is the big one. You prefer Anthropic's models, your coworker swears by OpenAI's. None of these companies are going to support frameworks/tools that allow agent swarms to use anything other than their own models.
The underlying models are impressive, be it Gemini (via direct API calls, vs the app or search), I would include alpha-go/fold/etc in that classification
The products they build, where the agentic stuff is, is what I find unimpressive. The quality is low, the UX is bad, they are forced into every product. Two notable examples, search in GCloud, gemini-cli, antigravity (not theirs technically, $2B whitelabel deal with windsurf iirc)
So yes, I see it as perfectly acceptable to be more skeptical of Google's take on agentic systems when I find their real world applications lackluster
Antigravity is not a windsurf reskin, at least not today; it introduces many concepts and optimisations that you wouldn't find anywhere else, and in my workflows Gemini 3 Flash in Antigravity also happens to outmatch Claude Code with Opus 4.5 on some really gruesomely complicated tasks (i.e. involving compiler/decompiler work.)
I agree with you in general re "agentic systems". Though they might deliberately not be trying to compete in the "agent harness" space yet.
The antigravity experiment yes was via windsurf - probably nobody expected that to take off but maybe was work that made have surfaced some lessons worth learning from.
My hunch is that Google is past it's prime, all the good PMs are gone, and now it looks like a chicken hydra with all the heads off and trying to run in multiple directs.
There is no clear vision, coherence, or confidence that the products will be around in a another year
Kind of a weird take given they are one of the strongest AI providers who are the most vertically integrated. Sure, maybe the company isn’t as healthy as it once was, but none of them are - late stage capitalism is rotting most foundations
Their poor product decisions have driven me away, that doesn't mean I'm still very impressed with everything under that. I'm building my custom agent on their open source Agent Development Kit and the Gemini family.
> Conversely, on tasks requiring strict sequential reasoning (like planning in PlanCraft), every multi-agent variant we tested degraded performance by 39-70%. In these scenarios, the overhead of communication fragmented the reasoning process, leaving insufficient "cognitive budget" for the actual task.
> As tasks require more tools (e.g., a coding agent with access to 16+ tools), the "tax" of coordinating multiple agents increases disproportionately.
This aligns well the principle of highly cohesive, loosely coupled design for software components. If you instruct the AI to design this way, it should result in components that're simpler to reason about, and require fewer sequential steps to work on. You can think of cohesion in many different ways, but one is common functions, and another is tool/library dependency.
Agreed.
Even in the case of a single agent, the compounding of errors [1] can easily make your "flow" unacceptable for your use case. The deterministic where possibe/decoupled/well tested approach is key.
With such a fast moving space I'm always wary of adopting optimization techniques that I can't easily prove and pivot from (which means measuring/evals are necessary).
Slowly but surely, abstractions allow us to use others' deep investments in the matter of coordination without losing control (e.g. pyspark worker/driver coordination) and we can invest on friction removal and direct value generation in our domains (e.g. banking/retail/legal payments, etc)
- [1] https://alexhans.github.io/posts/series/evals/error-compound...
> We found that independent multi-agent systems (agents working in parallel without talking) amplified errors by 17.2x
The paper sounds too shallow. The errors data doesn't seem to have a rationale or correlation against the architecture. Specifically, what makes the SAS architecture to have lowest error rates while the similar architecture with independent agents having highest error rates? The conclusion doesn't seem well-grounded with reasoning.
> The paper sounds too shallow. The errors data doesn't seem to have a rationale or correlation against the architecture. Specifically, what makes the SAS architecture to have lowest error rates while the similar architecture with independent agents having highest error rates?
I can believe SAS works great until the context has errors which were corrected - there seems to be a leakage between past mistakes and new ones, if you leave them all in one context window.
My team wrote a similar paper[1] last month, but we found the orchestrator is not the core component, but a specialized evaluator for each action to match the result, goal and methods at the end of execution to report back to the orchestrator on goal adherence.
The effect is sort of like a perpetual evals loop, which lets us improve the product every week but agent by agent without the Snowflake agent picking up the Bigquery tools etc.
We started building this Nov 2024, so the paper is more of a description of what worked for us (see Section 3).
Also specific models are great at some tasks, but not always good at others.
My general finding is that Google models do document extraction best, Claude does code well and OpenAI does task management in somewhat sycophantic fashion.
Multi-agents was originally supposed to let us put together a "best of all models" world, but it works at error correcting if I have Claude write code and GPT 5 check the results instead of everything going into one context.
[1] - https://arxiv.org/abs/2601.14351
"All MAS and SAS configurations were matched for total reasoning-token budget (mean 4,800 tokens per trial)"
A single-agent system (SAS) uses this budget for a deep, unified reasoning stream (averaging 7.2 turns), multi-agent teams would fragment the same budget into dozens of coordination messages
I wonder if the budget is increased (say 50k) would the same results be observed ?
It's true that most problems can be solved with context + prompt. I have actively seen teams within large organizations complicate it into complex "agentic orchestration" just to impress leadership who lack the expertise to realize it's not even necessary. Hell, there are various startups who make this their moat.
Good for promo projects though, lol
I’ve been building a lot of agent workflows at my day job. Something that I’ve found a lot of success with when deciding on an orchestration strategy is to ask the agent what they recommend as part of the planning for phase. This technique of using the agent to help you improve its performance has been a game changer for me in leveraging this tech effectively. YMMV of course. I mostly use Claude code so who knows with the others.
Can you expand on this at all? What are you asking the agent for help with?
Not who you asked but I have made similar experiences with Claude Code. For example I might have asked it to “commit the changes” and be surprised that it did not use the skill `committing-changes`.
Now if I ask “why did you not use the skill”, the answer typically starts with an apology, the insight which skill it should have used, and it will proceed using the skill.
In contrast, asking “what can we change in the skill description so you use it next time I ask you to commit”, it will typically explain how it selects skills and how to modify the skill in question’s description so it would pick it on its own.
It feels like everyone these days are thinking of Markdown IPC hierarchical multi agent orchestration. Just the other day I saw this[1] vibecoded thing. I wonder if there's any ones notable, or maybe I should try my hands at it.
1: https://github.com/yohey-w/multi-agent-shogun
This is a neat idea but there are so many variables here that it's hard to make generalizations.
Empirically, a top level orchestrator that calls out to a planning committee, then generates a task-dag from the plan which gets orchestrated in parallel where possible is the thing I've seen put in the best results in various heterogeneous environments. As models evolve, crosstalk may become less of a liability.
Reasoning is recursive - you cannot isolate where is should be symbolic and where it should be llm based (fuzzy/neural). This is the idea that started https://github.com/zby/llm-do - there is also RLM: https://alexzhang13.github.io/blog/2025/rlm/ RLM is simpler - but my approach also have some advantages.
I think the AI community is sleeping hard on proper symbolic recursion. The computer has gigabytes of very accurate "context" available if you start stacking frames. Any strategy that happens inside token space will never scale the same way.
Depth first, slow turtle recursion is likely the best way to reason through the hardest problems. It's also much more efficient compared to things that look more like breadth first search (gas town).
I only agree with that statement if you're drawing from the set of all possible problems a priori. For any individual domain I think it's likely you can bound your analytic. This ties into the no free lunch theorem.
Computers are finite - but we use an unbounded model for thinking about them - because it simplifies many things.
I found the captions on Figure 1 quite interesting.
> Average performance (%) across four agentic benchmarks improves consistently with increasing model Intelligence Index.
> Centralized and hybrid coordination generally yield superior scaling efficiency, suggesting that collaborative agentic structures amplify capability gains more effectively than individual scaling alone.
Then again, the deltas between SAS and best performing MAS approach are ~8%, so I can't help wonder if it's worth the extra cost, at least for the generation of models that was studied.
I've been building something in this space ("Clink" - multi-agent coordination layer) and this research confirms some of the assumptions that motivated the project. You can't just throw more agents at a problem and expect it to get better.
The error amplification numbers are wild! 17x for independent agents vs 4x with some central coordination. Clink provides users (and more importantly their agents) the primitives to choose their own pattern.
The most relevant features are...
- work queues with claim/release for parallelizable tasks - checkpoint dependencies when things need to be sequential - consensus voting as a gate before anything critical happens
The part about tool count increasing coordination overhead is interesting too. I've been considering exposing just a single tool to address this, but I wonder how this plays out as people start stacking more MCP servers together. It feels like we're all still learning what works here. The docs are at https://docs.clink.voxos.ai if anyone wants to poke around!
What are your other primitives for orchestration?
> The part about tool count increasing coordination overhead is interesting too. I've been considering exposing just a single tool to address this, but I wonder how this plays out as people start stacking more MCP servers together.
It works really well. Whatever knowledge LLMs absorb about CLI commands seems to transfer to MCP use so a single tool with commands/subcommands works very well. It’s the pattern I default to when I’m forced to use an MCP server instead of providing a CLI tool (like when the MCP server needs to be in-memory with the host process).
I've started with the basics for now: messages (called "Clinks" because... marketing), groups, projects, milestones - which are all fairly non-novel and one might say this is just Slack/Jira. The ones that distinguish it are proposals to facilitate distributed consensus behaviour between agents. That's paired with a human-in-the-loop type proposal that requires the fleet owner to respond to the proposal via email.
That's great to hear. It makes sense given the MCP server in this case is mainly just a proxy for API calls. One thing I wonder is at what point do you decide your single tool description packs in too much context? Do you introduce a tool for each category of subcommands?
Wouldn't it be better just to stack functionalities of multiple agents into a single agent instead of getting this multi-agent overhead/failure? Many people in academia consider multi-agentic systems to be just an artifact of the current crop of LLMs but with longer and longer reliable context and more reliable calls of larger numbers of tools in recent models multi-agentic systems seem less and less necessary.
In some cases, you might actually want to cleanly separate parallel agents' context, no? I suppose you could make your main agent with stack functionalities responsible for limiting the prompt of any subagents it spawns.
My hunch is that we'll see a number of workflows that will benefit from this type of distributed system. Namely, ones that involve agents having to collaborate across timezones and interact with humans from different departments at large organizations.
Can you explain a usecase for Clink
Coordination of workflows between people using different LLM providers is the big one. You prefer Anthropic's models, your coworker swears by OpenAI's. None of these companies are going to support frameworks/tools that allow agent swarms to use anything other than their own models.
almost feels like paper for the sake of paper to me.
gonna read this with a grain of salt because I have been rather unimpressed with Google's Ai products, save direct API calls to gemini
The rest is trash they are forcing down our throats
Yeah alpha go and zero were lame. The earth foundation model - that's just ridiculous.
That's sarcasm
---
Your "direct Gemini calls" is maybe the least impressive
edit: This paper is mostly a sort of "quantitative survey". Nothing to get too excited about requiring a grain of salt
The underlying models are impressive, be it Gemini (via direct API calls, vs the app or search), I would include alpha-go/fold/etc in that classification
The products they build, where the agentic stuff is, is what I find unimpressive. The quality is low, the UX is bad, they are forced into every product. Two notable examples, search in GCloud, gemini-cli, antigravity (not theirs technically, $2B whitelabel deal with windsurf iirc)
So yes, I see it as perfectly acceptable to be more skeptical of Google's take on agentic systems when I find their real world applications lackluster
Antigravity is not a windsurf reskin, at least not today; it introduces many concepts and optimisations that you wouldn't find anywhere else, and in my workflows Gemini 3 Flash in Antigravity also happens to outmatch Claude Code with Opus 4.5 on some really gruesomely complicated tasks (i.e. involving compiler/decompiler work.)
They are really cooking with Flash + Antigravity.
I agree with you in general re "agentic systems". Though they might deliberately not be trying to compete in the "agent harness" space yet.
The antigravity experiment yes was via windsurf - probably nobody expected that to take off but maybe was work that made have surfaced some lessons worth learning from.
My hunch is that Google is past it's prime, all the good PMs are gone, and now it looks like a chicken hydra with all the heads off and trying to run in multiple directs.
There is no clear vision, coherence, or confidence that the products will be around in a another year
Kind of a weird take given they are one of the strongest AI providers who are the most vertically integrated. Sure, maybe the company isn’t as healthy as it once was, but none of them are - late stage capitalism is rotting most foundations
I saying this as a big, but dimming, Google-stan
Their poor product decisions have driven me away, that doesn't mean I'm still very impressed with everything under that. I'm building my custom agent on their open source Agent Development Kit and the Gemini family.