(In this case, a different approach: they randomized the LLM provider for every agentic turn. They found this helped a lot.)
But it's funny (and not too surprising) that just "alloying" a model with itself has a very similar effect. It's basically just more test time compute right? More reasoning time. With the benefit that the reasoning is parallel. Same cost, less time!
I'd love to see more numbers on this, especially with the cheaper models. (For some models, caching is so good now, that reprompting and forking are basically free.) Are the gains for tiny llms comparatively bigger or smaller? etc.
Fusion of frontier models beating Fable, or cheaper models matching Fable performance at half the cost. Great announcement timing.
What is missing in the article is the reasoning/effort levels, so it is not ruled out the results differ just due to different reasoning budgets.
I would also be interested in seeing coding performance on SWE benchmarks.
Came here to post the same article!
The headline result here: (Opus 4.8 + Opus 4.8) > Fable 5
It looks like "fusing" a model with itself gives almost as much gain as fusing two different models.
I saw promising numbers for model fusion before https://news.ycombinator.com/item?id=44630724
(In this case, a different approach: they randomized the LLM provider for every agentic turn. They found this helped a lot.)
But it's funny (and not too surprising) that just "alloying" a model with itself has a very similar effect. It's basically just more test time compute right? More reasoning time. With the benefit that the reasoning is parallel. Same cost, less time!
I'd love to see more numbers on this, especially with the cheaper models. (For some models, caching is so good now, that reprompting and forking are basically free.) Are the gains for tiny llms comparatively bigger or smaller? etc.
> It's basically just more test time compute right?
I think this is the key takeaway from here.
I took a better look at their graphs. Their Opus+Opus fusion indeed matches Fable on this benchmark, but costs nearly twice as much as Fable!
Meanwhile their "budget" fusion almost matches Fable, and costs half as much.
At least on this benchmark. (Which looks a bit odd to me, e.g. DeepSeek ouranks GPT-5.5, ???)
Would love to see more benchmarks testing this technique.