I work on LLM benchmarks and human evals for a living in a research lab (as opposed to product). I can say: it’s pretty much the Wild West and a total disaster. No one really has a good solution, and researchers are also in a huge rush and don’t want to end up making their whole job benchmarking. Even if you could, and even if you have the right background you can do benchmarks full time and they still would be a mess.
Product testing (with traditional A/B tests) are kind of the best bet since you can measure what you care about _directly_ and at scale.
I would say there is of course “benchmarketing” but generally people do sincerely want to make good benchmarks it’s just hard or impossible. For many of these problems we’re hitting capabilities where we don’t even have a decent paradigm to use,
For what it's worth, I work on platforms infra at a hyperscaler and benchmarks are a complete fucking joke in my field too lol.
Ultimately we are measuring extremely measurable things that have an objective ground truth. And yet:
- we completely fail at statistics (the MAJORITY of analysis is literally just "here's the delta in the mean of these two samples". If I ever do see people gesturing at actual proper analysis, if prompted they'll always admit "yeah, well, we do come up with a p-value or a confidence interval, but we're pretty sure the way we calculate it is bullshit")
- the benchmarks are almost never predictive of the performance of real world workloads anyway
- we can obviously always just experiment in prod but then the noise levels are so high that you can entirely miss million-dollar losses. And by the time you get prod data you've already invested at best several engineer-weeks of effort.
AND this is a field where the economic incentives for accurate predictions are enormous.
In AI, you are measuring weird and fuzzy stuff, and you kinda have an incentive to just measure some noise that looks good for your stock price anyway. AND then there's contamination.
Looking at it this way, it would be very surprising if the world of LLM benchmarks was anything but a complete and utter shitshow!
> we completely fail at statistics (the MAJORITY of analysis is literally just "here's the delta in the mean of these two samples". If I ever do see people gesturing at actual proper analysis, if prompted they'll always admit "yeah, well, we do come up with a p-value or a confidence interval, but we're pretty sure the way we calculate it is bullshit")
Sort of tangential, but as someone currently taking an intro statistics course and wondering why it's all not really clicking given how easy the material is, this for some reason makes me feel a lot better.
FWIW, I don't think intro stats is easy the way I normally see it taught. It focuses on formulae, tests, and step-by-step recipes without spending the time to properly develop intuition as to why those work, how they work, which ones you should use in unfamiliar scenarios, how you might find the right thing to do in unfamiliar scenarios, etc.
Pair that with skipping all the important problems (what is randomness, how do you formulate the right questions, how do you set up an experiment capable of collecting data which can actually answer those questions, etc), and it's a recipe for disaster.
It's just an exercise in box-ticking, and some students get lucky with an exceptional teacher, and others are independently able to develop the right instincts when they enter the class with the right background, but it's a disservice to almost everyone else.
In AI though, you also have the world trying to compete with you, so even if you do totally cheat and put the benchmark answers in your training set and over fit, if it turns out that you model sucks, it doesn't matter how much your marketing department tells everyone you scored 110% on SWE bench, if it doesn't work out that well in production, your announcement's going to flow as users discover it doesn't work that well on their personal/internal secret benchmarks and tell /r/localLLAMA it isn't worth the download.
I have actually been thinking of hiring some training contractors to come in and teach people the basics of applied statistical inference. I think with a bit of internal selling, engineers would generally be interested enough to show up and pay attention. And I don't think we need very deep expertise, just a moderate bump in the ambient level of statistical awareness would probably go a long way.
It's not like there's a shortage of skills in this area, it seems like our one specific industry just has a weird blindspot.
I don't know how it is in the US and other countries, but in my country I would say statistics is typically not taught well, at least in CS degrees. I was a very good student, always had good understanding at the subjects at university, but in the case of statistics they just taught us formulae and techniques as dogmas without much explanation of where they came from, why, and when to use them. It didn't help either that the exercises we did always applied them to things outside CS (clinical testing, people's heights and things like that) with no application we could directly relate to. As a result, when I finished the degree I had forgotten most of it, and when I started working I was surprised that it was actually useful.
When I talk about this with other CS people in my own country (Spain) they tend to refer similar experiences.
A/B testing is radioactive too. It's indirectly optimizing for user feedback - less stupid than directly optimizing for user feedback, but still quite dangerous.
Human raters are exploitable, and you never know whether the B has a genuine performance advantage over A, or just found a meat exploit by an accident.
It's what fucked OpenAI over with 4o, and fucked over many other labs in more subtle ways.
Are you talking about just preferences or A/B tests on like retention and engagement? The latter I think is pretty reliable and powerful though I have never personally done them. Preferences are just as big a mess: WHO the annotators are matters, and if you are using preferences as a proxy for like correctness, you’re not really measuring correctness you’re measuring e.g. persuasion. A lot of construct validity challenges (which themselves are hard to even measure in domain).
Yes. All of them are poisoned metrics, just in different ways.
GPT-4o's endless sycophancy was great for retention, GPT-5's style of ending every response in a question is great for engagement.
Are those desirable traits though? Doubt it. They look like simple tricks and reek of reward hacking - and A/B testing rewards them indeed. Direct optimization is even worse. Combining the two is ruinous.
Mind, I'm not saying that those metrics are useless. Radioactive materials aren't useless. You just got to keep their unpleasant properties in mind at all times - or suffer the consequences.
> Brittle performance – A model might do well on short, primary school-style maths questions, but if you change the numbers or wording slightly, it suddenly fails. This shows it may be memorising patterns rather than truly understanding the problem
The big problem is that tech companies and journalist aren't transparent about this. They tout benchmark numbers constantly, like they're an object measure of capabilities.
That's because they are as close to "object measure capabilities" as anything we're ever going to get.
Without benchmarks, you're down to evaluating model performance based on vibes and vibes only, which plain sucks. With benchmarks, you have numbers that correlate to capabilities somewhat.
So because there isn't a better measure it's okay that tech companies effectively lie and treat these benchmarks like they mean more then they actually do?
That's assuming these benchmarks are the best we're ever going to get, which they clearly aren't. There's a lot to improve even without radical changes to how things are done.
In my experience everyone openly talks about how benchmarks are bullshit. On Twitter or on their podcast interviews or whatever everyone knows benchmarks are a problem. It's never praise.
Of course they tout benchmark numbers because let's be real, if they didn't tout benchmarks your not going to bother using it. For example if someone posts some random model on huggingface with no benchmarks you just won't proceed.
Humans have a really strong prior to not waste time. We always always evaluate things hierarchally. We always start with some prior and then whatever is easiest goes next even if its a shitty unreliable measure.
For example, for Gemini 3 everyone will start with a prior that it is going to be good. Then they will look at benchmarks, and only then will they move to harder evaluations on their own use cases.
I also work in LLM evaluation. My cynical take is that nobody is really using LLMs for stuff, and so benchmarks are mostly just make up tasks (coding is probably the exception). If we had real specific use cases it should be easier to benchmark and know if one is better, but it’s mostly all hypothetical.
The more generous take is that you can’t benchmarks advanced intelligence very well, whether LLM or person. We don’t have good procedures for assessing a person's fit-for-purpose e.g. for a job, certainly not standardized question sets. Why would we expect to be able to do this with AI?
I think both of these takes are present to some extent in reality.
We have 20+ services in prod that use llms. So I have 50k (or more) per service per day of data to evaluate. The question is- do people actually evaluate properly.
And how do you do an apples to apples evaluation of such squishy services?
Has your lab tried using any of the newer causal inference–style evaluation methods? Things like interventional or counterfactual benchmarking, or causal graphs to tease apart real reasoning gains from data or scale effects. Wondering if that’s something you’ve looked into yet, or if it’s still too experimental for practical benchmarking work.
It's a shifting goalpost, but one of the things that struck me was how some questions could still be trivial for a fairly qualified human (a doctor in this case) but difficult for an AI model. Reasoning, visual or logic, is built on a set of assumptions that are better gained through IRL experience than crawling datasets and matching answers.
This leads me to believe that much of the future for training AI models will lie in exposing them to "meatspace" and annotating their inferences, much like how we train a child. This is a long, long process, and one that is already underway at scale. But it's what might give us emergent intelligences rather than just a basket of competing yet somehow-magic thesaurus.
Benchmarks are like SAT scores. Can they guarantee you'll be great at your future job? No, but we are still roughly okay with what they signify. Clearly LLMs are getting better in meaningful ways, and benchmarks correlate with that to some extend.
I wish this was more broadly, explained to people…
There are LLMs, the engines that make these products run, and then the products themselves.
GPT anything should not be asked math problems. LLMs are language models, not math.
The line is going to get very blurry because ChatGPT, or Claude or Gemini, are not LLM’s. Their products driven by LLMs.
The question or requisite should not be can my LLM do math. It can I build a product that is LLM driven that can reason through math problems. Those are different things.
A coworker of mine told me that GPT’s LLM can use Excel files. No, it can’t. But the tools they plugged into it can.
> A coworker of mine told me that GPT’s LLM can use Excel files. No, it can’t. But the tools they plugged into it can.
It's a bit like saying that a human can't use Excel files, but when given a keyboard, mouse and monitor connected to a computer running Excel, it can. But then obviously the "Excel usage" competency is in the human; not in the tools, and a cat for example cannot use Excel proficiently however many training hours it gets and however good the keyboard is.
Taking it back to the LLMs, it is clear to me that some modern LLMs like the one running ChatGPT can be integrated with tools in a way that makes them somewhat proficient with Excel, while other simpler LLMs cannot, regardless of the tools.
> A coworker of mine told me that GPT’s LLM can use Excel files. No, it can’t. But the tools they plugged into it can.
And there's a 50/50 chance they'll use the right tool for the job. I tried the math question above multiple times on gpt5 and it gets it right about 50% of the time. If i ask to "try again" it usually gets it on the 2nd or 3rd try. Most times that it's wrong, it's not far off but it looks deceptively accurate at first glance.
People often use "clearly" or "obviously" to elide the subject that is under discussion. People are saying that they do not think that it is clear that LLMs are getting better in meaningful ways, and they are saying that the benchmarks are problematic. "Clearly" isn't a counterargument.
Definitely one of the weaker areas in the current LLM boom. Comparing models, or even different versions of the same model, is a pseudo-scientific mess.
I'm still using https://lmarena.ai/leaderboard. Perhaps there is something better and someone will pipe up to tell me about it. But we use LLMs at work and have unexplainable variations between them.
And when we get a prompt working reliably on one model, we often have trouble porting it to another LLM - even straight "version upgrades" such as from GPT-4 to -5. Your prompt and your model become highly coupled quite easily.
I dunno what to do about it and am tending to just pick Gemini as a result.
Even professional human evaluators are quite vulnerable to sycophancy and overconfident-and-wrong answers. And LMArena evaluators aren't professionals.
A lot of the sycophancy mess that seeps from this generation of LLM stems from reckless tuning based on human feedback. Tuning for good LMArena performance has similar effects - and not at all by a coincidence.
It's biased to small context performance, which is why I don't pay much attention to it as a developer aside from a quick glance. I need performance at 40-100k tokens which models like Deepseek can't deliver but Gemini 2.5 Pro and ChatGPT 5.0 Thinking can.
And even "long term performance" splits itself into "performance on multi-turn instruction following" and "performance on agentic tasks" down the line. And "performance on agentic tasks" is a hydra in itself.
Capturing LLM performance with a single metric is a hopeless task. But even a single flawed metric beats no metrics at all.
This is something I've stuggled with for my site, I made https://aimodelreview.com/ to compare the outputs of LLMs over a variety of prompts and categories, allowing a side by side comparison between them. I ran each prompt 4 times for each model with different temperature values available as a toggles.
My thinking was to just make the responses available to users and let them see how models perform. But from some feedback, turns out users don't want to have to evaluate the answers and would rather see a leaderboard and rankings.
The scalable solution to that would be LLM as judge that some benchmarks already use, but that just feels wrong to me.
LM Arena tries to solve this with the crowd sourced solution, but I think the right method would have to be domain expert human reviewers, so like Wirecutter VS IMDb, but that is expensive to pull off.
I’m working a lot with TTS (Text-to-Speach), and it’s also a total wild west - even worse than LLMs in some ways. The demos are always perfect, but once you generate hundreds of minutes you start seeing volume drift, pacing changes, random artifacts, and occasional mispronunciations that never show up in the curated clips.
The big difference from LLMs is that we don’t really have production-grade, standardized benchmarks for long-form TTS. We need things like volume-stability across segments, speech-rate consistency, and pronunciation accuracy over a hard corpus.
This is solvable at the level of an individual developer. Write your own benchmark for code problems that you've solved. Verify tests pass and that it satisfies your metrics like tok/s and TTFT. Create a harness that works with API keys or local models (if you're going that route).
At the developer level all my LLM use is in the context of agentic wrappers, so my benchmark is fairly trivial:
Configure aider or claude code to use the new model, try to do some work. The benchmark is pass/fail, if after a little while I feel the performance is better than the last model I was using it's a pass, otherwise it's a fail and I go back.
Building your own evaluations makes sense if you're serving an LLM up to customers and want to know how it performs, but if you are the user... use it and see how it goes. It's all subjective anyway.
> Building your own evaluations makes sense if you're serving an LLM up to customers and want to know how it performs, but if you are the user... use it and see how it goes. It's all subjective anyway.
I'd really caution against this approach, mainly because humans suck at removing emotions and other "human" factors when judging how well something works, but also because comparing across models gets a lot easier when you can see 77/100 vs 91/100 as a percentage score, over your own tasks that you actually use the LLMs for. Just don't share this benchmark publicly once you're using it for measurements.
So what? I'm the one that's using it, I happen to be a human, my human factor is the only one that matters.
At this point anyone using these LLMs every day have seen those benchmark numbers go up without an appreciable improvement in the day to day experience.
> So what? I'm the one that's using it, I happen to be a human, my human factor is the only one that matters.
Yeah no you're right, if consistency isn't important to you as a human, then it doesn't matter. Personally, I don't trust my "humanness" and correctness is the most important thing for me when working with LLMs, so that's why my benchmarks focus on.
> At this point anyone using these LLMs every day have seen those benchmark numbers go up without an appreciable improvement in the day to day experience.
Yes, this is exactly my point. The benchmarks the makers of these LLMs seems to always provide a better and better score, yet the top scores in my own benchmarks have been more or less the same for the last 1.5 years, and I'm trying every LLM I can come across. These "the best LLM to date!" hardly ever actually is the "best available LLM", and while you could make that judgement by just playing around with LLMs, actually be able to point to specifically why that is, is something at least I find useful, YMMV.
> "For example, if a benchmark reuses questions from a calculator-free exam such as AIME," the study says, "numbers in each problem will have been chosen to facilitate basic arithmetic. Testing only on these problems would not predict performance on larger numbers, where LLMs struggle."
When models figure out how to exploit an effect that every clever college student does, that should count as a win. That’s a much more human-like reasoning ability, than the ability to multiply large numbers or whatever (computers were already good at that, to the point that it has become a useless skill for humans to have). The point of these LLMs is to do things that computers were bad at.
I don’t think the fact that LLMs can handle small numbers more reliably has anything to do with their reasoning ability. To the contrary, reasoning ability should enable them to handle numbers of arbitrary size, just as it enables humans to do so, given some pencil and paper.
However:
> Testing only on these problems would not predict performance on larger numbers, where LLMs struggle.
Since performance on large numbers is not what these exams are intended to test for, I don’t see this as a counterargument, unless the benchmarks are misrepresenting what is being tested for.
> reasoning ability should enable them to handle numbers of arbitrary size, just as it enables humans to do so, given some pencil and paper.
Or given a calculator. Which it's running on. Which it in some sense is. There's something deeply ironic about the fact that we have an "AI" running on the most technologically advanced calculator in the history of mankind and...it can't do basic math.
This is like saying it's ironic that an alternator in a car cannot combust gasoline when the gasoline engine is right beside it, even though the alternator 'runs' on the gasoline engine.
Or similarly having a gasoline engine without an alternator and making the observation that there's an absurdity there in that you're generating large amounts of energy, yet aren't able to charge a relatively small 12V battery with any of it. It's a very practical and natural limitation, yet in some sense you have exactly what you want - energy - you just can't use it because of the form. If you step back there's an amusing irony buried in that. At least in my humble opinion :-)
Thing is, a LLM is nothing but a prediction algorithm based upon what it trained. So it missing basic calculator functionality is a given. This is why tool usage is more and more a thing for LLMs. So that the LLM can from itself use a calculator for the actual math parts it needs. Thus increasing accuracy ...
If they were selling LLMs as “LLMs” instead of magic code-writing, answer-giving PhD replacements, the lack of basic arithmetic capability would be a given… but they aren’t. Judging a paid service using their own implied claims is perfectly reasonable.
Why is it a given? The universal approximation theorem should apply since addition is a continuous function. Now whether the network is sufficiently trained for that is another question but I don’t think it's a given that a trillion parameter model can’t approximate the most basic math operations.
I think the tokenization is a bigger problem than the model itself.
Easy to answer that one ... predictions are based upon accuracy. So if you have a int4 vs a float16, the chance that the prediction goes off is higher with a int4. But even with a float16, your still going to run into issues where your prediction model goes off. Its going to be a lot less, your still going to get rounding issue, what may result in a 5 being a 8 (just a example).
So while it can look like a LLM calculates correctly, its still restricted by this accuracy issue. What happens when you get a single number wrong in a calculation, everything is wrong.
While a calculator does not deal with predictions but basic adding/multiplying/subtracting etc .. Things that are 100% accurate (if we not not count issues like cosmic rays hitting, failures in silica etc).
A trillion parameter model is just that, a trillion parameters, but what matter is not the tokens but the accuracy as in, the do they use int, float16, float32, float64 ... The issue is, the higher we go, the memory usage explodes.
There is no point in spending terabytes of memory, to just get a somewhat accurate predictive calculator, when we can just have the LLM call a actual calculator, to ensure its results are accurate.
Think of a LLM more like somebody with Dyslexia / Dyscalculia... It does not matter how good you are, all it takes is to switch one number in a algebraic calculation to get a 0/10 ... The reason why i mention this, is because i often think of a LLM like a person with Dyslexia / Dyscalculia. It can have insane knowledge, be smart, but be considered dumb by society because of that less then accurate prediction (or number swiping issue).
Take it from somebody that wasted a few years in school thanks to that issue, it really does not matter if your a good programmer later in life, when you flunk a few years thanks to undiagnosed issues. And yet, just like a LLM, i simply rely on tool usage to fix my inaccuracy issues. No point in wasting good shoulder space trying to graft a dozen more heads/brains onto me, when i can simply delegate the issue away. ;)
The fact that we can get computer models, that can almost program, write texts, ... and do so much more like a slightly malfunctioning human, amazes me. And at the same time, i curse at it like my teachers did, and also call it dumb at times hehehe ... I now understand how my teachers felt loool
That's confusing basic arithmetic as a user feature and as an implementation requirement.
I guarantee that computer vision and email clients both use basic arithmetic in implementation. And it would be trivially easy to bolt a calculator into an email app, because the languages used to write email apps include math features.
That's not true of LLMs. There's math at the bottom of the stack. But LLMs run as a separate closed and opaque application of a unique and self-contained type, which isn't easily extensible.
They don't include hooks into math features on the GPUs, and there's no easy way to add hooks.
If you want math, you need a separate tool call to conventional code.
IMO testing LLMs as if they "should" be able to do arithmetic is bizarre. They can't. They're not designed to. And even if they did, they'd be ridiculously inefficient at it.
> Pretty sure the only thing computer vision does is math.
That is only marginally less pedantic than saying that the only thing computer vision does is run discrete electrical signals through billions of transistors.
Yes, everything that a computer does, it does using math. This does not imply that things running on the computer can do basic arithmetic tasks for the user.
On some level this makes sense, but on the other hand LLMs already have perfect recall of thousands of symbols built into them, which is what pencil and paper gives to a human test taker.
If you're not doing clever hacks for very long windows, I thought a basic design fed in the entire window and it's up to the weights to use it properly.
Agreed. I don't like when the prompt sets up a good portion of how to go about finding the answer by saying which tools to use and how. The LLM needs to decide when and how to use them, not the prompt.
I don't think it should be completely open ended. I mean, you could have an "ask_hooman" tool that solves a ton of problems with current LLMs. But that doesn't mean the LLM is capable with respect to the benchmark.
Why not? One of the most intelligent things to do when stuck on a problem is to get outside help.
If allowing this behaviour raises a problem, you can always add constraints to the benchmark such as "final answer must come out under 15s" or something. The LLM can then make the decision to ask around in accordance to the time risk.
Because AI are good at devolving to the highest score, regardless of test intent. For most problems "ask_hooman", or especially the plural, would be much more effective. So, the degenerate case would dominate and tell you precisely zero about the intelligence of the AI. If a specific "tool" is more adept than the "AI" then "choose tool" will always be the correct answer. But I agree, a tight time constraint would help.
I'm not addressing an argument, just stating that's already a form of LLM testing done today for people wanting to look at the difference in results the same as the human analogy.
> To the contrary, reasoning ability should enable them to handle numbers of arbitrary size, just as it enables humans to do so, given some pencil and paper.
People interested can see the results of giving LLMs pen and paper today by looking at benchmarks with tools enabled. It's an addition to what you said, not an attack on a portion of your comment :).
I see now. My focus was on the effect of LLMs’ (and by analogy, humans’) reasoning abilities argued by bee_rider. The fact that tool use can enable more reliable handling of large numbers has no bearing on that, hence I found the reply confusing.
Hmm, maybe it depends on the specific test and reasoning in it? I certainly think reasoning how and when to use allowed tools and when not to is a big part of the reasoning and verification process E.g. most human math scores allow for a pen and paper calculation, or even a calculator, and that can be a great way to say spot check a symbolic derivative and see it needs to be revisited without relying on the calculator/paper to do the actual reasoning for the testee. Or to see the equation for motion of a system can't possibly have been right with some test values (without which I'm not sure I'd have passed my mid level physics course haha).
At the very least, the scores for benchmarking a human on such a test with and without tools would be different to comparing an LLM without the analogous constraints. Which is (IMO) a useful note in comparing reasoning abilities and why I thought it was interesting to note this kind of testing is just called testing with tools on the LLM side (not sure there is an equally as standard term on the human testing side? Guess the same could be used for both though).
At the same time I'm sure other reasoning tests don't gain much from/expect use of tools at all. So it wouldn't be relevant for those reasoning tests.
> Since performance on large numbers is not what these exams are intended to test for,
How so? Isn't the point of these exams to test arithmetic skills? I would hope we'd like arithmetic skills to be at a constant level regardless of the size of the number?
No. AIME is a test for advanced high schoolers that mostly tests higher level math concepts like algebra and combinatorics. The arithmetic required is basic. All the answers are 3-digit numbers so that judging is objective and automated while making guessing infeasible. You have 12 minutes on average for each question, so even if you are terribly slow at arithmetic, you should still be able to calculate the correct answer if you can perform all the other math.
That's probably a great test for high schoolers but it doesn't really test what we want from AI, no? I would expect AI to be limited by the far greater constraints of its computing ability, and not the working memory of a human high schooler.
College exam takers use those tricks because they are on a time limit and are gaming the system. It's clever and wink wink nudge nudge ok everyone does it. But it's one tiny signal in a huge spectrum of things we use to evaluate people.
Instead, these metrics are gamed and presented as the entire multi special signal of competence for LLMs because it is literally impossible to say that success in one domain would translate the way it might with a good hire.
What I want is something I don't have to guard against gaming. Something conscientious and capable like my co workers. Until then it's google version 2 married to intellisense and I'm not letting do anything by itself.
IMO I think the calculator problem goes away with tool use or NN architectures that basically add a calculator equivalent as one of the potential 'experts' or similar. It won't be much of a trope for longer.
LLMs can probably be taught or configured to use external tools like Excel or Mathematica when such calculations are needed. Just like humans. There are plenty of untapped optimization opportunities.
I tried making a spreadsheet application and found that they’re not that great at working with 2D data, especially if there’s a lot of it. It’s harder to do search for a large spreadsheet than a large text file - you might get a range of thousands of numbers, how do you search that? And things like headers or important information may not be anywhere near where it’s focused which means it needs to read a ton of irrelevant context. For small sheets it works perfectly though, it’ll have to be something I’ll take another look at in the future.
The point of these LLMs is to do things that computers were bad at.
That's a good point imo but we achieved this stuff by at least 2022 when ChatGPT was released. The thing about these giant black boxes is that they also fail to do things that directly human-written software ("computers") does easily. The inability to print text onto generated images or do general arithmetic is important. And sure, some of these limits look like "limits of humans". But it is important to avoid jumping from "they do this human-thing" to "they're like humans".
I don't claim to know anything but I thought tool usage was a major sign of intelligence. For example floats are a wonderful technology but people use them as if chainsaws are great for cutting bread and butter. We now have entire languages that cant do basic arithmetic. I thought it was alarming: People it cant compute like this! Now we have language models, those are still computers, why cant we just give them.. you know... calculators? Arguably the best thing their universe has to offer.
edit: I forgot my point: calculating big numbers is not a real world problem anyone has.
>the point of these LLMs is to do things that computers were bad at.
The way they’re being deployed it feels like the point of LLMs is largely to replace basic online search or to run your online customer support cheaply.
I’m a bit out on a limb here because this is not really my technical expertise by any stretch of the imagination, but it seems to me these benchmark tests don’t really tell us much about how LLM’s perform in the ways most people actually use them. Maybe I’m off base here though
Nobody really knows "the point" of LLMs yet. They weren't even "invented" as much as they emerged as a trick to get computers to better understand human language.
They're still brand spanking new and everyone's trying to figure out how to best use them. We don't even really know if they're ever going to be "really good at" any given task!
Are they "really good at" these things or are they merely "OK-ish"?
* Answering factual questions.
* Programming.
* Understanding what the user wants from natural language.
* Searching/recommending stuff.
Real world testing suggests that with billions and billions of dollars spent, you really can get an LLM to be "OK-ish" at all those things :D
Yet literally hundreds of billions of dollars are being invested in them. That’s what’s so concerning. And I can tell you not one of these startups would EVER acknowledge the truth of your statement.
We should make a collective git repo full of every kind of annoying bug we (expert developers) can think of. Then use that to benchmark LLMs.
Someone want to start? I've got a Yjs/CRDT collaborative editing bug that took like a week and a half of attempts with Claude Code (Sonnet 4.5), GPT5-codex (medium), and GLM-4.6 many, many attempts to figure out. Even then they didn't really get it... Just came up with a successful workaround (which is good enough for me but still...).
Aside: You know what really moved the progress bar on finding and fixing the bug? When I had a moment of inspiration and made the frontend send all it's logs to the backend so the AIs could see what was actually happening on the frontend (near real-time). Really, I was just getting sick of manual testing and pasting the console output into the chat (LOL). Laziness FTW!
I have the Google Chrome Dev Tools MCP but for some reason it doesn't work as well :shrug:
Have you tried the Playwright libraries? Not the MCP, instead telling Claude Code to use the Node.js or Python Playwright libraries directly. I have had some really good results for this for gnarly frontend challenges.
I don't really like MCPs, at least when I'm working with coding agents like Claude Code or Codex CLI. I'd rather let the agents write code that can do anything the underlying library is capable of, rather than restricting them to just the functionality that the MCP exposes.
It's more token efficient too since I don't need to load the full MCP description into my context.
When I have a bug I’m iterating on it’s much easier and faster to have it write out the playwright script. That way it does not have to waste time or tokens performing the same actions over and over again.
> took like a week and a half of attempts with Claude Code ...
What kind of expert developer wastes that much time prompting a bunch of different LLMs to end up with a workaround, instead of actually debugging and fixing the bug themselves?
To be charitable to the parent poster, I've had multi-week bugs that turned out to be a tiny change, where every test iteration took hours of compile time...
there is a lot of disdain for vibe coding/coders, as Im sure you already know. I was going to post something similar as soon as I read a week and a half of prompts. I pray that any gainfully employed expert coders don't spend 10 days prompting, rather than coding lol
> We should make a collective git repo full of every kind of annoying bug we (expert developers) can think of. Then use that to benchmark LLMs.
I think any LLM-user worth their salt have been doing this pretty much since we got API access to LLMs, as otherwise there is no way to actually see if they can solve the things you care about.
The only difference is that you must keep the actual benchmarks to yourself, don't share them with anyone and even less put them publicly. The second you do, you probably should stop using it as an actual benchmark, as newly trained LLMs will either intentionally or unintentionally slurp up your benchmark and suddenly it's no longer a good indicator.
I think I personally started keeping my own test cases for benchmarking around the GPT3 launch, when it became clear the web will be effectively "poisoned" from that part on, and anything on the public internet can be slurped up by the people feeding the LLMs training data.
Once you have this up and running, you'll get a much more measured view of how well new LLMs work, and you'll quickly see that a lot of the fanfare doesn't actually hold up when testing it against your own private benchmarks. On a happier note, you'll also be surprised when a model suddenly does a lot better in a specific area that wasn't even mentioned at release, and then you could switch to it for specifically that task :)
I actually started a collection of annoying bugs I’ve seen in the wild. I give the llm the buggy implementation and ask it to write a test that catches it. So far not even a frontier model (Claude Sonnet) can do it, even though they can find and fix the bug itself.
This may be intentional, but I'd like to point out that your basically suggesting that others aggregate high-quality training data for AI companies to use free of charge to replace software engineers.
Benchmarks are nothing more than highly contextual specs (in traditional code). They demonstrate your code works in a certain way in certain use cases, but they do not prove your code works as expected in all use cases.
> Our systematic review of 445 benchmarks reveals prevalent gaps that undermine the construct validity needed to accurately measure
targeted phenomena
Intelligence has an element of creativity, and as such the true measurement would be on metrics related to novelty, meaning tasks that have very little resemblance to any other existing task. Otherwise it's hard to parse out whether it's solving problems based on pattern recognition instead of actual reasoning and understanding. In other words, "memorizing" 1000 of the same type of problem, and solving #1001 of that type is not as impressive as solving a novel problem that has never been seen before.
Of course this presents challenges to creating the tests because you have to avoid however many petabytes of training data these systems are trained with. That's where some of the illusion of intelligence arises from (illusion not because it's artificial, since there's no reason to think the brain algorithms cannot be recreated in software).
In my opinion a major weakness in how people reason about this issue is that they describe solving problems as EITHER recall of an existing solution OR creative problem solving. Sure it is possible for a specific solution to be recalled, but it's not possible for a problem to be absolutely unrelated to anything the system has ever seen before and still be solvable. There are many shades of gray in the similarity a problem may have to previously seen problems. In fact, I expect that there are as many shades of gray as there are problems.
The difference is that humans don't memorize petabytes of problems, so from a relative perspective people are constantly solving novel problems they never saw before. I'm thinking this is a requirement for dynamic, few-shot learning. We can clearly see LLMs fail when you throw even a small wrench in the prompt.
Humans encounter massive numbers of problems in their experience that informs their problem solving. The same is true of LLM. LLMs do not actually have all their training data memorized.
I’m not sure what your basis is for saying “LLMs fail if there is a small wrench in the prompt.” They also succeed despite wrenches in the prompt with great regularity.
Let's clarify, this isn't about whether the models are capable. They are very capable and impressive. This is more about whether we can use the same type of metric we use for humans to compare and conclude if they are "intelligent".
It's not just semantics, the metrics are supposed to tell us the potential of the model. If they can solve extremely hard PhD problems, it should be the case that we're already in the singularity, and they should be solving absolutely everything in whatever field they were trained in, because it's not just PhD level, it's a machine that has a ton of memory, compute and never sleeps. However, once you use these models extensively, it becomes apparent they are just synthesizing data, and not as much understanding it in a way that would allow them to extrapolate into anything else as humans do.
I think this point is a little hard to explain. I'll just emphasize, these are smart systems, and they can do a lot, but there is still a disconnect between, let's say, a PhD level model and a human with a PhD, in the "quality" of what we would call "intelligence" of both entities (human and machine).
Human metrics of intelligence have always felt like rubbish. We never did this well. I would describe intelligence as effective adaption leading to survival and growth or prospering. Memorization, comprehension, speed of response etc. those are magnifying factors that are valued, we view them as components of intelligence, but llms are proving this is not the whole, without effective application, they are not intelligence. Perhaps learning is the difference? How to measure that?
Someone describing string theory is the literary equivalent of fractal structures in snowflakes. Lovely, complex, possibly unique, but not proof of a level of intelligence- for the string theorist maybe it is intelligent, perhaps persuading someone to fund their grant, which enables them to eat, shelter etc. Might be a bit harsh on string theory. Saying it is proof of an amount of intelligence leads us to falsifiable statements.
Benchmarks optimize for fundraising, not users. The gap between "state of the art" and "previous gen" keeps shrinking in real-world use, but investors still write checks based on decimal points in test scores.
we try to make benchmarks for users, but it's like that 20% article - different people want different 20% and you just end up adding "features" and whackamoling the different kinds of 20%
if a single benchmark could be a universal truth, and it was easy to figure out how to do it, everyone would love that.. but that's why we're in the state we're in right now
The problem isn’t with the benchmarks (or the models, for that matter) it’s their being used to prop up the indefensible product marketing claims made by people frantically justifying asking for more dump trucks of thousand-dollar bills to replace the ones they just burned through in a few months.
I wish the big providers would offer some sort of trial period where you can evaluate models in a _realistic_ setting yourself (i.e cli tools or IDE integrations). I wouldn't even mind strict limits -- just give me two hours or so of usage and I'd already be happy. Seriously.
My use-case is probably pretty far from the usual tasks: I'm currently implementing a full observability platform based on VictoriaMetrics / Victorialogs + Grafana. It's quite elaborate and has practically no overlap with the usual/cloud solutions you find out there. For example, it uses an authenticated query stack: I use the Grafana oauth token to authenticate queries by injecting matchers via prom-label-proxy and forward that to promxy for fan-out to different datasources (using the label filter to only query some datasources). The IaC stuff is also not mainstream as I'm not using any of the big cloud providers, but the provider I use nonetheless has a terraform provider.
As you can imagine, there's probably not much training data for most of this, so quality of the responses varies widely. From my experience so far Claude (Sonnet 4.5 ) does a _much_ better job than GTP-5 (Codex or normal) with the day-to-day task. Stuff like keeping documentation up to date, spotting inconsistencies, helping me find blind spots in the Alerting rules, etc. It also seems to do better working with provided documentation / links.
I've been using Claude for a couple of weeks now but recently switched to codex after my subscription to Claude ran out. I was really curious after reading a lot of good things about it but I gotta say, so far, I'm not impressed. Compared to Claude it gives wrong answers much more frequently (at least in this domain). The results it produces take much more effort to clean up than Claude's. Probably on a level where I could just invest the time myself. Might be that I do not yet know how to correctly prompt GPT but giving both tools the same prompt, Claude does a better job 90% of the time.
Anyway, I guess this is my long-winded way of saying that the quality of responses "off the beaten track" varies widely and is worth testing several models with. Especially if your work is not 70+% of coding. Even then I guess that many benchmarks have seized being useful by now?
There's the github copilot 30 day trial? "Access to Anthropic Claude Sonnet 4, GPT-5, Gemini 2.5 Pro, and more
300 premium requests to use the latest models and code review"
I'd like to see some video generation benchmarks. For example, one that tested a model's ability to generate POV footage of a humanoid form carrying out typical household tasks
Even if it requires human evaluators at first, and even if the models completely suck at this task right now: it seems like the kind of task you'd want them to be good at, if you want these models to eventually carry out these tasks in embodied forms in the real world.
Just having the benchmark in the first place is what gives model makers something to optimize for.
Generating footage wouldn't help with the opposite but navigating a simulation would which is a pretty standard type of evaluation for multimodal AIs designed to act in the real world.
Do you mean that it wouldn't help with ingesting footage and then determining how to act?
I can imagine a robotics architecture where you have one model generating footage (next frames for what it is currently seeing) and another dumber model which takes in the generated footage and only knows how to generate the motor/servo control outputs needed to control whatever robot platform it is integrated with.
I think that kind of architecture decoupling would be nice. It allows the model with all the world and task-specific knowledge to be agnostic from its underlying robot platform.
yes it does - it has to be meaningful or rigorous for the comparative ranking to be meaningful or rigorous, or else wtf are you doing? Say I have all the information on my side but only these questions that you are showing the user? Who cares about that comparison?
Humans are much better at out of sample prediction than LLMs. And inherently benchmarks cannot be out of sample. So I believe that leads to the disconnect between LLMs getting better and better at in sample prediction (benchmarks) while not improving nearly as much at out of sample (actual work).
For statistical AI models, we can use out of sample prediction error as an objective measure to compare models. What makes evaluating LLMs difficult is that comparisons are inextricable from utility (whereas statistical AI models do have a pre-utility step wherein it can be shown out of sample prediction epsilon is minimized).
They should laugh while they can ;) Still waiting for the crash and to see what lives on and what gets recycled. My bet is that grok is here to stay ;)
(Don't hurt me, I just like his chatbot. It's the best I've tried at, "Find the passage in X that reminded me of the passage in Y given this that and the other thing." It has a tendency to blow smoke if you let it, but they all seek to affirm more than I'd like, but ain't that the modern world? It can also be hilariously funny in surprisingly apt ways.)
If models get commoditised, distribution (and vertical integration) become key. OpenAI and xAI are the only companies that seem to be well hedged for this risk.
The problem with the LLM benchmarks is that if you see one that shows high performance by something that isn’t from Anthropic, Google or OpenAI, you don’t believe it, even if it were “true.” In that sense, benchmarks are a holistic social experience in this domain, less a scientific endeavour.
Tech companies/bloggers/press/etc are perpetually bad at benchmarks. For browsers they kept pushing simplistic javascript-centric benchmarks even when it was clear for at least 15 years that layout/paint/network/etc were the dominant bottlenecks in real-world usage.
It's primarily marketing-driven. I think the technical parts of companies need to attempt to own this more.
It gets really weird when engineering priorities shift because of these mostly irrelevant benchmarks.
I'm already quite put off by the title (it's science -- if you have a better benchmark, publish it!), but the contents aren't great either. It keeps citing numbers about "445 LLM benchmarks" without confirming whether any of the ones they deem insufficiently statistical are used by any of the major players. I've seen a lot of benchmarks, but maybe 20 are used regularly by large labs, max.
"For example, if a benchmark reuses questions from a calculator-free exam such as AIME," the study says, "numbers in each problem will have been chosen to facilitate basic arithmetic. Testing only on these problems would not predict performance on larger numbers, where LLMs struggle."
For a math-based critique, this seems to ignore a glaring problem: is it even possible to randomly sample all natural numbers? As another comment pointed out we wouldn't even want to ("LLMs can't accurately multiply 6-digit numbers" isn't something anyone cares about/expected them to do in the first place), but regardless: this seems like a vacuous critique dressed up in a costume of mathematical rigor.
At least some of those who design benchmark tests are aware of these concerns.
In related news, at least some scientists studying climate change are aware that their methods are imperfect. More at 11!
When people claim that there is such a thing as "X% accuracy in reasoning", it's really hard to take anything else seriously, no matter how impressive.
AI (and humans!) aside, claiming that there was an oracle that could "answer all questions" is a solved problem. Such a thing cannot exist.
But this is going already too deep IMO.
When people start talking about percentages or benchmark scores, there has to be some denominator.
And there can be no bias-free such denominator for
- trivia questions
- mathematical questions (oh, maybe I'm wrong here, intuitively I'd say it's impossible for various reasons: varying "hardness", undecidable problems etc)
- historical or policital questions
I wanted to include "software development tasks", but it would be a distraction. Maybe there will be a good benchmark for this, I'm aware there are plenty already. Maybe AI will be capable to be a better software developer than me in some capacity, so I don't want to include this part here. That also maps pretty well to "the better the problem description, the better the output", which doesn't seem to work so neatly with the other categories of tasks and questions.
Even if the whole body of questions/tasks/prompts would be very constrained and cover only a single domain, it seems impossible to guarantee that such benchmark is "bias-free" (I know AGI folks love this word).
Maybe in some interesting special cases? For example, very constrained and clearly defined classes of questions, at which point, the "language" part of LLMs seems to become less important and more of a distraction. Sure, AI is not just LLMs, and LLMs are not just assistants, and Neural Networks are not just LLMs...
There the problem begins to be honest: I don't even know how to align the "benchmark" claims with the kind of AI they are examinin and the ones I know exist.
Sure it's possible to benchmark how well an AI decides whether, for example, a picture shows a rabbit.
Even then: for some pictures, it's gotta be undecidable, no matter how good the training data is?
I'm just a complete layman and commenting about this; I'm not even fluent in the absolute basics of artificial neural networks like perceptrons, gradient descent, backpropagation and typical non-LLM CNNs that are used today, GANs etc.
I am and was impressed by AI and deep learning, but to this day I am thorougly disappointed by the hubris of snakeoil salespeople who think it's valuable and meaningful to "benchmark" machines on "general reasoning".
I mean, it's already a thing in humans. There are IQ tests for the non-trivia parts. And even these have plenty of discussion revolving around them, for good reason.
Is there some "AI benchmark" that exclusively focuses on doing recent IQ tests on models, preferably editions that were published after the particular knowledge cutoff of the respective models? I found (for example) this study [1], but to be honest, I'm not the kind of person who is able to get the core insights presented in such a paper by skimming through it.
Because I think there are impressive results, it's just becomimg very hard to see through the bullshit at as an average person.
I would also love to understand mroe about the current state of the research on the "LLMs as compression" topic [2][3].
I've been getting flagged by high-on-their-own-supply AI boosters for identifying that LLM benchmarks have been obvious bullshit for at least the last year and a half.
What changed to make "the inevitable AI bubble" the dominant narrative in last week or so?
Link those comments please because I checked your history and the flagged ones were pure nonsense with zero insights. Also, calling out LLM benchmarks has never been a radical take and basically the default on this site.
It is possible to be right on the main theme but only by accident (with arguments and claims being wrong), communicating in a highly faulty way, with pointless insults, doing it in offtopic derails, being correct on minor point while being mostly wrong etc.
Can you link some of these comments you consider useful but got flagged?
The market was down for AI related stocks especially, while down only over 3% it’s the worst week since April, and there’s no single event that is to blame it just looks like market sentiment has shifted away from the previous unchecked exuberance.
I work on LLM benchmarks and human evals for a living in a research lab (as opposed to product). I can say: it’s pretty much the Wild West and a total disaster. No one really has a good solution, and researchers are also in a huge rush and don’t want to end up making their whole job benchmarking. Even if you could, and even if you have the right background you can do benchmarks full time and they still would be a mess.
Product testing (with traditional A/B tests) are kind of the best bet since you can measure what you care about _directly_ and at scale.
I would say there is of course “benchmarketing” but generally people do sincerely want to make good benchmarks it’s just hard or impossible. For many of these problems we’re hitting capabilities where we don’t even have a decent paradigm to use,
For what it's worth, I work on platforms infra at a hyperscaler and benchmarks are a complete fucking joke in my field too lol.
Ultimately we are measuring extremely measurable things that have an objective ground truth. And yet:
- we completely fail at statistics (the MAJORITY of analysis is literally just "here's the delta in the mean of these two samples". If I ever do see people gesturing at actual proper analysis, if prompted they'll always admit "yeah, well, we do come up with a p-value or a confidence interval, but we're pretty sure the way we calculate it is bullshit")
- the benchmarks are almost never predictive of the performance of real world workloads anyway
- we can obviously always just experiment in prod but then the noise levels are so high that you can entirely miss million-dollar losses. And by the time you get prod data you've already invested at best several engineer-weeks of effort.
AND this is a field where the economic incentives for accurate predictions are enormous.
In AI, you are measuring weird and fuzzy stuff, and you kinda have an incentive to just measure some noise that looks good for your stock price anyway. AND then there's contamination.
Looking at it this way, it would be very surprising if the world of LLM benchmarks was anything but a complete and utter shitshow!
> we completely fail at statistics (the MAJORITY of analysis is literally just "here's the delta in the mean of these two samples". If I ever do see people gesturing at actual proper analysis, if prompted they'll always admit "yeah, well, we do come up with a p-value or a confidence interval, but we're pretty sure the way we calculate it is bullshit")
Sort of tangential, but as someone currently taking an intro statistics course and wondering why it's all not really clicking given how easy the material is, this for some reason makes me feel a lot better.
FWIW, I don't think intro stats is easy the way I normally see it taught. It focuses on formulae, tests, and step-by-step recipes without spending the time to properly develop intuition as to why those work, how they work, which ones you should use in unfamiliar scenarios, how you might find the right thing to do in unfamiliar scenarios, etc.
Pair that with skipping all the important problems (what is randomness, how do you formulate the right questions, how do you set up an experiment capable of collecting data which can actually answer those questions, etc), and it's a recipe for disaster.
It's just an exercise in box-ticking, and some students get lucky with an exceptional teacher, and others are independently able to develop the right instincts when they enter the class with the right background, but it's a disservice to almost everyone else.
“Here’s the throughout at sustained 100% load with the same ten sample queries repeated over and over.”
“The customers want lower latency at 30% load for unique queries.”
“Err… we can scale up for more throughput!”
ಠ_ಠ
And then when you ask if they disabled the query result cache before running their benchmarking, they blink and look confused.
Then you see 25% cache hit rate in production and realise that disabling it for benchmark is not a good option either.
Id say your experience is being more monetized for growth for growth sake.
In AI though, you also have the world trying to compete with you, so even if you do totally cheat and put the benchmark answers in your training set and over fit, if it turns out that you model sucks, it doesn't matter how much your marketing department tells everyone you scored 110% on SWE bench, if it doesn't work out that well in production, your announcement's going to flow as users discover it doesn't work that well on their personal/internal secret benchmarks and tell /r/localLLAMA it isn't worth the download.
Whatever happened with Llama 4?
Even a p-value is insufficient. Maybe can use some of this stuff https://web.stanford.edu/~swager/causal_inf_book.pdf
I have actually been thinking of hiring some training contractors to come in and teach people the basics of applied statistical inference. I think with a bit of internal selling, engineers would generally be interested enough to show up and pay attention. And I don't think we need very deep expertise, just a moderate bump in the ambient level of statistical awareness would probably go a long way.
It's not like there's a shortage of skills in this area, it seems like our one specific industry just has a weird blindspot.
Don’t most computer science programs require this? Mine had a statistics requirement
I don't know how it is in the US and other countries, but in my country I would say statistics is typically not taught well, at least in CS degrees. I was a very good student, always had good understanding at the subjects at university, but in the case of statistics they just taught us formulae and techniques as dogmas without much explanation of where they came from, why, and when to use them. It didn't help either that the exercises we did always applied them to things outside CS (clinical testing, people's heights and things like that) with no application we could directly relate to. As a result, when I finished the degree I had forgotten most of it, and when I started working I was surprised that it was actually useful.
When I talk about this with other CS people in my own country (Spain) they tend to refer similar experiences.
A/B testing is radioactive too. It's indirectly optimizing for user feedback - less stupid than directly optimizing for user feedback, but still quite dangerous.
Human raters are exploitable, and you never know whether the B has a genuine performance advantage over A, or just found a meat exploit by an accident.
It's what fucked OpenAI over with 4o, and fucked over many other labs in more subtle ways.
Are you talking about just preferences or A/B tests on like retention and engagement? The latter I think is pretty reliable and powerful though I have never personally done them. Preferences are just as big a mess: WHO the annotators are matters, and if you are using preferences as a proxy for like correctness, you’re not really measuring correctness you’re measuring e.g. persuasion. A lot of construct validity challenges (which themselves are hard to even measure in domain).
Yes. All of them are poisoned metrics, just in different ways.
GPT-4o's endless sycophancy was great for retention, GPT-5's style of ending every response in a question is great for engagement.
Are those desirable traits though? Doubt it. They look like simple tricks and reek of reward hacking - and A/B testing rewards them indeed. Direct optimization is even worse. Combining the two is ruinous.
Mind, I'm not saying that those metrics are useless. Radioactive materials aren't useless. You just got to keep their unpleasant properties in mind at all times - or suffer the consequences.
> Brittle performance – A model might do well on short, primary school-style maths questions, but if you change the numbers or wording slightly, it suddenly fails. This shows it may be memorising patterns rather than truly understanding the problem
This finding really shocked me
The big problem is that tech companies and journalist aren't transparent about this. They tout benchmark numbers constantly, like they're an object measure of capabilities.
That's because they are as close to "object measure capabilities" as anything we're ever going to get.
Without benchmarks, you're down to evaluating model performance based on vibes and vibes only, which plain sucks. With benchmarks, you have numbers that correlate to capabilities somewhat.
So because there isn't a better measure it's okay that tech companies effectively lie and treat these benchmarks like they mean more then they actually do?
That's assuming these benchmarks are the best we're ever going to get, which they clearly aren't. There's a lot to improve even without radical changes to how things are done.
In my experience everyone openly talks about how benchmarks are bullshit. On Twitter or on their podcast interviews or whatever everyone knows benchmarks are a problem. It's never praise.
Of course they tout benchmark numbers because let's be real, if they didn't tout benchmarks your not going to bother using it. For example if someone posts some random model on huggingface with no benchmarks you just won't proceed.
Humans have a really strong prior to not waste time. We always always evaluate things hierarchally. We always start with some prior and then whatever is easiest goes next even if its a shitty unreliable measure.
For example, for Gemini 3 everyone will start with a prior that it is going to be good. Then they will look at benchmarks, and only then will they move to harder evaluations on their own use cases.
I don't use them regardless of the benchmarks, but I take your point.
Regardless though, I think the marketing could be more transparent
I also work in LLM evaluation. My cynical take is that nobody is really using LLMs for stuff, and so benchmarks are mostly just make up tasks (coding is probably the exception). If we had real specific use cases it should be easier to benchmark and know if one is better, but it’s mostly all hypothetical.
The more generous take is that you can’t benchmarks advanced intelligence very well, whether LLM or person. We don’t have good procedures for assessing a person's fit-for-purpose e.g. for a job, certainly not standardized question sets. Why would we expect to be able to do this with AI?
I think both of these takes are present to some extent in reality.
We have 20+ services in prod that use llms. So I have 50k (or more) per service per day of data to evaluate. The question is- do people actually evaluate properly.
And how do you do an apples to apples evaluation of such squishy services?
Do you not have massive volumes of customer queries to extract patterns for what people are actually doing?
We struggle a bit with processing and extracting this kind of insight in a privacy-friendly way, but there’s certainly a lot of data.
You could have the world expert debate the thing. Someone who can be accused of knowing things. We have many such humans, at least as many as topics.
Publish the debate as~is so that others vaguely familiar with the topic can also be in awe or disgusted.
We have many gradients of emotion. No need to try quantify them. Just repeat the exercise.
Has your lab tried using any of the newer causal inference–style evaluation methods? Things like interventional or counterfactual benchmarking, or causal graphs to tease apart real reasoning gains from data or scale effects. Wondering if that’s something you’ve looked into yet, or if it’s still too experimental for practical benchmarking work.
What gets measured, gets managed and improved, though.
I've written about Humanity's Last Exam, which crowdsources tough questions for AI models from domain experts around the world.
https://www.happiesthealth.com/articles/future-of-health/hum...
It's a shifting goalpost, but one of the things that struck me was how some questions could still be trivial for a fairly qualified human (a doctor in this case) but difficult for an AI model. Reasoning, visual or logic, is built on a set of assumptions that are better gained through IRL experience than crawling datasets and matching answers.
This leads me to believe that much of the future for training AI models will lie in exposing them to "meatspace" and annotating their inferences, much like how we train a child. This is a long, long process, and one that is already underway at scale. But it's what might give us emergent intelligences rather than just a basket of competing yet somehow-magic thesaurus.
Mercor is doing doing nine digit per year revenue doing just that. Micro1 and others also.
Benchmarks are like SAT scores. Can they guarantee you'll be great at your future job? No, but we are still roughly okay with what they signify. Clearly LLMs are getting better in meaningful ways, and benchmarks correlate with that to some extend.
There’s no a priori reason to expect a test designed to test human academic performance would be a good one to test LLM job performance.
For example a test of “multiply 1765x9392” would have some correlation with human intelligence but it wouldn’t make sense to apply it to computers.
Actually… ask gpt1 to multiply 1765x9392.
I wish this was more broadly, explained to people…
There are LLMs, the engines that make these products run, and then the products themselves.
GPT anything should not be asked math problems. LLMs are language models, not math.
The line is going to get very blurry because ChatGPT, or Claude or Gemini, are not LLM’s. Their products driven by LLMs.
The question or requisite should not be can my LLM do math. It can I build a product that is LLM driven that can reason through math problems. Those are different things.
A coworker of mine told me that GPT’s LLM can use Excel files. No, it can’t. But the tools they plugged into it can.
> A coworker of mine told me that GPT’s LLM can use Excel files. No, it can’t. But the tools they plugged into it can.
It's a bit like saying that a human can't use Excel files, but when given a keyboard, mouse and monitor connected to a computer running Excel, it can. But then obviously the "Excel usage" competency is in the human; not in the tools, and a cat for example cannot use Excel proficiently however many training hours it gets and however good the keyboard is.
Taking it back to the LLMs, it is clear to me that some modern LLMs like the one running ChatGPT can be integrated with tools in a way that makes them somewhat proficient with Excel, while other simpler LLMs cannot, regardless of the tools.
> A coworker of mine told me that GPT’s LLM can use Excel files. No, it can’t. But the tools they plugged into it can.
And there's a 50/50 chance they'll use the right tool for the job. I tried the math question above multiple times on gpt5 and it gets it right about 50% of the time. If i ask to "try again" it usually gets it on the 2nd or 3rd try. Most times that it's wrong, it's not far off but it looks deceptively accurate at first glance.
Isn’t this like grading art critics?
We took objective computers, and made them generate subjective results. Isn’t this a problem that we already know there’s no solution to?
That grading subjectivity is just subjective itself.
People often use "clearly" or "obviously" to elide the subject that is under discussion. People are saying that they do not think that it is clear that LLMs are getting better in meaningful ways, and they are saying that the benchmarks are problematic. "Clearly" isn't a counterargument.
Definitely one of the weaker areas in the current LLM boom. Comparing models, or even different versions of the same model, is a pseudo-scientific mess.
I'm still using https://lmarena.ai/leaderboard. Perhaps there is something better and someone will pipe up to tell me about it. But we use LLMs at work and have unexplainable variations between them.
And when we get a prompt working reliably on one model, we often have trouble porting it to another LLM - even straight "version upgrades" such as from GPT-4 to -5. Your prompt and your model become highly coupled quite easily.
I dunno what to do about it and am tending to just pick Gemini as a result.
Ratings on LMArena are too easily gamed.
Even professional human evaluators are quite vulnerable to sycophancy and overconfident-and-wrong answers. And LMArena evaluators aren't professionals.
A lot of the sycophancy mess that seeps from this generation of LLM stems from reckless tuning based on human feedback. Tuning for good LMArena performance has similar effects - and not at all by a coincidence.
It's biased to small context performance, which is why I don't pay much attention to it as a developer aside from a quick glance. I need performance at 40-100k tokens which models like Deepseek can't deliver but Gemini 2.5 Pro and ChatGPT 5.0 Thinking can.
And even "long term performance" splits itself into "performance on multi-turn instruction following" and "performance on agentic tasks" down the line. And "performance on agentic tasks" is a hydra in itself.
Capturing LLM performance with a single metric is a hopeless task. But even a single flawed metric beats no metrics at all.
I'd rather quit then be forced to beta test idiocracy. What's your company so we can all avoid it?
Psychometric testing of humans has a lot of difficulties, too. It's hard to measure some things.
This is something I've stuggled with for my site, I made https://aimodelreview.com/ to compare the outputs of LLMs over a variety of prompts and categories, allowing a side by side comparison between them. I ran each prompt 4 times for each model with different temperature values available as a toggles.
My thinking was to just make the responses available to users and let them see how models perform. But from some feedback, turns out users don't want to have to evaluate the answers and would rather see a leaderboard and rankings.
The scalable solution to that would be LLM as judge that some benchmarks already use, but that just feels wrong to me.
LM Arena tries to solve this with the crowd sourced solution, but I think the right method would have to be domain expert human reviewers, so like Wirecutter VS IMDb, but that is expensive to pull off.
I’m working a lot with TTS (Text-to-Speach), and it’s also a total wild west - even worse than LLMs in some ways. The demos are always perfect, but once you generate hundreds of minutes you start seeing volume drift, pacing changes, random artifacts, and occasional mispronunciations that never show up in the curated clips.
The big difference from LLMs is that we don’t really have production-grade, standardized benchmarks for long-form TTS. We need things like volume-stability across segments, speech-rate consistency, and pronunciation accuracy over a hard corpus.
I wrote up what this could look like here: https://lielvilla.com/blog/death-of-demo/
This is solvable at the level of an individual developer. Write your own benchmark for code problems that you've solved. Verify tests pass and that it satisfies your metrics like tok/s and TTFT. Create a harness that works with API keys or local models (if you're going that route).
At the developer level all my LLM use is in the context of agentic wrappers, so my benchmark is fairly trivial:
Configure aider or claude code to use the new model, try to do some work. The benchmark is pass/fail, if after a little while I feel the performance is better than the last model I was using it's a pass, otherwise it's a fail and I go back.
Building your own evaluations makes sense if you're serving an LLM up to customers and want to know how it performs, but if you are the user... use it and see how it goes. It's all subjective anyway.
> Building your own evaluations makes sense if you're serving an LLM up to customers and want to know how it performs, but if you are the user... use it and see how it goes. It's all subjective anyway.
I'd really caution against this approach, mainly because humans suck at removing emotions and other "human" factors when judging how well something works, but also because comparing across models gets a lot easier when you can see 77/100 vs 91/100 as a percentage score, over your own tasks that you actually use the LLMs for. Just don't share this benchmark publicly once you're using it for measurements.
So what? I'm the one that's using it, I happen to be a human, my human factor is the only one that matters.
At this point anyone using these LLMs every day have seen those benchmark numbers go up without an appreciable improvement in the day to day experience.
> So what? I'm the one that's using it, I happen to be a human, my human factor is the only one that matters.
Yeah no you're right, if consistency isn't important to you as a human, then it doesn't matter. Personally, I don't trust my "humanness" and correctness is the most important thing for me when working with LLMs, so that's why my benchmarks focus on.
> At this point anyone using these LLMs every day have seen those benchmark numbers go up without an appreciable improvement in the day to day experience.
Yes, this is exactly my point. The benchmarks the makers of these LLMs seems to always provide a better and better score, yet the top scores in my own benchmarks have been more or less the same for the last 1.5 years, and I'm trying every LLM I can come across. These "the best LLM to date!" hardly ever actually is the "best available LLM", and while you could make that judgement by just playing around with LLMs, actually be able to point to specifically why that is, is something at least I find useful, YMMV.
Well, openai github is open to write evaluations. Just add your there and guaranteed that the next model will perform better on them.
That’s called evals and yes any serious AI project uses them
I think that's what this site is doing: https://aistupidlevel.info/
We have to keep in mind that "solving" might mean having the LLM recognize the pattern of solving something.
> "For example, if a benchmark reuses questions from a calculator-free exam such as AIME," the study says, "numbers in each problem will have been chosen to facilitate basic arithmetic. Testing only on these problems would not predict performance on larger numbers, where LLMs struggle."
When models figure out how to exploit an effect that every clever college student does, that should count as a win. That’s a much more human-like reasoning ability, than the ability to multiply large numbers or whatever (computers were already good at that, to the point that it has become a useless skill for humans to have). The point of these LLMs is to do things that computers were bad at.
I don’t think the fact that LLMs can handle small numbers more reliably has anything to do with their reasoning ability. To the contrary, reasoning ability should enable them to handle numbers of arbitrary size, just as it enables humans to do so, given some pencil and paper.
However:
> Testing only on these problems would not predict performance on larger numbers, where LLMs struggle.
Since performance on large numbers is not what these exams are intended to test for, I don’t see this as a counterargument, unless the benchmarks are misrepresenting what is being tested for.
> reasoning ability should enable them to handle numbers of arbitrary size, just as it enables humans to do so, given some pencil and paper.
Or given a calculator. Which it's running on. Which it in some sense is. There's something deeply ironic about the fact that we have an "AI" running on the most technologically advanced calculator in the history of mankind and...it can't do basic math.
This is like saying it's ironic that an alternator in a car cannot combust gasoline when the gasoline engine is right beside it, even though the alternator 'runs' on the gasoline engine.
Or similarly having a gasoline engine without an alternator and making the observation that there's an absurdity there in that you're generating large amounts of energy, yet aren't able to charge a relatively small 12V battery with any of it. It's a very practical and natural limitation, yet in some sense you have exactly what you want - energy - you just can't use it because of the form. If you step back there's an amusing irony buried in that. At least in my humble opinion :-)
Thing is, a LLM is nothing but a prediction algorithm based upon what it trained. So it missing basic calculator functionality is a given. This is why tool usage is more and more a thing for LLMs. So that the LLM can from itself use a calculator for the actual math parts it needs. Thus increasing accuracy ...
If they were selling LLMs as “LLMs” instead of magic code-writing, answer-giving PhD replacements, the lack of basic arithmetic capability would be a given… but they aren’t. Judging a paid service using their own implied claims is perfectly reasonable.
Why is it a given? The universal approximation theorem should apply since addition is a continuous function. Now whether the network is sufficiently trained for that is another question but I don’t think it's a given that a trillion parameter model can’t approximate the most basic math operations.
I think the tokenization is a bigger problem than the model itself.
Easy to answer that one ... predictions are based upon accuracy. So if you have a int4 vs a float16, the chance that the prediction goes off is higher with a int4. But even with a float16, your still going to run into issues where your prediction model goes off. Its going to be a lot less, your still going to get rounding issue, what may result in a 5 being a 8 (just a example).
So while it can look like a LLM calculates correctly, its still restricted by this accuracy issue. What happens when you get a single number wrong in a calculation, everything is wrong.
While a calculator does not deal with predictions but basic adding/multiplying/subtracting etc .. Things that are 100% accurate (if we not not count issues like cosmic rays hitting, failures in silica etc).
A trillion parameter model is just that, a trillion parameters, but what matter is not the tokens but the accuracy as in, the do they use int, float16, float32, float64 ... The issue is, the higher we go, the memory usage explodes.
There is no point in spending terabytes of memory, to just get a somewhat accurate predictive calculator, when we can just have the LLM call a actual calculator, to ensure its results are accurate.
Think of a LLM more like somebody with Dyslexia / Dyscalculia... It does not matter how good you are, all it takes is to switch one number in a algebraic calculation to get a 0/10 ... The reason why i mention this, is because i often think of a LLM like a person with Dyslexia / Dyscalculia. It can have insane knowledge, be smart, but be considered dumb by society because of that less then accurate prediction (or number swiping issue).
Take it from somebody that wasted a few years in school thanks to that issue, it really does not matter if your a good programmer later in life, when you flunk a few years thanks to undiagnosed issues. And yet, just like a LLM, i simply rely on tool usage to fix my inaccuracy issues. No point in wasting good shoulder space trying to graft a dozen more heads/brains onto me, when i can simply delegate the issue away. ;)
The fact that we can get computer models, that can almost program, write texts, ... and do so much more like a slightly malfunctioning human, amazes me. And at the same time, i curse at it like my teachers did, and also call it dumb at times hehehe ... I now understand how my teachers felt loool
This is a very unserious take. It's not ironic, because it's not a calculator.
What's meaning of `computer`, remind me quick?
Computer vision algorithms run on computers and they can’t do basic arithmetic.
My email client runs on my computer and it doesn’t do basic arithmetic either.
Something running on a computer does not imply that it can or should do basic arithmetic
That's confusing basic arithmetic as a user feature and as an implementation requirement.
I guarantee that computer vision and email clients both use basic arithmetic in implementation. And it would be trivially easy to bolt a calculator into an email app, because the languages used to write email apps include math features.
That's not true of LLMs. There's math at the bottom of the stack. But LLMs run as a separate closed and opaque application of a unique and self-contained type, which isn't easily extensible.
They don't include hooks into math features on the GPUs, and there's no easy way to add hooks.
If you want math, you need a separate tool call to conventional code.
IMO testing LLMs as if they "should" be able to do arithmetic is bizarre. They can't. They're not designed to. And even if they did, they'd be ridiculously inefficient at it.
Yes, you are agreeing with me.
Pretty sure the only thing computer vision does is math.
I’ve also observed email clients tallying the number of unread emails I have. It’s quite obnoxious actually, but I qualify adding as math.
> Pretty sure the only thing computer vision does is math.
That is only marginally less pedantic than saying that the only thing computer vision does is run discrete electrical signals through billions of transistors.
Yes, everything that a computer does, it does using math. This does not imply that things running on the computer can do basic arithmetic tasks for the user.
Pencil and paper is just testing with tools enabled.
On some level this makes sense, but on the other hand LLMs already have perfect recall of thousands of symbols built into them, which is what pencil and paper gives to a human test taker.
If only context recall was actually perfect! The data is certainly stored well, accurately accessing the right part... maybe worse than a human :D.
If you're not doing clever hacks for very long windows, I thought a basic design fed in the entire window and it's up to the weights to use it properly.
I’d say it’s fair for LLMs to be able to use any tool in benchmarks, so long as they are the ones to decide to use them.
Agreed. I don't like when the prompt sets up a good portion of how to go about finding the answer by saying which tools to use and how. The LLM needs to decide when and how to use them, not the prompt.
I don't think it should be completely open ended. I mean, you could have an "ask_hooman" tool that solves a ton of problems with current LLMs. But that doesn't mean the LLM is capable with respect to the benchmark.
Why not? One of the most intelligent things to do when stuck on a problem is to get outside help.
If allowing this behaviour raises a problem, you can always add constraints to the benchmark such as "final answer must come out under 15s" or something. The LLM can then make the decision to ask around in accordance to the time risk.
Because AI are good at devolving to the highest score, regardless of test intent. For most problems "ask_hooman", or especially the plural, would be much more effective. So, the degenerate case would dominate and tell you precisely zero about the intelligence of the AI. If a specific "tool" is more adept than the "AI" then "choose tool" will always be the correct answer. But I agree, a tight time constraint would help.
You seem to be addressing an argument that wasn’t made.
Personally, I’d say that such tool use is more akin to a human using a calculator.
I'm not addressing an argument, just stating that's already a form of LLM testing done today for people wanting to look at the difference in results the same as the human analogy.
Okay, but then I don’t understand why you replied to my comment for that, there is no direct connection to what I wrote, nor to what bee_rider wrote.
> To the contrary, reasoning ability should enable them to handle numbers of arbitrary size, just as it enables humans to do so, given some pencil and paper.
People interested can see the results of giving LLMs pen and paper today by looking at benchmarks with tools enabled. It's an addition to what you said, not an attack on a portion of your comment :).
I see now. My focus was on the effect of LLMs’ (and by analogy, humans’) reasoning abilities argued by bee_rider. The fact that tool use can enable more reliable handling of large numbers has no bearing on that, hence I found the reply confusing.
Hmm, maybe it depends on the specific test and reasoning in it? I certainly think reasoning how and when to use allowed tools and when not to is a big part of the reasoning and verification process E.g. most human math scores allow for a pen and paper calculation, or even a calculator, and that can be a great way to say spot check a symbolic derivative and see it needs to be revisited without relying on the calculator/paper to do the actual reasoning for the testee. Or to see the equation for motion of a system can't possibly have been right with some test values (without which I'm not sure I'd have passed my mid level physics course haha).
At the very least, the scores for benchmarking a human on such a test with and without tools would be different to comparing an LLM without the analogous constraints. Which is (IMO) a useful note in comparing reasoning abilities and why I thought it was interesting to note this kind of testing is just called testing with tools on the LLM side (not sure there is an equally as standard term on the human testing side? Guess the same could be used for both though).
At the same time I'm sure other reasoning tests don't gain much from/expect use of tools at all. So it wouldn't be relevant for those reasoning tests.
> Since performance on large numbers is not what these exams are intended to test for,
How so? Isn't the point of these exams to test arithmetic skills? I would hope we'd like arithmetic skills to be at a constant level regardless of the size of the number?
No. AIME is a test for advanced high schoolers that mostly tests higher level math concepts like algebra and combinatorics. The arithmetic required is basic. All the answers are 3-digit numbers so that judging is objective and automated while making guessing infeasible. You have 12 minutes on average for each question, so even if you are terribly slow at arithmetic, you should still be able to calculate the correct answer if you can perform all the other math.
That's probably a great test for high schoolers but it doesn't really test what we want from AI, no? I would expect AI to be limited by the far greater constraints of its computing ability, and not the working memory of a human high schooler.
Absolutely not.
College exam takers use those tricks because they are on a time limit and are gaming the system. It's clever and wink wink nudge nudge ok everyone does it. But it's one tiny signal in a huge spectrum of things we use to evaluate people.
Instead, these metrics are gamed and presented as the entire multi special signal of competence for LLMs because it is literally impossible to say that success in one domain would translate the way it might with a good hire.
What I want is something I don't have to guard against gaming. Something conscientious and capable like my co workers. Until then it's google version 2 married to intellisense and I'm not letting do anything by itself.
IMO I think the calculator problem goes away with tool use or NN architectures that basically add a calculator equivalent as one of the potential 'experts' or similar. It won't be much of a trope for longer.
Chatgpt has been calculating things in its python sandbox for years already. This is a trope indeed
A discussion on models "figuring out" things: https://www.youtube.com/watch?v=Xx4Tpsk_fnM (Forbidden Technique)
LLMs can probably be taught or configured to use external tools like Excel or Mathematica when such calculations are needed. Just like humans. There are plenty of untapped optimization opportunities.
I tried making a spreadsheet application and found that they’re not that great at working with 2D data, especially if there’s a lot of it. It’s harder to do search for a large spreadsheet than a large text file - you might get a range of thousands of numbers, how do you search that? And things like headers or important information may not be anywhere near where it’s focused which means it needs to read a ton of irrelevant context. For small sheets it works perfectly though, it’ll have to be something I’ll take another look at in the future.
Chatgpt spins up a python sandbox for any complex calculations. It’s been able to do that for a while now
The point of these LLMs is to do things that computers were bad at.
That's a good point imo but we achieved this stuff by at least 2022 when ChatGPT was released. The thing about these giant black boxes is that they also fail to do things that directly human-written software ("computers") does easily. The inability to print text onto generated images or do general arithmetic is important. And sure, some of these limits look like "limits of humans". But it is important to avoid jumping from "they do this human-thing" to "they're like humans".
I don't claim to know anything but I thought tool usage was a major sign of intelligence. For example floats are a wonderful technology but people use them as if chainsaws are great for cutting bread and butter. We now have entire languages that cant do basic arithmetic. I thought it was alarming: People it cant compute like this! Now we have language models, those are still computers, why cant we just give them.. you know... calculators? Arguably the best thing their universe has to offer.
edit: I forgot my point: calculating big numbers is not a real world problem anyone has.
We do? Tool use started coming in vogue around 2023
Actually, tool use started coming into vogue around 3.3 million years ago.
>the point of these LLMs is to do things that computers were bad at.
The way they’re being deployed it feels like the point of LLMs is largely to replace basic online search or to run your online customer support cheaply.
I’m a bit out on a limb here because this is not really my technical expertise by any stretch of the imagination, but it seems to me these benchmark tests don’t really tell us much about how LLM’s perform in the ways most people actually use them. Maybe I’m off base here though
Nobody really knows "the point" of LLMs yet. They weren't even "invented" as much as they emerged as a trick to get computers to better understand human language.
They're still brand spanking new and everyone's trying to figure out how to best use them. We don't even really know if they're ever going to be "really good at" any given task!
Are they "really good at" these things or are they merely "OK-ish"?
Real world testing suggests that with billions and billions of dollars spent, you really can get an LLM to be "OK-ish" at all those things :D> Nobody really knows "the point" of LLMs yet
Yet literally hundreds of billions of dollars are being invested in them. That’s what’s so concerning. And I can tell you not one of these startups would EVER acknowledge the truth of your statement.
We should make a collective git repo full of every kind of annoying bug we (expert developers) can think of. Then use that to benchmark LLMs.
Someone want to start? I've got a Yjs/CRDT collaborative editing bug that took like a week and a half of attempts with Claude Code (Sonnet 4.5), GPT5-codex (medium), and GLM-4.6 many, many attempts to figure out. Even then they didn't really get it... Just came up with a successful workaround (which is good enough for me but still...).
Aside: You know what really moved the progress bar on finding and fixing the bug? When I had a moment of inspiration and made the frontend send all it's logs to the backend so the AIs could see what was actually happening on the frontend (near real-time). Really, I was just getting sick of manual testing and pasting the console output into the chat (LOL). Laziness FTW!
I have the Google Chrome Dev Tools MCP but for some reason it doesn't work as well :shrug:
Have you tried the Playwright libraries? Not the MCP, instead telling Claude Code to use the Node.js or Python Playwright libraries directly. I have had some really good results for this for gnarly frontend challenges.
Curious why not the MCP? I use that
I don't really like MCPs, at least when I'm working with coding agents like Claude Code or Codex CLI. I'd rather let the agents write code that can do anything the underlying library is capable of, rather than restricting them to just the functionality that the MCP exposes.
It's more token efficient too since I don't need to load the full MCP description into my context.
When I have a bug I’m iterating on it’s much easier and faster to have it write out the playwright script. That way it does not have to waste time or tokens performing the same actions over and over again.
Think of it as TDD.
I can't tell how much of this is sarcasm
> we (expert developers) ...
> took like a week and a half of attempts with Claude Code ...
What kind of expert developer wastes that much time prompting a bunch of different LLMs to end up with a workaround, instead of actually debugging and fixing the bug themselves?
To be charitable to the parent poster, I've had multi-week bugs that turned out to be a tiny change, where every test iteration took hours of compile time...
Fair question but I think the tone of this is a bit abrasive towards the poster, and unnecessarily so.
there is a lot of disdain for vibe coding/coders, as Im sure you already know. I was going to post something similar as soon as I read a week and a half of prompts. I pray that any gainfully employed expert coders don't spend 10 days prompting, rather than coding lol
I really don't think so. "Expert" developer really needs to mean something other than "prompting and poking at Claude Code".
> We should make a collective git repo full of every kind of annoying bug we (expert developers) can think of. Then use that to benchmark LLMs.
I think any LLM-user worth their salt have been doing this pretty much since we got API access to LLMs, as otherwise there is no way to actually see if they can solve the things you care about.
The only difference is that you must keep the actual benchmarks to yourself, don't share them with anyone and even less put them publicly. The second you do, you probably should stop using it as an actual benchmark, as newly trained LLMs will either intentionally or unintentionally slurp up your benchmark and suddenly it's no longer a good indicator.
I think I personally started keeping my own test cases for benchmarking around the GPT3 launch, when it became clear the web will be effectively "poisoned" from that part on, and anything on the public internet can be slurped up by the people feeding the LLMs training data.
Once you have this up and running, you'll get a much more measured view of how well new LLMs work, and you'll quickly see that a lot of the fanfare doesn't actually hold up when testing it against your own private benchmarks. On a happier note, you'll also be surprised when a model suddenly does a lot better in a specific area that wasn't even mentioned at release, and then you could switch to it for specifically that task :)
I actually started a collection of annoying bugs I’ve seen in the wild. I give the llm the buggy implementation and ask it to write a test that catches it. So far not even a frontier model (Claude Sonnet) can do it, even though they can find and fix the bug itself.
> even a frontier model (Claude Sonnet) can do it
Probably because Sonnet is no longer a frontier model, it isn't even the best model Anthropic offers, according to themselves.
This may be intentional, but I'd like to point out that your basically suggesting that others aggregate high-quality training data for AI companies to use free of charge to replace software engineers.
It would be pretty easy to over fit the results with a static set of tests
What was the CRDT bug?
Benchmarks are nothing more than highly contextual specs (in traditional code). They demonstrate your code works in a certain way in certain use cases, but they do not prove your code works as expected in all use cases.
> Program testing can be used to show the presence of bugs, but never to show their absence. Edsger W. Dijkstra
Maybe we need something similar for benchmarks, and updated for today's LLMs, like:
> LLM benchmarks can be used to show what tasks they can do, but never to show what tasks they cannot.
This wasn't that hard to see.
> Our systematic review of 445 benchmarks reveals prevalent gaps that undermine the construct validity needed to accurately measure targeted phenomena
Intelligence has an element of creativity, and as such the true measurement would be on metrics related to novelty, meaning tasks that have very little resemblance to any other existing task. Otherwise it's hard to parse out whether it's solving problems based on pattern recognition instead of actual reasoning and understanding. In other words, "memorizing" 1000 of the same type of problem, and solving #1001 of that type is not as impressive as solving a novel problem that has never been seen before.
Of course this presents challenges to creating the tests because you have to avoid however many petabytes of training data these systems are trained with. That's where some of the illusion of intelligence arises from (illusion not because it's artificial, since there's no reason to think the brain algorithms cannot be recreated in software).
In my opinion a major weakness in how people reason about this issue is that they describe solving problems as EITHER recall of an existing solution OR creative problem solving. Sure it is possible for a specific solution to be recalled, but it's not possible for a problem to be absolutely unrelated to anything the system has ever seen before and still be solvable. There are many shades of gray in the similarity a problem may have to previously seen problems. In fact, I expect that there are as many shades of gray as there are problems.
The difference is that humans don't memorize petabytes of problems, so from a relative perspective people are constantly solving novel problems they never saw before. I'm thinking this is a requirement for dynamic, few-shot learning. We can clearly see LLMs fail when you throw even a small wrench in the prompt.
Humans encounter massive numbers of problems in their experience that informs their problem solving. The same is true of LLM. LLMs do not actually have all their training data memorized.
I’m not sure what your basis is for saying “LLMs fail if there is a small wrench in the prompt.” They also succeed despite wrenches in the prompt with great regularity.
Let's clarify, this isn't about whether the models are capable. They are very capable and impressive. This is more about whether we can use the same type of metric we use for humans to compare and conclude if they are "intelligent".
It's not just semantics, the metrics are supposed to tell us the potential of the model. If they can solve extremely hard PhD problems, it should be the case that we're already in the singularity, and they should be solving absolutely everything in whatever field they were trained in, because it's not just PhD level, it's a machine that has a ton of memory, compute and never sleeps. However, once you use these models extensively, it becomes apparent they are just synthesizing data, and not as much understanding it in a way that would allow them to extrapolate into anything else as humans do.
I think this point is a little hard to explain. I'll just emphasize, these are smart systems, and they can do a lot, but there is still a disconnect between, let's say, a PhD level model and a human with a PhD, in the "quality" of what we would call "intelligence" of both entities (human and machine).
Human metrics of intelligence have always felt like rubbish. We never did this well. I would describe intelligence as effective adaption leading to survival and growth or prospering. Memorization, comprehension, speed of response etc. those are magnifying factors that are valued, we view them as components of intelligence, but llms are proving this is not the whole, without effective application, they are not intelligence. Perhaps learning is the difference? How to measure that?
Someone describing string theory is the literary equivalent of fractal structures in snowflakes. Lovely, complex, possibly unique, but not proof of a level of intelligence- for the string theorist maybe it is intelligent, perhaps persuading someone to fund their grant, which enables them to eat, shelter etc. Might be a bit harsh on string theory. Saying it is proof of an amount of intelligence leads us to falsifiable statements.
Benchmarks optimize for fundraising, not users. The gap between "state of the art" and "previous gen" keeps shrinking in real-world use, but investors still write checks based on decimal points in test scores.
we try to make benchmarks for users, but it's like that 20% article - different people want different 20% and you just end up adding "features" and whackamoling the different kinds of 20%
if a single benchmark could be a universal truth, and it was easy to figure out how to do it, everyone would love that.. but that's why we're in the state we're in right now
The problem isn’t with the benchmarks (or the models, for that matter) it’s their being used to prop up the indefensible product marketing claims made by people frantically justifying asking for more dump trucks of thousand-dollar bills to replace the ones they just burned through in a few months.
I wish the big providers would offer some sort of trial period where you can evaluate models in a _realistic_ setting yourself (i.e cli tools or IDE integrations). I wouldn't even mind strict limits -- just give me two hours or so of usage and I'd already be happy. Seriously.
My use-case is probably pretty far from the usual tasks: I'm currently implementing a full observability platform based on VictoriaMetrics / Victorialogs + Grafana. It's quite elaborate and has practically no overlap with the usual/cloud solutions you find out there. For example, it uses an authenticated query stack: I use the Grafana oauth token to authenticate queries by injecting matchers via prom-label-proxy and forward that to promxy for fan-out to different datasources (using the label filter to only query some datasources). The IaC stuff is also not mainstream as I'm not using any of the big cloud providers, but the provider I use nonetheless has a terraform provider.
As you can imagine, there's probably not much training data for most of this, so quality of the responses varies widely. From my experience so far Claude (Sonnet 4.5 ) does a _much_ better job than GTP-5 (Codex or normal) with the day-to-day task. Stuff like keeping documentation up to date, spotting inconsistencies, helping me find blind spots in the Alerting rules, etc. It also seems to do better working with provided documentation / links.
I've been using Claude for a couple of weeks now but recently switched to codex after my subscription to Claude ran out. I was really curious after reading a lot of good things about it but I gotta say, so far, I'm not impressed. Compared to Claude it gives wrong answers much more frequently (at least in this domain). The results it produces take much more effort to clean up than Claude's. Probably on a level where I could just invest the time myself. Might be that I do not yet know how to correctly prompt GPT but giving both tools the same prompt, Claude does a better job 90% of the time.
Anyway, I guess this is my long-winded way of saying that the quality of responses "off the beaten track" varies widely and is worth testing several models with. Especially if your work is not 70+% of coding. Even then I guess that many benchmarks have seized being useful by now?
There's the github copilot 30 day trial? "Access to Anthropic Claude Sonnet 4, GPT-5, Gemini 2.5 Pro, and more 300 premium requests to use the latest models and code review"
https://wuu73.org/blog/aiguide1.html
You can get a lot of free usage out of the models.
This might explain the zeitgeist that new models feel same-ish, despite model developers saying they're getting spectacularly better.
I'd like to see some video generation benchmarks. For example, one that tested a model's ability to generate POV footage of a humanoid form carrying out typical household tasks
Even if it requires human evaluators at first, and even if the models completely suck at this task right now: it seems like the kind of task you'd want them to be good at, if you want these models to eventually carry out these tasks in embodied forms in the real world.
Just having the benchmark in the first place is what gives model makers something to optimize for.
Generating footage wouldn't help with the opposite but navigating a simulation would which is a pretty standard type of evaluation for multimodal AIs designed to act in the real world.
Do you mean that it wouldn't help with ingesting footage and then determining how to act?
I can imagine a robotics architecture where you have one model generating footage (next frames for what it is currently seeing) and another dumber model which takes in the generated footage and only knows how to generate the motor/servo control outputs needed to control whatever robot platform it is integrated with.
I think that kind of architecture decoupling would be nice. It allows the model with all the world and task-specific knowledge to be agnostic from its underlying robot platform.
A test doesn't need to be objectively meaningful or rigorous in any sense in order to still be useful for comparative ranking.
yes it does - it has to be meaningful or rigorous for the comparative ranking to be meaningful or rigorous, or else wtf are you doing? Say I have all the information on my side but only these questions that you are showing the user? Who cares about that comparison?
objectively vs comparatively
"Measuring money turns out to be easier than measuring intelligence." Don't ever change, El Reg.
Humans are much better at out of sample prediction than LLMs. And inherently benchmarks cannot be out of sample. So I believe that leads to the disconnect between LLMs getting better and better at in sample prediction (benchmarks) while not improving nearly as much at out of sample (actual work).
Id hope anyone using LLMs in production is testing them against their use directly.
Benchmarks make for a good first pass though to figure out which ones to test
For statistical AI models, we can use out of sample prediction error as an objective measure to compare models. What makes evaluating LLMs difficult is that comparisons are inextricable from utility (whereas statistical AI models do have a pre-utility step wherein it can be shown out of sample prediction epsilon is minimized).
for me the definition of AGI is the tool to measure https://arxiv.org/html/2510.18212v2
AI detractors can say whatever. As a developer Claude Code is almost an unfair cheat code. AI valuations may be absurd but the hype is justified.
They should laugh while they can ;) Still waiting for the crash and to see what lives on and what gets recycled. My bet is that grok is here to stay ;)
(Don't hurt me, I just like his chatbot. It's the best I've tried at, "Find the passage in X that reminded me of the passage in Y given this that and the other thing." It has a tendency to blow smoke if you let it, but they all seek to affirm more than I'd like, but ain't that the modern world? It can also be hilariously funny in surprisingly apt ways.)
Grok is terrible at coding though.
If models get commoditised, distribution (and vertical integration) become key. OpenAI and xAI are the only companies that seem to be well hedged for this risk.
Heh. I haven't tried it yet, but even grok says Claude is the way to go.
The problem with the LLM benchmarks is that if you see one that shows high performance by something that isn’t from Anthropic, Google or OpenAI, you don’t believe it, even if it were “true.” In that sense, benchmarks are a holistic social experience in this domain, less a scientific endeavour.
Tech companies/bloggers/press/etc are perpetually bad at benchmarks. For browsers they kept pushing simplistic javascript-centric benchmarks even when it was clear for at least 15 years that layout/paint/network/etc were the dominant bottlenecks in real-world usage.
It's primarily marketing-driven. I think the technical parts of companies need to attempt to own this more.
It gets really weird when engineering priorities shift because of these mostly irrelevant benchmarks.
Technically true but also a very dumb take and manipulative phrasing
I'm already quite put off by the title (it's science -- if you have a better benchmark, publish it!), but the contents aren't great either. It keeps citing numbers about "445 LLM benchmarks" without confirming whether any of the ones they deem insufficiently statistical are used by any of the major players. I've seen a lot of benchmarks, but maybe 20 are used regularly by large labs, max.
For a math-based critique, this seems to ignore a glaring problem: is it even possible to randomly sample all natural numbers? As another comment pointed out we wouldn't even want to ("LLMs can't accurately multiply 6-digit numbers" isn't something anyone cares about/expected them to do in the first place), but regardless: this seems like a vacuous critique dressed up in a costume of mathematical rigor. In related news, at least some scientists studying climate change are aware that their methods are imperfect. More at 11!If anyone doubts my concerns and thinks this article is in good faith, just check out this site's "AI+ML" section: https://www.theregister.com/software/ai_ml/
(We've since changed both title and URL - see https://news.ycombinator.com/item?id=45860056)
The article references this review:
https://openreview.net/pdf?id=mdA5lVvNcU
And the review is pretty damning regarding statistical validity of LLM benchmarks.
When people claim that there is such a thing as "X% accuracy in reasoning", it's really hard to take anything else seriously, no matter how impressive.
AI (and humans!) aside, claiming that there was an oracle that could "answer all questions" is a solved problem. Such a thing cannot exist.
But this is going already too deep IMO.
When people start talking about percentages or benchmark scores, there has to be some denominator.
And there can be no bias-free such denominator for
- trivia questions
- mathematical questions (oh, maybe I'm wrong here, intuitively I'd say it's impossible for various reasons: varying "hardness", undecidable problems etc)
- historical or policital questions
I wanted to include "software development tasks", but it would be a distraction. Maybe there will be a good benchmark for this, I'm aware there are plenty already. Maybe AI will be capable to be a better software developer than me in some capacity, so I don't want to include this part here. That also maps pretty well to "the better the problem description, the better the output", which doesn't seem to work so neatly with the other categories of tasks and questions.
Even if the whole body of questions/tasks/prompts would be very constrained and cover only a single domain, it seems impossible to guarantee that such benchmark is "bias-free" (I know AGI folks love this word).
Maybe in some interesting special cases? For example, very constrained and clearly defined classes of questions, at which point, the "language" part of LLMs seems to become less important and more of a distraction. Sure, AI is not just LLMs, and LLMs are not just assistants, and Neural Networks are not just LLMs...
There the problem begins to be honest: I don't even know how to align the "benchmark" claims with the kind of AI they are examinin and the ones I know exist.
Sure it's possible to benchmark how well an AI decides whether, for example, a picture shows a rabbit. Even then: for some pictures, it's gotta be undecidable, no matter how good the training data is?
I'm just a complete layman and commenting about this; I'm not even fluent in the absolute basics of artificial neural networks like perceptrons, gradient descent, backpropagation and typical non-LLM CNNs that are used today, GANs etc.
I am and was impressed by AI and deep learning, but to this day I am thorougly disappointed by the hubris of snakeoil salespeople who think it's valuable and meaningful to "benchmark" machines on "general reasoning".
I mean, it's already a thing in humans. There are IQ tests for the non-trivia parts. And even these have plenty of discussion revolving around them, for good reason.
Is there some "AI benchmark" that exclusively focuses on doing recent IQ tests on models, preferably editions that were published after the particular knowledge cutoff of the respective models? I found (for example) this study [1], but to be honest, I'm not the kind of person who is able to get the core insights presented in such a paper by skimming through it.
Because I think there are impressive results, it's just becomimg very hard to see through the bullshit at as an average person.
I would also love to understand mroe about the current state of the research on the "LLMs as compression" topic [2][3].
[1] https://arxiv.org/pdf/2507.20208
[2] https://www.mattmahoney.net/dc/text.html
[3] https://arxiv.org/abs/2410.21352
Url changed from https://www.theregister.com/2025/11/07/measuring_ai_models_h..., which points to this.
I've been getting flagged by high-on-their-own-supply AI boosters for identifying that LLM benchmarks have been obvious bullshit for at least the last year and a half.
What changed to make "the inevitable AI bubble" the dominant narrative in last week or so?
Link those comments please because I checked your history and the flagged ones were pure nonsense with zero insights. Also, calling out LLM benchmarks has never been a radical take and basically the default on this site.
Companies are talking about needing trillions of dollars is why.
And the government backstops.
Nothing says confidence that AGI is imminent like needing the US government to prevent your investments from losing you money.
It is possible to be right on the main theme but only by accident (with arguments and claims being wrong), communicating in a highly faulty way, with pointless insults, doing it in offtopic derails, being correct on minor point while being mostly wrong etc.
Can you link some of these comments you consider useful but got flagged?
Benchmarks in general have this problem, across pretty much all industries. "When a measure becomes a target" and all that.
The market was down for AI related stocks especially, while down only over 3% it’s the worst week since April, and there’s no single event that is to blame it just looks like market sentiment has shifted away from the previous unchecked exuberance.
Don’t get high on your own supply.