makes sense! we wrote something yesterday about the weaknesses of test-based evals like swe-bench [1]
they are definitely useful but they miss the things that are hard to encode in tests, like spec/intent alignment, scope creep, adherence to codebase patterns, team preferences (risk tolerance, etc)
and those factors are really important. which means that test-evals should be relied upon more as weak/directional priors than as definitive measures of real-world usefulness
LLM-written code passed SWE Bench even back then. This may just say that SWE Bench is an inadequate test, and should not be used for serious evaluation.
Do these benchmarks make any sense? I tried a few local models that seem to be scoring well in SWE but the results were pure rubbish. (For instance MiniMax-M2.5 at 128GB from unslothed - completely unusable).
SWE-bench scores well in the narrow task of making tests pass, which means models get good at exactly that. Real codebases have style constraints, architecture choices, and maintainability concerns that don't show up in any test suite. Not surprised at all that the PRs wouldn't get merged; you'd expect that from an eval that can't measure what reviewers actually care about.
They might have tried, but this would be pretty hard to achieve for real - especially for the older/worse models. For changes that do more than alter a couple of lines llm output can be very obvious. Stripping all comments from the changeset might go a long way to making it more blind, but then you're missing context that you kinda need to review the code properly.
I feel like I don't have the context for this conversation. If slop is obvious as slop, I feel like we should block it.
If you look at the comment it says what the code following the comment does. It doesn't matter whether it is a human or a machine that wrote it. It is useless. It is actually worse than useless because if someone needs to change the code, now they need to change two things. So in that sense, you just made twice the work for anyone who touches the code after you and for what benefit?
The point is that AI models do these kinds of things all the time. They're not really all that smart or intelligent, they just replicate patterns or boilerplate and then iterate until it sort of appears to work properly.
makes sense! we wrote something yesterday about the weaknesses of test-based evals like swe-bench [1]
they are definitely useful but they miss the things that are hard to encode in tests, like spec/intent alignment, scope creep, adherence to codebase patterns, team preferences (risk tolerance, etc)
and those factors are really important. which means that test-evals should be relied upon more as weak/directional priors than as definitive measures of real-world usefulness
[1] https://voratiq.com/blog/test-evals-are-not-enough/
> mid-2024 agents
Is this a post about AI archeology?
LLM-written code passed SWE Bench even back then. This may just say that SWE Bench is an inadequate test, and should not be used for serious evaluation.
This seems like an important caveat to the SWE-bench, but the trend is still clearly AI becoming more and more capable.
Do these benchmarks make any sense? I tried a few local models that seem to be scoring well in SWE but the results were pure rubbish. (For instance MiniMax-M2.5 at 128GB from unslothed - completely unusable).
SWE-bench scores well in the narrow task of making tests pass, which means models get good at exactly that. Real codebases have style constraints, architecture choices, and maintainability concerns that don't show up in any test suite. Not surprised at all that the PRs wouldn't get merged; you'd expect that from an eval that can't measure what reviewers actually care about.
Edit: Nevermind
Well, no: one of the first things it says is reviewers were blind to human vs. ai.
They might have tried, but this would be pretty hard to achieve for real - especially for the older/worse models. For changes that do more than alter a couple of lines llm output can be very obvious. Stripping all comments from the changeset might go a long way to making it more blind, but then you're missing context that you kinda need to review the code properly.
The comment you're replying to is talking about a hypothetical scenario.
In any case, the blinding didn't stop Reviewer #2 from calling out obvious AI slop. (Figure 5)
I feel like I don't have the context for this conversation. If slop is obvious as slop, I feel like we should block it.
If you look at the comment it says what the code following the comment does. It doesn't matter whether it is a human or a machine that wrote it. It is useless. It is actually worse than useless because if someone needs to change the code, now they need to change two things. So in that sense, you just made twice the work for anyone who touches the code after you and for what benefit?
The point is that AI models do these kinds of things all the time. They're not really all that smart or intelligent, they just replicate patterns or boilerplate and then iterate until it sort of appears to work properly.
> appears to work
That "appears" is doing a lot of heavy lifting.
The code working isn't what's being selected for.
The code looking convincing IS what is being selected for.
That distinction is massive.