The original ProgramBench paper released by Meta (5/5/26) used single runs for their tasks during benchmarking.
I was curious about the variance of the output and I made some runs w/ deepseek v4 flash and found that it was pretty high.
There's also a possibility of strong model memorization on the tasks, I saw a header w/ the authors generated in one of the runs for the cmatrix task (one of 3 of 200 tasks I selected to re-evaluate).
Additionally curious if anyone else thinks that passing in the gold executable as part of the task lowers the usefulness of this benchmark.
caveats: N=5 on my runs and I used my own generalized task prompt
The original ProgramBench paper released by Meta (5/5/26) used single runs for their tasks during benchmarking.
I was curious about the variance of the output and I made some runs w/ deepseek v4 flash and found that it was pretty high.
There's also a possibility of strong model memorization on the tasks, I saw a header w/ the authors generated in one of the runs for the cmatrix task (one of 3 of 200 tasks I selected to re-evaluate).
Additionally curious if anyone else thinks that passing in the gold executable as part of the task lowers the usefulness of this benchmark.
caveats: N=5 on my runs and I used my own generalized task prompt