Evaluating Evolving Agents with Evolving Benchmarks

(frontier-cs.org)

3 points | by lihanc111 9 hours ago ago

1 comments

Evolving agent systems (agents that use continual learning, memory, or context refinement) are becoming the mainstream. However, up until now, most papers show gains on only about 10 tasks, or small subsets of existing benchmarks.

The bigger issue is benchmark saturation. We highlight "circle packing" in the post as a prime example: four different recent papers (ThetaEvolve, TTT-Discover, AdaEvolve, and EvoX) all converge to the exact same ceiling (2.635983). Once multiple methods collapse to the same endpoint because they are essentially just wrapping the same SLSQP Python script, the benchmark stops measuring open-ended improvement.

To fix the evaluation gap, we built Frontier-CS. It consists of 172 algorithmic tasks and 68 research tasks, designed by ICPC World Finalists and CS PhDs.

Our main goal was to create environments with massive search spaces that require actual information-efficient exploration and long-horizon reasoning (like our hidden Permutation task), while still keeping the scoring 100% deterministic so we don't have to rely on "LLM-as-a-judge." SkyDiscover is already using this for 200+ task benchmark suites to get much cleaner, less noisy data on agent performance.