One of those rare papers where the code speaks for itself. They do a bunch of comparisons but the most salient is comparing Karpathy's autoresearch (verbatim, best as I can tell) vs. some HPO algorithms, and as of yet the Tree-structured Parzen estimator still wins out -- but just barely!
More interesting though is that the best results come from 'centaur' approaches, where an LLM is hooked up with a standard HPO. Somewhere around 1:3 LLM:HPO control seems to work best, with more LLM control degrading performance. But either way this method far outperforms either the naive autoresearch loop or the bare HPO approach.
> Centaur outperformed all methods including CMA-ES alone by using the LLM on only 30% of trials. The LLM receives CMA-ES's full internal state (mean vector, step-size, covariance matrix), the top-5 configurations, and the last 20 trials. A 0.8B LLM already suffices to outperform all classical and pure LLM methods. Scaling from 0.8B (0.9766) to 27B (0.9763) to Gemini Pro (0.9767) yields no improvement, suggesting a capability plateau [which Claude sliightly beats]
> We ablate the LLM ratio: higher ratios degrade performance, confirming that CMA-ES should retain majority control.
One of those rare papers where the code speaks for itself. They do a bunch of comparisons but the most salient is comparing Karpathy's autoresearch (verbatim, best as I can tell) vs. some HPO algorithms, and as of yet the Tree-structured Parzen estimator still wins out -- but just barely!
More interesting though is that the best results come from 'centaur' approaches, where an LLM is hooked up with a standard HPO. Somewhere around 1:3 LLM:HPO control seems to work best, with more LLM control degrading performance. But either way this method far outperforms either the naive autoresearch loop or the bare HPO approach.
> Centaur outperformed all methods including CMA-ES alone by using the LLM on only 30% of trials. The LLM receives CMA-ES's full internal state (mean vector, step-size, covariance matrix), the top-5 configurations, and the last 20 trials. A 0.8B LLM already suffices to outperform all classical and pure LLM methods. Scaling from 0.8B (0.9766) to 27B (0.9763) to Gemini Pro (0.9767) yields no improvement, suggesting a capability plateau [which Claude sliightly beats]
> We ablate the LLM ratio: higher ratios degrade performance, confirming that CMA-ES should retain majority control.