Deep Dive into G-Eval: How LLMs Evaluate Themselves

(medium.com)

10 points | by zlatkov 7 hours ago ago

5 comments

  • kirchoni 7 hours ago

    Interesting overview, though I still wonder how stable G-Eval really is across different model families. Auto-CoT helps with consistency, but I’ve seen drift even between API versions of the same model.

    • zlatkov 6 hours ago

      That's true. Even small API or model version updates can shift evaluation behavior. G-Eval helps reduce that variance, but it doesn’t eliminate it completely. I think long-term stability will probably require some combination of fixed reference models and calibration datasets.

  • eeasss 7 hours ago

    Are there any llms in particular that work best with g-evals?

    • zlatkov 7 hours ago

      I haven’t come across any research showing that a specific LLM consistently outperforms others for this. It generally works best with strong reasoning models that produce consistent outputs.

    • lyuata 6 hours ago

      LLM Benchmark leaderboard for common evals sounds like a fun idea to me.