Golden Sets: Regression Engineering for Probabilistic Systems

(heavythoughtcloud.com)

12 points | by ryan-s a day ago ago

4 comments

JSR_FDED a day ago

“You can ship AI without evaluation.”

“You can also ship without tests.”

“Both approaches create compelling personal growth opportunities.”

Now THAT is how you start an article I’ll actually read!

[-]

ryan-s 20 hours ago

Appreciate it.

“There will be no tests, only consequences” is still a surprisingly popular architecture pattern :).

mpalmer 19 hours ago

Are you producing this solely for LLMs to read? If not, you should edit the material more thoughtfully.

You're presenting yourself as a subject matter expert, and I'm having trouble distinguishing your original thoughts because they are bogged down in seemingly unedited slop prose.

It's apparent that you asked the model to keep its sentences short and the content digestible, but the result is unbearable, it reads like it's written in monotone.

I don't enjoy reading 20 sets of 4-5 bullet points which look exactly the same. Where's the emphasis? Why are you telling me 17 times that single-number scores are bad and multic-metric gates are good?

And of course, seven(!) instances of the most glaring canary:

    Golden sets are not just datasets. They are versioned cases plus a scoring contract.

    A useful golden set is not just a folder of examples. It is a contract with explicit fields.

    They work with those systems. They do not replace them.

    [..] evaluation is not a one-time report. It is a release gate.

    You do not need a specialized eval platform to start. You need disciplined case design and a willingness to treat scoring as engineering rather than ceremony.

    That is not the gate being annoying. That is the gate doing its job.

    That is not academic ceremony. That is how you keep AI systems from slowly degrading while every weekly update insists things are "looking strong".

[-]

ryan-s 4 hours ago

Fair criticism.

I was aiming for “field guide / reference doc” structure rather than a looser essay voice, which is why the piece is so contract-heavy and sectional. But I think you’re right that the repetition probably started doing the opposite of what it was supposed to do and dulled the emphasis.

That’s fixable, and I’ll tighten it.