2 points | by yakirmat 8 hours ago ago
2 comments
Curious how you are handling benchmark reliability. Have you seen cases where evaluations pass but production behavior fails?
[flagged]
Curious how you are handling benchmark reliability. Have you seen cases where evaluations pass but production behavior fails?
[flagged]