METR can barely measure Claude Mythos – 50% task horizon now exceeds 16 hours

(hugonomy.com)

1 points | by GlyphWeaver_a 8 hours ago ago

2 comments

Capability benchmarks may become less meaningful once agents operate across long execution horizons with external tools and permissions. The governance problem starts shifting toward execution boundaries and observability.

GlyphWeaver_a 8 hours ago

[dead]