2 comments

  • c0rp4s a day ago

    What strikes me is the finding that controllability decreases with longer reasoning --- suggests CoT monitoring gets more reliable in complex, multi-step tasks where scheming would be hardest to catch from outputs alone. The question is whether this holds as models get better at instruction following generally.

  • redhanuman a day ago

    the interesting part isn't that they cant control it but its that the reasoning trace is honest precisely because it isn't controlled a model that could perfectly curate its chain of thought on demand would be harder to audit not easier and the "problem" is actually the safety property.