What strikes me is the finding that controllability decreases with longer reasoning --- suggests CoT monitoring gets more reliable in complex, multi-step tasks where scheming would be hardest to catch from outputs alone. The question is whether this holds as models get better at instruction following generally.
the interesting part isn't that they cant control it but its that the reasoning trace is honest precisely because it isn't controlled a model that could perfectly curate its chain of thought on demand would be harder to audit not easier and the "problem" is actually the safety property.
What strikes me is the finding that controllability decreases with longer reasoning --- suggests CoT monitoring gets more reliable in complex, multi-step tasks where scheming would be hardest to catch from outputs alone. The question is whether this holds as models get better at instruction following generally.
the interesting part isn't that they cant control it but its that the reasoning trace is honest precisely because it isn't controlled a model that could perfectly curate its chain of thought on demand would be harder to audit not easier and the "problem" is actually the safety property.