Outcome Rewards Do Not Guarantee Verifiable or Causally Important Reasoning

(arxiv.org)

2 points | by krackers 11 hours ago ago

1 comments

krackers 11 hours ago

The fact that COTs often "hallucinate" was known anecdotally, but they study it more systematically here and provide ways to mitigate. Apparently SFT'ing on "meaningful" reasoning traces provides enough of a scaffold so that later RL results in meaningful/"truthful" traces rather than the appearance of reasoning. See also the author's summary at https://x.com/qinan_yu/status/2049865788304380239