1 comments

  • krackers 11 hours ago

    The fact that COTs often "hallucinate" was known anecdotally, but they study it more systematically here and provide ways to mitigate. Apparently SFT'ing on "meaningful" reasoning traces provides enough of a scaffold so that later RL results in meaningful/"truthful" traces rather than the appearance of reasoning. See also the author's summary at https://x.com/qinan_yu/status/2049865788304380239