Refusal in Language Models Is Mediated by a Single Direction

(arxiv.org)

13 points | by fagnerbrack 2 hours ago ago

2 comments

akersten an hour ago

2024 which is ancient history. This is not true anymore, the models now are trained to prevent abliteration by spreading out the refusal encoding

See https://arxiv.org/abs/2505.19056

[-]

Der_Einzige 4 minutes ago

That doesn't stop/prevent abliteration. The creator of XTC/DRY is also a chad who makes sure that you really can access the full model capabilities. Censorship is the devil.

https://github.com/p-e-w/heretic