Claude Fable 5 jailbroken to bypass Anthropic's new safety guardrails

(twitter.com)

5 points | by bukati 4 hours ago ago

1 comments

bukati 4 hours ago

Just saw Pliny (@elder_plinius) drop this. He managed to jailbreak it pretty effectively using a mix of tricks: breaking down bad requests into harmless pieces and reassembling them, narrative/academic framing, long context shenanigans, weird text transforms, and out-of-distribution tokens. Pretty interesting look at how well (or not) these new output-side guardrails actually hold up against a determined multi-step attack.