1 comments

  • bukati 4 hours ago

    Just saw Pliny (@elder_plinius) drop this. He managed to jailbreak it pretty effectively using a mix of tricks: breaking down bad requests into harmless pieces and reassembling them, narrative/academic framing, long context shenanigans, weird text transforms, and out-of-distribution tokens. Pretty interesting look at how well (or not) these new output-side guardrails actually hold up against a determined multi-step attack.