Doublespeak: In-Context Representation Hijacking

(mentaleap.ai)

32 points | by surprisetalk 6 days ago ago

3 comments

  • wood_spirit 11 minutes ago

    Intriguing and very cunning attack! So obvious in hindsight!

    It makes me wonder how Deepseek avoids commenting politically on China? I have heard anecdotes that it will be writing out a long reply and then presumably it generates some forbidden phrase and it abandons the output and replaces it all with an error message. So presumably the safeguards could be a separate trivial non-LLM-based post filtering which makes it immune to the doublespeak attack?

  • acjohnson55 41 minutes ago

    These types of attacks are interesting ways in which LLM "thinking" differs from human thinking.

  • measurablefunc an hour ago

    This means whatever NNs are currently used for "safety" will need to be extended. In the limit you essentially get another network of the same width & depth as the original network but which is designed for rejecting all "unsafe" queries which are context hijacking bomb construction with stories about fruits.