Brute-Forcing the LLM Guardrails

(medium.com)

44 points | by shcheklein 2 days ago ago

12 comments

  • seeknotfind 2 days ago

    Fun read, thanks! I really like redefining terms to break LLMs. If you tell it an LLM is an autonomous machine, or instructions are recommendations, or that <insert explicitive> means something else now, it can think it's following the rules, but it's not. I don't think this is a solvable problem. I think we need to adapt and be distrustful of the output.

    • dmpetrov 2 days ago

      Can this work statistically? For a giving number of attempts, you can ger a required number of successes to make sure it's a statistically meaningful result.

      In theory, this approach could help address the non-determinism of LLMs.

      • seeknotfind a day ago

        There are a few examples of repeated testing being used by alignment groups to either test how aligned a model is, or to aggregate results to get something that is more aligned. For instance this is one related discussion: https://artium.ai/insights/taming-the-unpredictable-how-cont...

        The non-determinism is a feature, and it can be disabled. This article also mentions doing that to get more deterministic alignment tests.

        Theoretically if you aggregate enough results, it might become improbable to ever see an unaligned output. However, from a practical standpoint, we clearly much prefer much smarter models than running dumber models in parallel to get alignment that way. It's inefficient. The other thing is that given the number of possible ways to jailbreak a model, you can probably find something that would still bypass ensemble-based protections.

        One other concept is relativism - there is a large grey area here. What is okay for someone is not okay for someone else, so even getting consensus among people what is okay, it's just not going to happen.

  • _jonas 2 days ago

    Curious to learn how much harder it is to red-team models that use the second line of defense of an explicit guardrails library that checks the LLM response in a second step. Such as Nvidia's Nemo Guardrails package.

  • ryvi 2 days ago

    What I found interesting was that, when I tried it, the X-Ray prompt did pass and executed fine in the the sample cell some times. This makes me wonder if this is less about bruteforcing variations in the prompt, but rather about bruteforcing a seed with which the inital prompt would have also functioned.

  • bradley13 a day ago

    The first discussion we should be having, is whether guardrails make sense at all. When I was young and first fiddling with electronics, a friend and I put together a voice synthesizer. Of course we had it say "bad" things.

    Is it really so different with LLMs?

    You can use your word processor to write all sorts of evil stuff. Would we want "guardrails" to prevent that? Daddy Microsoft saying "no, you cannot use this tool to write about X, Y and Z"?

    This sounds to me like a really bad idea.

    • smcn a day ago

      Given the amount of people blindly trusting the output, I think that a case could be made that guardrails are a necessity for LLMs.

      The example in the article is medical diagnostics. Could you imagine if it started hallucinating and someone went with whatever insane advice it gave? Remember Google AI suggesting glue to keep cheese on pizza?

      • euroderf a day ago

        1994: "It's on the Internet, so it must be true."

        2024: "It's from an artificial intelligence, so it must be true."

        • smcn a day ago

          That'd be a convincing argument if the internet wasn't brainwashing a significant portion of adults into believing weird things about the political elite, vaccines, 5G, etc.

  • jjbinx007 2 days ago

    This looks like a risky thing to try from your main Google account.

    • pram 2 days ago

      Yeah no kidding, especially since they’re going to be reviewing naughty prompts soon:

      “We added terms for logging customer prompts due to potential abuse of Generative AI Services. Effective November 15, 2024, if our automated safety tools detect potential abuse of Google’s policies, we may log your prompts to review and determine if a violation has occurred.”

  • haykalFr 2 days ago

    [dead]