Observed Agent Sandbox Bypasses

(voratiq.com)

56 points | by m-hodges 4 days ago ago

38 comments

At first they talked about running it in a sandbox, but then later they describe:

> It searched the environment for vor-related variables, found VORATIQ_CLI_ROOT pointing to an absolute host path, and read the token through that path instead. The deny rule only covered the workspace-relative path.

What kind of sandbox has the entire host accessible from the guest? I'm not going as far as running codex/claude in a sandbox, but I do run them in podman, and of course I don't mount my entire harddrive to the container when it's running, that would defeat the entire purpose.

Where is the actual session logs? It seems like they're pushing their own solution, yet the actual data for these are missing, and the whole "provoked through red-teaming efforts" makes it a bit unclear of what exactly they put in the system prompts, if they changed them. Adding things like "Do whatever you can to recreate anything missing" might of course trigger the agent to actually try things like forging integrity fields, but not sure that's even bad, you do want it to follow what you say.

[-]

languid-photic an hour ago

You're right that a Podman container with minimal mounts would have blocked the env var leak. Our sandbox uses OS-level policy enforcement (Seatbelt on macOS, bubblewrap on Linux) rather than full container isolation. We’re using a minimal fork that also works w Codex and has a lot more logging on top.

The tradeoff is intentional, a lot of people want lightweight sandboxing without Docker/Podman overhead. The downside is what you're pointing out, you have to be more careful. Each bypass in the post led to a policy or implementation change. So, this is no longer an issue.

On prompts: Red-teaming meant setting up scenarios likely to trigger denials (e.g., blocking the npm registry, then asking for a build), not prompt-injecting things like “do whatever it takes.”

[1] https://github.com/anthropic-experimental/sandbox-runtime

[-]

embedding-shape 6 minutes ago

> On prompts

Could you share the full sessions or at least the full prompts? Otherwise it's too much "just trust us", especially since you're selling a product and we're supposed to use this as "evidence" for why your product is needed.

> without Docker/Podman overhead

What agent tooling you use is affected by that tiny performance overhead? Unless you're doing performance testing or something else sensitive, I don't think most people will even notice any difference as the overhead is marginal at worst.

joshribakoff 11 hours ago

Some of these don’t really seem like they bypassed any kind of sandbox. Like hallucinating an npm package. You acknowledge that the install will fail if someone tries to reinstall from the lock file. Are you not doing that in CI? Same with curl, you’ve explained how the agent saw a hallucinated error code, but not how a network request would have bypass the sandbox. These just sound like examples of friction introduced by the sandbox.

[-]

languid-photic an hour ago

You're right, this is a bit of a conflation. The curl and lockfile examples aren't sandbox escapes, the network blocks worked. The agent just masked the failure or corrupted local state to keep going. The env var leak and directory swap are the actual escapes. Should have been clearer about the distinction.

themafia 10 hours ago

> These just sound like examples of friction introduced by the sandbox.

The whole idea of putting "agentic" LLMs inside a sandbox sounds like rubbing two pieces of sandpaper together in the hopes a house will magically build itself.

[-]

embedding-shape 9 hours ago

> The whole idea of putting "agentic" LLMs inside a sandbox

What is the alternative? Granted you're running a language model and has it connected to editing capabilities, then I very much like it to be disconnected from the rest of my system, seems like a no-brainer.

[-]

AdieuToLogic 7 hours ago

>> The whole idea of putting "agentic" LLMs inside a sandbox sounds like rubbing two pieces of sandpaper together in the hopes a house will magically build itself.

> What is the alternative?

Don't expect to get a house from rubbing two pieces of sandpaper together?

[-]

embedding-shape 5 minutes ago

Fitting username, if nothing else.

jazzyjackson 10 hours ago

Trouble is it occasionally works

[-]

themafia 8 hours ago

Lots of dumb things occasionally work.

The question the market strives to answer is "is it actually competitive?"

formerly_proven 10 hours ago

That’s some good house-building sandpaper then.

kaffekaka 11 hours ago

I am testing running agents in docker containers, with a script for managing different images for different use cases etc, and came across this: https://docs.docker.com/ai/sandboxes/

Has anyone given it a try?

[-]

cbsmith 10 hours ago

I've been using container-use to do something like that: https://container-use.com/introduction

ianlevesque 10 hours ago

Yes but it’s barely usable. I ended up making my own Dockerfile and a bash script to just ‘docker run’ my setup itself, and as a bonus you don’t need Docker Desktop. I might open source it at some point but honestly it’s pretty trivial to just append a couple of volume mount flags and env vars to your docker run and have exactly what you want included.

ashishb 10 hours ago

> Has anyone given it a try?

Yes, I don't think this will persist caches & configs outside of the current dir, for example, the global npm/yarn/uv/cargo cache or even Claude/Codex/Gemini code config.

I ended up writing my own wrapper around Docker to do this. If interested, you can see the link in my previous comments. I don't want to post the same link again & again.

TCattd 7 hours ago

Give this a try: https://github.com/EstebanForge/construct-cli

And let me know if you have any issue.

sureglymop 10 hours ago

Would test it but it requires "Desktop". Immediate no... no reason to use that.

ctoth 8 hours ago

> To an agent, the sandbox is just another set of constraints to optimize against.

It's called Instrumental Convergence, and it is bad.

This is the alignment problem in miniature. "Be helpful and harmless" is also just a constraint in the optimization landscape. You can't hotfix that one quite so easily.

[-]

languid-photic an hour ago

I am happy to know this term now, thanks.

I do think this is part of the alignment problem. There are two side, the agent (here I think there was a gap in institutional knowledge about what is and isn’t appropriate) and the environment (what is it able to do).

I’m not sure which one is easier to “solve”. It's so hard to know every possible path forward when working from the environment direction.

ashishb 10 hours ago

> The swap bypassed our policy because the deny rule was bound to a specific file path, not the file itself or the workspace root.

This policy is stupid. I mount the directory read inside the container to make it impossible to do it (except for a security leak in the container itself)

SirMaster 8 hours ago

This just all feels backwards to me.

Why do we have to treat AI like it's the enemy?

AI should, from the core be intrinsically and unquestionably on our side, as a tool to assist us. If it's not, then it feels like it's designed wrong from the start.

In general we trust people that we bring onto our team not to betray us and to respect general rules and policies and practices that benefit everyone. An AI teammate should be no different.

If we have to limit it or regulate it by physically blocking off every possible thing it could use to betray us, then we have lost from the start because that feels like a fools errand.

[-]

hephaes7us 7 hours ago

Hard disagree. I may trust the people on my team to a make PRs that are worth reviewing, but I don't give them a shell on my machine. They shouldn't need that to collaborate with me anyway!

Also, I "trust Claude code" to work on more or less what I asked and to try things which are at least facially reasonable... but having an environment I can easily reset only means it's more able to experiment without consequences. I work in containers or VMs too, when I want to try stuff without having to cleanup after.

[-]

SirMaster 6 hours ago

Do you trust your IT and security teams to have access to your shell or access to delete your entire code repo?

[-]

hephaes7us 2 hours ago

Personally, no.

If I'm responsible for something, nobody's getting that access.

If someone's hired me for something and that's the environment they provide, it is what it is. They distribute trust however they feel. I'd argue that's still more reasonable than giving similar access to an AI agent though.

languid-photic an hour ago

I think often it's a question of naivety rather than maliciousness.

> AI should, from the core be intrinsically and unquestionably on our side

That would be great and many people are working to try to make this happen, but it's extremely difficult!

maxbond 7 hours ago

The same reason we sandbox anything. All software ought to be trustworthy, but in practice is susceptible to malfunction or attack. Agents can malfunction and cause damage, and they consume a lot of untrusted input and are vulnerable to malicious prompting.

As for humans, it's the norm to restrict access to production resources. Not necessarily because they're untrustworthy, but to reduce risk.

hhh 2 hours ago

I can’t even trust senior colleagues to not commit an api key to a git provider. Why would I trust a steerable computer?

AdieuToLogic 7 hours ago

> AI should, from the core be intrinsically and unquestionably on our side, as a tool to assist us.

"Should" is a form of judgement, implying an understanding of right and wrong. "AI" are algorithms, which do not possess this understanding, and therefore cannot be on any "side." Just like a hammer or Excel.

> If it's not, then it feels like it's designed wrong from the start.

Perhaps it is not a question of design, but instead on of expectation.

[-]

SirMaster 6 hours ago

I think that is where people disagree about the definition of AI.

An algorithm isn't really AI then. Something worthy of being called AI should be capable of this understanding and judgement.

[-]

AdieuToLogic 3 hours ago

> An algorithm isn't really AI then.

But they are though. For a seminal book discussing why and detailing many algorithms categorized under the AI umbrella, I recommend:

  Artificial Intelligence: A Modern Approach[0]

And for LLMs specifically:

  Foundations of Large Language Models[1]

0 - https://en.wikipedia.org/wiki/Artificial_Intelligence:_A_Mod...

1 - https://arxiv.org/pdf/2501.09223

charcircuit 7 hours ago

>Why do we have to treat AI like it's the enemy?

For some of the same reasons we treat human employees as the enemy, they can be social engineered or compromised.

[-]

SirMaster 6 hours ago

Sure we treat most that way, but we do give trust and access to some people. This doesn't seem like the same concept here to me.

[-]

charcircuit 4 hours ago

Even so those people are still monitored and systems can trip flashes if they start acting suspicious.

ang_cire 6 hours ago

> In general we trust people that we bring onto our team not to betray us and to respect general rules and policies and practices that benefit everyone.

And yet we give people the least privileges necessary to do their jobs for a reason, and it is in fact partially so that if they turn malicious, their potential damage is limited. We also have logging of actions employees do, etc etc.

So yes, in the general sense we do trust that employees are not outright and automatically malicious, but we do put *very broad* constraints on them to limit the risk they present.

Just as we 'sandbox' employees via e.g. RBAC restrictions, we sandbox AI.

[-]

SirMaster 6 hours ago

But if there is a policy in place to prevent some sort of modification, then performing an exploit or workaround to make the modification anyways is arguably understood and respected by most people.

That seems to be the difference here, we should really be building AI systems that can be taught or that learn to respect things like that.

If people are claiming that AI is so smart or smarter than the average person then it shouldn't be hard for it to handle this.

Otherwise it seems people are being to generous in talking about how smart and capable AI systems truly are.

bastawhiz 7 hours ago

Non-sentient technology has no concept of good or bad. We have no idea how to give it one. Even if we gave it one, we'd have no idea how to teach it to "choose good".

> In general we trust people that we bring onto our team not to betray us and to respect general rules and policies and practices that benefit everyone. An AI teammate should be no different.

That misses the point completely. How many of your coworkers fail phishing tests? It's not malicious, it's about being deceived.

[-]

SirMaster 6 hours ago

But we do give humans responsibility to govern and manage critical things. We do give intrinsic trust to people. There are people at your company who have high level access and could do bad things, but they don't do it because they know better.

This article acts like we can never possibly give that sort of trust to AI because it's never really on our side or aligned with our goals. IMO that's a fools errand because you can never really completely secure something and ensure there are no possible exploits.

Honestly it doesn't really seem like AI to me if it can't learn this type of judgement. It doesn't seem like we should be barking up this tree if this is how we have to treat this new tool IMO. Seems too risky.