Fundamentally, there is no difference. Blocking syscalls in a Docker container is nothing new and one of the ways to achieve "sandboxing" and can already be done right now.
The only thing that caught people's attention was that it was applied to "AI Agents".
The seems to be looking to let the agent access the source code for review. But in that case, the agent should only see the codebase and nothing else. For a code review agent, all it really needs are:
- Access to files in the repositorie(s)
- Access to the patch/diff being reviewed
- Ability to perform text/semantic search across the codebase
That doesn’t require running the agent inside a container on a system with sensitive data. Exposing an API to the agent that specifically give it access to the above data, avoiding the risk altogether.
If it's really important that the agent is able to use a shell, why not use something like codespaces and run it in there?
This is a good explanation of how standard filesystem sandboxing works, but it's hopefully not trying to be convincing to security engineers.
> At Greptile, we run our agent process in a locked-down rootless podman container so that we have kernel guarantees that it sees only things it’s supposed to.
This sounds like a runc container because they've not said otherwise. runc has a long history with filesystem exploits based on leaked file descriptors and `openat` without NO_FOLLOW.
The agent ecosystem seems to have already settled on VMs or gVisor[2] being table-stakes. We use the latter.
OT: I wonder if WASM is ready to fulfill the sandboxing needs expressed in this article, i.e. can we put the AI agent into a web assembly sandbox and have it function as required?
Just gonna toss this out there, using an agent for code review is a little weird. You can calculate a covering set for the PR deterministically and feed that into a long context model along with the diff and any relevant metadata and get a good review in one shot without the hassle.
That used to be how we did it, but this method performed better on super large codebases. One of the reasons is that grepping is a highly effective way to trace function calls to understand the full impact of a change. It's also great for finding other examples of similar code (for example the same library being used) to ensure consistency of standards.
You shouldn't need the entire codebase, just a covering set for the modified files (you can derive this by parsing the files). If your PR is atomic, covering set + diff + business context is probably going to be less than 300k tokens, which Gemini can handle easily. Gemini is quite good even at 500k, and you can run it multiple times with KV cache for cheap to get a distribution (tell it to analyze the PR from different perspectives).
A bit confused, all this to say you folks use standard containerization?
Same. I didn't really understand what the difference is compared to containerization
Fundamentally, there is no difference. Blocking syscalls in a Docker container is nothing new and one of the ways to achieve "sandboxing" and can already be done right now.
The only thing that caught people's attention was that it was applied to "AI Agents".
What is so fundamentally different for AI agents?
The fact that the first thing people are going to do is punch holes in the sandbox with MCP servers?
Other than the current popular thing which is "AI agents", like all programs, it changes absolutely nothing.
The seems to be looking to let the agent access the source code for review. But in that case, the agent should only see the codebase and nothing else. For a code review agent, all it really needs are:
- Access to files in the repositorie(s)
- Access to the patch/diff being reviewed
- Ability to perform text/semantic search across the codebase
That doesn’t require running the agent inside a container on a system with sensitive data. Exposing an API to the agent that specifically give it access to the above data, avoiding the risk altogether.
If it's really important that the agent is able to use a shell, why not use something like codespaces and run it in there?
It would also need:
- Access to repo history
- Access to CI/CD logs
- Access to bug/issue tracking
This is a good explanation of how standard filesystem sandboxing works, but it's hopefully not trying to be convincing to security engineers.
> At Greptile, we run our agent process in a locked-down rootless podman container so that we have kernel guarantees that it sees only things it’s supposed to.
This sounds like a runc container because they've not said otherwise. runc has a long history with filesystem exploits based on leaked file descriptors and `openat` without NO_FOLLOW.
The agent ecosystem seems to have already settled on VMs or gVisor[2] being table-stakes. We use the latter.
1. https://github.com/opencontainers/runc/security/advisories/G...
2. https://gvisor.dev/docs/architecture_guide/security/
OT: I wonder if WASM is ready to fulfill the sandboxing needs expressed in this article, i.e. can we put the AI agent into a web assembly sandbox and have it function as required?
You'll probably need some kind of WebGPU bindings, but I think it sounds feasible.
Just gonna toss this out there, using an agent for code review is a little weird. You can calculate a covering set for the PR deterministically and feed that into a long context model along with the diff and any relevant metadata and get a good review in one shot without the hassle.
That used to be how we did it, but this method performed better on super large codebases. One of the reasons is that grepping is a highly effective way to trace function calls to understand the full impact of a change. It's also great for finding other examples of similar code (for example the same library being used) to ensure consistency of standards.
If that's the case, isn't a grep tool a lot more tractable than a Linux agent that will end up mostly calling `grep`?
But then you can't say is powered by AI and get that VC money.
Ah ha.
You shouldn't need the entire codebase, just a covering set for the modified files (you can derive this by parsing the files). If your PR is atomic, covering set + diff + business context is probably going to be less than 300k tokens, which Gemini can handle easily. Gemini is quite good even at 500k, and you can run it multiple times with KV cache for cheap to get a distribution (tell it to analyze the PR from different perspectives).
"How can I sandbox a coding agent?"
"Early civilizations had no concept of zero..."
If you only care about filesystem sandboxing isn't Landlock the easiest solution?