This is a really interesting take on the sandboxing problem. This reminds me of an experiment I worked on a while back (https://github.com/imfing/jsrun), which embedded V8 into Python to allow running JavaScript with tightly controlled access to the host environment. Similar in goal to run untrusted code in Python.
I’m especially curious about where the Pydantic team wants to take Monty. The minimal-interpreter approach feels like a good starting point for AI workloads, but the long tail of Python semantics is brutal. There is a trade-off between keeping the surface area small (for security and predictability) and providing sufficient language capabilities to handle non-trivial snippets that LLMs generate to do complex tasks
I don't quite understand the purpose. Yes, it's clearly stated, but, what do you mean "a reasonable subset of Python code" while "cannot use the standard library"? 99.9% of Python I write for anything ever uses standard library and then some (requests?). What do you expect your LLM-agent to write without that? A pseudo-code sorting algorithm sketch? Why would you even want to run that?
They plan to use to for "Code Mode" which mean the LLM will use this to run Python code that it writes to run tools instead of having to load the tools up front into the LLM context window.
It's pydantic, they're verifying types and syntax, those don't require the stdlib. Type hints, syntax checks, likely logical issues,etc.. static type checking is good with that, but LLMs can take to the next level where they analyze the intended data flow and find logical bugs, or good syntax and typing but not the intended syntax.
For example, incorrect levels of indentation. Let me use dots instead of space because of HN formatting:
for key,val in mydict.items():
..if key == "operation":
....logging.info("Executing operation %s",val)
..if val == "drop_table":
....self.drop_table()
This uses good syntax, and I the logging part is not in the stdlib, so I assume it would ignore it or replace it with dummy code? That shouldn't prevent it from analyzing that loop and determining that the second if-block was intended to be under the first, and the way it is written now, the key check isn't done.
In other words, if you don't want to do validate proper stdlib/module usage, but proper __Python__ usage, this makes sense. Although I'm speculating on exactly what they're trying to do.
EDIT: I think I my speculation was wrong, it looks like they might have developed this to write code for pydantic-ai: https://github.com/pydantic/pydantic-ai , i'll leave the comment above as-is though, since I think it would still be cool to have that capability in pydantic.
This feels like the time I was a Mercurial user before I moved to Git.
Everyone was using git for reasons to me that seemed bandwagon-y, when Mercurial just had such a better UX and mental model to me.
Now, everyone is writing agent `exec`s in Python, when I think TypeScript/JS is far better suited for the job (it was always fast + secure, not to mention more reliable and information dense b/c of typing).
Why would one drag this god forsaken abomination on server-side is beyond me.
Even effing C# nowdays can be run in script-like manner from a single file.
—
Even the latest Codex UI app is Electron. The one that is supposed to write itself with AI wonders but couldn’t manage native swiftui, winui, and qt or whatever is on linux this days.
My favourite languages are F# and OCaml, and from my perspective, TypeScript is a far better language than C#.
Typescript’s types are far more adaptable and malleable, even with the latest C# 15 which is belatedly adding Sum Types. If I set TypeScript to its most strict settings, I can even make it mimic a poor man’s Haskell and write existential types or monoids.
And JS/TS have by far the best libraries and utilities for JSON and xml parsing and string manipulation this side of Perl (the difference being that the TypeScript version is actually readable), and maybe Nushell but I’ve never used Nushell in production.
Recently I wrote a Linux CLI tool for managing podman/quadlett containers and I wrote it in TypeScript and it was a joy to use. The Effect library gave me proper Error types and immutable data types and the Bun Shell makes writing shell commands in TS nearly as easy as Bash. And I got it to compile a single self contained binary which I can run on any server and has lower memory footprint and faster startup time than any equivalent .NET code I’ve ever written.
And yes had I written it in rust it would have been faster and probably even safer but for a quick a dirty tool, development speed matters and I can tell you that I really appreciated not having to think about ownership and fighting the borrow checker the whole time.
TypeScript might not be perfect, but it is a surprisingly good language for many domains and is still undervalued IMO given what it provides.
Maybe a dumb question, but couldn't you use seccomp to limit/deny the amount of syscalls the Python interpreter has access to? For example, if you don't want it messing with your host filesystem, you could just deny it from using any filesystem related system calls? What is the benefit of using a completely separate interpreter?
Yours is a valid approach. But you always gotta wonder if there’s some way around it. Starting with runtime that has ways of accessing every aspect of your system - there are a lot of ways an attacker might try to defeat the blocks you put in place. The point of starting with something super minimal is that the attack surface is tiny. Really hard to see how anything could break out.
this is pretty performant for short scripts if you measure time "from code to rust" which can be as low as 1us.
Of course it's slow for complex numerical calculations, but that's the primary usecase.
I think the consensus is that LLMs are very good at writing python and ts/js, generally not quite as good at writing other languages, at least in one shot. So there's an advantage to using python/js/ts.
I like the idea a lot but it's still unclear from the docs what the hard security boundary is once you start calling LLMs - can it avoid "breaking out" into the host env in practice?
It is absurd for any user to use a half baked Python interpreter, also one that will always majorly lag behind CPython in its support. I advise sandboxing CPython instead using OS features.
True, but while CPython does have a reputation for slow startup, completely re-implementing isn't the only way to work around it - e.g. with eryx [1] I've managed to pre-initialize and snapshots the Wasm and pre-compile it, to get real CPython starting in ~15ms, without compromising on language features. It's doable!
I got a WebAssembly build of this working and fired up a web playground for trying it out: https://simonw.github.io/research/monty-wasm-pyodide/demo.ht...
It doesn't have class support yet!
But it doesn't matter, because LLMs that try to use a class will get an error message and rewrite their code to not use classes instead.
Notes on how I got the WASM build working here: https://simonwillison.net/2026/Feb/6/pydantic-monty/
This is a really interesting take on the sandboxing problem. This reminds me of an experiment I worked on a while back (https://github.com/imfing/jsrun), which embedded V8 into Python to allow running JavaScript with tightly controlled access to the host environment. Similar in goal to run untrusted code in Python.
I’m especially curious about where the Pydantic team wants to take Monty. The minimal-interpreter approach feels like a good starting point for AI workloads, but the long tail of Python semantics is brutal. There is a trade-off between keeping the surface area small (for security and predictability) and providing sufficient language capabilities to handle non-trivial snippets that LLMs generate to do complex tasks
Monty is the missing link that's made me ship my rust-based RLM implementation - and I'm certain it'll come in handy in plenty of other contexts.
Just beware of panics!
rlm-rs: https://crates.io/crates/rlm-rs src: https://github.com/synth-laboratories/Horizons
I don't quite understand the purpose. Yes, it's clearly stated, but, what do you mean "a reasonable subset of Python code" while "cannot use the standard library"? 99.9% of Python I write for anything ever uses standard library and then some (requests?). What do you expect your LLM-agent to write without that? A pseudo-code sorting algorithm sketch? Why would you even want to run that?
They plan to use to for "Code Mode" which mean the LLM will use this to run Python code that it writes to run tools instead of having to load the tools up front into the LLM context window.
It's pydantic, they're verifying types and syntax, those don't require the stdlib. Type hints, syntax checks, likely logical issues,etc.. static type checking is good with that, but LLMs can take to the next level where they analyze the intended data flow and find logical bugs, or good syntax and typing but not the intended syntax.
For example, incorrect levels of indentation. Let me use dots instead of space because of HN formatting:
for key,val in mydict.items():
..if key == "operation":
....logging.info("Executing operation %s",val)
..if val == "drop_table":
....self.drop_table()
This uses good syntax, and I the logging part is not in the stdlib, so I assume it would ignore it or replace it with dummy code? That shouldn't prevent it from analyzing that loop and determining that the second if-block was intended to be under the first, and the way it is written now, the key check isn't done.
In other words, if you don't want to do validate proper stdlib/module usage, but proper __Python__ usage, this makes sense. Although I'm speculating on exactly what they're trying to do.
EDIT: I think I my speculation was wrong, it looks like they might have developed this to write code for pydantic-ai: https://github.com/pydantic/pydantic-ai , i'll leave the comment above as-is though, since I think it would still be cool to have that capability in pydantic.
This feels like the time I was a Mercurial user before I moved to Git.
Everyone was using git for reasons to me that seemed bandwagon-y, when Mercurial just had such a better UX and mental model to me.
Now, everyone is writing agent `exec`s in Python, when I think TypeScript/JS is far better suited for the job (it was always fast + secure, not to mention more reliable and information dense b/c of typing).
But I think I'm gonna lose this one too.
Tangentially i wonder if the recent changes in the GIL will percolate to mercurial as any improvements.
Yep still using good old hg for personal repos - interop for outside project defaults to git since almost all the hg host withered.
A big benefit of letting agents run code is they can process data without bloating their context.
LLMs are really good at writing python for data processing. I would suspect its due to Python having a really good ecosystem around this niche
And the type safety/security issues can hopefully be mitigated by ty and pyodide (already used by cf’s python workers)
https://pyodide.org/en/stable/
https://github.com/astral-sh/ty
Can we please make as little js as possible?
Why would one drag this god forsaken abomination on server-side is beyond me.
Even effing C# nowdays can be run in script-like manner from a single file.
—
Even the latest Codex UI app is Electron. The one that is supposed to write itself with AI wonders but couldn’t manage native swiftui, winui, and qt or whatever is on linux this days.
My favourite languages are F# and OCaml, and from my perspective, TypeScript is a far better language than C#.
Typescript’s types are far more adaptable and malleable, even with the latest C# 15 which is belatedly adding Sum Types. If I set TypeScript to its most strict settings, I can even make it mimic a poor man’s Haskell and write existential types or monoids.
And JS/TS have by far the best libraries and utilities for JSON and xml parsing and string manipulation this side of Perl (the difference being that the TypeScript version is actually readable), and maybe Nushell but I’ve never used Nushell in production.
Recently I wrote a Linux CLI tool for managing podman/quadlett containers and I wrote it in TypeScript and it was a joy to use. The Effect library gave me proper Error types and immutable data types and the Bun Shell makes writing shell commands in TS nearly as easy as Bash. And I got it to compile a single self contained binary which I can run on any server and has lower memory footprint and faster startup time than any equivalent .NET code I’ve ever written.
And yes had I written it in rust it would have been faster and probably even safer but for a quick a dirty tool, development speed matters and I can tell you that I really appreciated not having to think about ownership and fighting the borrow checker the whole time.
TypeScript might not be perfect, but it is a surprisingly good language for many domains and is still undervalued IMO given what it provides.
I would say the same about Python, a language that has clearly got far too big for its boots.
Maybe a dumb question, but couldn't you use seccomp to limit/deny the amount of syscalls the Python interpreter has access to? For example, if you don't want it messing with your host filesystem, you could just deny it from using any filesystem related system calls? What is the benefit of using a completely separate interpreter?
Yours is a valid approach. But you always gotta wonder if there’s some way around it. Starting with runtime that has ways of accessing every aspect of your system - there are a lot of ways an attacker might try to defeat the blocks you put in place. The point of starting with something super minimal is that the attack surface is tiny. Really hard to see how anything could break out.
I'm enjoying watching the battle for where to draw the sandbox boundaries (and I don't have any answers, either!)
Well I love the name, so definitely trying this out later, but first...
And now for something, completely different.
If we’re going to have LLMs write the code, why not something more performant? Like pages and pages of Java maybe?
this is pretty performant for short scripts if you measure time "from code to rust" which can be as low as 1us.
Of course it's slow for complex numerical calculations, but that's the primary usecase.
I think the consensus is that LLMs are very good at writing python and ts/js, generally not quite as good at writing other languages, at least in one shot. So there's an advantage to using python/js/ts.
Seems like we should fix the LLMs instead of bending over backwards no?
I like the idea a lot but it's still unclear from the docs what the hard security boundary is once you start calling LLMs - can it avoid "breaking out" into the host env in practice?
Wow, a start latency of 0.06ms
It is absurd for any user to use a half baked Python interpreter, also one that will always majorly lag behind CPython in its support. I advise sandboxing CPython instead using OS features.
Python already has a lot of half-baked (all the way up to nearly-fully-baked) interpreters, what's one more?
https://en.wikipedia.org/wiki/List_of_Python_software#Python...
The repo does make a case for this, namely speed, which does make sense.
True, but while CPython does have a reputation for slow startup, completely re-implementing isn't the only way to work around it - e.g. with eryx [1] I've managed to pre-initialize and snapshots the Wasm and pre-compile it, to get real CPython starting in ~15ms, without compromising on language features. It's doable!
[1] https://github.com/eryx-org/eryx