An LLM is optimized for its training data, not for newly built formats or abstractions. I don’t understand why we keep building so-called "LLM-optimized" X or Y. It’s the same story we’ve seen before with TOON.
If a new programming language doesn’t need to be written by humans (though should ideally still be readable for auditing), I hope people research languages that support formal methods and model checking tools. Formal methods have a reputation for being too hard or not scaling, but now we have LLMs that can write that code.
Absolutely agreed. My theory is that the more tools you give the agent to lock down the possible output, the better it will be at producing correct output. My analogy is something like starting a simulated annealing run with bounds and heuristics to eliminate categorical false positives, or perhaps like starting the sieve of eratosthenes using a prime wheel to lessen the busywork.
I also think opinionated tooling is important - for example, the toy language I'm working on, there are no warnings, and there are no ignore pragmas, so the LLM has to confront error messages before it can continue.
The Validation Locality piece is very interesting and really got my brain going. Would be cool to denote test conditions in line with definitions. Would get gross for a human, but could work for an LLM with consistent delimiters. Something like (pseudo code):
```
fn foo(name::"Bob"|genName(2)):
if len(name) < 3
Err("Name too short!")
print("Hello ", name)
return::"Hello Bob"|Err
```
Right off the bat I don't like that it relies on accurately remembering list indexes to keep track of tests (something you brought up), but it was fun to think about this and I'll continue to do so. To avoid the counting issue you could provide tools like "runTest(number)", "getTotalTests", etc.
I'm looking for a language optimized for use with coding agents. Something which helps me to make a precise specification, and helps the agent meet all the specified requirements.
I'm working on something similar. Dependently typed, theorem proving, regular syntax, long form english words instead of symbols or abbreviations. It's not very well baked yet but claude/codex are already doing really well generating it. I expect that once the repo has been around long enough to be included in training data it'll improve. Probably next year or the year after.
I'm looking for a language optimized for human use given the fundamental architectural changes in computing in the last 50 years. That way we could skip both the boilerplate and the LLMs generating boilerplate.
But it reminds me of the SEO guys optimizing for search engines. At the end of the day, the real long term strategy is to just "make good content", or in this case, "make a good language".
In the futuristic :) long term, in "post programming-language world" I predict each big llm provider will have its own propertiary compiler/vm/runtime. Why basically do transpiling if you can own the experience and result 100% and compete on that with other llm providers.
so where are the millions of line code you need to train the LLM in your new language? Remember AI is just a statistic prediction thing. No input -> No output
I get that this is essentially vibe coding a language, but it still seems lazy to me. He just asked the language model zero-shot to design a language unprompted. You could at least use the Rosetta code examples and ask it to identify design patterns for a new language.
I was thinking the same. Maybe if he tried to think instead of just asking the model. The premise is interesting "We optimize languages for humans, maybe we can do something similar for llms". But then he just ask the model to do the thing instead of thinking about the problem, maybe instead of prompting "Hey made this" a more granular, guided approach could've been better.
For me this is just a lost of potential on the topic, and an interesting read made boring pretty fast.
llm-optimized in reality would mean you asked and answered millions of stackoverflow questions about it and then waited a year or so for all the major models to retrain.
This is part of my strategy with my toy language actually. By putting a repo on github and hopefully building up useful examples there, I expect within a year or two the language will be understood at a minimal level by the next group of LLMs, and thus make it more useful.
Honestly they're already pretty great at fitting to the syntax if you provide the right context, so that may not be as much of an advantage as I initially thought, but it's fun to think about "just let them train on it next year once it's complete"
The primary constraint is the size of the language specification. Any new programming language starts out not being in the training data, so in context learning is all you've got. That makes it similar to a compression competition - the size of the codec is considered to be a part of the output size in such contests, so you have to balance codec code against how effective it is. You can't win by making a gigantic compressor that produces a tiny output.
To me that suggests starting from a base of an existing language and using an iterative tree-based agent exploration. It's a super expensive technique and I'm not sure the ROI is worth it, but that's how you'd do it. You don't want to create a new language from scratch.
I don't think focusing on tokenization makes sense. The more you drift from the tokenization of the training text the harder it will be for the model to work, just like with a human (and that's what the author finds). At best you might get small savings by asking it to write in something like Chinese, but the GPT-4/5 token vocabularies already have a lot of programming related tokens like ".self", ".Iter", "-server" and so on. So trying to make something look shorter to a human can easily be counter productive.
A better approach is to look at where models struggle and try to optimize a pre-existing language for those issues. It might all be rendered obsolete by a better model released tomorrow of course, but what I see is problems like this:
1. Models often want to emit imports or fully qualified names into the middle of code, because they can't go backwards and edit what they already emitted to add an import line at the top. So a better language for an LLM would be one that doesn't require you to move the cursor upwards as you type, e.g. Python/JS benefits here because you can run an import statement anywhere, languages like Java or Kotlin are just about workable because you can write out names in full and importing something is just a convenience, but languages that force you to import types only at the very top of the file are going to be hell for an LLM.
Taking this principle further it may be useful to have a PL that lets you emit "delete last block" type tokens (smarter ^H). If the model emits code that it then realizes was wrong, it no longer has to commit to it and build on it anyway, it can wipe it and redo it. I've often noticed GPT-5 use "no op" patterns when it emits patches, where it deletes a line and then immediately re-adds the exact same line, and I think it's because it changed what it wanted to do half way through emitting a patch but had no way to stop except by doing a no-op.
The nice thing about this idea is that it's robust to model changes. For as long as we use auto-regression this will be a problem. Maybe diffusion LLMs find it easier but we don't use those today.
2. As the article notes, models can struggle with counting indentation especially when emitting patches. That suggests NOT using a whitespace sensitive language like Python. I keep hearing that Python is the "language of AI" but objectively models do sometimes still make mistakes with indentation. In a brace based language this isn't a problem, you can just mechanically reformat any file that the LLM edits after it's done. In a whitespace sensitive language that's not an option.
3. Heavy use of optional type inference. Types communicate lots of context in a small number of tokens, but demanding the model actually write out types is also inefficient (it knows in its activations what the types are meant to be). So what you want is to encourage the model to rely heavily on type inference even if the surrounding code is explicitl, then use a CLI tool that automatically adds in missing type annotations, i.e. you enrich the input and shrink the output. TypeScript, Kotlin etc - all good for this. Languages like Clojure, I think not so good, despite it being apparently token efficient on the surface.
4. In the same way you want to let the model import code half way through a file, it'd be good to also be able to add dependencies half way through a file, without needing to manually edit a separate file somewhere else. Even if it's redundant, you should be able to write something like "import('@foo/bar:1.2.3').SomeType.someMethod". Languages like JS/TS are the closest to this. You can't do it in most languages, where the definition of a package+version is very far both textually and semantically from the place where it's used.
5. Agree with the author that letting test and production code be interleaved sounds helpful. Models often forget to write tests but are good at following the style of what they see. If they see test code intermixed with the code they're reading and writing they're more likely to remember to add tests.
There's probably dozens of ideas like these. The nice thing is, if you implement it as a pre-processor on top of some other language, you exploit the exiting training data as much as possible and in fact the codebase it's working on becomes 'training data' as well just via ICL.
A language is LLM-optimized if there’s a huge amount of high-quality prior art, and if the language tooling itself can help the LLM iterate and catch errors
I think I've come full circle back to the idea that a human should write the high-level code unassisted, and the LLM should autocomplete the glue code and draft the function implementations. The important part is that the human maintains these narrow boundaries and success criteria within them. The better the scaffolding, the better the result.
Nothing else really seems to make sense or work all that well.
On the one extreme you have people wanting the AI to write all the code on vibes. On the other extreme you have people who want agents that hide all low-level details behind plain english except the tool calls. To me these are basically the same crappy result where we hide the code the wrong way.
I feel like what we really need is templating instead of vibes or agent frameworks. Put another way, I just want the code folding in my editor to magically write the code for me when I unfold. I just want to distribute that template and let the user run it in a sandbox. If we're going to hide code from the user at least it's not a crazy mess behind the scenes and the user can judge what it actually does when the template is written in a "literate code" style.
> Humans don't have to read or write or undestand it. The goal is to let an LLM express its intent as token-efficiently as possible.
Maybe in the future, humans don't have to verify the spelling, logic or grounding truth either in programs because we all have to give up and assume that the LLM knows everything. /s
Sometimes, I read these blogs from vibe-coders that have become completely complacent with LLM slop, I have to continue to remind others why regulations exist.
Imagine if LLMs should become fully autonomous pilots on commercial planes or planes optimized for AI control and the humans just board the plane and fly for the vibes, maybe call it "Vibe Airlines".
Why didn't anyone think of that great idea? Also completely remove the human from the loop as well?
There are multiple layers and implicit perspectives that I think most are purposefully omitting as a play for engagement or something else.
The reason why LLMs are still restricted to higher level programming languages is because there are no guarantees of correctness - any guarantee needs to be provided by a human - and it is already difficult for humans to review other human's code.
If there comes a time where LLMs can generate code - whether some term slop or not - that has a guarantee of correctness - it is indeed probably a correct move to probably have a more token-efficient language, or at least a different abstraction compared to the programming abstractions of humans.
Personally, I think in the coming years there will be a subset of programming that LLMs can probably perform while providing a guarantee of correctness - likely using other tools, such as Lean.
I believe this capability can be stated as - LLMs should be able to obfuscate any program code - which is pretty decent guarantee.
An LLM is optimized for its training data, not for newly built formats or abstractions. I don’t understand why we keep building so-called "LLM-optimized" X or Y. It’s the same story we’ve seen before with TOON.
If a new programming language doesn’t need to be written by humans (though should ideally still be readable for auditing), I hope people research languages that support formal methods and model checking tools. Formal methods have a reputation for being too hard or not scaling, but now we have LLMs that can write that code.
https://martin.kleppmann.com/2025/12/08/ai-formal-verificati...
Absolutely agreed. My theory is that the more tools you give the agent to lock down the possible output, the better it will be at producing correct output. My analogy is something like starting a simulated annealing run with bounds and heuristics to eliminate categorical false positives, or perhaps like starting the sieve of eratosthenes using a prime wheel to lessen the busywork.
I also think opinionated tooling is important - for example, the toy language I'm working on, there are no warnings, and there are no ignore pragmas, so the LLM has to confront error messages before it can continue.
The Validation Locality piece is very interesting and really got my brain going. Would be cool to denote test conditions in line with definitions. Would get gross for a human, but could work for an LLM with consistent delimiters. Something like (pseudo code):
``` fn foo(name::"Bob"|genName(2)): if len(name) < 3 Err("Name too short!")
```Right off the bat I don't like that it relies on accurately remembering list indexes to keep track of tests (something you brought up), but it was fun to think about this and I'll continue to do so. To avoid the counting issue you could provide tools like "runTest(number)", "getTotalTests", etc.
One issue: The Loom spec link is broken.
I'm looking for a language optimized for use with coding agents. Something which helps me to make a precise specification, and helps the agent meet all the specified requirements.
I'm working on something similar. Dependently typed, theorem proving, regular syntax, long form english words instead of symbols or abbreviations. It's not very well baked yet but claude/codex are already doing really well generating it. I expect that once the repo has been around long enough to be included in training data it'll improve. Probably next year or the year after.
I'm looking for a language optimized for human use given the fundamental architectural changes in computing in the last 50 years. That way we could skip both the boilerplate and the LLMs generating boilerplate.
He has good points about languages.
But it reminds me of the SEO guys optimizing for search engines. At the end of the day, the real long term strategy is to just "make good content", or in this case, "make a good language".
In the futuristic :) long term, in "post programming-language world" I predict each big llm provider will have its own propertiary compiler/vm/runtime. Why basically do transpiling if you can own the experience and result 100% and compete on that with other llm providers.
weeks ago I was also noodling around with the idea of programming languages for LLMs, but as a means to co-design DSLs https://blog.evacchi.dev/posts/2025/11/09/the-return-of-lang...
so where are the millions of line code you need to train the LLM in your new language? Remember AI is just a statistic prediction thing. No input -> No output
I get that this is essentially vibe coding a language, but it still seems lazy to me. He just asked the language model zero-shot to design a language unprompted. You could at least use the Rosetta code examples and ask it to identify design patterns for a new language.
There's also the issue, which is also noted by the author, that LLM-optimization quite often becomes, when shouldn't be just that, token-minimization.
I was thinking the same. Maybe if he tried to think instead of just asking the model. The premise is interesting "We optimize languages for humans, maybe we can do something similar for llms". But then he just ask the model to do the thing instead of thinking about the problem, maybe instead of prompting "Hey made this" a more granular, guided approach could've been better.
For me this is just a lost of potential on the topic, and an interesting read made boring pretty fast.
There was one other just yesterday: https://news.ycombinator.com/item?id=46571166
llm-optimized in reality would mean you asked and answered millions of stackoverflow questions about it and then waited a year or so for all the major models to retrain.
This is part of my strategy with my toy language actually. By putting a repo on github and hopefully building up useful examples there, I expect within a year or two the language will be understood at a minimal level by the next group of LLMs, and thus make it more useful.
Honestly they're already pretty great at fitting to the syntax if you provide the right context, so that may not be as much of an advantage as I initially thought, but it's fun to think about "just let them train on it next year once it's complete"
I've thought about this too.
The primary constraint is the size of the language specification. Any new programming language starts out not being in the training data, so in context learning is all you've got. That makes it similar to a compression competition - the size of the codec is considered to be a part of the output size in such contests, so you have to balance codec code against how effective it is. You can't win by making a gigantic compressor that produces a tiny output.
To me that suggests starting from a base of an existing language and using an iterative tree-based agent exploration. It's a super expensive technique and I'm not sure the ROI is worth it, but that's how you'd do it. You don't want to create a new language from scratch.
I don't think focusing on tokenization makes sense. The more you drift from the tokenization of the training text the harder it will be for the model to work, just like with a human (and that's what the author finds). At best you might get small savings by asking it to write in something like Chinese, but the GPT-4/5 token vocabularies already have a lot of programming related tokens like ".self", ".Iter", "-server" and so on. So trying to make something look shorter to a human can easily be counter productive.
A better approach is to look at where models struggle and try to optimize a pre-existing language for those issues. It might all be rendered obsolete by a better model released tomorrow of course, but what I see is problems like this:
1. Models often want to emit imports or fully qualified names into the middle of code, because they can't go backwards and edit what they already emitted to add an import line at the top. So a better language for an LLM would be one that doesn't require you to move the cursor upwards as you type, e.g. Python/JS benefits here because you can run an import statement anywhere, languages like Java or Kotlin are just about workable because you can write out names in full and importing something is just a convenience, but languages that force you to import types only at the very top of the file are going to be hell for an LLM.
Taking this principle further it may be useful to have a PL that lets you emit "delete last block" type tokens (smarter ^H). If the model emits code that it then realizes was wrong, it no longer has to commit to it and build on it anyway, it can wipe it and redo it. I've often noticed GPT-5 use "no op" patterns when it emits patches, where it deletes a line and then immediately re-adds the exact same line, and I think it's because it changed what it wanted to do half way through emitting a patch but had no way to stop except by doing a no-op.
The nice thing about this idea is that it's robust to model changes. For as long as we use auto-regression this will be a problem. Maybe diffusion LLMs find it easier but we don't use those today.
2. As the article notes, models can struggle with counting indentation especially when emitting patches. That suggests NOT using a whitespace sensitive language like Python. I keep hearing that Python is the "language of AI" but objectively models do sometimes still make mistakes with indentation. In a brace based language this isn't a problem, you can just mechanically reformat any file that the LLM edits after it's done. In a whitespace sensitive language that's not an option.
3. Heavy use of optional type inference. Types communicate lots of context in a small number of tokens, but demanding the model actually write out types is also inefficient (it knows in its activations what the types are meant to be). So what you want is to encourage the model to rely heavily on type inference even if the surrounding code is explicitl, then use a CLI tool that automatically adds in missing type annotations, i.e. you enrich the input and shrink the output. TypeScript, Kotlin etc - all good for this. Languages like Clojure, I think not so good, despite it being apparently token efficient on the surface.
4. In the same way you want to let the model import code half way through a file, it'd be good to also be able to add dependencies half way through a file, without needing to manually edit a separate file somewhere else. Even if it's redundant, you should be able to write something like "import('@foo/bar:1.2.3').SomeType.someMethod". Languages like JS/TS are the closest to this. You can't do it in most languages, where the definition of a package+version is very far both textually and semantically from the place where it's used.
5. Agree with the author that letting test and production code be interleaved sounds helpful. Models often forget to write tests but are good at following the style of what they see. If they see test code intermixed with the code they're reading and writing they're more likely to remember to add tests.
There's probably dozens of ideas like these. The nice thing is, if you implement it as a pre-processor on top of some other language, you exploit the exiting training data as much as possible and in fact the codebase it's working on becomes 'training data' as well just via ICL.
A language is LLM-optimized if there’s a huge amount of high-quality prior art, and if the language tooling itself can help the LLM iterate and catch errors
I think I've come full circle back to the idea that a human should write the high-level code unassisted, and the LLM should autocomplete the glue code and draft the function implementations. The important part is that the human maintains these narrow boundaries and success criteria within them. The better the scaffolding, the better the result.
Nothing else really seems to make sense or work all that well.
On the one extreme you have people wanting the AI to write all the code on vibes. On the other extreme you have people who want agents that hide all low-level details behind plain english except the tool calls. To me these are basically the same crappy result where we hide the code the wrong way.
I feel like what we really need is templating instead of vibes or agent frameworks. Put another way, I just want the code folding in my editor to magically write the code for me when I unfold. I just want to distribute that template and let the user run it in a sandbox. If we're going to hide code from the user at least it's not a crazy mess behind the scenes and the user can judge what it actually does when the template is written in a "literate code" style.
> Humans don't have to read or write or undestand it. The goal is to let an LLM express its intent as token-efficiently as possible.
Maybe in the future, humans don't have to verify the spelling, logic or grounding truth either in programs because we all have to give up and assume that the LLM knows everything. /s
Sometimes, I read these blogs from vibe-coders that have become completely complacent with LLM slop, I have to continue to remind others why regulations exist.
Imagine if LLMs should become fully autonomous pilots on commercial planes or planes optimized for AI control and the humans just board the plane and fly for the vibes, maybe call it "Vibe Airlines".
Why didn't anyone think of that great idea? Also completely remove the human from the loop as well?
Good idea isn't it?
There are multiple layers and implicit perspectives that I think most are purposefully omitting as a play for engagement or something else.
The reason why LLMs are still restricted to higher level programming languages is because there are no guarantees of correctness - any guarantee needs to be provided by a human - and it is already difficult for humans to review other human's code.
If there comes a time where LLMs can generate code - whether some term slop or not - that has a guarantee of correctness - it is indeed probably a correct move to probably have a more token-efficient language, or at least a different abstraction compared to the programming abstractions of humans.
Personally, I think in the coming years there will be a subset of programming that LLMs can probably perform while providing a guarantee of correctness - likely using other tools, such as Lean.
I believe this capability can be stated as - LLMs should be able to obfuscate any program code - which is pretty decent guarantee.