I find it surprising that a single-pass compiler is easier to implement than a traditional lexer->parser->AST->emitter. (I'm not a compiler expert, though.) I'd have expected that generating an AST would be at least as simple, if not simpler. Plus by generating an AST, doing some simple optimization is a lot easier: one can pattern-match parts of the AST and replace them with more efficient equivalents. Maybe I'm overthinking this, though. I tend to like extensible program designs, even when they don't necessarily make sense for the scale of the program…
Still a really cool article and an impressive project, though. I especially like the StringPool technique; I'll have to keep it in mind if I ever write a compiler!
I think this might depend on the language you're writing in.
Historically, at least, it's pretty verbose to define a data type in Python compared to languages that are more designed for writing compilers. Consider these definitions from my prototype Bicicleta interpreter, which is written in ML, specifically OCaml:
type methods = NoDefs
(* name, body, is_positional ... *)
| Definition of string * bicexpr * bool * methods
and bicexpr = Name of string
| Call of bicexpr * string
| Literal of string option * methods
| Derivation of bicexpr * string option * methods
| StringConstant of string
| Integer of int
| Float of float
| NativeMethod of (lookup -> bicobj)
Those ten lines of code would be ten classes in Python with an average of 1.6 attributes each. Using dataclasses or attrs, that would be 36 lines of code, and then (if you're doing it the OO way) every function that I defined on one of these OCaml types becomes a method implemented in each class implementing a particular protocol, with a copy of its argument signature in every class. (If you used namedtuple instead, it's no less code, but you write it on less lines.) So, for example, this function on bicexprs
becomes six to eight method definitions in the different classes. (You can cut it down to six if you define an abstract base class for the constant classes.) And Literal.freevars needs an if-then-else. So that's another 20 lines of code.
Python does support pattern-matching now, so functions like this might not be any more verbose than the ML version if you program them the same way instead of in the OO fashion. I haven't tried using Python pattern-matching, so I don't really know.
In general, though, Python is more verbose than ML-family languages for this kind of thing by a factor of about 2–4, and that's before you count the test code you need in Python to get the kind of confidence in correctness that ML's type-checking gives you with no extra code. To my knowledge, Mypy doesn't do the kinds of pattern-matching-exhaustiveness checks that ML compilers do.
I've sometimes "cheated" by trying to write code like this in Python using regular tuples rather than named tuples. You can definitely make it work, but it's a real pain to debug.
This article breaks it down well enough to make me feel like I could write my own C compiler targeting AVR. (I probably could... but it would not be easy.)
Never actually looked into how compilers work before, it's surprisingly similar/related to linguistics.
It's b/c when Chomsky invented the theory of formal grammars he was studying natural languages & the universality of abstract grammar¹. Computer scientists realized later that they could use the same theory as a foundation for formalizing the grammatical structures of programming languages.
Similar experience in DNA/genome analysis. A large part of DNA analysis was based on parser theory.
This paper was my introduction to DNA analysis as well as Chomsky hierarchy: https://www.jstor.org/stable/29774782 (I wasn't able to find a free copy).
IIRC, pseudoknots in RNA require context-free grammars to parse.
I could probably do it - but you wouldn't like it. My dictionaries would be a linked-list, looking for a key becomes a linear search... (if you gave me C++ I'd use std::map) I'm assuming you will allow me to use the C standard library, if I have to implement strlen or malloc in that 500 lines of C I'm not sure I can pull that off. 500 lines is aggressive, but IOCCC gives me plenty of tricks to get the line count down and the language isn't that big. I'm also going to assume 100% valid python code is fed in, if there is a bug or error of any sort that is undefined behavior.
Note that most of what makes python great isn't the language, it is the library. I believe that large parts of the python library are also written in C (for speed), and thus you won't be able to use my 500 line python compiler for anything useful because you won't have any useful libraries.
Indeed, a decent closed hash table is maybe 30 lines. An open hash table with linear probing is even less, especially if you don't need to remove entries. It's almost identical to a linear search through an array; you just change where you start iterating.
In my first stage Onramp linker [1], converting linear search to an open hash table adds a grand total of 24 bytecode instructions, including the FNV-1a hash function. There's no reason to ever linear search a symbol table.
A python VM that consumes bytecode might be doable in not-ludicrous-amounts of C. Not 500 lines I suppose. But something manageable I think? Especially if you targeted the older releases.
The original cc was just a wrapper like this Python example around a bunch of external programs, calling c00, c01, until something could be fed to as and then linked using ld.
Previously:
Writing a C compiler in 500 lines of Python - https://news.ycombinator.com/item?id=37383913 - Sept 2023 (165 comments)
I find it surprising that a single-pass compiler is easier to implement than a traditional lexer->parser->AST->emitter. (I'm not a compiler expert, though.) I'd have expected that generating an AST would be at least as simple, if not simpler. Plus by generating an AST, doing some simple optimization is a lot easier: one can pattern-match parts of the AST and replace them with more efficient equivalents. Maybe I'm overthinking this, though. I tend to like extensible program designs, even when they don't necessarily make sense for the scale of the program…
Still a really cool article and an impressive project, though. I especially like the StringPool technique; I'll have to keep it in mind if I ever write a compiler!
I think it depends on the language. I heard Turbo Pascal was pretty fast because 1) Pascal’s language features, 2) no optimization in TP 1.0 at least.
Not sure if fewer LoC necessarily implies easier!
I think this might depend on the language you're writing in.
Historically, at least, it's pretty verbose to define a data type in Python compared to languages that are more designed for writing compilers. Consider these definitions from my prototype Bicicleta interpreter, which is written in ML, specifically OCaml:
Those ten lines of code would be ten classes in Python with an average of 1.6 attributes each. Using dataclasses or attrs, that would be 36 lines of code, and then (if you're doing it the OO way) every function that I defined on one of these OCaml types becomes a method implemented in each class implementing a particular protocol, with a copy of its argument signature in every class. (If you used namedtuple instead, it's no less code, but you write it on less lines.) So, for example, this function on bicexprs becomes six to eight method definitions in the different classes. (You can cut it down to six if you define an abstract base class for the constant classes.) And Literal.freevars needs an if-then-else. So that's another 20 lines of code.Python does support pattern-matching now, so functions like this might not be any more verbose than the ML version if you program them the same way instead of in the OO fashion. I haven't tried using Python pattern-matching, so I don't really know.
In general, though, Python is more verbose than ML-family languages for this kind of thing by a factor of about 2–4, and that's before you count the test code you need in Python to get the kind of confidence in correctness that ML's type-checking gives you with no extra code. To my knowledge, Mypy doesn't do the kinds of pattern-matching-exhaustiveness checks that ML compilers do.
I've sometimes "cheated" by trying to write code like this in Python using regular tuples rather than named tuples. You can definitely make it work, but it's a real pain to debug.
This article breaks it down well enough to make me feel like I could write my own C compiler targeting AVR. (I probably could... but it would not be easy.)
Never actually looked into how compilers work before, it's surprisingly similar/related to linguistics.
It's b/c when Chomsky invented the theory of formal grammars he was studying natural languages & the universality of abstract grammar¹. Computer scientists realized later that they could use the same theory as a foundation for formalizing the grammatical structures of programming languages.
¹https://en.wikipedia.org/wiki/Chomsky_hierarchy
Similar experience in DNA/genome analysis. A large part of DNA analysis was based on parser theory.
This paper was my introduction to DNA analysis as well as Chomsky hierarchy: https://www.jstor.org/stable/29774782 (I wasn't able to find a free copy).
IIRC, pseudoknots in RNA require context-free grammars to parse.
"compilers work before, it's surprisingly similar/related to linguistics."
Since compilers transform languages with a clearly defined grammar ... the connection to linguistics is maybe not so surprising after all.
We've come full circle
Now write a Python compiler in 500 lines of C.
I could probably do it - but you wouldn't like it. My dictionaries would be a linked-list, looking for a key becomes a linear search... (if you gave me C++ I'd use std::map) I'm assuming you will allow me to use the C standard library, if I have to implement strlen or malloc in that 500 lines of C I'm not sure I can pull that off. 500 lines is aggressive, but IOCCC gives me plenty of tricks to get the line count down and the language isn't that big. I'm also going to assume 100% valid python code is fed in, if there is a bug or error of any sort that is undefined behavior.
Note that most of what makes python great isn't the language, it is the library. I believe that large parts of the python library are also written in C (for speed), and thus you won't be able to use my 500 line python compiler for anything useful because you won't have any useful libraries.
A hash table in C is about 30 lines of code, so I don't think you have to stick to linked lists for dictionaries.
Indeed, a decent closed hash table is maybe 30 lines. An open hash table with linear probing is even less, especially if you don't need to remove entries. It's almost identical to a linear search through an array; you just change where you start iterating.
In my first stage Onramp linker [1], converting linear search to an open hash table adds a grand total of 24 bytecode instructions, including the FNV-1a hash function. There's no reason to ever linear search a symbol table.
[1]: https://github.com/ludocode/onramp/blob/develop/core/ld/0-gl...
A python VM that consumes bytecode might be doable in not-ludicrous-amounts of C. Not 500 lines I suppose. But something manageable I think? Especially if you targeted the older releases.
Maybe 500 lines of Pythonic, macro-heavy C. If the macros' LOC don't count. Maybe.
Not to be that guy, but Python is an interpreted language.
That said, I guess technically you could make something that compiles python to an executable? This is hacker news after all
I wrote one in 2 lines:
import sys, subprocess
subprocess.run(["gcc", sys.argv[1], "-o", "a.out"])
Now do it without imports.
That is not a compiler. That is called a wrapper script. But funny none the less.
The original cc was just a wrapper like this Python example around a bunch of external programs, calling c00, c01, until something could be fed to as and then linked using ld.
GCC does basically the same thing even today,
yeah but c00 and c01 actually do stuff
So does gcc