Writing a C compiler in 500 lines of Python (2023)

(vgel.me)

122 points | by ofou 6 hours ago ago

23 comments

tomhow 4 hours ago

Previously:

Writing a C compiler in 500 lines of Python - https://news.ycombinator.com/item?id=37383913 - Sept 2023 (165 comments)

MarsIronPI 2 hours ago

I find it surprising that a single-pass compiler is easier to implement than a traditional lexer->parser->AST->emitter. (I'm not a compiler expert, though.) I'd have expected that generating an AST would be at least as simple, if not simpler. Plus by generating an AST, doing some simple optimization is a lot easier: one can pattern-match parts of the AST and replace them with more efficient equivalents. Maybe I'm overthinking this, though. I tend to like extensible program designs, even when they don't necessarily make sense for the scale of the program…

Still a really cool article and an impressive project, though. I especially like the StringPool technique; I'll have to keep it in mind if I ever write a compiler!

[-]

markus_zhang 4 minutes ago

I think it depends on the language. I heard Turbo Pascal was pretty fast because 1) Pascal’s language features, 2) no optimization in TP 1.0 at least.

arjvik an hour ago

Not sure if fewer LoC necessarily implies easier!

kragen 17 minutes ago

I think this might depend on the language you're writing in.

Historically, at least, it's pretty verbose to define a data type in Python compared to languages that are more designed for writing compilers. Consider these definitions from my prototype Bicicleta interpreter, which is written in ML, specifically OCaml:

    type methods = NoDefs
                               (* name, body, is_positional ... *)
                   | Definition of string * bicexpr * bool * methods
    and bicexpr = Name of string
                  | Call of bicexpr * string
                  | Literal of string option * methods
                  | Derivation of bicexpr * string option * methods
                  | StringConstant of string
                  | Integer of int
                  | Float of float
                  | NativeMethod of (lookup -> bicobj)

Those ten lines of code would be ten classes in Python with an average of 1.6 attributes each. Using dataclasses or attrs, that would be 36 lines of code, and then (if you're doing it the OO way) every function that I defined on one of these OCaml types becomes a method implemented in each class implementing a particular protocol, with a copy of its argument signature in every class. (If you used namedtuple instead, it's no less code, but you write it on less lines.) So, for example, this function on bicexprs

    let rec freevars = function
        Name n -> stringset [n]
      | Integer _ | StringConstant _ | Float _ -> stringset ["prog"]
      | NativeMethod _ -> stringset []
      | Literal (Some selfname, methods) -> 
            StringSet.diff (freevars_methods methods) (stringset [selfname])
      | Literal (None, methods) -> freevars_methods methods
      | Derivation(object_, self, methods) ->
            StringSet.union (freevars object_) (freevars (Literal(self, methods)))
      | Call(object_, _) -> freevars object_

becomes six to eight method definitions in the different classes. (You can cut it down to six if you define an abstract base class for the constant classes.) And Literal.freevars needs an if-then-else. So that's another 20 lines of code.

Python does support pattern-matching now, so functions like this might not be any more verbose than the ML version if you program them the same way instead of in the OO fashion. I haven't tried using Python pattern-matching, so I don't really know.

In general, though, Python is more verbose than ML-family languages for this kind of thing by a factor of about 2–4, and that's before you count the test code you need in Python to get the kind of confidence in correctness that ML's type-checking gives you with no extra code. To my knowledge, Mypy doesn't do the kinds of pattern-matching-exhaustiveness checks that ML compilers do.

I've sometimes "cheated" by trying to write code like this in Python using regular tuples rather than named tuples. You can definitely make it work, but it's a real pain to debug.

Liftyee 4 hours ago

This article breaks it down well enough to make me feel like I could write my own C compiler targeting AVR. (I probably could... but it would not be easy.)

Never actually looked into how compilers work before, it's surprisingly similar/related to linguistics.

[-]

measurablefunc 3 hours ago

It's b/c when Chomsky invented the theory of formal grammars he was studying natural languages & the universality of abstract grammar¹. Computer scientists realized later that they could use the same theory as a foundation for formalizing the grammatical structures of programming languages.

¹https://en.wikipedia.org/wiki/Chomsky_hierarchy

dekhn 28 minutes ago

Similar experience in DNA/genome analysis. A large part of DNA analysis was based on parser theory.

This paper was my introduction to DNA analysis as well as Chomsky hierarchy: https://www.jstor.org/stable/29774782 (I wasn't able to find a free copy).

IIRC, pseudoknots in RNA require context-free grammars to parse.

lukan an hour ago

"compilers work before, it's surprisingly similar/related to linguistics."

Since compilers transform languages with a clearly defined grammar ... the connection to linguistics is maybe not so surprising after all.

TZubiri 2 hours ago

We've come full circle

weregiraffe 5 hours ago

Now write a Python compiler in 500 lines of C.

[-]

bluGill 2 hours ago

I could probably do it - but you wouldn't like it. My dictionaries would be a linked-list, looking for a key becomes a linear search... (if you gave me C++ I'd use std::map) I'm assuming you will allow me to use the C standard library, if I have to implement strlen or malloc in that 500 lines of C I'm not sure I can pull that off. 500 lines is aggressive, but IOCCC gives me plenty of tricks to get the line count down and the language isn't that big. I'm also going to assume 100% valid python code is fed in, if there is a bug or error of any sort that is undefined behavior.

Note that most of what makes python great isn't the language, it is the library. I believe that large parts of the python library are also written in C (for speed), and thus you won't be able to use my 500 line python compiler for anything useful because you won't have any useful libraries.

[-]

dekhn 23 minutes ago

A hash table in C is about 30 lines of code, so I don't think you have to stick to linked lists for dictionaries.

[-]

ludocode 4 minutes ago

Indeed, a decent closed hash table is maybe 30 lines. An open hash table with linear probing is even less, especially if you don't need to remove entries. It's almost identical to a linear search through an array; you just change where you start iterating.

In my first stage Onramp linker [1], converting linear search to an open hash table adds a grand total of 24 bytecode instructions, including the FNV-1a hash function. There's no reason to ever linear search a symbol table.

[1]: https://github.com/ludocode/onramp/blob/develop/core/ld/0-gl...

wyldfire 4 hours ago

A python VM that consumes bytecode might be doable in not-ludicrous-amounts of C. Not 500 lines I suppose. But something manageable I think? Especially if you targeted the older releases.

nickpsecurity an hour ago

Maybe 500 lines of Pythonic, macro-heavy C. If the macros' LOC don't count. Maybe.

TZubiri 2 hours ago

Not to be that guy, but Python is an interpreted language.

That said, I guess technically you could make something that compiles python to an executable? This is hacker news after all

tvickery 5 hours ago

I wrote one in 2 lines:

import sys, subprocess

subprocess.run(["gcc", sys.argv[1], "-o", "a.out"])

[-]

rossant 3 hours ago

Now do it without imports.

emilbratt 3 hours ago

That is not a compiler. That is called a wrapper script. But funny none the less.

[-]

amszmidt 2 hours ago

The original cc was just a wrapper like this Python example around a bunch of external programs, calling c00, c01, until something could be fed to as and then linked using ld.

GCC does basically the same thing even today,

[-]

01HNNWZ0MV43FF 2 hours ago

yeah but c00 and c01 actually do stuff

[-]

TZubiri 2 hours ago

So does gcc