SectorC: A C Compiler in 512 bytes (2023)

(xorvoid.com)

304 points | by valyala 16 hours ago ago

59 comments

  • layer8 10 hours ago

    If this implementation had existed in the 1980s, the C standard would have a rule that different tokens hashing to the same 16-bit value invoke undefined behavior, and optimizing compilers in the 2000s would simply optimize such tokens away to a no-op. ;)

    • RodgerTheGreat 6 hours ago

      "you don't have -wTokenHashCollision enabled! it's your own foolish ignorance that triggered UB; the spec is perfectly clear!"

    • xorvoid 9 hours ago

      Too real! LMAO

  • zahlman 21 minutes ago

    > Big Insight #2 is that atoi() behaves as a (bad) hash function on ordinary text. It consumes characters and updates a 16-bit integer.

    I could have sworn I remembered atoi() being defined to return 0 for invalid input (i.e. text not representing an integer in base ten).

  • xorvoid 14 hours ago

    I may be the author.. enjoy! It was an absolute blast making this!

    • veltas 14 hours ago

      This is very nice. I'm currently writing a minimalist C compiler although my goal isn't fitting in a boot sector, it's more targeted at 8-bit systems with a lot more room than that.

      This is a great demonstration of how simple the bare bones of C are, which I think is one reason I and many others find it so appealing despite how Spartan it is. C really evolved from B which was a demake of Fortran, if Ken Thompson is to be trusted.

    • JamesTRexx 13 hours ago

      Would and how much would it shrink when if, while, and for were replaced by the simple goto routine? (after all, in assembly there is only jmp and no other fancy jump instruction (I assume) ).

      And PS, it's "chose your own adventure". :-) I love minimalism.

      • MobiusHorizons an hour ago

        The “fancy jump” is the branch instruction. As far as I know all ISAs have them. Even rv32i which is famously minimal has several branch instructions in addition to two forms of unconditional jump. Branches are typically used to construct if / for / while as well as && and || (because of short circuiting) and ternary (although some architectures may have special instructions for that that may or may not be faster than branches depending on the exact model). Without it you would have to use computed goto with a destination address computed without conditional execution using constant time techniques.

      • dzaima 9 hours ago

        It only does if & while, not for. A goto in a single-pass thing would need separate handling for forwards vs backwards jumps, which involves keeping track of data per name (in a form where you can tell when it's not yet set; whereas if/while data is freely held in recursion stack). And you'd still need to handle at least `if ( expr ) goto foo;` to do any conditionals at all.

      • SAI_Peregrinus 13 hours ago

        What fancy jumps are present in assembly depends on the CPU architecture. But there are always conditional jumps, like JNZ that jumps if the Zero flag isn't set.

      • direwolf20 9 hours ago

        It's "choose your own adventure"

        • globalnode 8 hours ago

          thats the most important thing i noticed about the article, apart from the forth tokenising ideas.

    • einpoklum 13 hours ago

      An interesting use case - for the compiler as-is or for the essentiall idea of barely-C - might be in bootstrapping chains, i.e. starting from tiny platform-specific binaries one could verify the disassembly of, and gradually building more complex tools, interpreters, and compiler, so that eventually you get to something like a version of GCC and can then build an entire OS distribution.

      Examples:

      https://github.com/cosinusoidally/mishmashvm/

      and https://github.com/cosinusoidally/tcc_bootstrap_alt/

  • mati365 13 hours ago

    Oh, it looks like my X86-16 boot sector C compiler that I made recently [1]. Writing boot sector games has a nostalgic magic to it, when programming was actually fun and showed off your skills. It's a shame that the AI era has terribly devalued these projects.

    [1] https://github.com/Mati365/ts-c-compiler

    • w4yai 4 hours ago

      > when programming was actually fun and showed off your skills

      Oh no. Now more people are able to do what I do. I'm not special anymore.

      • mlsu 4 hours ago

        Seems like this is facetious but to me, “I’m not special” is a pretty valid thing to be sad about.

      • tgv 3 hours ago

        The two dos in "do what I do" do absolutely not carry the same meaning.

  • shikaan an hour ago

    Such a great read! Reminds me of the bootsector OS I made some time ago[^1]

    Maybe it's time to equip it with a C compiler...

    [1]: https://github.com/shikaan/osle

  • riedel 14 hours ago

    Beautiful, but make sure to quickly add 2023 to the title.

    Discussed at the time: https://news.ycombinator.com/item?id=36064971

  • mojuba 13 hours ago

    Compare that to the C compiler in 100,000 lines written by Claude in two weeks for $20,000 (I think was posted on HN just yesterday)

    • vidarh 13 hours ago

      It's a fun comparison, but with the notable difference that that one can compile the Linux kernel and generate code for multiple different architectures, while this one can only compile a small proportion of valid C. It's a great project, but it's not so much a C compiler, as a compiler for a subset of C that allows all programs this compiler can compile to also be compiled by an actual C compiler, but not vice versa.

      • d_silin 12 hours ago

        But can it compile "Hello, World" example from its own README.md?

        https://github.com/anthropics/claudes-c-compiler/issues/1

        • Retr0id 12 hours ago

          It's fascinating how few people read past the issue title

          • fooker 8 hours ago

            And this is exactly why coding with AI is not-so-slowly taking over.

            Most people think they are more capable than they actually are.

        • vidarh 12 hours ago

          Noticed the part where all it requires is to actually have the headers in the right location?

          • d_silin 12 hours ago

            "The location of Standard C headers do not need to be supplied to a conformant compiler."

            From https://news.ycombinator.com/item?id=46920922 discussion.

            • vidarh 11 hours ago

              And it doesn't for the compiler in question either. As long as the headers exist in the places it looks for them. No compiler magically knows where the headers are if you haven't placed them in the right location

              • Retr0id 11 hours ago

                stddef.h (et al) should be shipped by the compiler itself, and so it should know where it is. But they rely on gcc for it, hence it doesn't always know where to look. Seems totally fine for a prototype.

                • vidarh 10 hours ago

                  Especially given they're not shipping anything. The GCC binaries can't find misplaced or not installed headers either.

              • d_silin 11 hours ago

                Would you accept the same quality of implementation from a human team?

                • dzaima 9 hours ago

                  I've certainly encountered clang & gcc not finding or just not having header files a good couple times. Mostly around cross-compilation, but there was a period of time for which clang++ just completely failed to find any C++ headers on my system.

                • fooker 8 hours ago

                  Yes, clang is famously in this category.

                  If you copy the clang binary to a random place in your filesystem, it will fail to compile programs that include standard headers.

                • vidarh 10 hours ago

                  A compiler that can't magically know how to find headers that don't exist in the expected directory?

                  Yes, that is the case for pretty much every compiler. I suppose you could build the headers into the binary, but nobody does that.

                  • tekne 8 hours ago

                    Consider: content-addressed headers.

      • mojuba 9 hours ago

        Well I'm pretty sure the author can make a compliant C compiler in a few more sectors.

  • sanufar 14 hours ago

    The way hashing is used for tokens and for making a pseudo symbol table is such an elegant idea.

    • avadodin a minute ago

      I actually "shipped" a parser using the symbols' hash(as the only identifier) for a test tool once. Hopefully, the users never used enough symbols to collide 32-bits.

    • fix4fun 13 hours ago

      I think the same. Really nice project and good trick with hashing tokens.

      PS. There left 21 bytes (21 * 0x00 - from 0x01e0 to 0x01fd). Maybe something can be packed there ;)

  • kreelman 5 hours ago

    There seems to be a good amount of interest for a boot sector compiler!!

    If you're running on Linux, adjust the qemu call to use alsa rather than coreaudio.

    I generated a pull request for this on Github. If the author is happy enough with my verbose shell scripting style :-) it might get included.

  • fooker 7 hours ago

    This is so cool!

    Fun fact, Tiny C Compiler was derived from such a C compiler submitted to the the International Obfuscated C Code Contest.

    https://www.ioccc.org/2001/bellard/index.html

  • userbinator 6 hours ago

    C-subset, to be precise; but microcomputer C compilers were in the tens of KB range, for one that can actually compile real C.

  • DeathArrow 2 hours ago

    For me is not interesting because it fits in 512 bytes, it's interesting because it's very simple. I think it would be a great introduction to learning about compilers.

  • SeanSullivan86 13 hours ago

    Why is it called a C Compiler if it's a subset of C?

    • userbinator 2 hours ago

      Why are you and others being downvoted for telling the truth?

  • wbsun 5 hours ago

    Nice, now you can dd it to your boot sector and ... Wait, it is 2026, there are 1000 ways of booting and memory mapping on so-called unified ARM architecture @,@

  • EGreg 10 hours ago

    Reminds me of Allegro SizeHack where we made games in 10KB - but we were using C and Allegro library!

    https://www.oocities.org/trentgamblin/sizehack/entries.html#...

  • NooneAtAll3 13 hours ago

    > I wrote a fairly straight-forward and minimalist lexer and it took >150 lines of C code

    was it supposed to be "<150"?

    • owalt 13 hours ago

      They're saying the naive implementation was more than 150 lines of C code (300-450 bytes), i.e. too big.

  • kayo_20211030 12 hours ago

    Nice. Very K&R-ish. Not a bad thing.

  • gonzus 12 hours ago

    Lacking support for structs, I think this is too minimalistic to be called "a C compiler".

    • userbinator 2 hours ago

      Yes, it is accurately described as C-subset, but it seems there are others here who don't want to speak of the truth.

    • pilord314 10 hours ago

      you bootstrap it into a library you can include optionally, duh