yes that makes sense. unfortunately 30cc is not able to compile that syntax, and will probably give type-checker errors when passing a void* to pointer of another type. but will implement it sometime soon!
I downvoted just because I thought it wasn't great advice. It makes the code incompatible with C++ compilers, and being able to compile C with C++ compiler is often useful.
Also... it felt kind of... unnecessary? to tell someone who is making a compiler how to write basic code in the language they're writing a compiler for. It almost felt like telling an overweight medical student/doctor that he should avoid eating fatty foods.
I disagree with that advice, FWIW -- (void *) is the cause of a lot of bugs in C programs, and much stricter type checking was often the norm for compilers like CodeWarrior on platforms like classic Mac OS, where (a) if you got something wrong you'd corrupt app memory at best and the filesystem at worst and (b) many developers were used to Pascal, which was stricter. (Moving to GCC on Mac OS X and unexpectedly getting much more lenient type checking was a big surprise.)
It doesn't make it any safer because the cast is unchecked. It could be argued it's discouraged[1] for a reason that doesn't really apply to modern C versions where implicit-int isn't allowed anymore.
But while the cast is a matter of style (or C++ compatibility), sizeof(*ret) is definitely superior to sizeof(thing_t). The reason is that people sometimes change the type of ret but forget the change the size of the allocation, especially if ret is assigned to on a different line it's declared.
This kind of condescending advice doesn't help anybody. Redundant casts, like redundant braces, white space, and a plethora of similar choices are stylistic more than anything else.
If you really feel the need to give unsolicited coding style feedback maybe at least spend more than a few seconds looking at the code before you do so. Otherwise it's just rude.
> If you really feel the need to give unsolicited coding style feedback maybe at least spend more than a few seconds looking at the code before you do so. Otherwise it's just rude.
Mh, at the same time, people who know their craft can give solid feedback within a very short timeframe. It's very impressive how good sound engineers can just pickup your mixed recording, listen to some 5 second batches, 15 - 20 seconds in total and already give very solid advice for improvement. Or when I was playing foosball more actively with league players. I could hold myself, but after 2-3 exchanges they could usually point out some technique issues.
I found GPs feedback (and the resulting discussion, heh) useful and the resulting code cleaner (in case I ever have to touch C-Code again, brr), and then learned that the compiler can't do that yet.
Giving feedback - solicited or not - is part of an engineer's job. Wether it's cosmetic feedback or not doesn't matter as cosmetics are also part of the job, although not everybody thinks so.
Really? Wow. I had no idea, and I really tried to sound non-condescending.
I simply don't agree that redundant casts are something to be ignored, since they can be harmful and hide actual errors (the cast is a bit like `sudo make me a sandwich`, it makes the compiler do what you say which is not always a good idea).
I guess I believe that publicly posting code, and even writing a compiler for the language in question, kind of automatically marks you as interested in the language and open for ideas of usage and so on.
My apologies to the OP if you were offended. Thanks for the feedback, @gizmo.
FWIW, I liked your suggestion, the author’s explanation, And your follow on written understanding.
If someone’s technical article is submitted and they are themselves participating in a discussion out feeling insulted, then accusations of unsolicited advice are unfounded.
Redundant casts are really, really good at hiding errors...
(And few people know that sizeof is a unary operator so the parentheses are rarely necessary. Dropping them reduces the visual noise => improves readability.)
> Then run ./build.py. This will use the bootstrapped 30cc-compiler to compile 30cc itself. It then again uses the 30cc-compiled compiler to compile 30cc once again. The final compiler is then stored as ./30cc.
Why isn't that also done by the Makefile? The only catch I could see is that you'd need to have it build to different output names, but that seems fine for what it is?
---
Also, I'm curious - did you find yourself having to constrain your use of C in order to make sure that the compiler could compile itself? Or does it implement everything you would use naturally anyways?
(And of course, congrats/bravo; super impressive:])
> Also, I'm curious - did you find yourself having to constrain your use of C in order to make sure that the compiler could compile itself? Or does it implement everything you would use naturally anyways?
That would be the "bootstrapping" process. Nearly a half-century ago I took a compiler lab class where we were given a working, but slightly lame, compiler, and were tasked with adding new, less lame, language features by bootstrapping. That is: 1) implement new feature without using the new feature, 2) using the compiler that results from step 1, re-implement the feature using the new feature, and compile again. 3) Repeat with more features until end of semester for best grade.
- Breadboarding a very, very minimal CPU. More like a simple Adding + Multiplication machine with 4 registers or so, with a few premade components.
- Eventually moving up to microcoding a simulated CPU
- Eventually writing binary code to control the microcoded CPU
- At that point I kinda cheated and wrote my own assembler because I got sick of checking so many bits.
- This project then stopped at an assembly level.
- But then we implemented our own ML-variant interpreter and later on ML-variant compiler in OCaml
- And later on we had that ML-Compiler compile itself and extended it from there.
Pick your poison where to start. Just be aware that breadboarding stuff becomes very... messy very quickly[1]. And I do include Hardware in this, because you needed simpler CPUs to design more complex CPUs.
In practice, you want to pick the best combination of a familiar language that is as high level as it can be. But back in the day, that would've been assembly, binary. In the case of C, it went from BCPL, in which a compiler for B was written, in which a compiler for New B (NB) was written, which then turned into C.
Assembly? Ha! In my day, we would have considered ourselves lucky to have had assembly, let alone a C compiler. No, we had to bootstrap our computers using switches and paper tape. You probably want a terminal too!
In all seriousness, the bootstrap process is fascinating to me. At some point, it did all start with switches and manually loading commands directly into memory. And over time, we’ve slowly kept this thing going and growing. Also… I’m old enough to have seen a Altair with toggle switches, but I’m not old enough to have had to toggle in a boot loader on one. :)
It depends on definition of full bootstrapping process. At some point, someone has to bootstrap in raw assembly but we don't need to do that in 2024. If I write a bootstrap compiler in c on a machine that can run c, I can write an alternate backend for any assembly language (that supports the necessary compilation features) and thus produce a compiler implemented in any assembly language without directly writing any assembly.
Taking this process to the extreme, no one could ever bootstrap anything because eventually you're bootstrapping mineral extraction.
My guess is that some people would rather write Python code than dig into Make’s documentation. That said, ChatGPT is excellent at creating valid make files.
Or just take a few moments to learn the basics about Makefiles.
The thing about Makefiles is that simples ones at least are really easy to write, and read. Much simpler and quicker than a cumbersome python script, that will most likely do less with much more boilerplate code (e.g. dependencies), and be much harder to read.
Of course, you may hit a point where you stretch your Makefile so much beyond its common capabilities that that no longer becomes true, but in my experience that point is pretty far away.
As someone that learned C# and python before C and C++, to this day I couldn't explain to you how make rules work. Make is so unlike build tools from other languages that it doesn't surprise me someone would rather use python to bootstrap their compiler.
It is so trivial, it takes you 10 minutes to learn.
Less even, here is an attempt:
foo.o: foo.c
clang -c -o foo.o foo.c
This builds a file called foo.o, if foo.c is newer than foo.o. Imagine that there is an identical bar.o as well, that builds it from bar.c.
fooexec: foo.o bar.o
clang -o fooexec foo.o bar.o
This links together foo.o and bar.o into a file called fooexec, if at least one of foo.o or bar.o is newer (which in turn means that any of its respective dependencies changed; e.g. change bar.c, and bar.o and then fooexec will rebuild).
Now type this into the shell:
make fooexec
Congratulations! You have a fully built fooexec. Now change bar.c, type it again, and like magic, you have a rebuilt bar.o and fooexec. foo.o will stay put, it’s up to date!
But of course all the .c -> .o rules look the same, so instead of writing a new one for every file, you'd use simple wildcards, and now the whole, entire Makefile is just:
Do $@ and $< look arcane at first glance? Yes. Is what they do simple, though? Also yes. (If your complaint is that you don't know what those "clang" lines do, then you don't know how to use the C compiler, and will have the same problem when writing a python script.)
It pains me if that is not immediately obvious. And it pains me even more that someone would want to write a boilerplate heavy, complex python program, that rebuilds everything all the time, instead of just learning this very simple concept.
This is an amazingly good 2-minute introduction to makefiles. It would have been better without the last two sentences (and most of your comments on the topic).
---
Something skipped over by your introduction (and most guides to makefiles) which very quickly either makes makefiles less short or much more complicated is header files.
C files often depend on one or more header files that aren't system headers (this C compiler certainly does).
They could be added to each individual rule by hand:
which causes too much code to be recompiled every time a header file is changed.
Or we could do something really fancy and get only the right header files added as dependencies for each compilation unit. That requires using a feature of the gcc/clang compilers (their preprocessors, really) that causes them to output a small makefile snippet that lists the exact header dependencies for a compilation unit (the '-MD' option). These snippets can then be included in the makefile. To make this look relatively clean requires some fancy macro use that is waaaay beyond what people can learn in ten minutes.
This is an extremely common use case and it is a bit of an embarrassment that GNU make doesn't handle it better.
I did mention that in other comments, including the C preprocessor helping. make is completely language agnostic, so it cannot “guess” header dependencies. In make, you get the dependencies you specify (whether manual or generated), that’s what I wanted to show.
This is an excellent explanation of make! The only thing I will add is the slightly more abstract thought that make is meant to build files out of other files based on timestamps of the source files. And the target files can themselves become source files, which means that you can build dependency graphs. Lastly, it has “caching” built in: if bar.o is newer than bar.c, don’t rebuild it. If it doesn’t exist or is older, do rebuild it. So when you change one file, you don’t recompile the entire damn tree.
It also has “phony” targets that aren’t files. The canonical example of this is “clean” which basically just does something like rm -rf *.o fooexec
This is what the JavaScript tooling people have been chasing for the past 15 years and can’t seem to grasp. Someone actually did build a make-based JS build system I believe but it never got popular because it wasn’t written in JavaScript.
A nitpick, but I don't think that should really be called caching, although it kind of is, in a looser sense. It's just comparing the file timestamps of the .o and .c files, and building the .o from the .c, if the .c is newer, or if the .o doesn't exist.
Saying this is an 'incremental makefile' isn't really correct, as changes to header files aren't going to lead to rebuilding.
So, either you need to manually keep references to which .h files you include in your Makefiles up to date, or start worrying about M / MM / MG / MP, and of course you'd like those to be re-run when you change your files, and suddenly your Makefile is an awful lot less simple.
This is the main reason I stopped teaching Makefiles in intro to C courses -- students often lost entire afternoons to forgetting this stuff, and failing to realise their mistake. It really shouldn't be any person's job to keep this up to date, when the computer knows it.
It is correct to call it an incremental Makefile, it merely doesn’t name the dependencies for you. A Makefile has exactly the dependencies you specify. It’s that simple.
That is not necessarily the most useful. Especially as your project gets larger, you probably will not want to specify every header file as a manual dependency, or alternatively rebuild everything every time you change a header file. (Though for me personally, either approach often works for surprisingly long.)
Then you can do things like letting the C preprocessor create the dependencies that get included in your Makefile (yes, common C compilers can create Makefile fragments). Or do something more complicated. Or switch to another build system, if your project really outgrew, or really mismatches, make.
But at its core, Makefiles follow a simple concept, are entirely language agnostic, and even with just the simpler concepts they remain useful for a long time.
Having 'incrementality', when it isn't actually going to incrementally rebuild things when files change, is (in my experience), worse that having no incrementality at all. Having to remember when I have to manually disable to incrementality is an annoying overhead, and easy to forget.
If you can remember exactly when you need to manually skip the incremental build, that's great for you, but I find Make has enough of these kinds of footguns I don't recommend it to people any more.
Meh, it can practically never be perfect, without diminishing returns that at some point invalidate the advantage of any incremental builds anyway.
As I said, you can let the C compiler generate your header dependencies for you. But that's not enough, because what's with header and libraries that are external for your project? Keep in mind that they will have dependencies, too, and at some point your transitive closure will be really big.
To a lesser degree, what's with a compiler, linker, or any random other part of your toolchain that changes (at least those are supposed to mostly keep a stable interface, which isn't the case at all for external libraries, but in some cases the toolchain can cause the footguns you mentioned).
A lot of the times, you don't need to rebuild everything just because a system header or the toolchain changed, but your build system would have a hard time knowing that. Because at some point, the halting problem even gets in your way (i.e. you can't really reliably detect if a given change in a header file would need rebuilding of a source file, which you would need for "fully incremental" builds). So it's always a trade-off.
Personally, I've fared very well with generated header dependencies (sometimes manually declared ones or none at all for small projects), and many other projects have, too.
YMMV of course, but I don't observe this to be bad. Most people who program C and use make are, I think, aware of what header mismatches can cause, and how to avoid that.
I think one fundamental difference between the two of us (I'm guessing here, let me know if I'm wrong), is I'm willing to cope with longer compile times, as long as I always get a correctly built executable (so there are no 'halting problems', if in doubt, rebuild it!), while you would prefer faster build times, even if that then requires sometimes having to do a little manual work when the build system doesn't realise it needs to rebuild? Of course, no system will be 100% perfect, but you can choose where your trade-off point is.
Two reasonable viewpoints, but probably effects how we do our building setups!
I think that's fair. On top of that, being a low level system programmer, I think I'm pretty good at realizing/intuiting when something critical changed that needs me to nuke the build dir, i.e. the manual work you mention to force a full recompile (and also good at noticing mismatched headers even through very weird symptoms).
I do have to switch between multiple SDK versions very often for example, and I always either nuke the build dir after doing that, or have separate build dirs per SDK in the first place.
Yes, this is quite important for C projects as it leads to subtle bugs. I also long for a way to let make detect a change of $CC (new build of the compiler itself) in the context of debugging a compiler against third party makefile projects.
Have a step that takes the SHA256 or similar of $CC, make that a dependency. You’d need slightly more advanced make features if you want to make it “pretty” (and not just, say, touch a file through some shell code comparison), but this is a slightly more advanced request.
Or potentially just specify the path to $CC itself as dependency? Presumably, if the C compiler changes, the timestamp of its executable does too. (Bad if it goes backwards, though.)
I often use a Python program called 'fabricate', which basically checks every file a command opens, and re-runs the command if any touched file changes.
I love it, and I hoped this would become the future of how build-systems are built -- you could automatically parallelise, rebuild, make clean, everything, just by tracking which files are read, and written.
It does add a bit of overhead, but I'd be willing to pay that for simplicity. Unfortunately, it doesn't seem to have caught on.
That is neat, but also super brittle, and full of abstraction leakiness, potentially leading to both false positives and negatives. Sounds like it would break down with ccache or distcc immediately, for example.
I never had any false negatives, they are basically impossible in this framework, except it does assume if a program, and all it's inputs are unchanged, the output is unchanged, so this is no good for programs which randomly produce different outputs.
ccache can cause things to get built twice, because the first run it sees it tries to read a file, then later writes to it, so it knows something different might happen if you run again -- but then again that is just ccache doing it's thing, so it's quick. Yes, distcc wouldn't work, but then again I find distcc super-brittle (so hard to make sure everything has the same compiler versions, etc), I may have used it 20 years ago, but I don't think I ever would now!
You could have taught them `make clean` whenever there's any issues. I work with software engineers and I still have to remind them to try rm -rf ./build/* now and again, which always seems to solve the problem.
I've written a good amount of makefiles (and even one rather complex one.. before replacing it with a thing written in an actually sane language, bootstrapped by a stripped down version of the makefile), and I still mix up $@ $< $> or whatever else my brain comes up with. And this is coming from someone deep into array languages, which have entire sets of unicode glyphs.
For how often one typically writes/looks at build scripts, remembering the specifics about makefiles is quite likely to not be worth the effort.
Makefiles start becoming incredibly awful if you want to do anything outside of pure computations in the separate targets. Making the target directory requires either putting the mkdir call in each separate built thing (awfully redundant), introducing a whole another rule (and still adding it as a dependency on every rule that'll be outputting to said directory), doing some recursive make, or running it outside of rules (but putting it under a proper if statement is likely pain). That's a bunch of awful solutions to something that really really really should just be a single calls to mkdir in any sane system.
Another horrifically awful thing is if you want some reused data across rules (e.g. pkg-config results, current OS/arch). You can put them as globals, but that'll make them be calculated always, even if no build target needs them, unless you put them under an if, which is again non-trivial to get the condition correctly for. This becomes super bad if you do recursive make stuff.
More generally, makefiles start out simple, but doing anything that's not strictly simple is gonna be pain, at the very least requiring looking at docs of a thing you'll probably never use again, and debuggability is also very low (amusingly, '-p' and '--debug' break on recursive make, seemingly losing some variables; and '-n' requires copying out & running the recursive call to get to the actually important part; and afaict none of those help with figuring out why a thing didn't build).
And you're completely SOL if you wanted to do things like reorder compilations based on how long they previously took, or want a nice display of currently-compiling things.
> Do $@ and $< look arcane at first glance? Yes. Is what they do simple, though?
And yet, I still don't actually know what `$@` and `$<` actually mean outside of your example. I only can assume that somehow that substitutes in `foo.o` and `foo.c` because you literally put the non-magic versions right below it (and above it a few paragraphs above as well), but I'm honestly not even positive that I'd run `make foo.o` rather than `make foo` with your example, and I feel fairly confident in guessing that this would _not_ compile into a binary called `fooexec` like the target you defined above. I don't have an idea whatsoever what I'd need to put in the target to be able to substitute in something like that because the fact that the first `%` somehow becomes `$@` and the second becomes `$<` doesn't follow any obvious pattern I can use to figure out what the third symbol would be, or even if there is one.
The problem is that "simple" is not the same as "intuitive" or "discoverable". Yes, I could probably put in the effort to go and find out these things, but why would you expect someone to be motivated to do that by someone telling you that it's "trivial" and it "pains" them as they explain things in a way that doesn't actually clarify things in a way that helps me? If you actually want to try to make a strong case that this is something worth learning for people, repeatedly referring to it in ways like "very simple" and "so trivial" and "immediately obvious" is counterproductive.
As an aside, I think it's also a bit of leap to assume that they're "rebuilding everything all the time". It looks to me like the build.py script is only needed a single time to bootstrap from the original compiler to this one, and rebuilding every single C file in the new compiler rather than using the artifacts built with the original one is kind of the whole point. After that's done once, I don't see why the new compiler couldn't be used via the Makefile. If you're complaining about Python build scripts in general rather than this specific one, I don't know why you assume they can't be written to check the timestamps on the files to see if a rebuild is necessary before doing so. That's exactly what `make` does, and hey, it's a very simple concept!
$@ is the name of the target -- the @looks a bit like an old-fashioned target for practice and competition shooting.
$< is the name of the input (the file the rule depends on). If the rule depends on more than one input, it is the first one. So you write your dependencies as xxx.c xxx.h yyy.h zzz.h with the C file first. Mnemotechnically, it is "the thing that goes into the build".
$^ is the names of all the files the rule depends on. Think of it as a large hat above all the input file names.
Many things in make are clumsy and GNU make is a behemoth that does so much more than the original make that it is really hard to learn -- but these shortcuts have really good names!
One might ask how you learned C (or python, or anything else) in the first place, if you can't be bothered to learn what the very simple $< and $@ mean, which would take you literally seconds to look up if my example wasn't actually enough already.
The rest is just... of course you can reimplement make, or a subset of it, in python, python is turing complete after all. But why? make isn't that arcane. I'm sorry it uses the weirdly looking $@ and $< for "target" and "dependency" respectively. Those two are extremely common and you have to type them in a lot of rules, so the authors chose a shorthand. One way or another your python make subset will have to bring in the same concepts they represent, too.
This is trivial in the sense that it's a part of any build system worth its salt. And I can almost guarantee you that OP for example, who wrote their own C compiler, understood it immediately.
> One might ask how you learned C (or python, or anything else) in the first place, if you can't be bothered to learn what the very simple $< and $@ mean, which would take you literally seconds to look up if my example wasn't actually enough already.
The problem is, you typically don't need to touch Makefiles that often once they work (or fulfil a reasonable interpretation of "work"), so unless you're a distro packager or the project(s) you work on are sufficiently complex and large to warrant an entire FTE just for build tooling, chances are high you need to touch that stuff only once every few years, by which time almost everyone has to dig into the documentation yet again... made worse by the fact that Google's search quality has gone down the drain.
Since you’re asking, $@ is the “target” of the build, that is, what you are aiming to build (left of the colon in the rule). This is why it looks like a bullseye.
And $< is of course the thing the target depends on. It’s the input, like standard Unix notation for stdin.
> Make is so unlike build tools from other languages
Interesting, I thought that "make" was like multiplication tables, something you learn very early one as it's a super simple tool. Also I assumed it was part of any unix 101 class, together with bash and command line stuff. But probably it shows my age. Most people don't have any reason to learn these tools nowadays.
> The thing about Makefiles is that simples ones at least are really easy to write, and read. Much simpler and quicker than a cumbersome python script, that will most likely do less with much more boilerplate code (e.g. dependencies), and be much harder to read.
Whether this is true or not depends a lot on from which programming culture/background you come.
Left depends on the right. I don't think the ability to learn that concept is related to programming culture/background.
What follows below that is what happens.
Left : right
cp right Left
echo and so on. Left is now updated.
The only tricky part of the above, and something I guess nobody has found any good reason for, is that the whitespace in front of the statements ('cp' in this case) has to be an actual tab, just spaces won't do.
When it comes to the more "advanced" concepts (wildcards etc) there are slight differences between Make versions. And to bother with that is to get into the mindset which created the (argh) Automake and Autoconf systems (to begin with), so back in the day our company simply decided that we'll use GNU Make on every single system (we supported lots of various UNIX systems) and not bother with any of that (no SYSV Make, no BSD Make or anything), because GNU Make was and is available on anything and everything. Made life very simple back then.
The story I have heard about why it must be a tab is that by the time the author of the original Make realised he didn't need to require a tab, he already had six users and didn't want to annoy them by breaking compatibility with the makefiles they'd written :-)
The classic XKCD graph that plots life satisfaction against days-since-editing-xorg.conf could equally apply to days-since-thinking-about-autotools.
In fact one of the few times I've thought about autotools in the last decade was when a failing python-based build script inflicted similar frustration!
If I were forced to find one nice thing to say about autotools it would probably be that at least it doesn't assume an internet connection is always available, reliable and without cost.
I honestly don't think so, and I think this is here is a prime example for proving that.
For example, a Makefile that does the same job as the build.py script in this project would be significantly smaller, simpler, and easier to read in several metrics that I'd reasonably call "objective" to a certain degree.
You need to know very little about Makefiles to make immediate sense of what that Makefile is doing, whereas you need to know much more about python to still not immediately see what the build.py script is doing. In fact, you will probably just "guess" that the python script is supposed to do a similar job only from its name before that.
And then the python script still does not do incremental builds at all!
Again, if it gets more complex that can change, but this is far away from that. It takes 10 or so minutes to learn enough about Makefiles to be productive with them, from scratch.
On the other hand, I'd argue that the point where you've stretched it so far that it's more trouble than it's worth is basically impossible to see until you've gone beyond it.
If make is too hard, ninja is simpler. Less powerful but in ways that rarely matter. I've written both by hand both are a little weird but not hard. Normally I use cmake these days (an aweful language but it works for hard problems)
i think it just supports smth like 70% of C.
I didn't bother supporting floating-point operators, or arrays (I mean, malloc and pointers are fine) or etc. I just wanted smth small and self-hosting!
Out of curiosity: would ./30cc_gcc (30cc complied by gcc-hosted 30cc) and ./30cc (30cc complied by self-hosted 30cc) be identical binary files, if 30cc was operating as expected?
No, gcc optimizes code output and can make code shorter and faster. A compiler can make many, many decisions which will show up as differences in the binary output.
What a fun project, thanks for sharing. I've dreamed of projects like this. What did you expect to learn from this project? Did you learn anything unexpected?
The first step merely shows that you wrote valid C code that gcc can compile. It doesn't prove that the program actually does what promised. For example, if you missed to implement 'for' loops, this step would still produce a compiler, which could still work for a subset of C.
Here comes the second step: it proves that it implements enough features to at least recompile itself. Now, this is still not a guarantee that all C constructs and features work, but we can reason that even if it only implements a subset of C, it's a large enough subset to build such a substantial app as a compiler. How likely is it that a compiler doesn't use a 'for' loop, not even once?
But what is the reason for the third step? It doesn't add anything, it doesn't trigger any code path that was excluded in the first one. After all, if you did miss 'for' loops, and the 2nd step hasn't detected it, it must be because there are no 'for' loops in the source, so you will never detect it this way, no matter how many steps of recompilation you run.
The 3rd step can flush out compiler bugs, as follows.
The 1st step uses gcc, so it's not going to find any bugs in this compiler.
The 2nd step uses the compiler as compiled by gcc. That will find some bugs, such as crash bugs or missing features (like your for loop example). However, it does not find bugs that lead to generating incorrect code.
The 3rd step uses the compiler compiled by itself. If there is a bug in code generation that led to the stage-2 compiler being compiled incorrectly, that is likely to lead to some error during stage-3 compilation, such as a crash. And if it doesn't crash outright, it is very likely to cause the resulting executable to differ between steps 2 and 3. The incorrectly compiled compiler is surely even worse at compiling correctly than the compiler that was presumably compiled correctly (using gcc)!
So, this 3-step process is a pretty good way of finding bugs, especially if you compare the stage-2 result against the stage-3 result.
if you want more of a challenge try a compiler compiler that can compile itself... :)
i got pretty far with this myself, although it doesn't work for weirdo languages like Python that need an exceptional lexer. i keep meaning to revisit this problem, since the tools in this space are pretty much non-existent or trash quality.
By the way, I dislike the term “compiler compiler”, because that’s not really what it does. I like “parser generator” for tax/bison, and “lexer generator” for flex.
Unlike any real machine (and most virtual ones), WASM is impossible to target with a simple single-pass C compiler. The reducible control flow requirement means you have to build a full control-flow graph and do graph algorithms to it before you can start emitting code. So IMO it makes for a bad starting point.
If you want to have "fun", generate code in a WASM-representation of Hoare's WHILE language, i.e. a program that only consists of a single outer while loop (in WASM a "loop" block), and conditionally decide for every single instruction in the loop whether it should be executed during a given pass.
I think that could be done with a single-pass C compiler. In a very trivial (and terrible) case, you could keep a running "line_counter" variable that every statement in the loop is predicated with, which either gets incremented at the end of the loop, or set to an arbitrary value for representing branches.
It would of course be a horrible (but very amusing) mess, and rather enforces your actual point.
Very cool, congratulations!
I took a 2-second peek at the code, and just wanted to offer a small piece of advice that I think makes it better. Or two, really. Counting is hard.
Instead of (this is from parser.c):
apply the two common principles of DRY [1] and "don't cast the return value of malloc()" [2] and you get: I think the latter is way better.[1]: https://en.wikipedia.org/wiki/Don%27t_repeat_yourself
[2]: https://stackoverflow.com/a/605858
thanks man!
yes that makes sense. unfortunately 30cc is not able to compile that syntax, and will probably give type-checker errors when passing a void* to pointer of another type. but will implement it sometime soon!
Aha, so it doesn't know about void* being compatible with all (object) pointers, but so far just requires the pointers to match exactly? I see.
Not sure why I was down-voted, I think that is ... not nice. :(
I downvoted just because I thought it wasn't great advice. It makes the code incompatible with C++ compilers, and being able to compile C with C++ compiler is often useful.
Also... it felt kind of... unnecessary? to tell someone who is making a compiler how to write basic code in the language they're writing a compiler for. It almost felt like telling an overweight medical student/doctor that he should avoid eating fatty foods.
I don't think you can be downvoted, there's no downvote button
Users with enough karma can downvote comments.
If you're implementing C23 compatibility, you could do
and you get both DRY and avoid (void *) casting.I disagree with that advice, FWIW -- (void *) is the cause of a lot of bugs in C programs, and much stricter type checking was often the norm for compilers like CodeWarrior on platforms like classic Mac OS, where (a) if you got something wrong you'd corrupt app memory at best and the filesystem at worst and (b) many developers were used to Pascal, which was stricter. (Moving to GCC on Mac OS X and unexpectedly getting much more lenient type checking was a big surprise.)
It doesn't make it any safer because the cast is unchecked. It could be argued it's discouraged[1] for a reason that doesn't really apply to modern C versions where implicit-int isn't allowed anymore.
But while the cast is a matter of style (or C++ compatibility), sizeof(*ret) is definitely superior to sizeof(thing_t). The reason is that people sometimes change the type of ret but forget the change the size of the allocation, especially if ret is assigned to on a different line it's declared.
[1]: https://c-faq.com/malloc/mallocnocast.html
But adding the cast doesn't make type checking stricter, quite the opposite - it destroys any hope of the compiler checking the assignment.
C++ shot itself in the foot by removing automatic conversion from void* to T*, because it forces programmers to add casts that weaken type safety.
This kind of condescending advice doesn't help anybody. Redundant casts, like redundant braces, white space, and a plethora of similar choices are stylistic more than anything else.
If you really feel the need to give unsolicited coding style feedback maybe at least spend more than a few seconds looking at the code before you do so. Otherwise it's just rude.
> If you really feel the need to give unsolicited coding style feedback maybe at least spend more than a few seconds looking at the code before you do so. Otherwise it's just rude.
Mh, at the same time, people who know their craft can give solid feedback within a very short timeframe. It's very impressive how good sound engineers can just pickup your mixed recording, listen to some 5 second batches, 15 - 20 seconds in total and already give very solid advice for improvement. Or when I was playing foosball more actively with league players. I could hold myself, but after 2-3 exchanges they could usually point out some technique issues.
I found GPs feedback (and the resulting discussion, heh) useful and the resulting code cleaner (in case I ever have to touch C-Code again, brr), and then learned that the compiler can't do that yet.
Giving feedback - solicited or not - is part of an engineer's job. Wether it's cosmetic feedback or not doesn't matter as cosmetics are also part of the job, although not everybody thinks so.
Really? Wow. I had no idea, and I really tried to sound non-condescending.
I simply don't agree that redundant casts are something to be ignored, since they can be harmful and hide actual errors (the cast is a bit like `sudo make me a sandwich`, it makes the compiler do what you say which is not always a good idea).
I guess I believe that publicly posting code, and even writing a compiler for the language in question, kind of automatically marks you as interested in the language and open for ideas of usage and so on.
My apologies to the OP if you were offended. Thanks for the feedback, @gizmo.
FWIW, I liked your suggestion, the author’s explanation, And your follow on written understanding.
If someone’s technical article is submitted and they are themselves participating in a discussion out feeling insulted, then accusations of unsolicited advice are unfounded.
I enjoyed reading the feedback, and their response, and felt no negativity whatsoever. Keep honestly interacting it's good for everyone!
I didn't find your tone condescending at all, it was a perfectly fine comment (esp with sources)
Redundant casts are really, really good at hiding errors...
(And few people know that sizeof is a unary operator so the parentheses are rarely necessary. Dropping them reduces the visual noise => improves readability.)
> Then run ./build.py. This will use the bootstrapped 30cc-compiler to compile 30cc itself. It then again uses the 30cc-compiled compiler to compile 30cc once again. The final compiler is then stored as ./30cc.
Why isn't that also done by the Makefile? The only catch I could see is that you'd need to have it build to different output names, but that seems fine for what it is?
---
Also, I'm curious - did you find yourself having to constrain your use of C in order to make sure that the compiler could compile itself? Or does it implement everything you would use naturally anyways?
(And of course, congrats/bravo; super impressive:])
> Also, I'm curious - did you find yourself having to constrain your use of C in order to make sure that the compiler could compile itself? Or does it implement everything you would use naturally anyways?
That would be the "bootstrapping" process. Nearly a half-century ago I took a compiler lab class where we were given a working, but slightly lame, compiler, and were tasked with adding new, less lame, language features by bootstrapping. That is: 1) implement new feature without using the new feature, 2) using the compiler that results from step 1, re-implement the feature using the new feature, and compile again. 3) Repeat with more features until end of semester for best grade.
Oh, and to the OP, well done!
It seems that a programmer making their own compiler is like a Jedi making their own lightsaber
I guess the full bootstrapping process would start with writing a basic compiler in assembly?
Back in university, we took the steps of:
- Breadboarding some CPU components
- Breadboarding a very, very minimal CPU. More like a simple Adding + Multiplication machine with 4 registers or so, with a few premade components.
- Eventually moving up to microcoding a simulated CPU
- Eventually writing binary code to control the microcoded CPU
- At that point I kinda cheated and wrote my own assembler because I got sick of checking so many bits.
- This project then stopped at an assembly level.
- But then we implemented our own ML-variant interpreter and later on ML-variant compiler in OCaml
- And later on we had that ML-Compiler compile itself and extended it from there.
Pick your poison where to start. Just be aware that breadboarding stuff becomes very... messy very quickly[1]. And I do include Hardware in this, because you needed simpler CPUs to design more complex CPUs.
In practice, you want to pick the best combination of a familiar language that is as high level as it can be. But back in the day, that would've been assembly, binary. In the case of C, it went from BCPL, in which a compiler for B was written, in which a compiler for New B (NB) was written, which then turned into C.
1: https://www.youtube.com/watch?v=l7rce6IQDWs&t=78s
Assembly? Ha! In my day, we would have considered ourselves lucky to have had assembly, let alone a C compiler. No, we had to bootstrap our computers using switches and paper tape. You probably want a terminal too!
In all seriousness, the bootstrap process is fascinating to me. At some point, it did all start with switches and manually loading commands directly into memory. And over time, we’ve slowly kept this thing going and growing. Also… I’m old enough to have seen a Altair with toggle switches, but I’m not old enough to have had to toggle in a boot loader on one. :)
It depends on definition of full bootstrapping process. At some point, someone has to bootstrap in raw assembly but we don't need to do that in 2024. If I write a bootstrap compiler in c on a machine that can run c, I can write an alternate backend for any assembly language (that supports the necessary compilation features) and thus produce a compiler implemented in any assembly language without directly writing any assembly.
Taking this process to the extreme, no one could ever bootstrap anything because eventually you're bootstrapping mineral extraction.
The full bootstrapping process starts with entering instructions in binary (or hex or octal) to build an assembler in the first place :)
The full bootstrapping process starts with a soldiering iron and a reel of copper wire.
Usually it’s easier to just cross-compile for the new target. No need to start from ground zero.
Sounds a lot like the kernel class at UIUC! Fun stuff
> Why isn't that also done by the Makefile?
My guess is that some people would rather write Python code than dig into Make’s documentation. That said, ChatGPT is excellent at creating valid make files.
Or just take a few moments to learn the basics about Makefiles.
The thing about Makefiles is that simples ones at least are really easy to write, and read. Much simpler and quicker than a cumbersome python script, that will most likely do less with much more boilerplate code (e.g. dependencies), and be much harder to read.
Of course, you may hit a point where you stretch your Makefile so much beyond its common capabilities that that no longer becomes true, but in my experience that point is pretty far away.
As someone that learned C# and python before C and C++, to this day I couldn't explain to you how make rules work. Make is so unlike build tools from other languages that it doesn't surprise me someone would rather use python to bootstrap their compiler.
It is so trivial, it takes you 10 minutes to learn.
Less even, here is an attempt:
This builds a file called foo.o, if foo.c is newer than foo.o. Imagine that there is an identical bar.o as well, that builds it from bar.c. This links together foo.o and bar.o into a file called fooexec, if at least one of foo.o or bar.o is newer (which in turn means that any of its respective dependencies changed; e.g. change bar.c, and bar.o and then fooexec will rebuild).Now type this into the shell:
Congratulations! You have a fully built fooexec. Now change bar.c, type it again, and like magic, you have a rebuilt bar.o and fooexec. foo.o will stay put, it’s up to date!But of course all the .c -> .o rules look the same, so instead of writing a new one for every file, you'd use simple wildcards, and now the whole, entire Makefile is just:
Do $@ and $< look arcane at first glance? Yes. Is what they do simple, though? Also yes. (If your complaint is that you don't know what those "clang" lines do, then you don't know how to use the C compiler, and will have the same problem when writing a python script.)It pains me if that is not immediately obvious. And it pains me even more that someone would want to write a boilerplate heavy, complex python program, that rebuilds everything all the time, instead of just learning this very simple concept.
This is an amazingly good 2-minute introduction to makefiles. It would have been better without the last two sentences (and most of your comments on the topic).
---
Something skipped over by your introduction (and most guides to makefiles) which very quickly either makes makefiles less short or much more complicated is header files.
C files often depend on one or more header files that aren't system headers (this C compiler certainly does).
They could be added to each individual rule by hand:
or they could be put into a variable: which causes too much code to be recompiled every time a header file is changed.Or we could do something really fancy and get only the right header files added as dependencies for each compilation unit. That requires using a feature of the gcc/clang compilers (their preprocessors, really) that causes them to output a small makefile snippet that lists the exact header dependencies for a compilation unit (the '-MD' option). These snippets can then be included in the makefile. To make this look relatively clean requires some fancy macro use that is waaaay beyond what people can learn in ten minutes.
This is an extremely common use case and it is a bit of an embarrassment that GNU make doesn't handle it better.
I did mention that in other comments, including the C preprocessor helping. make is completely language agnostic, so it cannot “guess” header dependencies. In make, you get the dependencies you specify (whether manual or generated), that’s what I wanted to show.
This is an excellent explanation of make! The only thing I will add is the slightly more abstract thought that make is meant to build files out of other files based on timestamps of the source files. And the target files can themselves become source files, which means that you can build dependency graphs. Lastly, it has “caching” built in: if bar.o is newer than bar.c, don’t rebuild it. If it doesn’t exist or is older, do rebuild it. So when you change one file, you don’t recompile the entire damn tree.
It also has “phony” targets that aren’t files. The canonical example of this is “clean” which basically just does something like rm -rf *.o fooexec
This is what the JavaScript tooling people have been chasing for the past 15 years and can’t seem to grasp. Someone actually did build a make-based JS build system I believe but it never got popular because it wasn’t written in JavaScript.
>Lastly, it has “caching” built in:
A nitpick, but I don't think that should really be called caching, although it kind of is, in a looser sense. It's just comparing the file timestamps of the .o and .c files, and building the .o from the .c, if the .c is newer, or if the .o doesn't exist.
Saying this is an 'incremental makefile' isn't really correct, as changes to header files aren't going to lead to rebuilding.
So, either you need to manually keep references to which .h files you include in your Makefiles up to date, or start worrying about M / MM / MG / MP, and of course you'd like those to be re-run when you change your files, and suddenly your Makefile is an awful lot less simple.
This is the main reason I stopped teaching Makefiles in intro to C courses -- students often lost entire afternoons to forgetting this stuff, and failing to realise their mistake. It really shouldn't be any person's job to keep this up to date, when the computer knows it.
It is correct to call it an incremental Makefile, it merely doesn’t name the dependencies for you. A Makefile has exactly the dependencies you specify. It’s that simple.
That is not necessarily the most useful. Especially as your project gets larger, you probably will not want to specify every header file as a manual dependency, or alternatively rebuild everything every time you change a header file. (Though for me personally, either approach often works for surprisingly long.)
Then you can do things like letting the C preprocessor create the dependencies that get included in your Makefile (yes, common C compilers can create Makefile fragments). Or do something more complicated. Or switch to another build system, if your project really outgrew, or really mismatches, make.
But at its core, Makefiles follow a simple concept, are entirely language agnostic, and even with just the simpler concepts they remain useful for a long time.
Having 'incrementality', when it isn't actually going to incrementally rebuild things when files change, is (in my experience), worse that having no incrementality at all. Having to remember when I have to manually disable to incrementality is an annoying overhead, and easy to forget.
If you can remember exactly when you need to manually skip the incremental build, that's great for you, but I find Make has enough of these kinds of footguns I don't recommend it to people any more.
Meh, it can practically never be perfect, without diminishing returns that at some point invalidate the advantage of any incremental builds anyway.
As I said, you can let the C compiler generate your header dependencies for you. But that's not enough, because what's with header and libraries that are external for your project? Keep in mind that they will have dependencies, too, and at some point your transitive closure will be really big.
To a lesser degree, what's with a compiler, linker, or any random other part of your toolchain that changes (at least those are supposed to mostly keep a stable interface, which isn't the case at all for external libraries, but in some cases the toolchain can cause the footguns you mentioned).
A lot of the times, you don't need to rebuild everything just because a system header or the toolchain changed, but your build system would have a hard time knowing that. Because at some point, the halting problem even gets in your way (i.e. you can't really reliably detect if a given change in a header file would need rebuilding of a source file, which you would need for "fully incremental" builds). So it's always a trade-off.
Personally, I've fared very well with generated header dependencies (sometimes manually declared ones or none at all for small projects), and many other projects have, too.
YMMV of course, but I don't observe this to be bad. Most people who program C and use make are, I think, aware of what header mismatches can cause, and how to avoid that.
I think one fundamental difference between the two of us (I'm guessing here, let me know if I'm wrong), is I'm willing to cope with longer compile times, as long as I always get a correctly built executable (so there are no 'halting problems', if in doubt, rebuild it!), while you would prefer faster build times, even if that then requires sometimes having to do a little manual work when the build system doesn't realise it needs to rebuild? Of course, no system will be 100% perfect, but you can choose where your trade-off point is.
Two reasonable viewpoints, but probably effects how we do our building setups!
I think that's fair. On top of that, being a low level system programmer, I think I'm pretty good at realizing/intuiting when something critical changed that needs me to nuke the build dir, i.e. the manual work you mention to force a full recompile (and also good at noticing mismatched headers even through very weird symptoms).
I do have to switch between multiple SDK versions very often for example, and I always either nuke the build dir after doing that, or have separate build dirs per SDK in the first place.
Yes, this is quite important for C projects as it leads to subtle bugs. I also long for a way to let make detect a change of $CC (new build of the compiler itself) in the context of debugging a compiler against third party makefile projects.
Have a step that takes the SHA256 or similar of $CC, make that a dependency. You’d need slightly more advanced make features if you want to make it “pretty” (and not just, say, touch a file through some shell code comparison), but this is a slightly more advanced request.
Or potentially just specify the path to $CC itself as dependency? Presumably, if the C compiler changes, the timestamp of its executable does too. (Bad if it goes backwards, though.)
> Have a step that takes the SHA256 or similar of $CC, make that a dependency.
Careful, one more step and we'll start recreating nix in Makefiles;)
(Wait, could we? Should we?)
I often use a Python program called 'fabricate', which basically checks every file a command opens, and re-runs the command if any touched file changes.
I love it, and I hoped this would become the future of how build-systems are built -- you could automatically parallelise, rebuild, make clean, everything, just by tracking which files are read, and written.
It does add a bit of overhead, but I'd be willing to pay that for simplicity. Unfortunately, it doesn't seem to have caught on.
That is neat, but also super brittle, and full of abstraction leakiness, potentially leading to both false positives and negatives. Sounds like it would break down with ccache or distcc immediately, for example.
I never had any false negatives, they are basically impossible in this framework, except it does assume if a program, and all it's inputs are unchanged, the output is unchanged, so this is no good for programs which randomly produce different outputs.
ccache can cause things to get built twice, because the first run it sees it tries to read a file, then later writes to it, so it knows something different might happen if you run again -- but then again that is just ccache doing it's thing, so it's quick. Yes, distcc wouldn't work, but then again I find distcc super-brittle (so hard to make sure everything has the same compiler versions, etc), I may have used it 20 years ago, but I don't think I ever would now!
Fair!
The 'tup' build system also works this way.
You could have taught them `make clean` whenever there's any issues. I work with software engineers and I still have to remind them to try rm -rf ./build/* now and again, which always seems to solve the problem.
I've written a good amount of makefiles (and even one rather complex one.. before replacing it with a thing written in an actually sane language, bootstrapped by a stripped down version of the makefile), and I still mix up $@ $< $> or whatever else my brain comes up with. And this is coming from someone deep into array languages, which have entire sets of unicode glyphs.
For how often one typically writes/looks at build scripts, remembering the specifics about makefiles is quite likely to not be worth the effort.
Makefiles start becoming incredibly awful if you want to do anything outside of pure computations in the separate targets. Making the target directory requires either putting the mkdir call in each separate built thing (awfully redundant), introducing a whole another rule (and still adding it as a dependency on every rule that'll be outputting to said directory), doing some recursive make, or running it outside of rules (but putting it under a proper if statement is likely pain). That's a bunch of awful solutions to something that really really really should just be a single calls to mkdir in any sane system.
Another horrifically awful thing is if you want some reused data across rules (e.g. pkg-config results, current OS/arch). You can put them as globals, but that'll make them be calculated always, even if no build target needs them, unless you put them under an if, which is again non-trivial to get the condition correctly for. This becomes super bad if you do recursive make stuff.
More generally, makefiles start out simple, but doing anything that's not strictly simple is gonna be pain, at the very least requiring looking at docs of a thing you'll probably never use again, and debuggability is also very low (amusingly, '-p' and '--debug' break on recursive make, seemingly losing some variables; and '-n' requires copying out & running the recursive call to get to the actually important part; and afaict none of those help with figuring out why a thing didn't build).
And you're completely SOL if you wanted to do things like reorder compilations based on how long they previously took, or want a nice display of currently-compiling things.
> Do $@ and $< look arcane at first glance? Yes. Is what they do simple, though?
And yet, I still don't actually know what `$@` and `$<` actually mean outside of your example. I only can assume that somehow that substitutes in `foo.o` and `foo.c` because you literally put the non-magic versions right below it (and above it a few paragraphs above as well), but I'm honestly not even positive that I'd run `make foo.o` rather than `make foo` with your example, and I feel fairly confident in guessing that this would _not_ compile into a binary called `fooexec` like the target you defined above. I don't have an idea whatsoever what I'd need to put in the target to be able to substitute in something like that because the fact that the first `%` somehow becomes `$@` and the second becomes `$<` doesn't follow any obvious pattern I can use to figure out what the third symbol would be, or even if there is one.
The problem is that "simple" is not the same as "intuitive" or "discoverable". Yes, I could probably put in the effort to go and find out these things, but why would you expect someone to be motivated to do that by someone telling you that it's "trivial" and it "pains" them as they explain things in a way that doesn't actually clarify things in a way that helps me? If you actually want to try to make a strong case that this is something worth learning for people, repeatedly referring to it in ways like "very simple" and "so trivial" and "immediately obvious" is counterproductive.
As an aside, I think it's also a bit of leap to assume that they're "rebuilding everything all the time". It looks to me like the build.py script is only needed a single time to bootstrap from the original compiler to this one, and rebuilding every single C file in the new compiler rather than using the artifacts built with the original one is kind of the whole point. After that's done once, I don't see why the new compiler couldn't be used via the Makefile. If you're complaining about Python build scripts in general rather than this specific one, I don't know why you assume they can't be written to check the timestamps on the files to see if a rebuild is necessary before doing so. That's exactly what `make` does, and hey, it's a very simple concept!
$@ is the name of the target -- the @looks a bit like an old-fashioned target for practice and competition shooting.
$< is the name of the input (the file the rule depends on). If the rule depends on more than one input, it is the first one. So you write your dependencies as xxx.c xxx.h yyy.h zzz.h with the C file first. Mnemotechnically, it is "the thing that goes into the build".
$^ is the names of all the files the rule depends on. Think of it as a large hat above all the input file names.
Many things in make are clumsy and GNU make is a behemoth that does so much more than the original make that it is really hard to learn -- but these shortcuts have really good names!
(They are GNU make extensions, btw.)
One might ask how you learned C (or python, or anything else) in the first place, if you can't be bothered to learn what the very simple $< and $@ mean, which would take you literally seconds to look up if my example wasn't actually enough already.
The rest is just... of course you can reimplement make, or a subset of it, in python, python is turing complete after all. But why? make isn't that arcane. I'm sorry it uses the weirdly looking $@ and $< for "target" and "dependency" respectively. Those two are extremely common and you have to type them in a lot of rules, so the authors chose a shorthand. One way or another your python make subset will have to bring in the same concepts they represent, too.
This is trivial in the sense that it's a part of any build system worth its salt. And I can almost guarantee you that OP for example, who wrote their own C compiler, understood it immediately.
> One might ask how you learned C (or python, or anything else) in the first place, if you can't be bothered to learn what the very simple $< and $@ mean, which would take you literally seconds to look up if my example wasn't actually enough already.
The problem is, you typically don't need to touch Makefiles that often once they work (or fulfil a reasonable interpretation of "work"), so unless you're a distro packager or the project(s) you work on are sufficiently complex and large to warrant an entire FTE just for build tooling, chances are high you need to touch that stuff only once every few years, by which time almost everyone has to dig into the documentation yet again... made worse by the fact that Google's search quality has gone down the drain.
Since you’re asking, $@ is the “target” of the build, that is, what you are aiming to build (left of the colon in the rule). This is why it looks like a bullseye.
And $< is of course the thing the target depends on. It’s the input, like standard Unix notation for stdin.
Hopefully this helps?
> Make is so unlike build tools from other languages
Interesting, I thought that "make" was like multiplication tables, something you learn very early one as it's a super simple tool. Also I assumed it was part of any unix 101 class, together with bash and command line stuff. But probably it shows my age. Most people don't have any reason to learn these tools nowadays.
That is likely the case for anyone formally educated in C. If I needed to learn make I would, but I use cmake because I develop on multiple platforms.
> The thing about Makefiles is that simples ones at least are really easy to write, and read. Much simpler and quicker than a cumbersome python script, that will most likely do less with much more boilerplate code (e.g. dependencies), and be much harder to read.
Whether this is true or not depends a lot on from which programming culture/background you come.
A Makefile dependency:
Left depends on the right. I don't think the ability to learn that concept is related to programming culture/background. What follows below that is what happens. The only tricky part of the above, and something I guess nobody has found any good reason for, is that the whitespace in front of the statements ('cp' in this case) has to be an actual tab, just spaces won't do.When it comes to the more "advanced" concepts (wildcards etc) there are slight differences between Make versions. And to bother with that is to get into the mindset which created the (argh) Automake and Autoconf systems (to begin with), so back in the day our company simply decided that we'll use GNU Make on every single system (we supported lots of various UNIX systems) and not bother with any of that (no SYSV Make, no BSD Make or anything), because GNU Make was and is available on anything and everything. Made life very simple back then.
The story I have heard about why it must be a tab is that by the time the author of the original Make realised he didn't need to require a tab, he already had six users and didn't want to annoy them by breaking compatibility with the makefiles they'd written :-)
> the (argh) Automake and Autoconf systems
The classic XKCD graph that plots life satisfaction against days-since-editing-xorg.conf could equally apply to days-since-thinking-about-autotools.
In fact one of the few times I've thought about autotools in the last decade was when a failing python-based build script inflicted similar frustration!
If I were forced to find one nice thing to say about autotools it would probably be that at least it doesn't assume an internet connection is always available, reliable and without cost.
I honestly don't think so, and I think this is here is a prime example for proving that.
For example, a Makefile that does the same job as the build.py script in this project would be significantly smaller, simpler, and easier to read in several metrics that I'd reasonably call "objective" to a certain degree.
In fact, contrast the Makefile in that project: https://github.com/keyvank/30cc/blob/main/Makefile
With the build.py script: https://github.com/keyvank/30cc/blob/main/build.py
You need to know very little about Makefiles to make immediate sense of what that Makefile is doing, whereas you need to know much more about python to still not immediately see what the build.py script is doing. In fact, you will probably just "guess" that the python script is supposed to do a similar job only from its name before that.
And then the python script still does not do incremental builds at all!
Again, if it gets more complex that can change, but this is far away from that. It takes 10 or so minutes to learn enough about Makefiles to be productive with them, from scratch.
It's is absolutely objectively true. Whether you already know it or not depends on your background
On the other hand, I'd argue that the point where you've stretched it so far that it's more trouble than it's worth is basically impossible to see until you've gone beyond it.
If make is too hard, ninja is simpler. Less powerful but in ways that rarely matter. I've written both by hand both are a little weird but not hard. Normally I use cmake these days (an aweful language but it works for hard problems)
i think it just supports smth like 70% of C. I didn't bother supporting floating-point operators, or arrays (I mean, malloc and pointers are fine) or etc. I just wanted smth small and self-hosting!
Similar: "A Small-C Compiler", by James E Hendrix (not Jim E, not Jimi).
Modify source, compile self, repeat.
This repo claims to have the code: https://github.com/DosWorld/smallc
Congratulations. You compiled yourself.
Out of curiosity: would ./30cc_gcc (30cc complied by gcc-hosted 30cc) and ./30cc (30cc complied by self-hosted 30cc) be identical binary files, if 30cc was operating as expected?
No, gcc optimizes code output and can make code shorter and faster. A compiler can make many, many decisions which will show up as differences in the binary output.
What a fun project, thanks for sharing. I've dreamed of projects like this. What did you expect to learn from this project? Did you learn anything unexpected?
i learnt that building a c compiler isn't that hard, just takes time :)
Was the linked list approach inspired by rui314's compiler?
Congrats!!
/me claps and cheers
tnx!
Nice job!
But why 3 steps of compilation?
The first step merely shows that you wrote valid C code that gcc can compile. It doesn't prove that the program actually does what promised. For example, if you missed to implement 'for' loops, this step would still produce a compiler, which could still work for a subset of C.
Here comes the second step: it proves that it implements enough features to at least recompile itself. Now, this is still not a guarantee that all C constructs and features work, but we can reason that even if it only implements a subset of C, it's a large enough subset to build such a substantial app as a compiler. How likely is it that a compiler doesn't use a 'for' loop, not even once?
But what is the reason for the third step? It doesn't add anything, it doesn't trigger any code path that was excluded in the first one. After all, if you did miss 'for' loops, and the 2nd step hasn't detected it, it must be because there are no 'for' loops in the source, so you will never detect it this way, no matter how many steps of recompilation you run.
The 3rd step can flush out compiler bugs, as follows.
The 1st step uses gcc, so it's not going to find any bugs in this compiler.
The 2nd step uses the compiler as compiled by gcc. That will find some bugs, such as crash bugs or missing features (like your for loop example). However, it does not find bugs that lead to generating incorrect code.
The 3rd step uses the compiler compiled by itself. If there is a bug in code generation that led to the stage-2 compiler being compiled incorrectly, that is likely to lead to some error during stage-3 compilation, such as a crash. And if it doesn't crash outright, it is very likely to cause the resulting executable to differ between steps 2 and 3. The incorrectly compiled compiler is surely even worse at compiling correctly than the compiler that was presumably compiled correctly (using gcc)!
So, this 3-step process is a pretty good way of finding bugs, especially if you compare the stage-2 result against the stage-3 result.
perfect explanation, tnx!
Thanks! The third step proves that the assembly code generated by the second-step compiler is valid even when compiling a big C project.
Thank You For Making And Sharing!
nice work.
if you want more of a challenge try a compiler compiler that can compile itself... :)
i got pretty far with this myself, although it doesn't work for weirdo languages like Python that need an exceptional lexer. i keep meaning to revisit this problem, since the tools in this space are pretty much non-existent or trash quality.
https://github.com/semiessessi/cp2
> if you want more of a challenge try a compiler compiler that can compile itself... :)
Is that not what OP did?
No (if the parent really meant ‘compiler compiler’ which, I think, would be what yacc/bison is).
Ah, thanks. My brain actually did not see that second "compiler" when reading, now it makes sense.
How could you tell the difference between missing it and your brain filtering it out?
Isn’t that the same now?
By the way, I dislike the term “compiler compiler”, because that’s not really what it does. I like “parser generator” for tax/bison, and “lexer generator” for flex.
yacc calls it's a compiler compiler, but it's a parser compiler.
I'd go further and say "parser generator", the generated parser will be compiled by a C compiler after all.
Congratulations! That's no small feat.
Congratulations!
tnx!
Awesome work!
awesome! keep up the great work!
Why not use WASM to bootstrap your compiler?
Unlike any real machine (and most virtual ones), WASM is impossible to target with a simple single-pass C compiler. The reducible control flow requirement means you have to build a full control-flow graph and do graph algorithms to it before you can start emitting code. So IMO it makes for a bad starting point.
If you want to have "fun", generate code in a WASM-representation of Hoare's WHILE language, i.e. a program that only consists of a single outer while loop (in WASM a "loop" block), and conditionally decide for every single instruction in the loop whether it should be executed during a given pass.
I think that could be done with a single-pass C compiler. In a very trivial (and terrible) case, you could keep a running "line_counter" variable that every statement in the loop is predicated with, which either gets incremented at the end of the loop, or set to an arbitrary value for representing branches.
It would of course be a horrible (but very amusing) mess, and rather enforces your actual point.
Reminds me of what I understand from the „Usagi Electric“ Bendix rotating drum -memory-computer execution :)
Elaborate, why use WASM?