Knuth's warning about premature optimization is really about not increasing complexity based on guesswork without profiling the actual bottleneck. That's about overall architecture design. What's happening here (in this blog post) is locally for the sake of learning, and further, for intellectual fun - I think it's completely justified.
Usually, the 'don't do premature optimization' quote gets misused as an excuse to avoid careful design, but setting that aside, learning within these kinds of constraints and eventually producing something.
that's not premature optimization in my view.
Looking at this code, they saved one AND instruction and reduced a pipeline stall, but it seems like it would be harder for a future maintainer to understand, because not_received feels a bit less readable. I always think
code that's easy for the computer to read and code that's easy for humans to understand are different things.
After writing my comment, I realized it came across as overly critical. But actually, I think this work is completely justified and beautiful. Honestly, it's at a level I couldn't achieve myself. I respect it.
Premature optimization is that occasional dessert we serve ourselves because it tastes so good.
That said, I agree with you that the maxim is often used to justify poor planning or absence of planning. Premature pessimization is bad too.
We don't have the bandwidth to test every micro-decision. That's what the learning of domain specific principles are for. Then some choices you reason through, some you test and when confronted with a perf problem, test those reasoning based choices that benchmarks point fingers at.
On top of that, I use it as a way to express jealousy when I see impressive optimization work.
"Don't do premature optimization like this (I can't do it myself)."
Is this about premature optimisation or just good architecture?
The number of data formats I see that miss obvious things like alignment etc. it isn't something you can easily change later if it becomes widely used.
To me this post represents the minimum you should be thinking about when designing any kind of data structure/format.
The only time where I would say it strays into premature is inverting the received flag, but only because you're optimising it for a particular processor and in a way that isn't particularly obvious. The inversion can easily be hidden behind a function call.
>The number of data formats I see that miss obvious things like alignment etc.
After reading that sentence, I felt a little guilty. It's actually a basic principle of design, but in practice, I just don't pay much attention to it and only write code for readability
Everything would still be bloated, it’s just that the people doing the bloating point to that quote as though it’s defending their position.
What kills me is that a lot of the stuff just isn’t that hard. You need a data structure that you’ll later check membership of? Use a set. Might a list / array be more front-of-mind? Yes. But if you don’t need it, why? Is it noticeably slower? Not really. Is it objectively the correct answer? No, and it costs you essentially nothing, so just use the correct one.
Its like writing a thriller where you are the main operative, heroically saving the day with your skill, foresight and tenacity.
The problem is, it sets a rigid path far to early that you are unwilling to move away from, either because you had ambitions for those empty stubs, or because the obvious solution means admitting that you current $thing is not as successful as it should be.
The problem I have found recently is that it bleeds into the training set that LLMs to use to make software. our platform is pretty well defined and has excellent metrics and logging that come for free.
But the LLMs are creating Otel forwarders with custom NATs transport, even though we have all of that for free already (and in the agents.md)
If the array is dynamically allocated on its own, you will likely get a whole page from any decent allocator. If it is statically allocated the compiler can at least theoretically optimize the alignment. If its allocated as part of a larger struct then yes its up to you.
Even if it is not aligned, an unaligned 8K array would be spread to three pages which is still worse than two.
The system pings a few different servers, so it should be a target ID which indexes into another table of target addresses. It's likely to benefit from having a separate ping array per target, as well.
Cool trick, but personally I don’t trust C bitfields. When I need something like that, I usually create C++ class or C# structure with a single private uint64 field, and public methods to extract or manipulate the logical fields.
Because the class/structure only has a single uint64 field, the compilers are likely to pass value in a single general-purpose register. I believe that’s unlikely to happen for a structure with bit fields.
If you target AVX2 or newer you also have BMI1 and BMI2, intrinsics like bextr and bzhi are probably faster than whatever codes compilers are generating for bit fields.
Binary compatibility of bit fields is a moot point, using them at the API surface across compilers or languages is not ideal. A structure with a single uint64 field is very compatible.
Knuth's warning about premature optimization is really about not increasing complexity based on guesswork without profiling the actual bottleneck. That's about overall architecture design. What's happening here (in this blog post) is locally for the sake of learning, and further, for intellectual fun - I think it's completely justified. Usually, the 'don't do premature optimization' quote gets misused as an excuse to avoid careful design, but setting that aside, learning within these kinds of constraints and eventually producing something. that's not premature optimization in my view.
Looking at this code, they saved one AND instruction and reduced a pipeline stall, but it seems like it would be harder for a future maintainer to understand, because not_received feels a bit less readable. I always think code that's easy for the computer to read and code that's easy for humans to understand are different things.
After writing my comment, I realized it came across as overly critical. But actually, I think this work is completely justified and beautiful. Honestly, it's at a level I couldn't achieve myself. I respect it.
Premature optimization is that occasional dessert we serve ourselves because it tastes so good.
That said, I agree with you that the maxim is often used to justify poor planning or absence of planning. Premature pessimization is bad too.
We don't have the bandwidth to test every micro-decision. That's what the learning of domain specific principles are for. Then some choices you reason through, some you test and when confronted with a perf problem, test those reasoning based choices that benchmarks point fingers at.
On top of that, I use it as a way to express jealousy when I see impressive optimization work. "Don't do premature optimization like this (I can't do it myself)."
Is this about premature optimisation or just good architecture?
The number of data formats I see that miss obvious things like alignment etc. it isn't something you can easily change later if it becomes widely used.
To me this post represents the minimum you should be thinking about when designing any kind of data structure/format.
The only time where I would say it strays into premature is inverting the received flag, but only because you're optimising it for a particular processor and in a way that isn't particularly obvious. The inversion can easily be hidden behind a function call.
>The number of data formats I see that miss obvious things like alignment etc.
After reading that sentence, I felt a little guilty. It's actually a basic principle of design, but in practice, I just don't pay much attention to it and only write code for readability
It doesn't even have to impact readability.
See for example
https://github.com/MayaPosch/NymphRPC/blob/master/doc/nymphr...
There's an int8 in the middle of int32s knocking everything out of alignment. And it doesn't need to be because flags is int32 and only uses 4 bits.
Changing that doesn't have to impact readability. Makes it easier to process, makes a struct representation more compact.
Cool
> Usually, the 'don't do premature optimization' quote gets misused as an excuse to avoid careful design ...
The "don't do premature optimization" mindset is the reason why we've got monstrous Electron apps doing jack shit.
I hate that quote, I hate that mindset. It's the reason everything is bloated and sucks.
Everything would still be bloated, it’s just that the people doing the bloating point to that quote as though it’s defending their position.
What kills me is that a lot of the stuff just isn’t that hard. You need a data structure that you’ll later check membership of? Use a set. Might a list / array be more front-of-mind? Yes. But if you don’t need it, why? Is it noticeably slower? Not really. Is it objectively the correct answer? No, and it costs you essentially nothing, so just use the correct one.
Oh of course its fun, its seductive.
Its like writing a thriller where you are the main operative, heroically saving the day with your skill, foresight and tenacity.
The problem is, it sets a rigid path far to early that you are unwilling to move away from, either because you had ambitions for those empty stubs, or because the obvious solution means admitting that you current $thing is not as successful as it should be.
The problem I have found recently is that it bleeds into the training set that LLMs to use to make software. our platform is pretty well defined and has excellent metrics and logging that come for free.
But the LLMs are creating Otel forwarders with custom NATs transport, even though we have all of that for free already (and in the agents.md)
If it wasn't fun you wouldn't need to have an aphorism against it.
If you don't align the array to a 4K boundary in memory, fitting within a page isn't as big of an optimization win.
If the array is dynamically allocated on its own, you will likely get a whole page from any decent allocator. If it is statically allocated the compiler can at least theoretically optimize the alignment. If its allocated as part of a larger struct then yes its up to you.
Even if it is not aligned, an unaligned 8K array would be spread to three pages which is still worse than two.
Cries for IPv4-only structure in 2025!
The system pings a few different servers, so it should be a target ID which indexes into another table of target addresses. It's likely to benefit from having a separate ping array per target, as well.
Cries for lines of code in 2026!
Its fun when you are aware about what you are doing.
Yes and I usually say premature optimization is a skill.
Balance maintenance with optimization. Is it easier to maintain? Win win!
Cool trick, but personally I don’t trust C bitfields. When I need something like that, I usually create C++ class or C# structure with a single private uint64 field, and public methods to extract or manipulate the logical fields.
Because the class/structure only has a single uint64 field, the compilers are likely to pass value in a single general-purpose register. I believe that’s unlikely to happen for a structure with bit fields.
If you target AVX2 or newer you also have BMI1 and BMI2, intrinsics like bextr and bzhi are probably faster than whatever codes compilers are generating for bit fields.
Binary compatibility of bit fields is a moot point, using them at the API surface across compilers or languages is not ideal. A structure with a single uint64 field is very compatible.