Optimizers need a rethink

(typesanitizer.com)

100 points | by ingve 5 days ago ago

101 comments

  • kragen 11 minutes ago

    Daniel Bernstein has given an interesting talk about "the death of optimizing compilers" (PDF slides at https://cr.yp.to/talks/2015.04.16/slides-djb-20150416-a4.pdf) claiming that less and less of our code is in that middle ground where it's performance-critical enough to risk running it through an optimizer but not so performance-critical that it needs to be rewritten in assembly.

    If we think of the high-level source code (whether in SQL, JS, or Python) as being a functional spec that the low-level implementation should provably follow, maybe we should check them both in, and at build time, run a proof assistant to verify that the implementation still complies with the spec? Rather than running an optimizing compiler every time. I think this is what nanolith is suggesting in https://news.ycombinator.com/item?id=41965701.

    Maybe you run the optimizing compiler once when you first write the code, and again when the proof assistant tells you an optimization is no longer valid, but you check the output before you check it in. Or maybe you have the compiler running all the time on a GPU cluster looking for new optimizations.

  • pizlonator 13 hours ago

    Compiler optimizers are designed from a “win in the average” mindset - so whether a particular optimization succeeds or not is really not something the user ought to rely on. If you’re relying on it then you’re playing with fire. (I say that as someone who has played with this particular kind of fire. Sometimes I’m ok, other times I get burned.)

    A great example of when winning in the average works is register allocation. It’s fine there because the cost of any particular variable getting spilled is so low. So, all that matters is that most variables are in registers most of the time. If spill heuristics change for the better, it usually means some of your variables that previously got spilled now are in registers while others that were in registers are now spilled - and the compiler writer declares victory if this is a speedup in some overall average of large benchmarks. Similar thinking plays out in stuff like common subexpression elimination or basically any strength reduction. (In fact, most of those optimizations have the peculiar property that you’ll always be able to craft a program that shows the optimization to be a bad idea; we do them anyway because on average they are a speedup.)

    In my view, if a compiler optimization is so critical that users rely on it reliably “hitting” then what you really want is for that optimization to be something guaranteed by the language using syntax or types. The way tail calls work in functional languages comes to mind. Also, the way value types work in C#, Rust, C++, etc - you’re guaranteed that passing them around won’t call into the allocator. Basically, relying on the compiler to deliver an optimization whose speedup from hitting is enormous (like order of magnitude, as in the escape analysis to remove GC allocations case) and whose probability of hitting is not 100% is sort of a language design bug.

    This is sort of what the article is saying, I guess. But for example on the issue of the optimizer definitely removing a GC allocation: the best design there is for the GC’d language to have a notion of value types that don’t involve allocation at all. C# has that, Java doesn’t.

    • WalterBright 11 hours ago

      One thing I learned about optimizers is some optimizations undo earlier optimizations. The state of the optimized program will sometimes flip-flop back and forth and never converge on a solution. Then you have to rely on a heuristic to guess at the better form, and live with it. The optimizer has a counter in it so it will "pull the plug" if it fails to converge, and will just go with the last state.

      My optimizer first appeared in Datalight C around 1984 or so. It was the first DFA optimizer for any C compiler on the PC. C compiler benchmark roundups were popular articles in programming magazines at the time. We breathlessly waited for the next roundup article.

      When it came, Datalight C was omitted from it! The journalist said they excluded Datalight C because it was buggy, as it deleted the benchmark code and just printed the success message. The benchmarks at the time consisted of things like:

          for (i = 0; i < 1000; ++i) a = 3;
      
      so of course it deleted the useless code.

      I was really angry about that, as the journalist never bothered to call us and ask about DLC's behavior. Our sales tanked after that article.

      But it wasn't long until the industry realized that optimizers were the future, the benchmarks were revised, and the survivors in the compiler business all did optimizers. DLC recovered, but as you can tell, I am still annoyed at the whole incident.

      • WalterBright 9 hours ago

        Recently, a D user remarked that a non-optimized build took a few seconds, but turning on the optimizer caused it to take 35 minutes. I was wondering how this was possible. It turns out his code had a function with 30,000 lines of code in it!

        Some things one just doesn't anticipate.

        Working on compilers is never dull.

        • subharmonicon 9 hours ago

          I’m honestly a bit surprised if that’s the first time you’ve seen something like that.

          I’ve been working on compilers 30 years, primarily on optimizers (although some frontend work as well). Learned C from Borland C and C++ from…Zortech C++ 1.0, so thank you for that!

          Within a few years of working on compilers I came across my first examples of 50k+ line functions. These were often (but not always) the result of source code generators that were translating some kind of problem description to code. It taught me very early on that you really need to focus on scalability and compile time in compilers, whether it’s the amount of code within a function, or across functions (for IPO / LTO).

          And yes, working on compilers is never dull. 25 years ago I thought we’d end up in a monolithic world with x86, C++, and Java being the focus of all work. Instead, there’s been an absolute explosion of programming models, languages, and architectures, as well as entirely new problem spaces like graph compilers for ML.

          • badmintonbaseba an hour ago

            > Within a few years of working on compilers I came across my first examples of 50k+ line functions. These were often (but not always) the result of source code generators that were translating some kind of problem description to code.

            Like the issue in the Linux kernel recently where they had some simple looking min/max macros that generated megabytes of source code.

          • WalterBright 8 hours ago

            I did see slow optimize builds before, but that was on much, much slower machines. I thought that was a thing of the past with modern CPUs.

      • pizlonator 10 hours ago

        > One thing I learned about optimizers is some optimizations undo earlier optimizations. The state of the optimized program will sometimes flip-flop back and forth and never converge on a solution. Then you have to rely on a heuristic to guess at the better form, and live with it. The optimizer has a counter in it so it will "pull the plug" if it fails to converge, and will just go with the last state.

        The way every optimizer I've worked on (and written) deals with this is canonical forms. Like, you decree that the canonical form of "multiply integer by 2" is "x << 1", and then you make sure that no optimization ever turns "x << 1" into anything else (though the instruction selector may then turn "x << 1" into "x + x" since that's the best thing on most CPUs).

        But that doesn't necessarily make this problem any easier. Just gives you a principled story for how to fix the problem if you find it. I think that usually the canonical forms aren't even that well documented, and if you get it wrong, then you'll still have valid IR so it's not like the IR verifier will tell you that you made a mistake - you'll just find out because of some infinite loop.

        And yeah, lots of compiler optimization fixpoints have a counter to kill them after some limit. The LLVM inliner fixpoint is one example of such a thing.

        > I was really angry about that, as the journalist never bothered to call us and ask about DLC's behavior. Our sales tanked after that article.

        Whoa! That's a crazy story! Thanks for sharing!

        • saagarjha 2 hours ago

          I'm not sure if any "real" optimizers are using them yet but in theory e-graphs can store a set of forms and avoid this kind of issue.

          • Sesse__ 2 hours ago

            Cranelift, the (perenially?) up-and-coming Rust compiler, is based on e-graphs.

            • saagarjha 2 hours ago

              I guess it's borderline "real" ;)

    • typesanitizer 9 hours ago

      > In my view, if a compiler optimization is so critical that users rely on it reliably “hitting” then what you really want is for that optimization to be something guaranteed by the language using syntax or types. The way tail calls work in functional languages comes to mind. Also, the way value types work in C#, Rust, C++, etc - you’re guaranteed that passing them around won’t call into the allocator. Basically, relying on the compiler to deliver an optimization whose speedup from hitting is enormous (like order of magnitude, as in the escape analysis to remove GC allocations case) and whose probability of hitting is not 100% is sort of a language design bug. > > This is sort of what the article is saying, I guess.

      I agree with this to some extent but not fully. I think there are shades of grey to this -- adding language features is a fairly complex and time-consuming process, especially for mainstream languages. Even for properties which many people would like to have, such as "no GC", there are complex tradeoffs (e.g. https://em-tg.github.io/csborrow/)

      My position is that language users need to empowered in different ways depending on the requirements. If you look at the Haskell example involving inspection testing/fusion, there are certain guarantees around some type conversions (A -> B, B -> A) being eliminated -- these are somewhat specific to the library at hand. Trying to formalize each and every performance-sensitive library's needs using language features is likely not practical.

      Rather, I think it makes sense instead focus on a more bottoms-up approach, where you give somewhat general tools to the language users (doesn't need to expose a full IR), and see what common patterns emerge before deciding whether to "bless" some of them as first-class language features.

      • pizlonator 9 hours ago

        I think I agree with all of that.

        My point is that if in the course of discovering common patterns you find that the optimizer must do a heroic optimization with a 10x upside when it hits and weird flakiness about when it hits, then that’s a good indication that maybe a language feature that lets you skip the optimization and let the programmer sort of dictate the outcome is a good idea.

        By the way, avoiding GC is not the same thing as having value types. Avoiding GC altogether is super hard. But value types aren’t that hard and aren’t about entirely avoiding GC - just avoiding it in specific cases.

    • pjmlp 12 hours ago

      Java doesn't have it yet, but they are making progress,

      https://jdk.java.net/valhalla/

      Yes, it was a bummer that Java didn't take up on the ideas of Cedar, Oberon linage, Modula-3, Eiffel,... even though some are quoted as its influences.

      Still I am confident that it might be getting value types, before C++ reflection, networking, senders/receivers, or safety gets sorted out. Or even that we can finally write portable C++ code using C++20 modules.

      • astrange 11 hours ago

        When are they going to finish making progress? I've been hearing about this for years now and it doesn't seem to have happened. Which means Java continues with just about the worst design for memory efficiency you could have.

        • saagarjha 2 hours ago

          Don't worry, Python is always there to do a worse job.

        • pjmlp 4 hours ago

          When they are sure the design is sound enough not to do a Python 2/Python 3.

          If it was easy it would be done by now.

          There are plenty of long time efforts on other language ecosystem that have also taken decades and still not fully done, e.g. C++ modules, contracts, reflection,...

          If you really need value type like objects today, it is possible in Panama, even without language syntax for them.

      • dzaima 12 hours ago

        Unfortunately, it doesn't look like there's much of a guarantee that value objects wouldn't result in heap allocations. https://openjdk.org/jeps/401 even contains:

        > So while many small value classes can be flattened, classes that declare, say, 2 int fields or a double field, might have to be encoded as ordinary heap objects.

        There's a further comment about potential of opting out of atomicity guarantees to not have that problem, but then there are more problems - looks like pre-JIT would still allocate, and who knows how consistent would JIT be about scalarization. IIRC there was also some mention somewhere about just forcing large enough value objects to always be heap allocations.

        • elygre 11 hours ago

          That’s some very selective quoting there. Let’s do the full thing:

          > Heap flattening must maintain the integrity of objects. For example, the flattened data must be small enough to read and write atomically, or else it may become corrupted. On common platforms, "small enough" may mean as few as 64 bits, including the null flag. So while many small value classes can be flattened, classes that declare, say, 2 int fields or a double field, might have to be encoded as ordinary heap objects.

          And maybe the end of the next paragraph is even more relevant:

          > In the future, 128-bit flattened encodings should be possible on platforms that support atomic reads and writes of that size. And the Null-Restricted Value Types JEP will enable heap flattening for even larger value classes in use cases that are willing to opt out of atomicity guarantees.

          • pizlonator 11 hours ago

            If value types still require allocation for stuff larger than 128 bits then that really sucks! That’s not how value types usually work.

            • bremac 10 hours ago

              Value types still require allocation for types larger than 128 bits if the value is either nullable or atomic — that seems like a reasonable trade-off to me.

              • pizlonator 9 hours ago

                Oh! Reasonable trade-off indeed. Thank you for clarifying.

          • dzaima 11 hours ago

            Yeah, had edited my comment to expand a bit on that. Nevertheless, this is just one place where allocation of value objects may appear; and it looks like eliding allocations is still generally in the "optimization" category, rather than "guarantee".

      • pizlonator 12 hours ago

        That is really cool!

        It’s a dang hard feature to retrofit into the way the JVM works. I wish those folks the best of luck.

        • pjmlp 12 hours ago

          Yes, that is the biggest engineering effort of the whole thing, how to add the value types concept into the JVM, without breaking the ecosystem.

          JARs and modules that work on the JVM before value types introduction should keep running, and how can new code interoperate with such jars.

    • crabmusket 12 hours ago

      > In my view, if a compiler optimization is so critical that users rely on it reliably “hitting” then what you really want is for that optimization to be something guaranteed by the language using syntax or types. The way tail calls work in functional languages comes to mind.

      Automatic vectorisation is another big one. It feels to me like vectorisation is less reliable / more complex than TCO? But on the other hand the downside is a linear slowdown, not "your program blows the stack and crashes".

      • dzaima 12 hours ago

        With clang you can add "#pragma clang loop vectorize(assume_safety)" to a loop to reduce the burden of proof of vectorizability and to do it if at all possible, giving a warning when it fails to. gcc has "#pragma GCC ivdep" to reduce dependency analysis, but it's not as powerful as clang's pragma.

        • nextaccountic 3 hours ago

          > giving a warning when it fails to

          This is the critical point. If CI fails or otherwise you are warned when the loop doesn't vectorize, then you can count on it to always happen

      • pizlonator 12 hours ago

        If you’re talking about autovectorization in C++ then you have the option of using intrinsics to get real vector code. So I think that’s fine because you have a way to tell the compiler, “I really want simd”.

        • astrange 11 hours ago

          A better fundamental design would be a SPMD language (like ispc, or GPU languages) and then autoscalarization, which is a lot easier to do reliably than autovectorization.

          Intrinsics work poorly in some compilers, and Intel's intrinsics are so hard to read because of inscrutable Hungarian notation that you should just write in asm instead.

          • pizlonator 10 hours ago

            Ispc is great but I think you’re also saying that:

            - it would be better if the intrinsics had sensible names. I couldn’t agree more.

            - it would be better if compilers consistently did a good job of implementing them. I wonder which compilers do a bad job? Does clang do a good job or not so much?

            I think intrinsics make sense for the case where the language being used is not otherwise simd and that language already has value types (so it’s easy to add a vector type). It would be great if they at least worked consistently well and had decent names in that case.

            • saagarjha 2 hours ago

              I have yet to see a compiler that does a good job with autovectorization.

            • neonsunset 10 hours ago

              Intrinsics have somewhat saner names in C#, and unify under the same types for both portable and platform-specific variants (Vector64/128/256/512<T>, numeric operators on them and methods on VectorXXX.* or Sse42/Avx/AdvSimd/etc.)

  • keybored 13 hours ago

    The standard optimization user case story is absurd.

    1. You’re not gonna get any guarantees that the optimization will happen. That makes it High Level. Just write code. We won’t force you to pollute your code with ugly annotations or pragmas.

    2. In turn: check the assembly or whatever the concrete thing that reveals that the optimization you wished for in your head actually went through

    There’s some kind of abstraction violation in the above somewhere.

    • glitchc 13 hours ago

      Agreed. You put your finger directly on something that's always bugged me about optimistic code optimization strategies. The code I write is supposed to have deterministic behaviour. If the determinism changes, it should be because of a change I made, not a flag in the compiler. The behaviour is completely opaque and uncorrelatable, makes it very hard to figure out if a given change will lead to better or worse performance.

      "Abstraction violation" is a good way to put it.

      • rocqua 12 hours ago

        The excuse for this is that performance is not considered part of the behavior of the program. So it doesn't matter to the question of whether your program is deterministic.

        • bluGill 12 hours ago

          These days you don't know what cpu you run on so you can't make performance guarentees anyway. Even in embedded we have been burn with the only cpu going out of production enough to not depend on it anymore in most cases.

        • glitchc 9 hours ago

          Except for the fact that performance optimization frequently does change the behaviour of the program.

    • GuB-42 11 hours ago

      It is a leaky abstraction, all abstractions are. I didn't find it to be much of a problem in practice.

      Usually, if you don't get the optimization you wished for, it means that there is something you didn't account for. In C++, it may be exception processing, aliasing rules, etc... Had the compiler made the optimization you wished for, it wouldn't have been correct with regard to the specifications of the language, it may even hide a bug. The solution is then to write it in a way that is more explicit, to make the compiler understand that the edge case can never happen, which will then enable the optimization. It is not really an abstraction violation, more like a form of debugging.

      If you really need to get low level, there is some point where you need to write assembly language, which is obviously not portable, but getting every last bit of performance is simply incompatible with portability.

    • jiggawatts 13 hours ago

      I have the same experience with database query planners. The promise is that you just write your business logic in SQL and the planner takes care of the rest. In practice you spend weeks staring at execution plans.

      • dspillett 12 hours ago

        A key difference between many compilers and DB query planners, is that a compiler can spend more time over its optimisations because it is run dev-side and the benefits (and none of the time taken) are felt by the users. A query planner needs to be more conservative with its resource use.

        This is not true of JIT compilers, of course, which have similar constraints to DB query planners. In these cases the goal is to do a good job pretty quickly, rather than an excellent job in a reasonable time.

        • pjmlp 2 hours ago

          However most big boys databases also package a JIT compiler into the mix, making it more fun.

        • jiggawatts 11 hours ago

          > A query planner needs to be more conservative with its resource use.

          The number of possible distinct query plans grows very rapidly as the complexity increases (exponentially or factorially... I can't remember). So even if you have 10x as much time available for optimisation, it makes a surprisingly small difference.

          One approach I've seen with systems like Microsoft Exchange and its undrelying Jet database is that queries are expressed in a lower-level syntax tree DOM structure. The specific query plan is "baked in" by developers right from the beginning, which provides stable and consistent performance in production. It's also lower latency because the time spent by the optimiser at runtime is zero.

    • randomNumber7 11 hours ago

      When you already know what you want to have after the optimization, you could just write that code in the beginning.

      • aleph_minus_one 10 hours ago

        Not all programming languages have this amount of expressiveness. How do you, for example, tell how a database what the intended execution plan is for some complicated SQL query?

        You can normally only send SQL queries to a database and not execution plans.

        • magicalhippo 2 hours ago

          Using bit rotation instructions in C is another case I recall.

          Since there's no bit rotate operator in C, you're left hoping the compiler recognizes what the shifts and bitwise-ands are trying to do.

        • Spivak 9 hours ago

          I've always wanted to just write the execution plan directly and bypass the planner for a few hot queries.

  • froh 41 minutes ago

    in the "closing thoughts" [1] of the OP article there's a pointer to "Proebstrings law" [2] which got me to a benchmark [3] how LLVM has improved over 10 years from 2.7 to 11

    which leaves me with the strong intuition that indeed what mit benefit more from a rethink is not optimizers but focus on programmer productivity

    > Perhaps this means Programming Language Research should be concentrating on something other than optimizations. Perhaps programmer productivity is a more fruitful arena.

    [1] https://typesanitizer.com/blog/rethink-optimizers.html#closi...

    [2] https://proebsting.cs.arizona.edu/law.html

    [3] https://zeux.io/2022/01/08/on-proebstings-law/

  • leni536 14 hours ago

    I think this is mostly a philosphical question, rather than optimizer quality.

    Once an optimization becomes part of the interface and it is guaranteed, is it really an optimization? Or did it just became part of the language/library/database/whatever?

    One example is return value optimization in C++. In C++17 the "optimization" became mandatory in some contexts. What really happened though is that the rules of temporary materialization changed, and in those contexts it just never happens prematurely by the language rules. This ceased to be an optimization and became a mechanism in the language.

    What I'm getting at is that unreliability is a defining quality of optimizations.

    Sure, there are certain optimizations that become load-bearing, in which case it would be better if they became part of the language's semantics and guarantees, therefore they ceased to be optimizations.

    • gizmo686 13 hours ago

      It is still a useful distinction. Programming languages are complex enough to understand. It is useful to have a base of 'this is the simplest description of how programs will work' that is as simple as possible. Then, a separate set of 'and here is a separate description of its performance characteristics; nothing in this portion should be understood to change the defined behavior of any program".

      Even if that second description is stable and part of the guarantees you make, keeping it seperate is still incredibly useful from a user perspective.

      From an implementation perspective, there is also a useful distinction. Optimizations take a valid representation, and turn it into a different valid representation of the same type that shares all defined behavior. This is a fairly different operation than compilation, which converts between representations. In particular, for the compilation step, you typically have only one compilation function for a given pair of representations; and if you have multiple, you select one ahead of time. For optimizations, each representation has a set of optimization functions, and you need to decided what order to apply them and how many times to do so. Compilation functions, for their part, need to deal with every difference between the two representations, whereas optimization functions get to ignore everything except the part they care about.

  • h1fra 12 hours ago

    I don't hate query planning, some people are better than me at writing algo and the database knows my data better than me.

    However, what I hate is the lack of transparency (and I feel like this article tries to pin-point just this). When I execute a query locally I get a different plan vs staging vs prod. A plan than can also change depending on some parameters or load or size.

    I don't care about understanding all the underlying optimizations, I just care that the query plan I saw is the same and is still the same in prod, and that I can be warned when it changes. PG does not return the hash of the query plan or metrics along the data, which is imo a mistake. With this you could track it in your favorite metrics store and be able to point when and why stuff are executing differently.

  • nanolith 13 hours ago

    One area that I have been exploring is building equivalence proofs between high-level specifications, an implementation in C, and the machine code output from the compiler. I'm still very early in that work, but one of my hopes is to at least demonstrate that the output still meets the specifications, and that we can control things like timing (e.g. no branching on secret data) and cache in this output.

    I think that the compilation and optimization step, as a black box, is a disservice for highly reliable software development. Compiler and optimizer bugs are definitely a thing. I was bitten by one that injected timing attacks into certain integer operations by branching on the integer data in order to optimize 32-bit multiplications on 8-bit microcontrollers. Yeah, this makes perfect sense when trying to optimize fixed point multiplication, but it completely destroys the security of DLP or ecDLP based cryptography by introducing timing attacks that can recover the private key. Thankfully, I was fastidious about examining the optimized machine code output of this compiler, and was able to substitute hand coded assembler in its place.

    • kragen 4 minutes ago

      What do you think about Jasmin?

    • cesarb 13 hours ago

      > One area that I have been exploring is building equivalence proofs between high-level specifications, an implementation in C, and the machine code output from the compiler.

      AFAIK, that's how seL4 is verified. Quoting from https://docs.sel4.systems/projects/sel4/frequently-asked-que...

      "[...] Specifically, the ARM, ARM_HYP (ARM with virtualisation extensions), X64, and RISCV64 versions of seL4 comprise the first (and still only) general-purpose OS kernel with a full code-level functional correctness proof, meaning a mathematical proof that the implementation (written in C) adheres to its specification. [...] On the ARM and RISCV64 platforms, there is a further proof that the binary code which executes on the hardware is a correct translation of the C code. This means that the compiler does not have to be trusted, and extends the functional correctness property to the binary. [...] Combined with the proofs mentioned above, these properties are guaranteed to be enforced not only by a model of the kernel (the specification) but the actual binary that executes on the hardware."

      • nanolith 13 hours ago

        Indeed it is. What I'm working toward is a more efficient way to do the same, that doesn't take the touted 30 man years of effort to accomplish.

        I'm working on a hybrid approach between SMT solving and constructive proofs. Model checking done with an SMT solver is pretty sound. I'm actually planning a book on a scalable technique to do this with CBMC. But, the last leg of this really is understanding the compiler output.

    • nlewycky 12 hours ago

      > I was bitten by one that injected timing attacks into certain integer operations by branching on the integer data in order to optimize 32-bit multiplications on 8-bit microcontrollers.

      FWIW, I think this should be considered a language design problem rather than an optimizer design problem. Black box optimizer behaviour is good for enabling language designs that have little connection to hardware behaviour, and good for portability including to different extensions within an ISA.

      C doesn't offer a way to express any timing guarantees. The compiler, OS, CPU designer, etc. can't even do the right thing if they wanted to because the necessary information isn't being received from the programmer.

      • jcranmer 12 hours ago

        To a large degree, constant-time programming is hampered by the fact that even hardware is often unwilling to provide constant-time guarantees, let alone any guarantees that the compiler would care to preserve. (Although, to be honest, constant-time guarantees are the sort of things that most compiler writers prefer to explicitly not guarantee in any circumstances whatsoever).

        • bluGill 12 hours ago

          8 bit cpus offer constant time. 16 bit was starting to get into the issues where you cannot off it.

          • wiml 6 hours ago

            Your comment makes me wonder about the idea of building a superscalar, out of order, speculative implementation of the 6502 or 8080 instruction sets. Might make a good educational project.

      • nanolith 12 hours ago

        Few languages provide such guarantees. But, there really was no way with this particular compiler to pass a hint to generate constant time code.

        Black box designs work until the knob or dial you need to control it isn't there. I would have taken a pragma, a command-line option to the compiler, or even a language extension.

        This is one example of many as to why I think that user-guided code generation should be an option of a modern tool suite. If I build formal specifications indicating the sort of behavior I expect, I should be able to link these specifications to the output. Ultimately, this will come down to engineering, and possibly, overriding or modifying the optimizer itself. An extensible design that makes it possible to do this would significantly improve my work. Barring that, I have to write assembler by hand to work around bad assumptions made by the optimizer.

    • gizmo686 13 hours ago

      If you haven't done so, you might want to look at some of the work done by the seL4 microkernel project.

      They start with a Haskell prototype that is translated programatically into a formal specification for the theorem prover.

      They then implement the same thing in C, and use a refinement prove to demonstrate that it matches their Haskell implementation.

      They then compile the program, and create another refinement proof to demonstrate that the binary code matches the C semantics.

      • nanolith 13 hours ago

        I'll refer you to my reply to a sibling comment. I'm hoping that I can build a more efficient means of doing similar work as with seL4, but without the 30 man year effort.

        They are on the right track. But, I think there have been some improvements since their effort that can lead to more streamlined equivalence proofs.

  • WalterBright 11 hours ago

    > D has first-class support for marking functions as “no GC”.

    It never occurred to me that this would be considered a hint to the optimizer. It doesn't affect code generation. What it does do is flag any use of the gc in the function and any functions it transitively may call.

    Optimizers have been likened to turning a cow into a hamburger. If you're symbolically debugging optimized code, you're looking at the hamburger. Nobody has been able to solve that problem.

    It's true that optimizers themselves are hard to show being correct. The one in the D compiler is a conventional DFA optimizer that uses data flow equations I learned from Hennessy and Ullman in a 1982 seminar they taught. So it has been battle tested for 42 years now(!) and it's pretty rare to find a problem with it, unless it's a new pass I added like SROA. The idea is anytime a problem is identified and corrected, it goes into the test suite. This has the effect of always ratcheting it forward and not regress.

    The GC dates from around 2000, when I wrote it for a Javascript engine. It was brutally tested for that, and has been pretty solid ever since. People complain about the GC, but not about it being buggy. A buggy GC is a real horror show as it is painfully difficult to debug.

    • typesanitizer 10 hours ago

      Thanks for the feedback.

      The preceding paragraph had "and occasionally language features" so I thought it would be understood that I didn't mean it as an optimizer-specific thing, but on re-reading the post, I totally see how the other wording "The knobs to steer the optimizer are limited. Usually, these [...]" implies the wrong thing.

      I've changed the wording to be clearer and put the D example into a different bucket.

      > In some cases, languages have features which enforce performance-related properties at the semantic checking layer, hence, granting more control that integrates with semantic checks instead of relying on the optimizer: > > - D has first-class support for marking functions as “no GC”.

    • saagarjha 2 hours ago

      Languages that run in a VM are pretty good at turning the hamburger back into a cow when you're looking at it.

      • dzaima 2 hours ago

        But they only allow you to look at it at certain points in time (unless requested more precisely ahead-of-time), and/or lose the ability to do a bunch of meaningful optimizations. Whereas compiled programs can be desired to be paused and made sense of at any assembly instruction, and there'll always be some people wanting to trade perfect debuggability for perf.

        • saagarjha 2 hours ago

          Not really. If you look at HotSpot for example it will deoptimize a function for you all the way if you debug it and then reoptimize it when you stop single-stepping.

          • dzaima an hour ago

            But how do you get in the "debug it" mode? If you mark it as to-be-debugged ahead-of-time (e.g. adding a breakpoint) then it can indeed ahead-of-time set up exactly that spot to be debuggable from. If you just pause the VM at a random time, it can freely choose to quietly run maybe a couple dozen to thousand instructions to reach a safepoint. Compiled languages don't have that luxury, as they don't get to control when a coredump happens or ptrace attaches, but people nevertheless expect such to provide a debugging environment.

            Even if HotSpot had perfect assembly-level debug information (which it cannot, as it does do (a tiny bit of) autovectorization, which by necessity can reorder operations, potentially leading to intermediate states that cannot be mapped to any program state), that just means it'd come at a performance cost (e.g. no autovectorization).

            • dzaima an hour ago

              Also a general problem is that HotSpot does not tolerate memory corruption, but for memory-unsafe languages you'd quite like to have an as-reasonable-as-possible debugging environment even if 60% of the heap was zeroed by bad code, which throws a pretty large wrench in being able to complete a "just deoptimize everything" as the first step to debugging.

            • dzaima an hour ago

              Perhaps there's an argument for having two debugging modes for compiled programs - the "basic" best-effort one that is what we already have, and a feature-complete one that requires tight runtime integration and cannot relate to any of the compiled assembly. But that's adding a decent amount of complication & confusion, doesn't really help coredumps, and still has to pay the cost of having things be deoptimizable even if at limited locations.

  • Mikhail_Edoshin 3 hours ago

    One of reasons SQL is declarative is that queries run concurrently yet give the illusion of working in isolation. And unlike the concurrency of independent processes, which is mostly mechanical, this one involves detailed coordination because database queries work on the same data. With these constraints it may be not that easy to come up with an imperative model that is as simple as the current declarative one.

  • andyferris 11 hours ago

    If you have any language where it is "semantially correct" to execute it with a simple interpetter, than all optimizations in that language are not semantically important by definition, right? I've even seen interpretters for C... so while I 100% have felt the pain in this article, I don't know where that leaves us? These are more like compilation assertions/strategies than language features. (Not that these shouldn't be written inline with the code, e.g. it can be nice to write unit tests in the same files as the code in some languages).

    In the case of SQL, I'd love access to a flavor where I do the joins on indices explicitly, the query is executed as written, and each join (or filter) can be annoted with a strategy (btree lookup, etc). (The most difficult part about using indices by hand is correctly writing all the synchronous triggers on updates, not the queries, IMO).

    • typesanitizer 9 hours ago

      > If you have any language where it is "semantially correct" to execute it with a simple interpetter, than all optimizations in that language are not semantically important by definition, right?

      Technically, yes. :)

      But I think this should perhaps be treated as a bug in how we define/design languages, rather than as an immutable truth.

      - We already have time-based versioning for languages. - We also have "tiers" of support for different platforms in language implementations (e.g. rarer architectures might have Tier 2 or Tier 3 support where the debugging tooling might not quite work)

      One idea would be to introduce "tiers" into a language's definition. A smaller implementation could implement the language at Tier 1 (perhaps this would even be within reach for a university course project). An industrial-strength implementation could implement the language at Tier 3.

      (Yes, this would also introduce more complications, such as making sure that the dynamic semantics at different tiers are equivalent. At that point, it becomes a matter of tradeoffs -- does introducing tiers help reduce complexity overall?)

    • Archit3ch 5 hours ago

      > If you have any language where it is "semantially correct" to execute it with a simple interpetter, than all optimizations in that language are not semantically important by definition, right?

      Not when "correct" needs optimizations to meet real-time guarantees. It's hard to argue that a program which doesn't run is "correct".

  • yxhuvud 2 hours ago

    No. Language designers need to think about what their language promises and when some users have differing needs, they may need to fulfill those needs.

  • skybrian 12 hours ago

    > The optimizer’s behavior is in some contexts load-bearing and has little margin for error (e.g. missing an optimization).

    Well, sure, sometimes, to an extent. But if it's load-bearing, maybe that's a bug? You might have written non-portable code that won't last, because it depends on an implementation detail that isn't standardized.

    There are widespread applications where performance isn't critical. For example, any time you do a network request. Web pages shouldn't break because the network is slow today, if you can possibly avoid it.

    The web provides no performance guarantees, but it tries pretty hard to provide compatibility guarantees. Your code should, usually, work on new browser versions and on devices that haven't been released yet. New browsers will have different JIT compilers with different performance cliffs. And yet, there are websites written many years ago that still work.

    When standardizing things, we need to be precise about what's standardized and what isn't, or protocols "rust shut" and can't be changed without breaking lots of stuff. Not standardizing performance is often a win. (With some exceptions like video game consoles where the hardware is standardized.)

    Hyrum's law suggests that all compiler optimizations will eventually be load-bearing for someone, but we should usually try to avoid it. To make sure that the code is robust, perhaps performance should vary. Maybe it would be useful to have something like a chaos monkey for programs, where optimizations vary based on a random seed?

    • bsder 11 hours ago

      > But if it's load-bearing, maybe that's a bug?

      If your "debug optimization" code is so slow as to be unusable (see: Rust), then your optimizations qualify as load-bearing.

      The problem is that "optimization level" needs a mindset change. The optimization levels should be "release", "small" and "experimental".

      "Release level" needs to be perfectly usable for debugging as well as in production--"debug level" should be "release level". Compilation time should be reasonable and run time should be functional.

      After that, for embedded, you should have "small level"--checks should get turned off and any optimizations that make code significantly bigger should get turned off (loop unrolling, for example). You might enable some optimizations that make compile time brutally slow.

      Finally, there should be an "experimental level" which tests out optimizations before they go into release.

      And there should be no "fast" optimization level. If your optimization is that situation specific, it should stay stuck in "experimental".

      And through all of this, the compiler should also eject files that carry enough information to allow debuggers to make sense of the compiled code, unwind the optimizations when the users asks, and present a coherent version of what is going on. This is actually where compilers really break down nowadays. The compiler needs to eject enough information and context that a debugger can unwind what is going on rather than being an afterthought.

      We need the equivalent of an LSP (Language Server Protocol) for debuggers.

      • dzaima 10 hours ago

        DWARF does contain turing-complete VM for doing unwinding, so theoretically it should already be possible to make a compiler that gives precise debug info everywhere, no new protocol necessary.

        But completely-reversible-anywhere optimizations are rather limiting, disallowing a bunch of things; including, but not limited to, dead code elimination (more generally, forgetting some fraction of state from somewhere), identical code merging, and instruction reordering (esp. vectorization, which essentially reorders across multiple loop iterations), severely limiting what your optimizer can even do.

      • astrange 10 hours ago

        > And through all of this, the compiler should also eject files that carry enough information to allow debuggers to make sense of the compiled code, unwind the optimizations when the users asks, and present a coherent version of what is going on.

        You can't do this because some optimizations can't be seen through; most obvious one is identical code merging. If you're in a merged function then you don't know which of the original source lines you're looking at.

        I'd like to see a system which only had optimized compilation and used an interpreter for debugging.

      • FridgeSeal 10 hours ago

        How would this model handle code that needs to be as fast as possible, at the cost of debugging? E.g. all hot-loop code, all high performance numerical code, all latency sensitive and instruction sensitive code?

        > If your optimization is that situation specific, it should stay stuck in "experimental".

        Yeah, I have a feeling that there’s low-key lots of applications for whom many compiler optimisations that you’re branding “experimental” are in fact run-of-the-mill and would just enable “experimental” mode so frequently it’d just be synonymous with “actually release mode”.

        > And through all of this, the compiler should also eject files that carry enough information to allow debuggers to make sense of the compiled code, unwind the optimizations when the users asks, and present a coherent version of what is going on.

        This is a pretty cool idea, semi-related, you should check out some of the work around E-Graphs, which attempt to enable the same level of optimisations but also preserve more original context. It’d be neat to have the equivalent of source-maps for highly optimised code, but I’m not sure how we’d manage them without them becoming enormous. After things like inlining, constant-folding, desugaring, rearrangement operations, etc all take place I suspect you’d be left with code that bears only the merest passing resemblance to its original form.

      • skybrian 11 hours ago

        For the web, there are debuggers, but no apparent "debug mode."

  • f33d5173 12 hours ago

    the optimizer should be a magic black box. As soon as you start demanding a particular optimization, it shouldn't be an optimization anymore. In C you have inline assembly and intrinsics and pragmas and macros and so on. If you want the compiler to compile your code a particular way you should be using these, not trying to wrangle the optimizer to invoke a particular optimization.

    • dzaima 12 hours ago

      Except even intrinsics aren't even that much of a guarantee - clang converts them to its internal operations and applies the same optimization passes over them as it does on its own autovectorized code; and there are no intrinsics for x86's inline memory operands, so issues can arise around those (or the inverse - I've seen clang do a memory load of the same constant twice (the second one being behind some branches), despite there being only one such load).

      And there also are no intrinsics for most scalar operations, e.g. if you wanted to force "x>>48 == 0x1234" to be actually done via the shift and not "x & 0xffff000000000000 == 0x1234000000000000" (or vice versa).

      And of course assembly means writing platform-specific code (potentially undesirable even if you want to only do the optimization for a single architecture, as it means having to learn to write assembly of said architecture).

      There is some potential middle-ground of doing black-boxing, but as-is in C/C++ the way to do this is with a no-op asm block, but that can make register allocation worse, and still requires some platform-specific logic for deriving the register kind from value type.

  • mcfig 11 hours ago

    Great article. One real case I encountered that I find thought provoking, is where a bunch of test failures were bucketed into the same bucket because link-time code-generation had noticed that a bunch of C++ getter functions had the same output code and combined them all. So stack traces became confusing because the address-to-symbol mapping was more complicated than the logic we had in place was prepared for.

    i.e. optimization had violated a rule we were implicitly relying on (that each non-inlined function should start at a distinct address, so that address-to-symbol mapping could be done easily). But that’s not an explicit guarantee and optimizers don’t seem to think about it much. (Well for inlining it seems to have had some thought, still sucks, but anyway this case doesn’t fit the pattern of inlining).

    I find it hard to say anyone is dead wrong in this case… but I would turn off that LTCG optimization any time I could, except where proven necessary.

  • pornel 11 hours ago

    Optimizations in compilers like LLVM are done by many individual code transformation passes, one applied to the result of the previous.

    This layering makes the order of the passes important and very sensitive. The passes usually don't have a grand plan, they just keep shuffling code around in different ways. A pass may only be applicable to code in a specific form created by a previous simplification pass. One pass may undo optimizations of a previous pass, or optimize-out a detail required by a later pass.

    Separation into passes makes it easier to reason about correctness of each transformation in isolation, but the combined result is kinda slow and complicated.

  • CalChris 11 hours ago

    We are nearing the death of Proebsting's Law. AMD CEO Lisa Su's HotChips’19 keynote said that compilers had accounted for 8% performance increase over the decade. That means compilers are now only doubling performance every 90 years.

    https://www.youtube.com/watch?v=nuVBid9e3RA

  • gpm 12 hours ago

    Arguing against query planning by pointing at a quote about databases is wild. Automatic query planning is ubiquitous and hugely succesfull in databases.

    • typesanitizer 10 hours ago

      I've added a clarification in the post to make my position explicit:

      > This is not to imply that we should get rid of SQL or get rid of query planning entirely. Rather, more explicit planning would be an additional tool in database user’s toolbelt.

      I'm not sure if there was some specific part of the blog post that made you think I'm against automatic query planning altogether; if there was, please share that so that I can tweak the wording to remove that implication.

      • gpm 9 hours ago

        > I'm not sure if there was some specific part of the blog post that made you think I'm against automatic query planning altogether; if there was, please share that so that I can tweak the wording to remove that implication.

        The quote from another article (which I didn't read) starting with "I dislike query planners".

        "Against ... altogether" is mildly stronger than I took away from this, more like "generally of the opinion that the tradeoff nearly everyone is making with sql isn't worth it".

        Judging by the lack of upvotes other people didn't react as strongly to this quote as I did, so take it as you will.

    • bsder 12 hours ago

      And sometimes highly problematic.

      I'm surprised that the "query planner" doesn't have a way to eject an opaque object that is the "assembly language of the query" that you can run that it is not allowed to change.

      • gpm 9 hours ago

        Sure. It's definitely a tradeoff which definitely hurts on rare occasion. I agree that the lack of fallback in most databases is a bit strange. Altogether though the productivity benefits have proven larger than the drawbacks of not defaulting to a query planner.

  • sebmellen 13 hours ago

    I like that the author included their intended audience up front. Definitely not me, but it helped me read the article with a different perspective.

  • jonstewart 12 hours ago

    At least databases have Explain. I'd love to get feedback from clang or gcc about why particular optimizations were not applied.

    • typesanitizer 10 hours ago

      I'm guessing you've tried these flags mentioned in the blog post but haven't had luck with them?

      > LLVM supports an interesting feature called Optimization Remarks – these remarks track whether an optimization was performed or missed. Clang support recording remarks using -fsave-optimization-record and Rustc supports -Zremark-dir=<blah>. There are also some tools (opt-viewer.py, optview2) to help view and understand the output.

    • einpoklum 12 hours ago

      Explain doesn't give you that information in many (most?) DBMSes. It's a bit like seeing the compiler IR code of your program. It lets you understand some things, while others remain a mystery.

  • QuadmasterXLII 12 hours ago

    Frankly, the problem is that (generally, across languages various compiler hints) @inline sometimes fails to inline. At this point I’ve given up on ever having an @inline that reliably inlines, and I would very happily settle for an @assert_inline that doesn’t change the generated assembly at all but reliably crashes out if the function isn’t inline.

    Julia is by far the worst language about this. It would be vastly more usable with the addition of @assert_type_stable, @assert_doesn’t_allocate, and @assert_doesn’t_error macros.

    • andyferris 11 hours ago

      On the Julia front, I believe the evolving effect system may help a little bit with obtaining/surfacing that information. JET.jl should be able to solve the type stability (inference) one from a whole-of-program perspective. We have `@inferred` for unit tests. The macros would be a cool addition - I wonder if they would get overused though?

      I agree that Julia takes the idea of optimization to the extreme - it's semantically a very dynamic language and only fast due to non-semantically guaranteed optimization. On the other hand, getting access to the generated IR, LLVM and assembly and iteratively improving it is far easier than any other language I've seen.

      • pjmlp 2 hours ago

        What Julia offers is quite common in Common Lisp systems.

    • dzaima 11 hours ago

      With gcc & clang you can use __attribute__((always_inline)) to force-inline, even at -O0, giving an error if it's impossible.

  • einpoklum 12 hours ago

    From the article:

    > Have a good mental model of what the optimizer can and cannot do.

    Most DB query planner designers and implementers have little imagination, and their mental model of what optimizers can and cannot do is, well, extremely narrow-minded. There is huge unexplored space of what query planning can be (at least for analytic queries, and we think in columnar terms) - if we just stop insisting on thinking of DBMS operations as black boxes.