One Billion Nested Loop Iterations

(benjdd.com)

49 points | by behnamoh 4 hours ago ago

36 comments

  • mcdeltat an hour ago

    You have to be extremely careful with benchmarks like these. It's very easy to measure and conclude different things. In this case, it really only measures the compiler's ability to optimise the `for` loop construct - nothing else. It doesn't say anything much about the languages' general "speed", whatever that means (which is generally the implication with tests like this).

    Some points to consider:

    - `for` is not how you'd write an optimised loop in all languages.

    - Be VERY suspicious of microbenchmarks on C, C++, and similar - the compilers are smarter than you think and you won't get the code you wrote.

    - Languages do arrays differently - added memory effects on performance.

    - Languages do numbers differently.

    - Summing and modulo is somewhat contrived example. Modulo is one of the slower instructions out there.

  • eiathom 2 minutes ago

    It's interesting to imagine the head scratching going on in this thread. All these clever people, stumped. 'How can this be!???'. Quite the post, bravo OP.

  • ayaros 27 minutes ago

    I'm working on a web app in JS, the sort of thing I think people on here will enjoy and I'll definitely share it when it's presentable. It's all in JS, including the graphical aspects of it (I'm drawing to a canvas). For per-pixel rendering I have 4 levels of nested loops. I feel like I've done everything in my power to optimize the process, and it's still not at the ideal speed. Really, it ought to be about 5-10 times faster than it is now.

    I've been thinking of getting it into WASM somehow - it's really just one function that needs to be replaced by a WASM module - but one of the things which has given me pause are benchmarks like this, as well as other anecdotes, that demonstrate I'd only get a minor speedup from WASM. I guess part of this is due to modern JS engines being so optimized.

    Just about all of the relevant code is adding and subtracting integers, but theres's some Boolean logic. There's also some multiplication in there that I could eliminate if I really had to, although at this point I'm not certain it would make a significant difference. Most of the data that contributes to the value of each pixel is spread across a tree of objects, and there's some polymorphism involved too. The point is it's not like I'm just dealing with a bunch of contiguous data in memory that is easily cached by the CPU, which might speed things up considerably. I guess I'm just venting now, lamenting a nightmare of my own creation...

  • iambvk an hour ago

    Golang folks, after looking at the code, I am surprised and don't understand why Go version is slow compared to C/C++/Rust. Can anyone please explain? Thank you.

    • bddicken an hour ago

      Author here: Yeah, I was surprised that there doesn't seem to be many options for extra optimizations in the Go compiler. Would be curious if Go experts have more insight or other compiler recommendations.

      • boyter 19 minutes ago

        I doubt its the GC kicking in, but you could run it with the following environment variable set just to ensure that its not doing anything.

            GOGC=-1
        
        EDIT: Trying it out quickly shows a small improvement actually, but so small as to likely be noise as I was doing other things on the machine.

            Summary
              ./mainnogc 1000 ran
                1.01 ± 0.06 times faster than ./maingc 1000
  • maxmcd 2 hours ago

    Another fave of mine by this person: https://benjdd.com/az/

  • superjan 2 hours ago

    It’s a billion times integer modulus, summed. Div/mod is by far the slowest arithmetic instruction on a CPU, so it underestimates C/rust performance. I guess the author chose that to prevent the C compiler from optimizing the loop away.

    • bddicken an hour ago

      Author here: Yeah, the mod was added to reduce risk of loop(s) getting optimized away.

      • remcob 38 minutes ago

        Did you verify this through disassembly? These loops can still be replaced by a closed form expression and I wouldn’t be surprised if LLVM figured this out.

  • obituary_latte 30 minutes ago

    The animation is confusing...is one trip to the end and back the 1B iterations? I thought maybe they were counting up and once the slowest reached the end they would all stop but that didn't seem to be it. Maybe if it did (and there was a counter on each showing how many bounces they made), it'd be a little more clear.

    Interesting nonetheless.

    • ayaros 20 minutes ago

      Looks like they are just bouncing back and forth at relative speeds to each other; the speed of each ball is all that matters. I also found this diagram confusing at first.

  • pbw 35 minutes ago

    I wrote this to probe the question of why “slow” languages are popular and valid when used in conduction with fast ones: https://tobeva.com/articles/beyond/

  • wanderingmind 30 minutes ago

    How in the world is pypy faster than cpython by almost 2 orders of magnitude. Is pypy normally faster in other benchmarks. If so, why? Can some python expert pitch in?

  • xutopia 2 hours ago

    It blows my mind to see that speed difference between different languages. In even the slowest language there we have lots of ways we could optimize this contrived code but this is not something that we can always do in the real world. Seeing the visualization here really underscores how fast some languages are.

    • swatcoder an hour ago

      > Seeing the visualization here really underscores how fast some languages are.

      Yes and no.

      It emphasizes that there are some scenarios where language choice can have an outsized effect on performance when running on modern architectures, which is a reminder some people could use or have never really witnessed.

      But it's a very particular scenario that relatively few projects will encounter at a scale that meaningfully impacts them.

      In particular, C and Rust are much better positioned to instruction pipelining and lack of cache busting because there are no conditionals and little-to-no out-of-loop code to execute. Even with a massive loop like this, that can be a real world scenario in things like data/signal processing and numerical computation. When you're doing those things, language choice can make a huge difference that may not be apparent without understanding how they work or at least seeing visualizations like this.

      But even in C and Rust, most applications don't have to do this kind of thing. Most are dominated by smaller loops, with branching that breaks pipelining, complexity that busts cpu cache, calls and jumps that might interfere with both etc. In those much more common cases, the practical performance difference is going to be quite a lot less significant, and the fluency/ergonomics/safety/convenience of working in some other language can easily matter more.

      • PaulHoule an hour ago

        I like the extreme boost that PyPy gets here which is most more than you usually get with PyPy, on one hand it shows the potential of PyPy but it is testing just one area where PyPy is strong, other workloads depend on performance that is not well optimized.

        My take though is that branchy, old-AI, and heavy code (say keep every fact in a triple store and not as a Python object at the benefit of having an index for every permutation of ?s-?p-?o) that cannot be handled with numpy or similar tools benefits greatly from PyPy, but not 100x.

  • JohnMakin an hour ago

    Looking at the source code, is the random value generated before the main loop iterations to stop the compiler from shortcutting all the iterations? Does that work on some of the better compilers out there? The random value is only generated once. I would like to see a similar benchmark on creating/destroying objects in memory.

    • neonsunset an hour ago

      If you are interested in looking into the details of microbenchmarks of this type, there is always https://benchmarksgame-team.pages.debian.net/benchmarksgame/... which has clearly defined methodology and allows to understand wall-clock time vs CPU cost too.

      The benchmark code linked in the submission unfortunately suffers from the mistakes that happen whenever you write a benchmark for the first time. The visuals are pretty but might as well have been based on something more comprehensive.

      (also, please stop measuring performance with loops that perform no work and fibonacci sequences, this was bad 30 years ago, this is bad today too)

      • JohnMakin an hour ago

        Cool, thanks! In another lifetime I had an internship at an embedded software company which was to write tests and optimize compute or throughput for various chips and chip firmware, and reading the functions here sparked some alarm bells in my head. Still a fun idea though and cool graphics.

    • swatcoder an hour ago

      Yeah, the random value prevents compile-time pre-calculation and the print makes sure the loops actually get run and values actually get calculated (as modern compilers are wont to simply omit code that clearly doesn't have consequence).

  • o11c an hour ago

    Note that these benchmarks ignore object-related stuff, which is usually the expensive part.

  • fifilura an hour ago

    Doing Python or R right means using vector/matrix operations in e.g. numpy. Those will be faster.

    Just think twice before using a for loop. Most of the time it is not needed.

    • MooseBurger an hour ago

      I disagree w.r.t. python. You shouldn't need a numerical computing library to do a for loop generally.

      • fifilura 17 minutes ago

        Kick the habit. For loops are what "goto" was 20 years ago. Error prone and not possible to abstract for example for parallel execution.

  • fbn79 2 hours ago

    Would be nice to know max RAM usage by each implementation.

  • davikr 2 hours ago

    This does not seem like a realistic benchmark target.

    • mrkeen 2 hours ago

      Perhaps it's unrealistic in the sense that this is unrealistically-optimisable code.

      Real-world code has closures, allocations, and pointer-dererencing. This code is just iterations, additions and modulus.

      • behnamoh 2 hours ago

        > This code is just iterations...

        Which is what the average programmer out there writes. Therefore, I think this is actually a good realistic benchmark.

        • steveBK123 an hour ago

          Precisely - its bad code, which is what most code is!

        • elzbardico an hour ago

          What the average programmer out there writes is just a tip on the iceberg of what gets executed. And an incredibly small tip.

    • moralestapia an hour ago

      Nor does it pretend to be.

      It's literally just "1 Billion nested loop iterations", methods and results.

      You make of that whatever you want.

    • deanCommie an hour ago

      The other problem with it is that there are other variables involved besides pure speed, which is CONSISTENCY of speed.

      Folks are always surprised to see just how fast Java is able to perform on benchmarks - it has a reputation as such a slow language then how come it's able to execute almost as fast as C++?

      But Java's problem isn't execution performance. It's:

      * Startup performance. It takes time to instantiate the JVM. That matters for some, not for others.

      * Consistent performance. This is the nature of Garbage Collection. It can surprise you when it executes you and drop a performance blip when you least expect it.

      Most developers who think they need the fastest performance actually need CONSISTENT performance more than the fastest.

      • PaulHoule an hour ago

        I even see some video games (say Dome Keeper) that periodically go out to lunch even on a top of the line gaming PC and understand that kind of game is often written in C# which has a garbage collector. Thing is I remember playing games on the much weaker PS Vita that were written in C# but I never remember pausing like that.

        VR games are now the frontier of gaming (in terms of the industry outdoing itself) and there you're approaching hard real time requirements because it is so awful to get seasick. I appreciate it.

        • bob1029 an hour ago

          I think it might have more to do with the specific implementation than the tool.

          Most incarnations of C# offer a GC.Collect method and an ability to configure the threading model around collections. Used properly, you can keep maximum delays well-bounded. You still have to work your ass off to minimize allocations, but when some do inevitably occur due to framework internals, etc., you don't want to have to clean up 3 hours worth of it at once. Do it every frame or scene change.