The Cost of a Closure in C: The Rest

(thephd.dev)

66 points | by ingve 5 days ago ago

62 comments

uecker 2 days ago

Since it mentions my paper again: "It also brings into question whether a “lean” approach that grabs the “environment pointer” or the “stack frame” pointer directly, as in n3654 is a good idea to start with." which is annoying since the approach promoted in n3654 is not even tested here. At least he continues with "But, it’s premature to condemn n3654 because it’s unknown if the problem is the fact that the use of accessing variables through what is effectively __builtin_stack_address and a trampoline is why performance sucks so bad, or if it’s the way the trampoline screws with the stack". So why insinuating it when you have no idea? It is well known that the trampoline approach is costly, and it seems particular bad on arm. But n3654 does not promote trampolines and instead wide function pointers which is similar to std::function_ref, but without wasting the static chain register and avoiding the need to create a thunk when converting a regular function pointer.

I am also not entirely sure whether "manorboy" is a good benchmark, but for estimating the function call overhead it should be ok. For the access to variables part, other considerations are far more important IMHO. So I am not sure I would put too much weight on the accessing "k" part of the post.

n3654 contains a criticism of the lambda approach from a language-design perspective. The issue is that it seems not to be a good fit for C. Copying values is always cheap, but this obviously works in toy examples but not necessarily for interesting data structures. In C++ you could then capture a pointer as a value, but then you haven't avoided an indirection either. In general, this is fine in C++ as smart pointer than deal with the memory management, but in C having a captured value in a lambda you can not have explicit access to anymore does not make too much sense.

The "unfortunately, is that unlike C++ there are no templates in C" is also interesting. I fled from C++ back to C exactly because of templates. In a performance context, the fallacy is that you can always create super-optimized code using compile-time techniques that absolutely shines in microbenchmarks (such as this one) but cause a lot of bloat on larger scale (and long compilations times). If you want this, I think you should stick to C++.

[-]

dfawcus 2 days ago

Well his "Normal Functions" (benchmarks/closures/source/normal_functions.cpp in his repo) looks quite similar to what I had with my GNU nested functions using a stand in "wide pointer", and hence no generated trampoline.

(https://news.ycombinator.com/item?id=46243298)

Which rather suggests to me that such a scheme, but generated by the compiler, should have a similar performance to said "Normal Functions" and hence similar to his preferred lambda form.

Since his benchmark environment is so unwieldy, I may have a go at extracting those two code sets to a standalone environment, and measure them so see...

[-]

uecker 2 days ago

So here are my preliminary benchmarks with my own implementation on an AMD EPYC 9334 32-Core processo. I need to double checks things - so take this with a grain of salt for now. Time is in seconds for 100000 iterations of manorboy(10). So far, the only implementation which clearly sucks is std::function<>. Even trampolines are suprisingly good (but I can imagine that they are much worse on other CPUs / architectures)

  xgcc (GCC) 16.0.0 20260103 (experimental)
  1.50 gcc -ftrampoline-impl=stack -Wl,-no-warn-execstack
  1.11 gcc -ftrampoline-impl=stack -Wl,-no-warn-execstack -DREFARG
  7.21 gcc -ftrampoline-impl=heap
  7.34 gcc -ftrampoline-impl=heap -DREFARG
  0.93 gcc -DWIDEPTR
  1.38 gcc -DWIDEPTR -DREFARG
  1.40 gcc -DDIRECT
  1.05 gcc -xc++ -std=c++26 -DFUNCREF -DDEDUCING
  19.68 gcc -xc++ -std=c++26 -DDEDUCING
  20.73 gcc -xc++ -std=c++26
  6.31 gcc -xc++ -std=c++26 -DDEDUCING -DREFARG
  6.31 gcc -xc++ -std=c++26 -DREFARG
  Debian clang version 16.0.6 (15~deb12u1)
  21.11 clang -xc++
  6.16 clang -xc++ -DREFARG
  1.66 clang -fblocks
  1.70 clang -fblocks -DREFARG

[-]

uecker an hour ago

Here some more bechmarking: https://codeberg.org/uecker/nested-functions-benchmarking

tialaramex 2 days ago

> I fled from C++ back to C exactly because of templates

Obviously I sympathize as someone who spent the front half of their career writing C and has no interest in writing C++ beyond toy examples. Nevertheless, IMO you're cutting off a choice here, forcing yourself to make potentially sub-optimal choices.

Sometimes the templates would have been better and writing X-macros is just worse, ergonomically at least. Perhaps C++ people use templates more often than they should, but I'm sure "never" is not the correct amount.

Still, if C gets fat pointers that's at least a step in the right direction. So I must wish you the best with that part.

[-]

uecker 2 days ago

I used to write expression template libraries, so I know what C++ can do. I just figured out I am a lot more productive in C than in C++ and that in the time I waste on C++ I can do a lot more in C. I also do rarely use macros in C, so what C++ people get wrong is assuming that C programmer would do the same thing just using macros. This is not the case. One has to realize that compile-time programming is 95% of times a waste of time and effort and the same thing can be achieved in an easier and better way.

I also recently removed some code written with templates in C++ from one of my projects, because this single file caused compilation times for a ca. 450 file C project to increasse by 50% from 20s to ca. 30s.

1718627440 2 days ago

> Still, if C gets fat pointers that's at least a step in the right direction.

AFAIK the C standard doesn't prevent an implementation from using fat pointers, this is one of the reasons why the conversion from pointer to integer only works in one direction. This is actually necessary for segmented memory. The compiler is allowed to optimize based on the assumption, which allocation a pointer comes from (including the allocations boundaries, i.e. size), even if bytewise the pointers would be equal, so you could argue, that C abstract machine already has fat pointers.

[-]

uecker 2 days ago

Indeed. C implementation could just use fat pointers, but it is unrealistic that they now break their ABI. Also the cost of fat pointers is higher when not needed. So the plan is to give them a new wide pointer type.

oflebbe 2 days ago

Actually the author is building onto my suggestion https://github.com/soasis/idk/pull/17/files which I created several years ago, but adapted a bit too fast while submitting. And yes it looks very similar to GNU nested functions, since I started tinkering with these first.

I am really not sure if all these observations mentioned in the article are 100% correct, though.

First : Code seems to be compiled with clang. On Linux with gcc the native function one is way faster than the clang one.

Second: The author does run the code on ARM64/MacOS .

At least on my ryzen CPU on Linux with gcc the "normal C code" is way faster than anything else. Not that we do not need to thing about "closure" type functionality, but one should be careful to extrapolate implementations from one compiler on one platform to the rest of the pack.

Regarding N3654 I am not sure how to benchmark it here, since C could potentially use __builtin_call_with_static_chain , but I am not sure how to write the function to use the chain for accessing the variables.

I tried to estimate N3654 it by using "tinygo" which is AFAIK using the usual Calling ABI, but it was a factor of two slower than clang. Even "go" with its very specific ABI is still much slower. I discovered this isn't representative since runtime calling costs had been totally shadowed by costs of allocations.

Even the rust example I am usually using http://www.reddit.com/r/rust/comments/2t80mw/the_man_or_boy_... is much slower than anything else, presumably because of the "Cell" needed

TLDR: This micro benchmark might be misleading

[-]

uecker 2 days ago

See my other comment in this thread with my preliminary benchmarking results. I have a patched GCC with two new builtins __builtin_static_chain and __builtin_nested_code (needs a better name), that give you the static chain pointer and the code pointer without creating a trampoline. Then I put both into a structure to simulate a wide pointer. Later I call it with __builtin_call_with_static_chain.

A trick one can do is to let it create the trampoline and then read off the two pointer from the position in the code where it is stored. Not portable and you still have the overhead for creating the trampoline, but you do not need the executable stack anymore.

userbinator 2 days ago

Those graph axes units are... perplexing, to say the least.

So, we need to swap to the logarithmic graphs to get a better picture

I wish more people would know about decibels.

[-]

throw_await 2 days ago

Logarithmic axes are inifinitely better than decibels, because dB are just to overloaded semantically

Someone 2 days ago

>> So, we need to swap to the logarithmic graphs to get a better picture

> I wish more people would know about decibels.

Huh? Is there any difference? https://en.wikipedia.org/wiki/Decibel:

“The decibel […] expresses the ratio of two values of a power or root-power quantity on a logarithmic scale”

greatgib a day ago

This article is so badly written, it makes it hard to understand something that should be easy to understand...

Panzerschrek 2 days ago

Why struggling using qsort? std::sort from C++ is much better in terms of usability and performance.

[-]

flohofwoe 2 days ago

FWIW, when the C compiler has access to the qsort implementation body and the comparer function like the C++ compiler has, it will also stamp out a specialized version that will in most cases be identical with the C++ compiler output for std::sort (assuming it implements the same sorting algorithm as C's qsort).

That whole 'zero cost abstraction' idea is not unique to C++ since all the important work to make the 'zero cost thing' happen is performed down in the optimizer passes, and those are identical between C and C++.

[-]

kccqzy 2 days ago

Yeah but C++ encourages you to put things into templates which are inside headers, so of course the implementation body is exposed to the compiler all the time. Whereas a C compiler usually won’t have access to the qsort implementation body.

Panzerschrek a day ago

I see no inlining of qsort in at least two major compilers: https://godbolt.org/z/59xehaqnv.

[-]

flohofwoe a day ago

Yes, that's why C++ std::sort is usually recommended over C qsort. But either compiling with -flto or providing your own fully inlineable qsort should do the trick.

pjmlp 2 days ago

Because some people will never move beyond C, no matter what.

[-]

tliltocatl 2 days ago

People will never move beyond C (among other reasons) because C allows precise control over memory allocation. Closures and precise control over memory allocation doesn't play very well together.

[-]

pjmlp 2 days ago

Urban myths by people frozen in time, systems programming languages predate C by a decade, and it isn't as if the world stopped outside Bell Labs in the 1970's.

[-]

tliltocatl 2 days ago

Which systems (as in zero-runtime) language would you choose if not C (even if we disregard the fact that no SoC vendor support anything but C)?

- Ada? Many cool features, but that's also the problem - it's damn complicated.

- C++ combines simplicity of Ada with safety of C.

- Rust and Zig are only ~10 years old and haven't really stabilized yet. They also start to suffer from npm-like syndrome, which is much more problematic for a systems language.

- ATS? F#? Not all low-level stuff needs (or can afford) this level of assurance.

- Idris? Much nicer than ATS, but it's even younger than Rust and still a running target (and I'm not sure if zero-runtime support is there yet).

I mean, yes, C is missing tons of potentially useful features (syntactic macros, for one thing), but closures are not one of them.

[-]

pjmlp 2 days ago

You can start by learning that even C has a tiny runtime, followed by history of systems programming languages, starting with JOVIAL in 1958.

Even if not perfect, "Typescript for C" came out in 1983.

[-]

tliltocatl 2 days ago

> even C has a tiny runtime

Which functions are required for a C program to run aside from those that it run explicitly? memcpy? It's basically a missing CPU instruction. malloc will never get called unless you call it yourself.

> followed by history of systems programming languages, starting with JOVIAL in 1958.

All system languages (e. g. BLISS, PL/S, NEWT) designed as such before C was vendor-specific. Some of these had nice things missing from C, but none propagated beyond their original platform. And today we have no option but C.

> "Typescript for C" came out in 1983.

C++ is not just "not perfect", it is far worse in every way. Let's let people overload everything. And let's overload shift operator to do IO. And make every error message no less than 1000 lines, with actual error somewhere in the middle. Let's break union and struct type punning because screw you that's why. You say C macros are unreadable? Behold, template metaprogramming! C is not perfect, but it has the justification of being minimal. C++ is a huge garbage dumpster on fire.

[-]

pjmlp 2 days ago

Everything that runs before main(), and on exit, floating point emulation when needed and IEEE support, signal handlers, threading support since C11, at least, then depends on what else Compiler C has as extensions.

C was also vendor specific to Bell Labs.

C++ is definitely Typescript for C.

tialaramex 2 days ago

It is insane to assert that Rust hasn't "really stabilized yet" in comparison to C, a language which has been replaced twice (as C17 and C23) after Rust 1.0

[-]

tliltocatl 2 days ago

C17 didn't introduce any changes to the language. C23 mostly standardized existing compiler-specific goodies. None of this broke existing code. When I run `sudo dnf upgrade gcc` I can be 99.9% sure my old code still compiles.

Compare e. g. https://github.com/rust-lang/rust/pull/102750/ (I'm not following Rust development closely, picked up the first one). Yes, developers do rely on non-guaranteed behavior, that's their nature. C would likely standardize old behavior. Basically all of the C standard is a promoted vendor-specific extension - for better or worse.

Here is another good one: https://internals.rust-lang.org/t/type-inference-breakage-in...

[-]

tialaramex 2 days ago

C17 ostensibly "didn't change" the language but it did still need to fix a bunch of stuff and it chose to do this by just issuing an entirely new language standards document.

C23 on the other hand landed a bunch of hard changes, and "it didn't break my code" only matches the reality most people observe with Rust too. The changes you mention didn't break my Rust either.

But some trivial C did break because C23 is a new language

https://c.godbolt.org/z/n9vhMGYW5

In fact those Rust changes aren't language changes unlike C23, the first you linked is a choice for the compiler to improve on layout it didn't guarantee, anybody who was relying on that layout (and there were a few) was never promised this wouldn't change, next version, next week or indeed the next time they compiled the same code. You can even ask Rust's compiler to help light a fire under you on this by arbitrarily changing some layout between builds, so that stuff which depends on unpromised layout breaks earlier, reminding you to actually tell Rust e.g. repr(C) "I need the same layout guarantees for my type T as I'd get in C" or repr(transparent) "I need to ensure my type wrapper T has the same layout as the type I'm wrapping"

The second isn't a language change at all, it's a change to a library feature, so now we're down to "C isn't stable until the C library is forever unchanging" which hopefully you recognise as entirely silly. Needless to say nobody does that.

1718627440 2 days ago

Except that those are more like bug fixes, it's unlikely to have something newer than C11 and most are still on C99.

[-]

dfawcus 2 days ago

Actually lots is still on C89.

I'm trying to drag one program at $employer up to C99 (plus C11 _Generic), so I can then subsequently drag it to the bits of C23 which GCC 13 supports.

This all takes times, and having to convince colleagues during code reviews.

What C23 has done is authorise some of the extensions which GCC has had for some time as legitimate things (typeof, etc).

However the ability to adopt is also limited by what third party linters in use at $employer may also support.

pjmlp 2 days ago

Show us that you haven't read ISO C documents without telling us.

[-]

1718627440 2 days ago

True, at most I read excerpts when I have a question. Can you tell me what gave it away? I thought the saying is that C17 is the sane version of C11 and C23 has quite some changes, but is way to new to be counted on.

[-]

uecker 2 days ago

C17 is indeed a bug fix release. C23 finally removed some features that were deprecated a long time ago already in the first standardized version (auto as storage classifier, K&R function definitions, empty parenthesis as "arbitrary arguments") and also support for ones' complement. So yes, C is extremely backwards compatible.

[-]

1718627440 2 days ago

Auto as storage classifier was deprecated? TIL.

[-]

uecker 2 days ago

Ah no, sorry, my mistake. And it is still a storage classifier, but now (in C23) it is mentioned that it will become a type specifier.

spacechild1 2 days ago

> Closures and precise control over memory allocation doesn't play very well together.

How so? In C++ a lambda is just a regular type that does not allocate any memory by itself. You have in fact precise control over how/where a lambda is allocated.

[-]

flohofwoe 2 days ago

True, but once a capture needs to survive the parent function scope you'll need to store it somewhere, either via a std::function object which has opaque rules on whether a heap allocation happens or via your own std::function implementation where you define the heap allocation rules but then will have to face discussions about why you're rewriting the stdlib ;)

Any C implementation of capturing lambdas has the same problem of course, that's why the whole idea doesn't really fit into the C language IMHO.

[-]

fuhsnn 2 days ago

I did a version with explicit capture objects for fun: https://github.com/fuhsnn/c-extensions#extended-capture

Basically, with each capturing lambda, a context type _Ctxof(fn) is created that users can then declare on stack or on heap themselves.

[-]

uecker 2 days ago

Interesting! Can you tell me about your experience? How does it compare to the versions in N3654? (where this idea comes from) https://www.open-std.org/jtc1/sc22/wg14/www/docs/n3654.pdf

flohofwoe a day ago

Anectodal and probably my bubble, but quite a few people had moved from C to C++ long time ago, but then got fed up with where modern C++ is heading (mostly around C++14 and C++17) and went back to C.

[-]

a day ago

[deleted]

pjmlp a day ago

Except it doesn't offer the safety alternatives that C++ has, thus it is a big unsafe bunch of code.

Using C++ doesn't mean having to use the whole standard.

Now if C type safety actually was like Modula-2, Object Pascal or Zig, that would not be as bad.

1718627440 2 days ago

Because C is a much nicer language than C++ by some measures.

[-]

pjmlp 2 days ago

Nah, C++ gives you the safety tools, even if with some flaws, that WG14 does not have any interest to ever add to C.

dfawcus 2 days ago

Nah - more that a lot of commercial code is written in it; and it doesn't make sense to replace (or rewrite) it at this time.

For example, I'm maintaining some 20 year old C code, which the employer adopted around 10 years ago. It will likely stay in use at least until the current product is replaced, whenever that may be.

[-]

1718627440 2 days ago

Let me rephrase that: I feel myself addressed by "some people will never move beyond C, no matter what" and I prefer C over C++, because it is a much simpler language. Each time I am leaving the cozy world of C and write C++ I am annoyed and miss things.

tialaramex 2 days ago

If you care about performance and are willing to use a different programming language surely it would make more sense to use Rust's unstable sort ?

djaouen 2 days ago

Lisp solves the performance hit of closures with macros. However, given what macros look like in C, I hope it never amounts to that!

[-]

skavi 2 days ago

There’s no necessary performance hit for closures. The performance cost here is caused by these closures needing to conform to a function pointer looking interface in order to be generally useful in C.

[-]

djaouen 2 days ago

Iirc, in Lisp, closures are heap-allocated, unless they are created at compile-time with macros. Therefore, there is the added overhead of malloc and free calls with each closure created. Am I wrong here?

[-]

kazinator 12 hours ago

Definitely. Lisp is a family of languages that has now existed for almost 70 years. There simply isn't one way that closures work in "Lisp". Some implementations of languages in the lisp family have compilers that have sophisticated ways of handling closures. They carefully classify what is or is not captured by closure (and what is shared with other closures and what is mutably shared), and try to guess whether a closure can escape from the environment where it's created. Different code generation strategies, and strategies for representing the environment, apply to different situations.

pfdietz 2 days ago

Lisp implementations typically don't use malloc and (especially) free. There is GC overhead.

BoingBoomTschak 2 days ago

Yes, in that normal closures being stack or heap allocated is an implementation detail; at least in CL. SBCL can stack allocate some closures if DYNAMIC-EXTENT is used: https://www.sbcl.org/manual/#Stack-allocation-1

[-]

pfdietz 2 days ago

Or I believe if it can determine the closure has dynamic extent, for example if it's passed to a standard function for which it knows the closure will not escape.

[-]

BoingBoomTschak 2 days ago

Indeed, it's newish (from this summer https://github.com/sbcl/sbcl/releases/tag/sbcl-2.5.6) and mainly thanks to C. Zhang's work.

https://github.com/sbcl/sbcl/commit/021414445224e692ebb9c48c... https://github.com/sbcl/sbcl/commit/2c56c5901df1a5ea98ae19e9...

2 days ago

[deleted]

andsoitis 2 days ago

> Cost of a Closure in C

C does not have closures. You could simulate closures, but it is neither robust not automatic compared to languages tha truly support them.

[-]

skavi 2 days ago

I think maybe this post would make more sense if you were familiar to its antecedent [0] which compares existing closure extensions to C. IIRC the comparison was in the context of considering designs for standardization.

[0]: https://thephd.dev/the-cost-of-a-closure-in-c-c2y

aragilar 2 days ago

C does not currently have closures, the post it looking at their performance properties with an eye for what form closures should be added to the standard.

flohofwoe 2 days ago

The post is about a potential new feature of C. The author is working on the C committee and has contributed new C features in the past.

pjmlp 2 days ago

Completely missed the point of the article, this is about ongoing C2y efforts at WG14.