I write type-safe generic data structures in C

(danielchasehooper.com)

186 points | by todsacerdoti 6 hours ago ago

75 comments

  • o11c 5 hours ago

    For your level 2 code, `uint64_t data[];` is wrong for types whose alignment is greater than `uint64_t`, and also wasteful for types whose alignment is smaller (for example, under an ilp32 ABI on 64-bit architectures).

    For your level 3 code, it should be `int main() { List(Foo) foo_list = {NULL};`

    Note that working around a lack of `typeof` means you can't return anything. Also, your particular workaround allows `const`ness errors since `==` is symmetrical.

    You can't safely omit `payload` since you need it to know the correct size. Consider a `List(int64_t)` and you try to add an `int32_t` to it - this should be fine, but you can't `sizeof` the `int32_t`. Your code is actually lacking quite a bit to make this work.

    =====

    There are 2 major limitations to generics in C right now:

    * Delegating to a vtable (internal or external) is limited in functionality, since structs cannot contain macros, only functions.

    * Delegating to an external vtable (mandatory to avoid overhead) means that you have to forward-declare all of the types you'll ever use a vtable with. So far the best approach I've found is to declare (but not define) static functions in the same forwarding header I declare the typedefs in; note that GCC and Clang differ in what phase the "undefined static" warning appears in for the case where you don't actually include that particular type's header in a given TU.

    (think about writing a function that accepts either `struct SizedBuffer {void *p; size_t len;};` or `struct BoundedBuffer {void *begin; void *end;};`, and also const versions thereof - all from different headers).

    • rectang 3 hours ago

      > Delegating to an external vtable (mandatory to avoid overhead) means that you have to forward-declare all of the types you'll ever use a vtable with.

      We went down the rabbit hole of writing a compiler for this as part of a project I used to work on (Apache Clownfish[1], a subproject of the retired Apache Lucy project). We started off parsing .h files, but eventually it made sense to create our own small header language (.cfh "Clownfish Header" files).

      Here's some generated code for invoking the CharBuf version of the "Clone" method defined in parent class "Obj":

          typedef cfish_CharBuf*
          (*CFISH_CharBuf_Clone_t)(cfish_CharBuf* self);
      
          extern uint32_t CFISH_CharBuf_Clone_OFFSET;
      
          static inline cfish_CharBuf*
          CFISH_CharBuf_Clone(cfish_CharBuf* self) {
              const CFISH_CharBuf_Clone_t method
                  = (CFISH_CharBuf_Clone_t)cfish_obj_method(
                      self,
                      CFISH_CharBuf_Clone_OFFSET
                  );
              return method(self);
          }
      
      Usage:

          cfish_CharBuf *charbuf = cfish_CharBuf_new();
          cfish_CharBuf *clone = CFISH_CharBuf_Clone(charbuf);
      
      We had our reasons for going to these extremes: the point of Clownfish was to provide a least-common-denominator object model for bindings to multiple dynamic languages (similar problem domain to SWIG), and the .cfh files also were used to derive types for the binding languages. But there was truly an absurd amount of boilerplate being generated to get around the issue you identify.

      This is why almost everybody just uses casts to void* for the invocant, skipping type safety.

      [1] https://github.com/apache/lucy-clownfish

      • zem an hour ago

        i am firmly of the opinion that compiling to c is a better route than doing clever c tricks to sort of get what you want. the compiler can be pretty minimal and as you note it pays for itself.

    • kccqzy 3 hours ago

      > it should be `int main() { List(Foo) foo_list = {NULL};`

      In C `int main()` means the function takes an unknown number of arguments. You need `int main(void)` to mean it doesn't take any arguments. This is a fact frequently forgotten by those who write C++.

      • flohofwoe 34 minutes ago

        That had been harmonized with C++ in C23 (e.g. func() is equivalent with func(void) now).

        It's not really relevant for main() though, even in older C versions main() works fine and simply means "I don't need argc and argv".

        • el_pollo_diablo 28 minutes ago

          This is about a function definition, not a random function declarator. C23 does not change anything in that case.

      • tedunangst 2 hours ago

        This is incorrect. In a function definition, an empty list means it takes no parameters. 6.7.5.3 Function declarators

        > 14. An empty list in a function declarator that is part of a definition of that function specifies that the function has no parameters.

        • s3graham 2 hours ago

          As you surely know if you're quoting the standard, it depends on which standard!

          • gpderetta an hour ago

            I believe that since C23 foo() is now a nullary function. As this is the last approved standard and it supersedes all previous standards, it is technically correct to say that de-jure this is what the (unqualified) C standard mandates.

            Of course de-facto things are more nunanced.

          • tedunangst an hour ago

            Quote a different standard.

    • 25 minutes ago
      [deleted]
    • n_plus_1_acc 3 hours ago

      This is also problematic, because there might be padding and the calculated size might be too small:

      `malloc(sizeof(*node) + data_size);`

      • o11c 2 hours ago

        There's no a problem with the author's current code, since the padding is already included in the node size, but it would be a problem after doing alignment more intelligently.

    • EPWN3D 4 hours ago

      I would love for `union`s to be federated, that is, a type could declare itself as thought it was part of a union with another type, without having to pre-declare all possible types in one place.

      • o11c 4 hours ago

        For layout-compatible types, you can often just include a `_base` member in each child. Maybe twice (once named and once unnamed) to avoid excess typing - I don't understand the common-initial-subsequence rule but people do this enough that compilers have to allow it.

  • gritzko 5 hours ago

    Hi. I object.

    The trick#0 you mention is how I made an entire C dialect. Here is a generic binary heap, for example https://github.com/gritzko/librdx/blob/master/abc/HEAPx.h The syntax is a bit heavyweight, but a huge huge advantage is: you get regular C structs in the end, very plain, very predictable, very optimizable. Compiler would eat them like donuts.

    In the other cases, it is void* and runtime memory sizing and you have to define macros anyway.

    • dhooper 4 hours ago

      Author here. Binary heaps and linked lists are different use cases. A binary heap must read the data you put in it to store it correctly, but a linked list doesn't. If I were writing a generic binary heap, maybe I would weigh my options differently. I mentioned this in the footnotes.

      • wordglyph an hour ago

        And that's why I like C++ templates

    • variadix 4 hours ago

      I agree, there are actually several reasons to prefer the header impl. Debugging is better, both because you can step through the header code where you can’t with a macro function, and because the type information available to the debugger is better. There are more opportunities for compiler optimizations because each instantiation is monomorphized and you don’t pay a runtime cost with variable sizing, generic structures can also be placed on the stack because of the fixed sizing.

      There are workarounds for at least two of the problems the author mentions. Naming can be changed from Bar_func(args…) to func(Bar)(args…) with a function name macro that just does name mangling. You can avoid some of the binary bloat by using weak symbols, letting the linker deduplicate functions shared between translation units at link time.

      There are other problems for generic containers of pointer types however, you can work around them by using a typedef or a type alias.

      Intrusive data structures are more convenient in C still, but working with them in a debugger is a pain.

      • dhooper 4 hours ago

        Author here. It's worth noting that no work is being done in the macros of my article, they compile down to a normal c function call which you can step through in a debugger.

        There is little benefit in monomorphizing the implementation of a data structure like a linked list where its behavior doesn't depend on the contents of the data it contains (compared to, say, a max heap)

    • knutwannheden 4 hours ago

      > Compiler would eat them like donuts.

      Made me laugh out loud!

  • layer8 4 hours ago

    The casting of the function type assumes that the item pointer type (e.g. Foo*) has the same representation as void*, which the C standard doesn’t guarantee (in standardese: the two types aren’t “compatible”). Calling the function with the converted type therefore constitutes undefined behavior. It also impacts aliasing analysis by compilers (see [0], incidentally), even if the pointer representation happens to be the same.

    This casting of the functions to different argument types constitutes the core of the type safety of the generic invocations; I’m not sure it can be fixed.

    [0] https://news.ycombinator.com/item?id=44421185

    • dhooper 4 hours ago

      This is addressed in the footnotes. casting is not the core of the type safety. Read the whole article.

      • layer8 3 hours ago

        Ah, that’s what I get for not reading the footnotes. However, the alternative solution presented evaluates the item argument twice, which is problematic as well (but could probably be worked around by passing `(list)->payload` on instead). Secondly, the assignment for type-checking doesn’t work for read-only operations on a const List, or does it? And doesn’t the assignment overwrite the head? Lastly, the do-while construction means you can’t use it for operations that return a value (without compiler extensions).

        I also don’t agree it’s “squeamish” to be wary of aliasing analysis going wrong. It’s not a clean abstraction and can hide subtle bugs down the road.

  • david2ndaccount 24 minutes ago

    The “typeof on old compilers” section contains the code:

             (list)->payload = (item); /* just for type checking */\
    
    That is not a no-op. That is overwriting the list head with your (item). Did you mean to wrap it in an `if(0)`?
  • cherryteastain 4 hours ago

    Why would you jump through all these hoops instead of just writing C++ if you want "C with generics"

    • Kranar 4 hours ago

      Because for many of the use cases where C is used, switching to C++ involves jumping through even more hoops.

      • lionkor 3 hours ago

        Do you have a couple of real world examples?

        • mikepurvis 3 hours ago

          Any established C codebase, for example the kernel or Postgres?

          Traditionally microcontroller firmwares as well, though those are increasingly friendly to C++, you just have to be careful about allocations as C++ makes it way easier to accidentally allocate than C does.

          • neonz80 an hour ago

            I'm not sure about other compilers, but compiling C code as C++ with MSVC ends up with pretty much the exact same code, instruction by instruction. C++ is a bit more strict though especially with casting, so a lot of code won't compile out of the box.

            • vbezhenar 20 minutes ago

              C++ code compiles to a different function names in object file (name mangling). You probably need to put a lot of ugly `#ifdef __cplusplus extern "C" {` boilerplate in your headers, otherwise C and C++ files will not compile together.

        • adastra22 3 hours ago

          Embedded systems, for example.

          • teraflop 3 hours ago

            I know it used to be, but is it really still common for embedded systems to use weird architectures that G++/Clang don't support?

            • adastra22 2 hours ago

              Unless it is a popular system or common architecture, yes.

        • rectang 3 hours ago

          Writing extensions for projects that support C extensions but may not support C++ extensions, e.g. many dynamic languages.

          • Snarwin 3 hours ago

            You can still write the extension in C++ and expose an extern "C" interface.

            • rectang 2 hours ago

              That's possible, but then the people building your extension need a C++ toolchain.

              The question was "please provide examples where switching to C++ involves jumping through even more hoops", and in my view requiring downstream to use a C++ environment when they're expecting to use a C environment qualifies.

              • uecker 2 hours ago

                True. For me, C++ itself is the maze of hoops I would rather want to avoid.

    • 4 hours ago
      [deleted]
  • mfuzzey 4 hours ago

    There's also the method used in the Linux kernel to embed the list information (struct list_head) within the type specific struct. https://kernelnewbies.org/FAQ/LinkedLists

    • nixpulvis an hour ago

      The naming of LIST_HEAD_INIT and INIT_LIST_HEAD is confusing to me.

      • mfuzzey an hour ago

        The way I remember it is:

        INIT_LIST_HEAD is of form VERB_NOUN so is called from within a function to programatically initialise the list.

        LIST_HEAD_INIT is NOUN_VERB and is used within a structure initialiser not from a function.

        But my main point was to show the "embed the list in the data" approach rather than "embed the data in the list" or "point to the data from the list" and not to discuss the naming details in the kernel implementation of the concept.

  • WalterBright 2 hours ago

    Here's how to do it in D:

        struct ListNode(T) {
            ListNode* next;
            T data;
        }
    
        T!int node;
    
    Why suffer the C preprocessor? Using preprocessor macros is like using a hammer for finish carpentry, rather than a nail gun. A nail gun is 10x faster, drives the nail perfectly every time, and no half moon dents in your work.
    • dhooper 2 hours ago

      Thanks, this post is about C.

      On some projects you must use C.

      • WalterBright 2 hours ago

        If I may may be provocative :-) this post isn't about C. It's about layering on a custom language using C preprocessor macros.

        My compilers were originally written in C. I started using the C preprocessor to do metaprogramming. After some years I got fed up with it and removed nearly all of the preprocessor use, and never looked back. My code was much easier to understand.

        An amusing story: long ago, a friend of mine working for Microsoft was told by a team leader that a 50K program had a bug in it, and sadly the developer was long gone. He'd assigned programmer after programmer to it, who could not fix it. My friend said he'd give it a try, and had it fixed in 2 hours.

        The source code was written in Microsoft MASM, where the M stood for "Macro". You can guess where this is going. The developer had invented his own high level language using the macro system (which was much more powerful than C's). Unfortunately, he neglected to document it, and the other programmers spent weeks studying it and could not figure it out.

        The leader, astonished, asked him how he figured it out in 2 hours? My friend said simple. He assembled it to object code, then disassembled the object code with obj2asm (a disassembler I wrote that converts object code back to source code). He then immediately found and fixed the bug, and checked in the "new" source code which was the disassembled version.

        I've seen many very smart and clever uses of the C macros, the article is one of them. But isn't it time to move on?

        • ryao 30 minutes ago

          If the C compiler accepts it, it is C.

  • 2 hours ago
    [deleted]
  • ryao 33 minutes ago

    uint64_t data[] in level 2 violates the strict aliasing rule. Use the char type instead to avoid the violation.

  • eqvinox 5 hours ago

    Interesting idea with the union and using typeof(). We (I) went with large macros defining wrappers instead, which, I believe, is a necessity with intrusive data structures, or at least I don't immediately see how to do that with unions & typeof. Maybe it's possible...?

    e.g. hash table wrapper: https://github.com/FRRouting/frr/blob/master/lib/typesafe.h#...

    (cf. https://docs.frrouting.org/projects/dev-guide/en/latest/list...)

  • uecker 6 hours ago

    It is cool trick. I already use in my experimental library though ;-) https://github.com/uecker/noplate/blob/main/src/list.h

    • eqvinox 5 hours ago

      I guess if anyone might know it might be you—do you see any way of doing this for intrusive data structures, embedding the node struct in the data (and as side effect supporting an object to be on multiple containers) rather than the data in the node like you're doing there?

      • uecker 4 hours ago

        You could put the dummy member into the embedded node. But for intrusive data structures you often want them to erase the type so that you write generic algorithms as regular functions. In this case, it makes more sense to have a run-time check do to down casts. I do this with my variadic type which has an intrusive "super" member: https://godbolt.org/z/ofdKe7Pfv The overhead is often completely removed by the compiler.

        • eqvinox 3 hours ago

          Mhm. Putting the dummy member into the embedded node doesn't give a path from the proper object to find the embedded node "mid-struct". run-time checks are the "easy way out". We/I'll stick to macro soup probably, so we have compile-time checks.

          btw. For ISO C WG14… has anyone suggested adding _Include to the preprocessor, along the lines of _Pragma? It'd really help with doing this kind of really long macros, hiding the clunky "#define, #define, #define, #include" inside a macro…

  • 5 hours ago
    [deleted]
  • HexDecOctBin 5 hours ago

    The key idea here seems to be to use function pointer's type to enforce type safety rather than using the data "handle" type (that is often found in implementations inspired by Sean Barrett's strechy_buffers).

    > One annoying thing about C is that it does not consider these two variables to have the same type

    C23 solves that too: https://www.open-std.org/jtc1/sc22/wg14/www/docs/n3037.pdf

    Supported by latest GCC and Clang, but not by MSVC.

    • dhooper 5 hours ago

      Author here. Not quite. The key idea is about using a union to associate type information with a generic data type. Type casting a function is not the only way to use that type information. I discuss that as well as the C23 changes in the footnotes and the "typeof on old compilers" section.

      • wahern 2 hours ago

        FWIW, as far back as 2015 my feature check library documents Visual Studio as supporting "__typeof".[1] Note the leading but not trailing underscores. Perhaps I was mistaken, but I usually tested that sort of thing. It's also possible __typeof had slightly different semantics.

        [1] See https://github.com/wahern/autoguess/blob/b44556e4/config.h.g... (that's the 2015 revision, but HEAD has the same code).

        • dhooper 2 hours ago

          msvc 19.39 is the first to support it, which I mention in the article. You can confirm it didn't work up through 19.38 in godbolt [1]. I don't use Visual Studio, so I don't know what version of that first started using msvc 19.39

          [1] https://godbolt.org/z/M7zPYdssP

  • 5 hours ago
    [deleted]
  • hgs3 4 hours ago

    I'm curious what a hashmap looks like with this approach. It's one thing to pass through or hold onto a generic value, but another to perform operations on it. Think computing the hash value or comparing equality of generic keys in a generic hashmap.

  • monkeyelite 4 hours ago

    Another way is to not try to write generic data structures. When you tailor them to the use case you can simplify.

    The #1 data structure in any program is array.

  • asplake 5 hours ago

    Interesting! I’m working on toy/educational generator of ML-style tagged variants and associated functions in C (for a compiler) and when I’m a bit further along I will see if they’re compatible.

  • ape4 4 hours ago

    Or write in CFront and have it translated to C

    • zabzonk 2 hours ago

      And where are you going to get a cfront compiler these days?

  • b0a04gl 4 hours ago

    what happens if two types have same size and alignment but different semantics : like `int` vs `float` or `struct { int a; }` vs `int`? does the type tag system catch accidental reuse . aka defending against structual collisions

  • notnmeyer 5 hours ago

    pretty sure C is the new Go.

    • qustrolabe an hour ago

      pretty sure C has to go

    • revskill 4 hours ago

      Without the concurreny part.

      • oflebbe 4 hours ago

        OpenMP to the rescue

      • sltkr 3 hours ago

        Or garbage collection. Or interfaces. Or packages. Or actual generics.

  • luppy47474 5 hours ago

    [flagged]

  • luppy47474 5 hours ago

    [flagged]