My understanding of the word "quick" in the title is that this code is quick to write. But then this brings up the question of why. Why do we prioritize code that is easy to write, which presumably means it needs to be written over and over again. Is this a reflection of the sad state of C library packaging?
A quick perusal of the hashmap landscape[0] in C++ shows a vibrant ecosystem where newer and faster implementations appear every year or so.
> Is this a reflection of the sad state of C library packaging?
No, it's more of a reflection of how many different ISAs and compilers there are for C.
To get a packaging system going for C you'd have to support quite a few of them and, even then, some degree of forensic analysis would have to be done for existing codebases in order to determine if they truly support what they claim to support.
For instance, in 2017 I had to modify hiredis to properly support the Qualcomm Snapdragon 800 since it had quite a few instances of type punning which were causing bus errors.
Did hiredis support this platform in 2017? Who knows?! Hell, it doesn't even mention on its GitHub which ISAs and compilers it's intended to support.
At MINIMUM, packages need something akin to the table on the following page in order to get a packaging system going for C: https://liblfds.org
--
I work with C daily. Unfortunately, I so often have to patch dependencies to work on the targets I support that I'm in the habit of "vendoring" any code I depend on. This practice, I imagine, is pretty common for multi-platform C projects.
> Is this a reflection of the sad state of C library packaging?
No, it is a sad state of the ability of C to encapsulate things. You cannot define vector<T> in C. Using some macros you probably can devise a way to get an implementation of a dynamic array for your type of an element, but... macros... And in any case it probably will be ugly. If you can implement hash table in a few dozen of lines, it is much better.
The hash table is still doing a string compare per bucket.
The best technique I have seen is to have an open addressing hash table whose keys are 64 bit hashes and values are pointers.
Then you layer on top of that to make a string dictionary. The value of the hash table is a linked list chain of key value pairs where the key is a string.
In the vast majority of cases you never touch the chain. Only when two keys have identical 64 bit hashes (not bucket indices) do you need to disambiguate with string equality.
This design is much faster and can be easily reconfigured for key types which are not strings, with no changes to the underlying hash table.
In school (and Wikipedia?) they told me chaining is just moving elements with the same bucket index into a linked list.
Sometimes textbooks focus too much on the abstract idea, and throw out what they think are unnecessary details, when they are actually important contributions by practitioners.
My understanding of the word "quick" in the title is that this code is quick to write. But then this brings up the question of why. Why do we prioritize code that is easy to write, which presumably means it needs to be written over and over again. Is this a reflection of the sad state of C library packaging?
A quick perusal of the hashmap landscape[0] in C++ shows a vibrant ecosystem where newer and faster implementations appear every year or so.
[0]: https://martin.ankerl.com/2022/08/27/hashmap-bench-01/
> Is this a reflection of the sad state of C library packaging?
No, it's more of a reflection of how many different ISAs and compilers there are for C.
To get a packaging system going for C you'd have to support quite a few of them and, even then, some degree of forensic analysis would have to be done for existing codebases in order to determine if they truly support what they claim to support.
For instance, in 2017 I had to modify hiredis to properly support the Qualcomm Snapdragon 800 since it had quite a few instances of type punning which were causing bus errors.
Did hiredis support this platform in 2017? Who knows?! Hell, it doesn't even mention on its GitHub which ISAs and compilers it's intended to support.
At MINIMUM, packages need something akin to the table on the following page in order to get a packaging system going for C: https://liblfds.org
--
I work with C daily. Unfortunately, I so often have to patch dependencies to work on the targets I support that I'm in the habit of "vendoring" any code I depend on. This practice, I imagine, is pretty common for multi-platform C projects.
> Is this a reflection of the sad state of C library packaging?
No, it is a sad state of the ability of C to encapsulate things. You cannot define vector<T> in C. Using some macros you probably can devise a way to get an implementation of a dynamic array for your type of an element, but... macros... And in any case it probably will be ugly. If you can implement hash table in a few dozen of lines, it is much better.
Less to write means less to read.
It's comparing apples to oranges. If in C++ you removed allocators, exceptions, references, and generics, then unordered_map would be "quick" too.
The hash table is still doing a string compare per bucket.
The best technique I have seen is to have an open addressing hash table whose keys are 64 bit hashes and values are pointers.
Then you layer on top of that to make a string dictionary. The value of the hash table is a linked list chain of key value pairs where the key is a string.
In the vast majority of cases you never touch the chain. Only when two keys have identical 64 bit hashes (not bucket indices) do you need to disambiguate with string equality.
This design is much faster and can be easily reconfigured for key types which are not strings, with no changes to the underlying hash table.
AFAICT, what you write here is more-or-less the standard approach to open addressed hash tables since they were invented.
I'm not sure why Chris overlooked this in writing his article. The string comparison is supposed to only happen when the hash values are equal.
In school (and Wikipedia?) they told me chaining is just moving elements with the same bucket index into a linked list.
Sometimes textbooks focus too much on the abstract idea, and throw out what they think are unnecessary details, when they are actually important contributions by practitioners.
Why chain rather than reprobing on full key collisions?