An overengineered solution to `sort | uniq -c` with 25x throughput (hist)

(github.com)

6 points | by noamteyssier 15 hours ago ago

1 comments

Was sitting around in meetings today and remembered an old shell script I had to count the number of unique lines in a file. Gave it a shot in rust and with a little bit of (over-engineering)™ I managed to get 25x throughput over the naive approach using coreutils as well as improve over some existing tools.

Some notes on the improvements:

1. using csv (serde) for writing leads to some big gains

2. arena allocation of incoming keys + storing references in the hashmap instead of storing owned values heavily reduced the number of allocations and improves cache efficiency (I'm guessing, I did not measure).

There are some regex functionalities and some table filtering built in as well.

happy hacking