4 comments

  • juliusgeo 5 hours ago

    This is super cool! If I'm understanding your implementation correctly, you do perform bit by bit state machine logic to check whether quotes should be escaped etc. You can do that in a single pass by using carry-less polynomial multiplication instructions (_mm_clmulepi64_si128 on AVX-512 I believe), or by just computing the carryless xor directly on the quote mask and then &ing the inverse with the bitmask for quotes. Simdjson uses this trick, and I use it as well in my Rust simd csv parser:

    https://github.com/juliusgeo/csimdv-rs/blob/681df3b036f30c5a...

    This is a good write-up on how the approach works: https://nullprogram.com/blog/2021/12/04/

    • tokkyokky 5 minutes ago

      Thanks for the tip! Your comment prompted me to refactor the quote handling - replaced the bit-by-bit state machine loop with prefix XOR, and switched to adjacent bit masking for double-quote detection. Seeing a nice performance improvement in benchmarks. Go's simd/archsimd doesn't have CLMUL yet, but the XOR cascade works well. Appreciate your feedback!

  • zigzag312 14 hours ago

    Benchmark comparison with C# SIMD optimized CSV parser [1] would be fun to see.

    [1] https://github.com/nietras/Sep

    • tokkyokky 12 hours ago

      Oh, nice! I’ll try to do it!!