49 points | by sanix-darker 9 hours ago ago

9 comments

  • mpalmer 6 hours ago

    It's unfortunate. I was willing to look past the clumsy, braggart clickbait title because there might actually be something behind the bluster.

    I was ready to learn something, and instead I found that the "author" doesn't actually respect their reader's time, because the substance of the piece - the part where they're supposed to be schooling us - is clearly unedited LLM output.

    Lorrrrrd but I'm fed up with this particular flavor of submission. I want to see it and its kind die hard every time it hits HN. It's a waste of human energy to produce them, and doubly so to read them.

    The people who make these use LLMs enough to understand that they're offering us nothing except the service of writing prompts for us. And it's not even to share their own knowledge, or their unique perspective. The author fails to mention that using SIMD for CSV parsing has been a thing for years. They say that they are "pioneering" these techniques.

    Why should serious people entertain this misrepresentation?

    It's not to illuminate others, it's to masquerade as someone who knows or has learned better. Not a lot offends me, but this clears the bar easy.

    Congrats on publishing.

  • Neywiny 8 hours ago

    I don't get it. The miller csv beats it in the row counting, medium, and large benchmarks and then it's dropped from the rankings and not mentioned again. So.... It's not the fastest ever made, right?

    • hbbio 6 hours ago

      The article (and maybe the code) is/are written by LLMs.

  • rfool 6 hours ago

    This reads as like the Author is awaiting a falsification of its statement(s): but the thing is: it was neither accidently created, nor is it the fastest CSV parser, for sure.

    But I am too lazy, to proof or falsify that.

    We are in post-factual age anway, right?

  • ivanjermakov 8 hours ago

    I think SIMD would be a lot more approachable and popular if intrinsics APIs didn't look like this: _mm512_cmpeq_epi8_mask. I understand that keeping naming close to instruction names is a good idea, but this ain't a great newcomer experience.

    Modern LLVM already does not a bad job with SIMD optimizations, maybe even AVX-512 when targeting supported CPU.

    Wonder if we soon see a language employing SIMD/MIMD optimizations without explicit asm/intrinsics from user. I remember @cmuratori said somewhere this is one thing he misses a lot in todays languages.

  • VoidWhisperer 8 hours ago

    Just to clarify something from the article and the github repo code (sorry if this is readily explained in the repo, C code using AVX stuff is not something I have any experience with): does the SIMD version still handle things like escaped quotes inbetween quotes denoting a string in a field in the csv file? It isn't mentioned in the article

    • jiggawatts 8 hours ago

      The readme in the source code mentions RFC 4180 compliance.

      This snippet of the code shows that it is correctly handled:

          // from init_tables()
          } else {
              // When in a quoted string, a quote char transitions to the escape state.
              st[S_QUOTED][q] = S_QUOTE_ESC;
      
              // By default, any character seen after the escape state's quote
              // will transition back to the unquoted state (ending the field).
              memset(st[S_QUOTE_ESC], S_UNQUOTED, 256);
      
              // CRITICAL OVERRIDE: If the character is another quote, it was an
              // escaped quote, so we go back to the quoted state.
              st[S_QUOTE_ESC][q] = S_QUOTED;
          }
  • boogheta 8 hours ago

    Nice! Funny coincidence, @Yomguithereal has been working for the past few weeks pretty much on the same idea of using SIMD for CSV parsing, but in Rust! https://github.com/medialab/simd-csv

  • 9 hours ago
    [deleted]