Full Unicode Search at 50× ICU Speed with AVX‑512

(ashvardanian.com)

20 points | by ashvardanian 20 hours ago ago

6 comments

  • mgaunard 32 minutes ago

    In practice you should always normalize your Unicode data, then all you need to do is memcmp + boundary check.

    Interestingly enough this library doesn't provide grapheme cluster tokenization and/or boundary checking which is one of the most useful primitive for this.

    • stingraycharles 12 minutes ago

      That’s not practical in many situations, as the normalization alone may very well be more expensive than the search.

      If you’re in control of all data representations in your entire stack, then yes of course, but that’s hardly ever the case and different tradeoffs are made at different times (eg storage in UTF-8 because of efficiency, but in-memory representation in UTF-32 because of speed).

      • mgaunard 3 minutes ago

        That doesn't make sense; the search is doing on-the-fly normalization as part of its algorithm, so it cannot be faster than normalization alone.

    • orthoxerox 6 minutes ago

      In practice the data is not always yours to normalize. You're not going to case-fold your library, but you still want to be able to search it.

  • andersa 32 minutes ago

    From a German user perspective, ICU and your fancy library are incorrect, actually. Mass is not a different casing of Maß, they are different characters. Google likely changed this because it didn't do what users wanted.