Full Unicode Search at 50× ICU Speed with AVX‑512

(ashvardanian.com)

20 points | by ashvardanian 20 hours ago ago

6 comments

mgaunard 32 minutes ago

In practice you should always normalize your Unicode data, then all you need to do is memcmp + boundary check.

Interestingly enough this library doesn't provide grapheme cluster tokenization and/or boundary checking which is one of the most useful primitive for this.

[-]

stingraycharles 12 minutes ago

That’s not practical in many situations, as the normalization alone may very well be more expensive than the search.

If you’re in control of all data representations in your entire stack, then yes of course, but that’s hardly ever the case and different tradeoffs are made at different times (eg storage in UTF-8 because of efficiency, but in-memory representation in UTF-32 because of speed).

[-]

mgaunard 3 minutes ago

That doesn't make sense; the search is doing on-the-fly normalization as part of its algorithm, so it cannot be faster than normalization alone.

orthoxerox 6 minutes ago

In practice the data is not always yours to normalize. You're not going to case-fold your library, but you still want to be able to search it.

andersa 32 minutes ago

From a German user perspective, ICU and your fancy library are incorrect, actually. Mass is not a different casing of Maß, they are different characters. Google likely changed this because it didn't do what users wanted.

[-]

b2ccb2 13 minutes ago

The confusion likely stems from the relatively new introduction of the capitalized ẞ https://de.wikipedia.org/wiki/Gro%C3%9Fes_%C3%9F

Maß capitalized (used to be) MASS.

Funnily enough, Mass means one liter beer (think Oktoberfest).