14 comments

  • bluedino 15 minutes ago

    Now I'm remembering all the fun we had a long time ago with php websites that used an AS/400 for a data source. They didn't sort the same, and the mom and pop web dev shop that was hired to create the web site didn't understand the issue and hacked around it and failed.

  • asveikau 5 hours ago

    Sorting is language specific even if you're restricted to languages using Latin characters. Eg. How do you sort N relative to Ñ? How do you treat the Turkish variations on the letter I?

    Doing a dumb sort by character or byte values is obviously the wrong call for any diacritics, but the right call may also depend on the language.

    • dmurray 4 hours ago

      And that's why there are a hundred different possible values for LC_COLLATE, and it's completely normal that two popular Unix distributions picked different default values for that setting...right?

      It would have been reasonable to conclude the article a third of the way through, and say "sorting is locale-dependent, if what you value is consistent behaviour between different OSs (instead of sorting based on the user's preferences) you need to implement the sorting yourself."

      • harrall an hour ago

        LC_ALL=C which gives you consistent sorting behavior.

        The article does mention it but in passing.

    • encom 3 hours ago

      Before the Danish language adopted the letter "å" (in 1948), the vowel was written as "aa". In the Danish alphabet, "å" is the last letter. Therefore a list of three Danish city names would be correctly sorted as:

        * Albertslund
        * Odense
        * Aarhus
      
      This feels like material for another Tom Scott video.
      • tpmoney an hour ago

        Not Tom Scott, but Dylan Beattie has done a handful of interesting talks[1] effectively on "there's no such thing as plain text" which in part covers this sort of thing. In fact, I think your Danish cities list is actually one of his examples.

        [1]: https://www.youtube.com/watch?v=gd5uJ7Nlvvo

  • OptionOfT 6 hours ago
  • 1a527dd5 2 hours ago

    Ask anyone who did a postgres upgrade. The words "collate" and "glibc" are enough to cause me to pause now. Learnt loads, never going to really use it again, but man do I understand the pain that causes now.

  • o11c 3 hours ago

    Minor note: on Debian (and possibly other distros), you don't have to use `locale-gen` to dynamically build things into `$complocaledir/locale-archive` (which, incidentally, can cause random breakage for programs that happen to start during system upgrades).

    The `locales-all` package works more like macOS. It's only a ~10MB download but unpacks to take ~250MB of disk space (these numbers will vary based on your libc version and packaging format).

    There are a lot of sparse arrays and UTF32 character data in compiled locales.

    Incidentally, the command to dump a locale's data is:

      LC_ALL=whatever locale -ck `locale | sed 's/=.*//; /LANG\|LC_ALL/d'`
  • skopje 5 hours ago

    So the ISO way is the right way, right?

    • dataflow 4 hours ago

      I wondered the same. What's the right ordering?

  • greesil 3 hours ago

    It's not a stable sort?

  • loeg 6 hours ago

    (2020)

  • pjmlp 5 hours ago

    Yet another one of those POSIX and ISO things that most people don't bother to know about.

    https://pubs.opengroup.org/onlinepubs/9799919799/basedefs/V1...