What has case distinction but is neither uppercase nor lowercase?

(devblogs.microsoft.com)

188 points | by thunderbong 2 months ago ago

93 comments

esnard 2 months ago

Just scripted something to find them all:

    U+01C5: ǅ (lower ǆ, upper Ǆ)
    U+01C8: ǈ (lower ǉ, upper Ǉ)
    U+01CB: ǋ (lower ǌ, upper Ǌ)
    U+01F2: ǲ (lower ǳ, upper Ǳ)
    U+1F88: ᾈ (lower ᾀ, upper ἈΙ)
    U+1F89: ᾉ (lower ᾁ, upper ἉΙ)
    U+1F8A: ᾊ (lower ᾂ, upper ἊΙ)
    U+1F8B: ᾋ (lower ᾃ, upper ἋΙ)
    U+1F8C: ᾌ (lower ᾄ, upper ἌΙ)
    U+1F8D: ᾍ (lower ᾅ, upper ἍΙ)
    U+1F8E: ᾎ (lower ᾆ, upper ἎΙ)
    U+1F8F: ᾏ (lower ᾇ, upper ἏΙ)
    U+1F98: ᾘ (lower ᾐ, upper ἨΙ)
    U+1F99: ᾙ (lower ᾑ, upper ἩΙ)
    U+1F9A: ᾚ (lower ᾒ, upper ἪΙ)
    U+1F9B: ᾛ (lower ᾓ, upper ἫΙ)
    U+1F9C: ᾜ (lower ᾔ, upper ἬΙ)
    U+1F9D: ᾝ (lower ᾕ, upper ἭΙ)
    U+1F9E: ᾞ (lower ᾖ, upper ἮΙ)
    U+1F9F: ᾟ (lower ᾗ, upper ἯΙ)
    U+1FA8: ᾨ (lower ᾠ, upper ὨΙ)
    U+1FA9: ᾩ (lower ᾡ, upper ὩΙ)
    U+1FAA: ᾪ (lower ᾢ, upper ὪΙ)
    U+1FAB: ᾫ (lower ᾣ, upper ὫΙ)
    U+1FAC: ᾬ (lower ᾤ, upper ὬΙ)
    U+1FAD: ᾭ (lower ᾥ, upper ὭΙ)
    U+1FAE: ᾮ (lower ᾦ, upper ὮΙ)
    U+1FAF: ᾯ (lower ᾧ, upper ὯΙ)
    U+1FBC: ᾼ (lower ᾳ, upper ΑΙ)
    U+1FCC: ῌ (lower ῃ, upper ΗΙ)
    U+1FFC: ῼ (lower ῳ, upper ΩΙ)

[-]

chrismorgan 2 months ago

You can find them all with this UnicodeSet query (though the query alone naturally won’t show you the lower and upper forms):

  [[:Changes_When_Lowercased:]&[:Changes_When_Uppercased:]]

https://util.unicode.org/UnicodeJsps/list-unicodeset.jsp?a=%...

It’s a handy way of finding all kinds of things along these lines. Look at the properties of some characters you care about, and see how you can add, subtract and intersect them.

rob74 2 months ago

TIL:

Polytonic orthography (from Ancient Greek πολύς (polýs) 'much, many' and τόνος (tónos) 'accent') is the standard system for Ancient Greek and Medieval Greek and includes:

- acute accent (´)

- circumflex accent (ˆ)

- grave accent (`); these 3 accents indicate different kinds of pitch accent

- rough breathing (῾) indicates the presence of the /h/ sound before a letter

- smooth breathing (᾿) indicates the absence of /h/.

Since in Modern Greek the pitch accent has been replaced by a dynamic accent (stress), and /h/ was lost, most polytonic diacritics have no phonetic significance, and merely reveal the underlying Ancient Greek etymology.

(https://en.wikipedia.org/wiki/Greek_diacritics)

[-]

dhosek 2 months ago

This seems to be missing the iota subscript (aka ypogegrammeni) which is the source of the weirdness of what happens when casing, e.g., ῳ. (This is another diacritical that modern Greek has abandoned since its impact on pronunciation was already being lost in the classical era (when I took Attic Greek in college, pronunciation wasn’t a critical thing, but we treated all the accents as simply a stress accent, ignored iota subscript and pronounced the rough breathing as h.)

In upper case, ῳ can be written as ῼ, Ω with the subscript or ΩΙ with the distinction between the first two often made as a matter of font design (in fact the appearance of ῼ differs depending on whether it’s in the edit box or in text on this site.

[-]

dhosek 2 months ago

One of the features of finl is the ability to have automatic substitutions of character inputs to, e.g., enable the TeX standard for inputing characters like “, ” and —

Playing with this, I was thinking that I could enable use of the Silvio Levy’s old 7-bit ascii input for Greek and realized that you would need different mappings of characters depending on where the character mapping happened relative to case folding. Text is messier than most peopler realize.

kjellsbells 2 months ago

Reminds me of Vietnamese and its use of diacritics to mark tones. Vietnamese also uses diacritical markings to differentiate some vowels.

https://en.wikipedia.org/wiki/Vietnamese_phonology#Tone?wpro...

[-]

dhosek 2 months ago

There is speculation that the polytonic accents in Greek (which were a late addition to the alphabet, incidentally), originally were tone markers. ΄ represented a rising tone, ` a falling tone and ῀ a rising then falling tone.

Rendello 2 months ago

The other day I posted similar tables/scripts for related character properties and there was some good discussion: https://news.ycombinator.com/item?id=42014045

- Unicode codepoints that expand or contract when case is changed in UTF-8: https://gist.github.com/rendello/d37552507a389656e248f3255a6...

- Unicode roundtrip-unsafe characters: https://gist.github.com/rendello/4d8266b7c52bf0e98eab2073b38...

For example, if we do uppercase→lower→upper, some characters don't survive the roundtrip:

Ω ω Ω

İ i̇ İ

K k K

Å å Å

ẞ ß SS

ϴ θ Θ

I'm using the scripts to build out a little automated-testing generator library, something like "Tricky Unicode/UTF-8 case-change characters". Any other weird case quirks anyone can think of to put in the generators?

[-]

int_19h 2 months ago

Note that semantic meaning for the second case is preserved - whether you use a precomposed symbol for capital I with overdot, or a combining character for the latter, it's supposed to be the same thing.

The others are much worse in this regard, since they actually lose meaningful information.

zokier 2 months ago

Seems like lot of these would be taken care by normalization though? Pre-composed characters are bit of a mess.

I do feel it is a error that unit/math symbols get changed, imho they should stay as-is through case conversions.

[-]

Rendello 2 months ago

These lists (and the future library) were made to test normalization and break software that made bad assumptions. I initially generated the list because I knew that some of the assumptions the parser I was writing were not solid, and sure enough the tests broke it.

Someone pointed out the canonical source, which I'll have to look at more closely:

https://www.unicode.org/Public/16.0.0/ucd/CaseFolding.txt

ks2048 2 months ago

The Unicode names of these 31 chars,

  LATIN CAPITAL LETTER D WITH SMALL LETTER Z WITH CARON

  LATIN CAPITAL LETTER {_} WITH SMALL LETTER {_}
    L,J
    N,J
    D,Z

  GREEK CAPITAL LETTER {ALPHA,ETA,OMEGA} WITH PROSGEGRAMMENI

  GREEK CAPITAL LETTER {ALPHA,ETA,OMEGA} WITH {PSILI,DASIA} AND {_}
    PROSGEGRAMMENI
    VARIA AND PROSGEGRAMMENI
    OXIA AND PROSGEGRAMMENI
    PERISPOMENI AND PROSGEGRAMMENI

frantathefranta 2 months ago

What's the difference with letter Ch [0]? When it's capitalized at the beginning of the word, it also looks like uppercase C and lowercase h.

[0]https://en.wikipedia.org/wiki/Ch_(digraph)

[-]

ks2048 2 months ago

There is no single unicode character representing "Ch".

Here's a list of Unicode digraphs: Ǳ, ǲ, ǳ, Ǆ, ǅ, ǆ, Ĳ, ĳ, Ǉ, ǈ, ǉ, Ǌ, ǋ, ǌ, ᵺ

https://en.wikipedia.org/wiki/Digraph_(orthography)#In_Unico...

[-]

notpushkin 2 months ago

Yeah, but why does Unicode have those and not ch?

[-]

ks2048 2 months ago

According to [1], these particular ones exist because of legacy encodings of Serbo-Croatian,

    Digraphs ⟨dž⟩, ⟨lj⟩ and ⟨nj⟩ in their upper case, title case and lower case forms have dedicated Unicode code points as shown in the table below, However, these are included chiefly for backwards compatibility with legacy encodings which kept a one-to-one correspondence with Cyrillic; modern texts use a sequence of characters.

[1] https://en.wikipedia.org/wiki/Gaj%27s_Latin_alphabet#Computi...

TRiG_Ireland 2 months ago

Ch may be a digraph in many languages, but is it implemented in Unicode as a single character?

qingcharles 2 months ago

I just fixed a function named RemoveEmojis that would strip emoji characters. The problem was that emojis were still being detected in the output, even though you could open the string and clearly see it was "plaintext."

I suddenly realized the code must only be removing one part of some of the surrogate emojis, leaving behind an invisible non-printing part of an emoji in the string.

Some emojis got so complex they literally scrapped them this year. The family emojis seemed cool in someone's head, but then someone tried to make a family with mixed-ethnic parents and the children are locked to one skin color; the only solution presented was to add 7,000 more emojis to Unicode.

https://www.mobiletechjournal.com/the-family-emojis-are-now-...

[-]

lifthrasiir 2 months ago

> The family emojis seemed cool in someone's head, but then someone tried to make a family with mixed-ethnic parents and the children are locked to one skin color; the only solution presented was to add 7,000 more emojis to Unicode.

Funnily enough, that "someone" is the entire Emoji subcommittee under the Unicode consortium, and pretty much everyone in the consortium represents some vendor company. The solution was proposed mainly because they can be technically composed via far less than 7,000 glyphs and several vendors did implement that as an experiment; ultimately the industry didn't bother, so it was scrapped.

geon 2 months ago

I liked the smiley/simpsons colored emoji. They didn't need different skin tones.

[-]

oniony 2 months ago

Except the yellow has become a proxy for white. Even the people of colour in The Simpsons are not yellow.

br1 2 months ago

The Unicode consortium keeps adding garbage like emojis to keep their job...

[-]

mijamo 2 months ago

Emojis are one of the best things about Unicode. They're not even that complex to handle, and they allow sooooo many things.

[-]

yencabulator 2 months ago

You're literally responding to a comment about how some emoji were too complex to even implement, much less handle universally.

KineticLensman 2 months ago

Uppest case! [0]

[0] Lowestcase and uppestcase letters: Advances in derp learning, Sigbovik 1st April 2021

https://sigbovik.org/2021/proceedings.pdf

rustcleaner 2 months ago

For an intereſting third, deleted caſe of 'S' I preſent to you: ſ (the long-s).

[-]

pwdisswordfishz 2 months ago

It has the advantage that while ſome programming languages feature “class” as a reſerved word, “claſs” almoſt never is, ſo you can uſe that inſtead of a mis-ſpelling.

[-]

lexicality 2 months ago

Presumably you can also use claß then

twic 2 months ago

claß is much more concise.

[-]

c-linkage 2 months ago

The esszet is a ligature of the long ess and a zee. I never understood why Germans "expand" it to double ess.

[-]

weinzierl 2 months ago

"In the course of the Second Sound Shift in the 7th and 8th centuries, two different sounds emerged from Germanic /t/ and /tː/ - a fricative and an affricate - both of which were initially rendered with zz. Since Old High German, spellings such as sz for the fricative and tz for the affricate were used to differentiate between them."

"The sound written with ss, which goes back to an inherited Germanic /s/, differed from the sound written with sz; the ss was pronounced as a voiceless alveolo-palatal fricative [ɕ], whereas the sz was pronounced as a voiceless alveolar fricative [s]. Even when these two sounds merged, both spellings were retained. However, they were confused because no one knew anymore where an sz had originally been and where an ss had been."

https://de.m.wikipedia.org/wiki/%C3%9F

Rendello 2 months ago

Recently they've been able to expand it to (capital) ß, too:

https://en.wikipedia.org/wiki/%C3%9F

hinkley 2 months ago

Flashbacks of 90's era PDF files.

I hate you both.

rustcleaner 2 months ago

Yes my uſe of this caſe is correct.

[-]

dpig_ a month ago

What's an ufe? Get out of my cafe.

lpapez 2 months ago

On Serbian Wikipedia you have an option to automatically transliterate from Cyrillic to Latin script, so I guess this would come up in similar contexts.

In Croatian it doesn't matter, literally nobody uses the digraph Unicode characters because they do not appear on the keyboard. Instead you just write these digraphs as two regular Latin characters: nj, lj and dž.

chrismorgan 2 months ago

> no more than “lav” should match “law” just because the first part of the letter “w” looks like a “v”.

Well of course not, it’s double u, not double v… so maybe “lau” should match “law”!

(That’s one thing French got right. Dooblah vay, double v. (Is there are proper French spelling for that pronunciation? Like how h is aitch in English.))

[-]

whynotmaybe 2 months ago

W is fairly recent in the official French alphabet and its officially called "double v".

In Belgium it can be pronounced or heard as "way" (wé) usually for - BMW as "bay-hem-way" (bé-m-wé) - www as "way-way-way" (wé-wé-wé) - WC as "way-say" (wé-c) .

[-]

sjrd 2 months ago

And it's so convenient, too! No letter thus requires several syllables to be pronounced.

It's one thing I keep using from Belgian French despite having lived in Switzerland for over a decade, because it's objectively better.

(Swiss French has the objectively better names for 70-80-90, though. No quatre-vingt-dix BS like on France. :-p)

[-]

chrismorgan 2 months ago

All the letters are one syllable, except w, which is three.

All the digits are one syllable, except 7 and 0, which are two.

I dislike these facts about them.

marcosdumay 2 months ago

It is a double u in English. Naming it differently would be wrong.

I think it's a double v in German. Since French doesn't really use it, they could import any of the names. Portuguese is on the same boat. It imported the double u name, but still has plenty of words where it's a double v... you can't make it all correct.

[-]

hinkley 2 months ago

No no, it's double-v in French as well. Though it mostly seems to be used for borrow words.

sjrd 2 months ago

> Is there are proper French spelling for that pronunciation? Like how h is aitch in English.

No, French doesn't have spelling for the name of letters.

(I'm a native French speaker.)

Tade0 2 months ago

Strange that this exists. Polish also has dz(it's the same phoneme), along with dź, dż, sz, cz, all of which use Title case in, among other instances, acronyms (e.g. RiGCz), but I'm not aware of any special code points for them - dz is definitely always spelled as d-z.

[-]

int_19h 2 months ago

Does Polish treat them as distinct letters in their own right for sorting purposes? That is usually when you see digraphs appear in (at least some) national encodings, from whence they end up in Unicode for compatibility reasons.

[-]

dhosek 2 months ago

Sorting rules can get really weird, and while some languages treat digraphs as separate letters for sorting, (e.g., Czech considers ch a separate letter coming after h), Polish does not.

advisedwang 2 months ago

Per the article:

> These digraphs owe their existence in Unicode ... to Serbo-Croatian. Serbo-Croatian is written in both Latin script (Croatian) and Cyrillic script (Serbian), and these digraphs permit one-to-one transliteration between them.

[-]

dhosek 2 months ago

There are lots of weirdnesses in Unicode that are consequences of enabling lossless round-trip translations to/from legacy encodings. Inconsistencies in how the various descendants of the Brahmic script are another such consequence.

kazinator 2 months ago

I would say, hiragana and katakana, in a way.

Each nominal syllable sound in Japanese can be written using a characater in one of these two scripts:

Roman transcription: a i u e o ka ki ku ke ko

Hiragana: あいうえおかきくけこ

Katakana: アイウエオカキクケコ

There are some rough parallels between upper case and katakana.

- Katakana is used less than hiragana; "katakana heavy" text will be something that is loaded with foreign words (like a software manual) or terms from zoology and botany.

- It is sometimes used to denote SHOUTING, like in quoted speech such as cartoon bubbles.

- Some early computing displays in the west could only produce upper case characters; in Japan, some early displays only featured katakana. It needs less resolution for clarity.

[-]

lifthrasiir 2 months ago

> Katakana is used less than hiragana

Katakana was used much more frequent than hiragana for about a century following the Meiji restoration, regardless of the technical limitation. That can be thought as another parallel though: Latin majuscule letters are (close to) the original Latin script while minuscule letters are derived from them, but now minuscule letters are much more frequently used. The only difference here is that either hiragana or katakana wasn't derived from each other, they share the single origin of simplified Chinese characters.

flysand7 2 months ago

I've also seen a use where katakana is used to represent a "robotic" speech, like something a robot would say monotonely.

twic 2 months ago

Dutch also has a digraph-which-counts-as-a-letter, "ij". But that doesn't get title-cased internally - there is a city called IJmuiden, not Ijmuiden.

[-]

jfk13 2 months ago

As a test for your browser's internationalisation support, try

    data:text/html,<div lang="nl" style="text-transform:capitalize">ijmuiden

In Firefox, this displays correctly as "IJmuiden" (thanks to the lang attribute; without that, it would show "Ijmuiden").

[-]

twic 2 months ago

Oh, that's nice. Chrome and Edge both fail for me!

librasteve 2 months ago

here is a raku regex (see https://docs.raku.org/language/regexes#Unicode_properties)

  "ǅ" ~~ /<:Lt>/    #｢ǅ｣   (matches)
  "ǅ" ~~ /<:Lu>/    #Nil   (doesn't match)

  Lt = Titlecase_Letter
  Lu = Uppercase_Letter

raku regex are a step improvement over the original perl5 regex which is used in most current languages (both regex engines were designed by Larry Wall - raku is perl6 with a new name)

deep support for Unicode and Graphemes makes raku almost unique in its support for Unicode properties within this new regex 2.0 (I hear that Swift is also strong in this area)

here is a great blog series by Paweł bbkr Pabian that explains all these unicode things in a very unserstandable way https://dev.to/bbkr/utf-8-regular-expressions-20h0

unbalancedevh 2 months ago

> The fact that dz is treated as a single letter in Hungarian means that if you search for “mad”, it should not match “madzag” (which means “string”) because the “dz” in “madzag” is a single letter and not a “d” followed by a “z”, no more than “lav” should match “law” just because the first part of the letter “w” looks like a “v”.

This doesn't seem right. If the individual letters "d" and "z" exist, then it should be possible to have them next to each other in a text file without them necessarily collapsing into a single letter -- especially if they're actually represented as separate characters, which they are in the example. Even if the letter "w" wasn't correctly represented and required actually typing "uu", you wouldn't want the word "vacuum" to be interpreted as having a "w"!

[-]

Hunpeter 2 months ago

Yes, I'm Hungarian, and I'm not even mad (pun intended) about "mad" matching "madzag". I find that we ourselves sometimes conflate characters and letters, so many people's first thought would be that "madzag" is six letters. I think most other digraphs e.g. "sz" or "gy" are considered more tightly bound, so one would be unlikely to say that "szám" (=number) is four letters rather than three.

[-]

d1sxeyes 2 months ago

Yes but it’s utter nonsense that you shouldn’t return it as a search result. There’s no “dz” key on a Hungarian keyboard, so you’d need to create that (or an alternative way to type it)… and on top of that it’s not consistent.

The easiest way is to imagine text being written vertically. In some cases, the digraphs (or trigraphs) will be written together on a single line, and sometimes they’ll be written on separate lines.

However, more consistently, if you imagine a person’s initials, Csanádi Dzsenifer is CsDzs.

2 months ago

[deleted]

2 months ago

[deleted]

fedeb95 2 months ago

For anyone wondering, this doesn't seem to be a problem for Java toLowerCase and toUpperCase.

rob74 2 months ago

Wait what? He writes "For example, the first ten letters of the Hungarian alphabet are¹", but the note is "I got this information from the Unicode Standard, Version 15.0, Chapter 7: “Europe I”, Section 7.1: “Latin”, subsection “Latin Extended-B: U+0180-U+024F”, sub-subsection “Croatian Digraphs Matching Serbian Cyrillic Letters.”

Actually it kinda makes sense to have two Latin letters form a digraph if they are used to represent a single Cyrillic letter, while it makes less sense for Hungarian, which (AFAIK) has always been written with Latin letters? I mean, of course you could do it, but then I want an extra Unicode code point for the German "sch" too!

If you look at the whole Hungarian alphabet (https://learnhungarianfromhome.com/wp-content/uploads/2020/0...), you get a total of 8 digraphs and 1 trigraph (plus 9 letters with diacritics), but "Lj" and "Nj" are not among them...

[-]

jolmg 2 months ago

> Access denied [...] The owner of this website (learnhungarianfromhome.com) does not allow hotlinking to that resource

https://en.wikipedia.org/wiki/Hungarian_alphabet

[-]

jakub_g 2 months ago

Copy the link and open in a new tab.

anamexis 2 months ago

From the article:

> These digraphs owe their existence in Unicode not to Hungarian but to Serbo-Croatian. Serbo-Croatian is written in both Latin script (Croatian) and Cyrillic script (Serbian), and these digraphs permit one-to-one transliteration between them.¹

[-]

rob74 2 months ago

Yeah, but then why bring up Hungarian (which has very little in common with Serbo-Croatian, although spoken in a neighboring country) in the first place?

[-]

anamexis 2 months ago

Because Hungarian is an example of having 3 cases, but only some of the Hungarian digraphs have these 3 cases encoded in Unicode.

[-]

rob74 2 months ago

Yes, buuuut Serbo-Croatian obviously has those 3 cases too, so he could have made the post much clearer by leaving out Hungarian and only focusing on Serbo-Croatian (or mentioning Hungarian only as an aside). I mean, if three of these four digraphs don't even exist in Hungarian, and "dz" is the only encoded Hungarian digraph, it's pretty obvious that the fact that it was encoded is only a coincidence?

gspencley 2 months ago

Is this a riddle? I don't want to click on the article until I've tried to crack it.

Is the answer a switch statement?

Edit: ah no, we're actually talking about human language and characters.

csours 2 months ago

Technology is not implemented for internal consistency or to make sense, technological implementations are an artifact of history.

Of course, at the time it made sense to someone.

geon 2 months ago

I have had to do locale specific search to avoid matching A to Ä etc.

2 months ago

[deleted]

pseingatl 2 months ago

Small caps?

[-]

cwmma 2 months ago

title case for digraphs

noname120 2 months ago

Solution: NFKD. It's the equivalent of type-casting but for unicode.

https://unicode.org/reports/tr15/#Norm_Forms

[-]

freeone3000 2 months ago

This changes the original text, though, which might not always be suitable.

alexvitkov 2 months ago

So taking the first character of a word and uppercasing it is wrong because you'd get "dzen" -> "DZen".

I really wish the Unicode consortium would learn to say "No". If you added a three-letter letter to your alphabet, you can probably make do with three lettets in your text files.

There's so many characters with little to no utility and weird properties that seem to exist just to trip up programs attempting to commit the unforgivable sin of basic text manipulation.

[-]

ccppurcell 2 months ago

This is just your monoculture speaking. Transliterations between alphabets are actually mentioned in the article, did you read it? Nobody added anything to their alphabet, alphabets are invented and then grow and shrink organically.

[-]

alexvitkov 2 months ago

Bringing up "monoculture" here is hilarious, as this whole situation is a direct consequence of a people attempting to enforce just that by replacing their native Cyrillic alphabet with the Latin one.

My native language also happens to use a Cyrillic alphabet and has letters that would translate to multiple ones in the Latin alphabet:

  ш -> sh
  щ -> sht
  я -> ya

Somehow we manage to get by without special sh, sht, and ya unicode characters, weird.

[-]

int_19h 2 months ago

The native alphabet for most Southern Slavs would be Glagolitic - indeed, Croatians still occasionally used that in religious contexts as late as 19th century. Cyrillic alphabet is more or less Glagolitic with new and distinct letter shapes replaced by Greek ones, so it is in an of itself a product of the same process that you are complaining about; it just happened a few centuries earlier than the transition to Latin, so you're accustomed to its outcome being the normal.

I should also note that it's not like Cyrillic doesn't have its share of digraphs - that's what combinations like нь effectively are, since they signify a single phoneme. And, conversely, it's pretty obvious that you can have a Latin-based orthography with no digraphs at all, just diacritics.

This whole situation has to do with legacy encodings and not much else.

[-]

alexvitkov 2 months ago

> The native alphabet for most Southern Slavs would be Glagolitic

That's a bit of an exaggeration, the Glagolitic script was only ever used by scholars, the earliest Cyrillic writings are not not even 50 years older than the Glagolitic.

You're right that the Cyrillic is indeed way closer to the Greek alphabet than the Glagolitic, despite being named after Cyril. I'm not complaining about the "forsaking of culture", I just found it interesting that I was being "mono-cultural" for disagreeing with the existence of a few weird Unicode code-points that themselves are a direct result of someone's attempt to move towards a "mono-culture".

What I'm complaining against, if anything, are overly complex standards. This is just one of what's probably 100 different quirks that you should be aware of when working with Unicode text, and this one could've been easily avoided by just not including a few useless characters.

[-]

int_19h 2 months ago

Unicode is supposed to be able to represent basically everything humans ever wrote, that's why we have things like https://en.wikipedia.org/wiki/Phaistos_Disc_(Unicode_block) in there, and why it's inevitably so complex. These aren't even particularly weird codepoints when you look at some other scripts like Arabic or traditional Mongolian.

Correctly supporting the entirety of Unicode faithfully in this sense has been unreachable for your average app for a very long time now, IMO, so it's fine to just do the best you can (i.e. usually, the most you can defer to the libraries) for the audience that you actually have or anticipate for convoluted stuff like this. I don't think that correctly handling casing for legacy digraph codepoints is something that many people need in practice, not even speakers of languages whence those Unicode digraphs originate.

It's still a massive improvement for interop because at least you can be sure that any two apps that need the symbol will use the same encoding for it and will be able to exchange that data, even if nobody truly supports the whole thing.

notpushkin 2 months ago

This exactly. Digraphs should just be deprecated and normalized to two code points.

f1shy 2 months ago

There are other ways around without making the standard impossible to get right. Great, we have a standard that can cope with any alphabet... oh pitty that is impossible to write programs that use it correctly.

[-]

ks2048 2 months ago

It's tricky, but that's why nearly all of the time, you should use standard libraries. E.g., in Python, ".upper()" and ".capitalize()" does the work for you.

[-]

2 months ago

[deleted]

f1shy 2 months ago

Does it have titleize() ?

[-]

ks2048 2 months ago

That is capitalize()

There's a note in the docs [0],

    Changed in version 3.8: The first character is now put into titlecase rather than uppercase. This means that characters like digraphs will only have their first letter capitalized, instead of the full character.

[0] https://docs.python.org/3/library/stdtypes.html#str.capitali...

EDIT: As the other reply says, ".title()" is probably a better answer to your question. Warning as the docs show [1], this splits things on sequence of letters, not whitespace!

    >>> "they're bill's friends from the UK".title()

    "They'Re Bill'S Friends From The Uk"

[1] https://docs.python.org/3/library/stdtypes.html#str.title

o11c 2 months ago

It's `.title()`, but note that it doesn't follow language-specific semantic rules like not capitalizing "the" and "of".

int_19h 2 months ago

In practice, all languages that use digraphs and trigraphs don't use distinct Unicode codepoints for them, generally speaking (and Unicode specifically marks those codepoints as legacy, so this is an officially blessed practice). The reason why they exist is because one of the explicit goals of Unicode as originally designed was to be able to roundtrip many existing national encodings lossless. So digraphs that were already in the national encodings for whatever reason ended up in Unicode as legacy, while those that were not, did not.

zokier 2 months ago

While I do have some reservations about Unicode I think its important to note that nobody forces you to deal with all of it. I think programmers should embrace the idea of picking subsets of Unicode that they know how to handle correctly, instead of trying (and failing) to handle everything. DIN 91379 is one good example https://en.wikipedia.org/wiki/DIN_91379

Incidentally I believe that this is kinda also the approach HN takes, there is at least some Unicode filtering going on here.

ks2048 2 months ago

I agree in some cases, but note that lots of the ugly and weird things in Unicode are there for backwards compatibility with older encodings.

AlotOfReading 2 months ago

The purpose of Unicode is to encode written text. There's an inherent level of complexity that comes with that, like the fact that not all languages obey the same rules as English. If you don't want to deal with text from other systems, don't accept anything except ASCII/the basic Latin block and be upfront about it.