It’s a handy way of finding all kinds of things along these lines. Look at the properties of some characters you care about, and see how you can add, subtract and intersect them.
Polytonic orthography (from Ancient Greek πολύς (polýs) 'much, many' and τόνος (tónos) 'accent') is the standard system for Ancient Greek and Medieval Greek and includes:
- acute accent (´)
- circumflex accent (ˆ)
- grave accent (`); these 3 accents indicate different kinds of pitch accent
- rough breathing (῾) indicates the presence of the /h/ sound before a letter
- smooth breathing (᾿) indicates the absence of /h/.
Since in Modern Greek the pitch accent has been replaced by a dynamic accent (stress), and /h/ was lost, most polytonic diacritics have no phonetic significance, and merely reveal the underlying Ancient Greek etymology.
For example, if we do uppercase→lower→upper, some characters don't survive the roundtrip:
Ω ω Ω
İ i̇ İ
K k K
Å å Å
ẞ ß SS
ϴ θ Θ
I'm using the scripts to build out a little automated-testing generator library, something like "Tricky Unicode/UTF-8 case-change characters". Any other weird case quirks anyone can think of to put in the generators?
These lists (and the future library) were made to test normalization and break software that made bad assumptions. I initially generated the list because I knew that some of the assumptions the parser I was writing were not solid, and sure enough the tests broke it.
Someone pointed out the canonical source, which I'll have to look at more closely:
LATIN CAPITAL LETTER D WITH SMALL LETTER Z WITH CARON
LATIN CAPITAL LETTER {_} WITH SMALL LETTER {_}
L,J
N,J
D,Z
GREEK CAPITAL LETTER {ALPHA,ETA,OMEGA} WITH PROSGEGRAMMENI
GREEK CAPITAL LETTER {ALPHA,ETA,OMEGA} WITH {PSILI,DASIA} AND {_}
PROSGEGRAMMENI
VARIA AND PROSGEGRAMMENI
OXIA AND PROSGEGRAMMENI
PERISPOMENI AND PROSGEGRAMMENI
> no more than “lav” should match “law” just because the first part of the letter “w” looks like a “v”.
Well of course not, it’s double u, not double v… so maybe “lau” should match “law”!
(That’s one thing French got right. Dooblah vay, double v. (Is there are proper French spelling for that pronunciation? Like how h is aitch in English.))
W is fairly recent in the official French alphabet and its officially called "double v".
In Belgium it can be pronounced or heard as "way" (wé) usually for
- BMW as "bay-hem-way" (bé-m-wé)
- www as "way-way-way" (wé-wé-wé)
- WC as "way-say" (wé-c) .
On Serbian Wikipedia you have an option to automatically transliterate from Cyrillic to Latin script, so I guess this would come up in similar contexts.
In Croatian it doesn't matter, literally nobody uses the digraph Unicode characters because they do not appear on the keyboard. Instead you just write these digraphs as two regular Latin characters: nj, lj and dž.
Strange that this exists. Polish also has dz(it's the same phoneme), along with dź, dż, sz, cz, all of which use Title case in, among other instances, acronyms (e.g. RiGCz), but I'm not aware of any special code points for them - dz is definitely always spelled as d-z.
> These digraphs owe their existence in Unicode ... to Serbo-Croatian. Serbo-Croatian is written in both Latin script (Croatian) and Cyrillic script (Serbian), and these digraphs permit one-to-one transliteration between them.
It has the advantage that while ſome programming languages feature “class” as a reſerved word, “claſs” almoſt never is, ſo you can uſe that inſtead of a mis-ſpelling.
"In the course of the Second Sound Shift in the 7th and 8th centuries, two different sounds emerged from Germanic /t/ and /tː/ - a fricative and an affricate - both of which were initially rendered with zz. Since Old High German, spellings such as sz for the fricative and tz for the affricate were used to differentiate between them."
"The sound written with ss, which goes back to an inherited Germanic /s/, differed from the sound written with sz; the ss was pronounced as a voiceless alveolo-palatal fricative [ɕ], whereas the sz was pronounced as a voiceless alveolar fricative [s]. Even when these two sounds merged, both spellings were retained. However, they were confused because no one knew anymore where an sz had originally been and where an ss had been."
Each nominal syllable sound in Japanese can be written using a characater in one of these two scripts:
Roman transcription: a i u e o ka ki ku ke ko
Hiragana: あ い う え お か き く け こ
Katakana: ア イ ウ エ オ カ キ ク ケ コ
There are some rough parallels between upper case and katakana.
- Katakana is used less than hiragana; "katakana heavy" text will be something that is loaded with foreign words (like a software manual) or terms from zoology and botany.
- It is sometimes used to denote SHOUTING, like in quoted speech such as cartoon bubbles.
- Some early computing displays in the west could only produce upper case characters; in Japan, some early displays only featured katakana. It needs less resolution for clarity.
Wait what? He writes "For example, the first ten letters of the Hungarian alphabet are¹", but the note is "I got this information from the Unicode Standard, Version 15.0, Chapter 7: “Europe I”, Section 7.1: “Latin”, subsection “Latin Extended-B: U+0180-U+024F”, sub-subsection “Croatian Digraphs Matching Serbian Cyrillic Letters.”
Actually it kinda makes sense to have two Latin letters form a digraph if they are used to represent a single Cyrillic letter, while it makes less sense for Hungarian, which (AFAIK) has always been written with Latin letters? I mean, of course you could do it, but then I want an extra Unicode code point for the German "sch" too!
> These digraphs owe their existence in Unicode not to Hungarian but to Serbo-Croatian. Serbo-Croatian is written in both Latin script (Croatian) and Cyrillic script (Serbian), and these digraphs permit one-to-one transliteration between them.¹
Yeah, but then why bring up Hungarian (which has very little in common with Serbo-Croatian, although spoken in a neighboring country) in the first place?
Yes, buuuut Serbo-Croatian obviously has those 3 cases too, so he could have made the post much clearer by leaving out Hungarian and only focusing on Serbo-Croatian (or mentioning Hungarian only as an aside). I mean, if three of these four digraphs don't even exist in Hungarian, and "dz" is the only encoded Hungarian digraph, it's pretty obvious that the fact that it was encoded is only a coincidence?
So taking the first character of a word and uppercasing it is wrong because you'd get "dzen" -> "DZen".
I really wish the Unicode consortium would learn to say "No". If you added a three-letter letter to your alphabet, you can probably make do with three lettets in your text files.
There's so many characters with little to no utility and weird properties that seem to exist just to trip up programs attempting to commit the unforgivable sin of basic text manipulation.
This is just your monoculture speaking. Transliterations between alphabets are actually mentioned in the article, did you read it? Nobody added anything to their alphabet, alphabets are invented and then grow and shrink organically.
Bringing up "monoculture" here is hilarious, as this whole situation is a direct consequence of a people attempting to enforce just that by replacing their native Cyrillic alphabet with the Latin one.
My native language also happens to use a Cyrillic alphabet and has letters that would translate to multiple ones in the Latin alphabet:
ш -> sh
щ -> sht
я -> ya
Somehow we manage to get by without special sh, sht, and ya unicode characters, weird.
There are other ways around without making the standard impossible to get right. Great, we have a standard that can cope with any alphabet... oh pitty that is impossible to write programs that use it correctly.
It's tricky, but that's why nearly all of the time, you should use standard libraries. E.g., in Python, ".upper()" and ".capitalize()" does the work for you.
While I do have some reservations about Unicode I think its important to note that nobody forces you to deal with all of it. I think programmers should embrace the idea of picking subsets of Unicode that they know how to handle correctly, instead of trying (and failing) to handle everything. DIN 91379 is one good example https://en.wikipedia.org/wiki/DIN_91379
Incidentally I believe that this is kinda also the approach HN takes, there is at least some Unicode filtering going on here.
The purpose of Unicode is to encode written text. There's an inherent level of complexity that comes with that, like the fact that not all languages obey the same rules as English. If you don't want to deal with text from other systems, don't accept anything except ASCII/the basic Latin block and be upfront about it.
Just scripted something to find them all:
You can find them all with this UnicodeSet query (though the query alone naturally won’t show you the lower and upper forms):
https://util.unicode.org/UnicodeJsps/list-unicodeset.jsp?a=%...It’s a handy way of finding all kinds of things along these lines. Look at the properties of some characters you care about, and see how you can add, subtract and intersect them.
TIL:
Polytonic orthography (from Ancient Greek πολύς (polýs) 'much, many' and τόνος (tónos) 'accent') is the standard system for Ancient Greek and Medieval Greek and includes:
- acute accent (´)
- circumflex accent (ˆ)
- grave accent (`); these 3 accents indicate different kinds of pitch accent
- rough breathing (῾) indicates the presence of the /h/ sound before a letter
- smooth breathing (᾿) indicates the absence of /h/.
Since in Modern Greek the pitch accent has been replaced by a dynamic accent (stress), and /h/ was lost, most polytonic diacritics have no phonetic significance, and merely reveal the underlying Ancient Greek etymology.
(https://en.wikipedia.org/wiki/Greek_diacritics)
Reminds me of Vietnamese and its use of diacritics to mark tones. Vietnamese also uses diacritical markings to differentiate some vowels.
https://en.wikipedia.org/wiki/Vietnamese_phonology#Tone?wpro...
The other day I posted similar tables/scripts for related character properties and there was some good discussion: https://news.ycombinator.com/item?id=42014045
- Unicode codepoints that expand or contract when case is changed in UTF-8: https://gist.github.com/rendello/d37552507a389656e248f3255a6...
- Unicode roundtrip-unsafe characters: https://gist.github.com/rendello/4d8266b7c52bf0e98eab2073b38...
For example, if we do uppercase→lower→upper, some characters don't survive the roundtrip:
Ω ω Ω
İ i̇ İ
K k K
Å å Å
ẞ ß SS
ϴ θ Θ
I'm using the scripts to build out a little automated-testing generator library, something like "Tricky Unicode/UTF-8 case-change characters". Any other weird case quirks anyone can think of to put in the generators?
Seems like lot of these would be taken care by normalization though? Pre-composed characters are bit of a mess.
I do feel it is a error that unit/math symbols get changed, imho they should stay as-is through case conversions.
These lists (and the future library) were made to test normalization and break software that made bad assumptions. I initially generated the list because I knew that some of the assumptions the parser I was writing were not solid, and sure enough the tests broke it.
Someone pointed out the canonical source, which I'll have to look at more closely:
https://www.unicode.org/Public/16.0.0/ucd/CaseFolding.txt
The Unicode names of these 31 chars,
What's the difference with letter Ch [0]? When it's capitalized at the beginning of the word, it also looks like uppercase C and lowercase h.
[0]https://en.wikipedia.org/wiki/Ch_(digraph)
There is no single unicode character representing "Ch".
Here's a list of Unicode digraphs: DZ, Dz, dz, DŽ, Dž, dž, IJ, ij, LJ, Lj, lj, NJ, Nj, nj, ᵺ
https://en.wikipedia.org/wiki/Digraph_(orthography)#In_Unico...
Ch may be a digraph in many languages, but is it implemented in Unicode as a single character?
Uppest case! [0]
[0] Lowestcase and uppestcase letters: Advances in derp learning, Sigbovik 1st April 2021
https://sigbovik.org/2021/proceedings.pdf
> no more than “lav” should match “law” just because the first part of the letter “w” looks like a “v”.
Well of course not, it’s double u, not double v… so maybe “lau” should match “law”!
(That’s one thing French got right. Dooblah vay, double v. (Is there are proper French spelling for that pronunciation? Like how h is aitch in English.))
> Is there are proper French spelling for that pronunciation? Like how h is aitch in English.
No, French doesn't have spelling for the name of letters.
(I'm a native French speaker.)
W is fairly recent in the official French alphabet and its officially called "double v".
In Belgium it can be pronounced or heard as "way" (wé) usually for - BMW as "bay-hem-way" (bé-m-wé) - www as "way-way-way" (wé-wé-wé) - WC as "way-say" (wé-c) .
And it's so convenient, too! No letter thus requires several syllables to be pronounced.
It's one thing I keep using from Belgian French despite having lived in Switzerland for over a decade, because it's objectively better.
(Swiss French has the objectively better names for 70-80-90, though. No quatre-vingt-dix BS like on France. :-p)
On Serbian Wikipedia you have an option to automatically transliterate from Cyrillic to Latin script, so I guess this would come up in similar contexts.
In Croatian it doesn't matter, literally nobody uses the digraph Unicode characters because they do not appear on the keyboard. Instead you just write these digraphs as two regular Latin characters: nj, lj and dž.
Strange that this exists. Polish also has dz(it's the same phoneme), along with dź, dż, sz, cz, all of which use Title case in, among other instances, acronyms (e.g. RiGCz), but I'm not aware of any special code points for them - dz is definitely always spelled as d-z.
Per the article:
> These digraphs owe their existence in Unicode ... to Serbo-Croatian. Serbo-Croatian is written in both Latin script (Croatian) and Cyrillic script (Serbian), and these digraphs permit one-to-one transliteration between them.
For an intereſting third, deleted caſe of 'S' I preſent to you: ſ (the long-s).
It has the advantage that while ſome programming languages feature “class” as a reſerved word, “claſs” almoſt never is, ſo you can uſe that inſtead of a mis-ſpelling.
Presumably you can also use claß then
claß is much more concise.
The esszet is a ligature of the long ess and a zee. I never understood why Germans "expand" it to double ess.
"In the course of the Second Sound Shift in the 7th and 8th centuries, two different sounds emerged from Germanic /t/ and /tː/ - a fricative and an affricate - both of which were initially rendered with zz. Since Old High German, spellings such as sz for the fricative and tz for the affricate were used to differentiate between them."
"The sound written with ss, which goes back to an inherited Germanic /s/, differed from the sound written with sz; the ss was pronounced as a voiceless alveolo-palatal fricative [ɕ], whereas the sz was pronounced as a voiceless alveolar fricative [s]. Even when these two sounds merged, both spellings were retained. However, they were confused because no one knew anymore where an sz had originally been and where an ss had been."
https://de.m.wikipedia.org/wiki/%C3%9F
Recently they've been able to expand it to (capital) ß, too:
https://en.wikipedia.org/wiki/%C3%9F
Yes my uſe of this caſe is correct.
I would say, hiragana and katakana, in a way.
Each nominal syllable sound in Japanese can be written using a characater in one of these two scripts:
Roman transcription: a i u e o ka ki ku ke ko
Hiragana: あ い う え お か き く け こ
Katakana: ア イ ウ エ オ カ キ ク ケ コ
There are some rough parallels between upper case and katakana.
- Katakana is used less than hiragana; "katakana heavy" text will be something that is loaded with foreign words (like a software manual) or terms from zoology and botany.
- It is sometimes used to denote SHOUTING, like in quoted speech such as cartoon bubbles.
- Some early computing displays in the west could only produce upper case characters; in Japan, some early displays only featured katakana. It needs less resolution for clarity.
Dutch also has a digraph-which-counts-as-a-letter, "ij". But that doesn't get title-cased internally - there is a city called IJmuiden, not Ijmuiden.
As a test for your browser's internationalisation support, try
In Firefox, this displays correctly as "IJmuiden" (thanks to the lang attribute; without that, it would show "Ijmuiden").Technology is not implemented for internal consistency or to make sense, technological implementations are an artifact of history.
Of course, at the time it made sense to someone.
Wait what? He writes "For example, the first ten letters of the Hungarian alphabet are¹", but the note is "I got this information from the Unicode Standard, Version 15.0, Chapter 7: “Europe I”, Section 7.1: “Latin”, subsection “Latin Extended-B: U+0180-U+024F”, sub-subsection “Croatian Digraphs Matching Serbian Cyrillic Letters.”
Actually it kinda makes sense to have two Latin letters form a digraph if they are used to represent a single Cyrillic letter, while it makes less sense for Hungarian, which (AFAIK) has always been written with Latin letters? I mean, of course you could do it, but then I want an extra Unicode code point for the German "sch" too!
If you look at the whole Hungarian alphabet (https://learnhungarianfromhome.com/wp-content/uploads/2020/0...), you get a total of 8 digraphs and 1 trigraph (plus 9 letters with diacritics), but "Lj" and "Nj" are not among them...
> Access denied [...] The owner of this website (learnhungarianfromhome.com) does not allow hotlinking to that resource
https://en.wikipedia.org/wiki/Hungarian_alphabet
Copy the link and open in a new tab.
From the article:
> These digraphs owe their existence in Unicode not to Hungarian but to Serbo-Croatian. Serbo-Croatian is written in both Latin script (Croatian) and Cyrillic script (Serbian), and these digraphs permit one-to-one transliteration between them.¹
Yeah, but then why bring up Hungarian (which has very little in common with Serbo-Croatian, although spoken in a neighboring country) in the first place?
Because Hungarian is an example of having 3 cases, but only some of the Hungarian digraphs have these 3 cases encoded in Unicode.
Yes, buuuut Serbo-Croatian obviously has those 3 cases too, so he could have made the post much clearer by leaving out Hungarian and only focusing on Serbo-Croatian (or mentioning Hungarian only as an aside). I mean, if three of these four digraphs don't even exist in Hungarian, and "dz" is the only encoded Hungarian digraph, it's pretty obvious that the fact that it was encoded is only a coincidence?
For anyone wondering, this doesn't seem to be a problem for Java toLowerCase and toUpperCase.
Small caps?
title case for digraphs
Solution: NFKD. It's the equivalent of type-casting but for unicode.
https://unicode.org/reports/tr15/#Norm_Forms
This changes the original text, though, which might not always be suitable.
So taking the first character of a word and uppercasing it is wrong because you'd get "dzen" -> "DZen".
I really wish the Unicode consortium would learn to say "No". If you added a three-letter letter to your alphabet, you can probably make do with three lettets in your text files.
There's so many characters with little to no utility and weird properties that seem to exist just to trip up programs attempting to commit the unforgivable sin of basic text manipulation.
This is just your monoculture speaking. Transliterations between alphabets are actually mentioned in the article, did you read it? Nobody added anything to their alphabet, alphabets are invented and then grow and shrink organically.
Bringing up "monoculture" here is hilarious, as this whole situation is a direct consequence of a people attempting to enforce just that by replacing their native Cyrillic alphabet with the Latin one.
My native language also happens to use a Cyrillic alphabet and has letters that would translate to multiple ones in the Latin alphabet:
Somehow we manage to get by without special sh, sht, and ya unicode characters, weird.There are other ways around without making the standard impossible to get right. Great, we have a standard that can cope with any alphabet... oh pitty that is impossible to write programs that use it correctly.
It's tricky, but that's why nearly all of the time, you should use standard libraries. E.g., in Python, ".upper()" and ".capitalize()" does the work for you.
While I do have some reservations about Unicode I think its important to note that nobody forces you to deal with all of it. I think programmers should embrace the idea of picking subsets of Unicode that they know how to handle correctly, instead of trying (and failing) to handle everything. DIN 91379 is one good example https://en.wikipedia.org/wiki/DIN_91379
Incidentally I believe that this is kinda also the approach HN takes, there is at least some Unicode filtering going on here.
I agree in some cases, but note that lots of the ugly and weird things in Unicode are there for backwards compatibility with older encodings.
The purpose of Unicode is to encode written text. There's an inherent level of complexity that comes with that, like the fact that not all languages obey the same rules as English. If you don't want to deal with text from other systems, don't accept anything except ASCII/the basic Latin block and be upfront about it.