MessageFormat: Unicode standard for localizable message strings

(github.com)

76 points | by todsacerdoti 3 hours ago ago

35 comments

jp1016 2 hours ago

One practical thing I appreciated about MessageFormat is how it eliminates a bunch of conditional UI logic.

I used to write switch/if blocks for:

• 0 rows → “No results” • 1 row → “1 result” • n rows → “{n} results”

Which seems trivial in English, but gets messy once you support languages with multiple plural categories.

I wasn’t really aware of how nuanced plural rules are until I dug into ICU. The syntax looked intimidating at first, but it actually removes a lot of branching from application code.

I’ve been using an online ICU message editor (https://intlpull.com/tools/icu-message-editor) to experiment with plural/select cases and different locales helped me understand edge cases much faster than reading the spec alone.

[-]

Vinnl an hour ago

This post shows a lot of the challenges with localisation, that many seemingly simple tools don't have an answer to: https://hacks.mozilla.org/2019/04/fluent-1-0-a-localization-...

(Fluent informed much of the design of MessageFormat 2.)

[-]

draw_down an hour ago

Indeed, if only it were as simple as “{n} rows”.

I18n / l10n is full of things like this, important details that couldn’t be more boring or fiddly to implement.

[-]

Joker_vD 13 minutes ago

Which is why Windows UI is littered with language like "number of rows: {n}".

chokma 28 minutes ago

This reminds me of https://perldoc.perl.org/Locale::Maketext::TPJ13

Seems like to get it right for every use case / language, you would need functions to translate phrases - so switch statements may be a valid solution. The number of text elements needed for pagination, CRUD operations and similiar UI elements should be finite :)

pferde 2 hours ago

Did not gettext have this for decades? https://www.gnu.org/software/gettext/manual/html_node/Plural...

[-]

Muromec 2 hours ago

Gettext has everything, it just takes knowing five languages to understand what to use for

Sharlin 31 minutes ago

Yeah, some sort of pluralization support is pretty much the second most important feature in any message localization tool, right after the ability to substitute externally-defined strings in the first place. Even in a monolingual application, spamming plural formatting logic in application code isn't exactly the best practice.

iririririr 5 minutes ago

that's a lazy feature. dealing with this on the front end is the right thing so you can have rich empty states anyway.

Muromec 2 hours ago

I checked the spec and don't get that really. Something should specify the formula for choosing the correct form (ie 1 for 21 in Slavic languages) and the format isnt any better compared to the gettext of 30 years ago

[-]

gcr an hour ago

This confused me too but the formula and rules for variants are specified by the configured language out-of-band, so there is support for this.

Let's take your example. In English, counting files looks like this:

    You have {file_count, plural,
       =0 {no files}
       one {1 file}
       other {# files}
    }

In Polish, there are several possible variants depending on the count:

    Masz 1 plik
    Masz 2,3,4 pliki
    Masz 5-21 pliko'w
    Masz 22-24 pliki
    Masz 25-31 pliko'w

Your Polish translators would write:

    Masz {file_count, plural,
       one {# plik}
       few {# pliki}
       other {# pliko'w}
    }

The library (and your translators) know that in Polish, the `few` variant kicks in when `i%10 = 2..4 && i%100 != 12..14`, etc. I think the library just knows these rules for each language as part of the standard. Mozilla says that it was an explicit design goal to put "variant selection logic in the hands of localizers rather than developers"

The point is that it's supported, it simplifies developer logic, and your translators know how to work with it.

See https://www.unicode.org/cldr/charts/48/supplemental/language...

(Apologies if I got the above translation strings wrong, I don't speak Polish. Just working from the GNU gettext example.)

[-]

npodbielski 43 minutes ago

usually it is ó instead of o' but otherwise very good :)

1116574 2 hours ago

Looks alot like mozilla's project fluent, atleast in the basic use case.

https://projectfluent.org/

I wonder why it hasn't been adopted more widely.

[-]

Vinnl an hour ago

Yes, Fluent informed much of the design of MessageFormat. See this FOSDEM talk: https://archive.fosdem.org/2023/schedule/event/mozilla_intme...

xeeeeeeeeeeenu 2 hours ago

Here's a comparison between the two on Fluent's wiki: https://github.com/projectfluent/fluent/wiki/Fluent-and-ICU-...

It seems the last edit of the page was in 2019, so I'm not sure how up to date it is.

[-]

Vinnl 30 minutes ago

Yeah it's actually MessageFormat 2 [1] that's very informed by Fluent's design I believe; I think that comparison is to "normal" MessageFormat.

[1] https://messageformat.unicode.org/

hobofan 2 hours ago

They seems to be a strong overlap of people behind both projects, so that likely explains the similarities.

xnorswap 2 hours ago

I often wonder this myself, this really should be a standard by now.

[-]

hobofan 2 hours ago

I can't speak for the status quo, but for at least the first ~5 years (so until 3 years ago when I last attempted to use it), the JS implementation of Fluent was a mess. Constant issues with incomplete API, wrong TS typings (which at that point were external) and build/bundling issues to the point where we opted for a homebrew solution.

I imagine that I probably wasn't the only one driven away by that (and I gave it many attempts!).

creshal an hour ago

The standard is, for better or worse, gettext; it's good enough that any attempt to replace it runs into the problem that people can't agree on how much better an alternative needs to be to be worth migrating to; so you get a constant churn that so far hasn't seen any clear winner.

revetkn 2 hours ago

My project Lokalized attempts to solve many of these complex plural/gender/ordinal/etc. rules with a tiny expression language:

https://lokalized.com

[-]

frizlab 2 hours ago

Same here (linked to a test because I don’t have a (meaningful) readme…)

That being said your project looks very cool!

https://github.com/Frizlab/XibLoc/blob/e85a5179bdd93e0174731...

BoppreH 3 hours ago

The meeting notes in the repo was a nice surprise. Overall looked great, striking a good balance.

  .input {$var :number maximumFractionDigits=0}
  .local $var2 = {$var :number maximumFractionDigits=2}
  .match $var2
  0 {{The selector can apply a different function to {$var} for the purposes of selection}}
  * {{A placeholder in a pattern can apply a different function to {$var :number maximumFractionDigits=3}}}

Oof, that's a programming language already. And new syntax to be inevitably iterated on. I feel like we have too many of those already, from Python f-strings to template engines.

I wish it'll at least stay small: no nesting, no plugins, no looping, no operators, no side effects or calls to external functions (see Log4J).

[-]

silvestrov 2 hours ago

English has just singular and plural: one car, two cars, three cars (and zero cars).

Some languages have more variations. E.g. Czech, Slovene and Russian has 1, 2-4 and 5 as different cases.

Personally I think the syntax is too brittle. It looks too much like TeX code and it has the lisp like deal with lines ending with too many } braces.

I would separate it into two cases: simple strings with just simple interpolation and then a more fuller markup language, more like a simplified xml.

There are more example code at https://github.com/unicode-org/message-format-wg/blob/main/d...

[-]

BoppreH 2 hours ago

Oh, the language aspect gets a lot worse than that. They explicitly have a non-goal of "all grammatical features of all languages", but the "common" cases are hard enough. From https://github.com/unicode-org/message-format-wg/blob/main/s... :

  .local $hasCase = {$userName :ns:hasCase}
  .match $hasCase
  vocative {{Hello, {$userName :ns:person case=vocative}!}}
  accusative {{Please welcome {$userName :ns:person case=accusative}!}}
  * {{Hello!}}

But if anyone can find a good compromise, it's the Unicode team.

alexchamberlain 30 minutes ago

Apologies if this is obvious and I missed it. Does this define a way to store the strings in various languages?

strogonoff 2 hours ago

Does anyone know the ETA of MessageFormat 2.0? I am aware of the effort since pre-COVID times. I recall that some of the developers behind Mozilla Fluent have been among the people working on MF 2.0, and it’d be great to know whether Fluent and ICU MF are going to be interoperable in foreseeable future.

[-]

Vinnl 25 minutes ago

IIRC, the goal was for Fluent to have a convertor or something to be able to work with MessageFormat 2.0, but I don't quite remember where I heard that. My approach has just been to stick to Fluent for now.

Brosper an hour ago

I discovered it working in https://tolgee.io but I am kind of surprised it boomed today :D

What I can say that it's a well-maintained format but also kinda hard to learn.

rocqua 3 hours ago

This seems great in concept, and totally infeasible. But if anyone can do it, unicode seems like a great candidate.

Does anyone have reason for more optimism?

[-]

hobofan 3 hours ago

Care to explain why you think it's infeasible? Then one could provide targeted counter-optimism ;)

I don't see what's infeasible about it. It doesn't seem too different from .po files (gettext catalogs) meshed with hooks for post-processing as would see in e.g. a handlebars, both of which have individually found great adoption.

[-]

bmn__ 2 hours ago

> why you think it's infeasible?

GP based his opinion on the assumption that this spec new and no implementations for it exist.

junon 2 hours ago

Unicode consortium already manages a ton of language specs. If there's any group of folks I'd trust to understand languages (natural or otherwise), it's them.

tuyiown 2 hours ago

I've been using this format for almost 10 years, and I only see increasing adoption. Why would I be pessimistic?

bmn__ 2 hours ago

Looking for an expert who knows both libintl/Gettext and MessageFormat.

What is the equivalent of xgettext.pl, the file extension for the main catalog file `.po`, the __ function?

How does gender work (small example)? How does layering pt_BR on pt_PT work?

What is a compelling reason to switch?