The WTF-8 Encoding

(simonsapin.github.io)

35 points | by pabs3 2 days ago ago

16 comments

  • HelloUsername 3 hours ago

    Related:

    Charset="WTF-8" (xn--stpie-k0a81a.com) 25-nov-2024 437 comments https://news.ycombinator.com/item?id=42226953

    The WTF-8 encoding (simonsapin.github.io) 27-may-2015 104 comments https://news.ycombinator.com/item?id=9611710

  • fanf2 3 hours ago

    I regret never finishing my April 1st RFC which is closely based on the UTF-8 RFC salted with a lot of frustration about bad interoperability https://fanf2.user.srcf.net/hermes/doc/qsmtp/draft-fanf-wtf8... (I never wrote the jokes for section 7)

  • CodesInChaos 3 hours ago

    > WTF-8 is a hack intended to be used internally in self-contained systems with components that need to support potentially ill-formed UTF-16 for legacy reasons.

    > Any WTF-8 data must be converted to a Unicode encoding at the system’s boundary before being emitted. UTF-8 is recommended. WTF-8 must not be used to represent text in a file format or for transmission over the Internet.

    I strongly disagree with that part. When you need to be able to serialize every possible Windows filename, WTF-8 is a great choice. This could be a backup tool, or an NTFS driver for Linux.

    I also think rust's serde should always serialize OsString as a bytestring, using WTF-8 on Windows. Instead of the system dependent union of u16/u8 sequences it currently uses.

    • Rygian 3 hours ago

      The way I read the "Intended Audience", I think the use cases you mention are non-goals for WTF-8:

      > There is no and will not be any encoding label [ENCODING] or IANA charset alias [CHARSETS] for WTF-8.

      The goal is to ensure WTF-8 remains fully contained, so that ill-formed strings don't end up processed by systems that expect well-formed strings.

      If you need to serialize every possible Windows filename, then you must also own the corresponding de-serializer (ie make your solution self-contained), and cannot expect users to work with the serialized contents using tools you do not control.

    • RedShift1 3 hours ago

      Which characters are not available in UTF-8 that warrant using WTF-8?

      • badmintonbaseba 2 hours ago

        Invalid UTF-16 with unpaired surrogates. Or rather WTF-8 is an alternate encoding of UCS-2. The subset of UCS-2 that is valid UTF-16 encodes to valid UTF-8 when encoded with WTF-8. The encoding is invertible, valid UTF-8 decodes to valid UTF-16, otherwise any byte sequence decodes to UCS-2.

      • chrismorgan 2 hours ago

        Just read the abstract:

        > WTF-8 (Wobbly Transformation Format − 8-bit) is a superset of UTF-8 that encodes surrogate code points if they are not in a pair. It represents, in a way compatible with UTF-8, text from systems such as JavaScript and Windows that use UTF-16 internally but don’t enforce the well-formedness invariant that surrogates must be paired.

        • RedShift1 2 hours ago

          Ok, but in practice, what does this mean for the characters? Are there certain characters unavailable?

          • chrismorgan an hour ago

            It’s the unpaired surrogate code points. That’s the whole thing. It’s about encoding ill-formed UTF-16, which is distressingly common in the real world.

          • numpad0 an hour ago

            broken emojis? There apparently are known issues that some frameworks break Unicode at wrong boundaries, maybe the author saw it regularize into a deeper mess

            • masklinn an hour ago

              It’s not just broken emoji, it’s straight up broken content: UTF-8 can not represent unpaired surrogates.

              WTF-8 is necessary for Rust’s compatibility with Windows filesystems (it underlines OsString on Windows) as e.g. file names are sequences of UTF-16 code units (and thus may contain unpaired surrogates).

  • zelphirkalt 3 hours ago

    This encoding should be twice as good as UTF-8.

  • Aardwolf 2 hours ago

    Why do surrogates even exist? UTF-8 is a code to represent roughly 21-bit integers, and UTF-16 is another code that represents roughly 21-bit integers.

    Somehow UTF-16 reserves some of those decoded integer values (instead of solving its whatever problem it had in its encoding itself)

    The fact that UTF-8 didn't need to also destroy some output integer values to work proves it's not necessary to do that

    Encoding and decoded value should be separate concerns

    That's like having a mathematical encoding of integers that's like base 10, but for some reason you decide that integer values 100 to 110 are reserved and may never be used by anyone, not even other legit encodings like regular base 10

    • thristian 2 hours ago

      The fact that U+D800-U+DFFF are reserved means that it's generally pretty easy to distinguish UTF-16 text from UCS-2 text - if you spot even one 16-bit value in that reserved range, it should be UTF-16.

      This property is not true of UTF-8 - if you get a byte-string with bytes between 0x80 and 0xFF, it might be UTF-8, or it might be one of a bunch of other encodings, you need to do a more involved check to be sure.

      Granted, the presence of a value between 0xD800 and 0xDFFF does not guarantee that the text is UTF-16, that's why this "WTF-8" encoding exists. But confusion would be a whole lot more likely if the U+D800-U+DFFF range were not reserved.

    • manwe150 2 hours ago

      UTF-8 has many similar problems with malformed sequences, such as overlong encodings. There is a similar scheme to this necessary if you want to handle arbitrary bytes as almost being UTF-8, instead of treating them as an inaccurate Latin-1 as is commonly done (the Julia language strings have such an ability for the basic String type for a reference point)

    • Leszek 2 hours ago

      Because Unicode 1.0 had already defined characters both at the start and end of the 16-bit range (https://www.unicode.org/versions/Unicode1.0.0/CodeCharts1.pd...), and UCS-2/UTF-16 had to be compatible with that.