Rich Text, Poor Text (2013)

(laemeur.sdf.org)

57 points | by SerCe 3 months ago ago

29 comments

II2II 3 months ago

You pretty much need to use markup (or control codes) for rich text. Take bold, italic, underline, strikeout: those four can, and are, used in nearly any combination. You would need one bit for each of them. You would need two bits to specify four levels of headings. If you don't allow for that, you are back to using markup. You would also need one bit to specify proportional/fixed width font, because that is a thing too. That remaining bit would have to be used for superscript, since superscripts are commonly used for footnotes and simple mathematical expressions.

Okay, you can now create passable rich text documents for a limited (though common) range of purposes with that 8/24-bit breakdown that was suggested. But you may have noticed the author mentioned subscripts, which wasn't in my list. Well, it turns out that subscript and superscript have a terribly limited range of applications if you are specifying them per character: x^2^2 would be visually identical to x^22, and x^a_b would look different from x_b^a (with both presentations being nonsensical). The use of subscripts and superscripts in any technical applications would be severely limited. You need a much richer markup language to be truly expressive. So there really isn't much of a point in offering subscripts. Superscripts, sure, because they have a few non-technical uses.

Yet the reality is that people want a much richer set of formatting options. At a minimum, they want to select fonts and font sizes. Some of the formatting options have semantics. I know I crammed four levels of headings in those eight bits, but that only makes sense in headings. It doesn't make sense to specify it per character. Then there are other common document elements, like tables. You can create decent tables using monospaced fonts, but that is limiting and would produce undesirable results in some cases (try displaying April 5^th sensibly, using a monospace font so that it won't affect the width of the columns). On top of that, you are ditching the concept of styles because that implies some sort of markup.

[-]

mmooss 3 months ago

Also, different languages have different formatting varieties. 256 combinations doesn't seem like nearly enough.

Note that is 256 combinations. If you want both bold and italics, either it's one of the 256 combinations, separate from the bold-only combination and from the italics-only combination, or you need another 8 bits for each option.

I think HN made a very aesthetically pleasing decision to exclude bold and underline. Imagine the appearance of comment pages if those were options.

chias 3 months ago

On the topic of subscripts and superscripts, it's also worth noting that `2` and `2` are different things. And, in practice, can be further nested.

tinthedev 3 months ago

Hah, I was about to criticise the text for far too lightly conflating markup and punctuation, just to see the afterword.

I actually do think the author has a point, in that must solutions today are inelegant, I also don't think this is a problem which has a real elegant solution. Where to draw the line? Why not encode fonts into the standard too, if we're doing bold? Etc.

I'm still mostly in favour of keeping everything markdown (in my own writing), however much it pollutes the "purity" of text.

[-]

astrobe_ 3 months ago

Yes, it's not markup but typesetting [1]. Well before 2013 people used to use stars, _underscores_ or /slashes/ in Usenet forums or mailing lists to mimic typesetting, which lead to Markdown.

The name still maintains the confusion as it tries to be an alternative to markup systems such as HTML which had the purpose to introduce semantic clues for computers.

We all know how it went; the semantic part was entirely thrown away and markup was thoroughly abused for layout (HTML tables before CSS - CSS which also has little to do with "style" and more to do with typesetting and layout), as no browser today can just show a table of contents based on the HTML title tags.

[1] https://en.wikipedia.org/wiki/Typesetting

hello_computer 3 months ago

This person is confused. He's citing a Ted Nelson paper about separating these things into layers (content, structure, & special effects), while personally advocating that we mash it all into unicode.

https://www.xml.com/pub/a/w3j/s3.nelson.html

[-]

LegionMammal978 3 months ago

Nelson's arguments sound odd to me. He says that embedded markup is bad for WYSIWYG editors since they have to maintain a connection between the raw and formatted text streams (which can have different character counts, etc.), but out-of-line styling would similarly need careful implementation work to keep it synchronized with the text stream at all times, even with concurrent editing and other such features.

(Cf. how the cross-reference stream in PDF files makes it painful to edit objects in them, even when the files are nominally encoded in plaintext.)

He then goes into how a separate styling layer can assist with transcluding text from other people's work while modifying the style. But style variations are hardly the only legitimate changes typically made to direct quotations: people often want to modify capitalization or punctuation, elide portions, or insert bracketed notes. And at that point, you're modifying the content as well as the styling, so style-only modifications would be very limiting for that use case.

As for the structure layer, this would have the same issues as every other attempt in the last three decades to create a semantic web or whatever. Authors don't want to spend their time carefully curating metadata that 99.9% of readers won't care about, while bad actors want to game their relevancy metrics through any mechanism available.

[-]

hello_computer 3 months ago

I think anyone who has done the work quickly realizes all of that (i.e. no point kicking Ted while he’s down). Just thought it odd that the article is citing Ted to endorse the anti-Ted.

lewisjoe 3 months ago

A lack of universally recognized richtext format is really a problem. Why? practically any rich-text that needs to be rendered across platforms (web and mobile devices) are now being stored as html or markdown or app-dependent json.

HTML was never envisioned as a cross-platform richtext format and markdown lacks almost half of all formatting features. Specialzed json is even more evil because the content becomes unrenderable when the parent app goes out of existence.

op's suggestion (accomodating formattings as unicode bytes) might not be optimal however I'm happy at least somebody thought of this as a problem to solve.

ht_th 3 months ago

The odd thing is, you can do quite some bold/italics/superscript in Unicode nowadays. Because, at least from the ASCII letter range, they have been used in symbolic ways in Mathematics, etc., and have been added to Unicode as symbols rather than bold variants of letters. For example:

, !

ᴴᵉˡˡᵒ, ᵂᵒʳˡᵈ!

So, there's almost no bold/italic punctuation. And non-ASCII Unicode letters aren't "supported" this way either. But you can get quite far with "formatted" ASCII letters in Unicode, if you're so inclined.

[-]

ht_th 3 months ago

Of course, hackernews or the font it uses (?), doesn't seem to support the bold and italics Unicode symbols. Although it does seem to support the supperscript ones.

[-]

wruza 3 months ago

HN actively erases unicode regions to prevent emoji abuse and other zalgoing. Sites and apps do it nowadays, just not with emoji. It's the other side of your point – unicode can do too much and it's not a regular text, so you can't search within that sort of bold, validate, etc. So people choose to work with a subset, which may still leak: https://news.ycombinator.com/item?id=42231608

Tomte 3 months ago

body { font-family:Verdana, Geneva, sans-serif; font-size:10pt; color:#828282; }

td { font-family:Verdana, Geneva, sans-serif; font-size:10pt; color:#828282; }

AlienRobot 3 months ago

People are limited by their tools.

The author believes that plain text should encode bold, italic, etc., because that's all they had exposure to. Were the text written today, they would claim emojis belong in unicode as well.

Most social media don't support it, but on Tumblr, for example, you can specify the color of the text and even choose a different font. I think there was some other social media that allowed you to have animated effects on the text as well, but I forgot the name.

[-]

tomxor 3 months ago

> Were the text written today, they would claim emojis belong in unicode as well.

Not sure what you mean, unicode does contain emojis. That's what most platform use for emojis now,

[-]

AlienRobot 3 months ago

But should it contain emoji? I can copy and paste bold text from one rich text editor to another just fine. Why not use XML to encode emoji?

[-]

rhet0rica 3 months ago

nextos 3 months ago

Yes, Unicode even defines characters for subindex and superindex. It's quite capable for basic inline math equations.

[-]

wruza 3 months ago

Which adds complexity and solves nothing. We'd better have a standard markup (we already have) than this half-assed wannabe-markup that is so complex and a minefield that modern forums tend to filter it anyway. https://en.m.wikipedia.org/wiki/Unicode_subscripts_and_super... – The wikipedia article about unicode sub/superscripts can't even render half of these symbols on neither of my {ios,windows,android} devices. In theory we have it, in practice it's dead baggage.

mmooss 3 months ago

Weren't many of those formatting codes - maybe not sub/superindex? - deprecated (but preserved for backward compatibility)?

[-]

nextos 3 months ago

I don't think so, but I'm not so familiar with that topic.

Julia uses them quite extensively to make source code closer to math.

timeflex 3 months ago

Sad what things like Markdown has done to people. It's like they forgot about all the amazing semantic markup of HTML 5 to create strong relations between their data. I'll take a Lexical editor with SQLite to store my data any day.

[-]

scelerat 3 months ago

HTML5 is a long way from the simple hypertext document format embodied in early versions of HTML. It is that way from necessity: how the web and applications built upon it using HTML have evolved. HTML started off being very simple and very accessible. One just had to grasp the concept of a tag, and that tags had to be nested and closed (sometimes), and you could make a hypertext document to serve on the WWW. One influential cohort of users, application developers, ran hard with HTML, pushed it to its limits, and continuously revised it to be an application rendering language. Anyone who merely wanted to write a hypertext encountered a complex and growing language spec.

Markdown emerged to fulfill that "simple" hypertext document role. If you're writing READMEs and blog posts, you probably don't need more than that. And I think it's more accessible (certainly less error-prone) than HTML for most people.

If you need richer semantics, HTML5 is available. And if semantics are important to you, you're probably still using HTML5 as a rendering layer and your actual semantics are processed, stored and delivered in layers much more purpose-built for that.

kstrauser 3 months ago

I don't think it’s that so much as that all that extra context is overkill in lots of situations. If I'm writing a blog post or a Slack message or my own internal-use note, I probably just want some lightweight formatting. Making rich semantic connections wouldn't have a good payoff for the extra work in those cases.

HankB99 3 months ago

> I'll take a Lexical editor with SQLite to store my data any day.

Do you have tools that do this or an example?

I'm pretty happy with Markdown and mkdocs (on Linux) to manage and format my notes. VS Code does a pretty good job with this providing both a preview and facilitating linking between documents (both file and heading links.) I'm always open to something better.

keepamovin 3 months ago

I like the idea of keeping the presentation out of the content, but keeping it in the character encoding. It's a cool idea. Never thought of it before reading this.

[-]

wruza 3 months ago

It's an old idea, as old as at least CGA. Text modes used one byte for attributes, one byte for ascii. But it's still markup and compressing markup into binary is a bad idea in the era of 48-bit address bus.

[-]

keepamovin 3 months ago

Thank you for the history lesson! :)

curtisszmania 3 months ago

[dead]