Why the Sanitizer API is just `setHTML()`

(frederikbraun.de)

90 points | by birdculture 2 days ago ago

33 comments

  • brainbag 4 hours ago

    With context, this article is more interesting than the title might imply.

    > The Sanitizer API is a proposed new browser API to bring a safe and easy-to-use capability to sanitize HTML into the web platform [and] is currently being incubated in the Sanitizer API WICG, with the goal of bringing this to the WHATWG.

    Which would replace the need for sanitizing user-entered content with libraries like DOMPurify by having it built into the browser's API.

    The proposed specification has additional information: https://github.com/WICG/sanitizer-api/

    • crote 3 hours ago

      Yeah, I was expecting something closer to "because that's what people Google for".

      A big part of designing a security-related API is making it really easy and obvious to do the secure thing, and hide the insecure stuff behind a giant "here be dragons" sign. You want people to accidentally do the right thing, so you call your secure and insecure functions "setHTML" and "setUnsafeHTML" instead of "setSanitizedHTML" and "setHTML".

    • mubou2 4 hours ago

      The author really needs to start with that. They say "the API that we are building" and assume I know who they are and what they're working on, all the way until the very bottom. I just assumed it's some open source library.

      > HTML parsing is not stable and a line of HTML being parsed and serialized and parsed again may turn into something rather different

      Are there any examples where the first approach (sanitize to string and set inner html) is actually dangerous? Because it's pretty much the only thing you can do when sanitizing server-side, which we do a lot.

      Edit: I also wonder how one would add for example rel="nofollow noreferrer" to links using this. Some sanitizers have a "post process node" visitor function for this purpose (it already has to traverse the dom tree anyway).

      • crote 3 hours ago

        > Are there any examples where the first approach (sanitize to string and set inner html) is actually dangerous?

        The article links to [0], which has some examples of instances in which HTML parsing is context-sensitive. The exact same string being put into a <div> might be totally fine, while putting it inside a <style> results in XSS.

        [0]: https://www.sonarsource.com/blog/mxss-the-vulnerability-hidi...

      • LegionMammal978 3 hours ago

        They had a link in their post [0]: it seems like most of the examples are with HTML elements with wacky contextual parsing semantics such as <svg> or <noscript>. Their recommendation for server-side sanitization is "don't, lol", and they don't offer much advice regarding it.

        Personally, my recommendation in most cases would be "maintain a strict list of common elements/attributes to allow in the serialized form, and don't put anything weird in that list: if a serialize-parse roundtrip has the remote possibility of breaking something, then you're allowing too much". Also, "if you want to mutate something, then do it in the object tree, not in the serialized version".

        [0] https://www.sonarsource.com/blog/mxss-the-vulnerability-hidi...

        • tlb 3 hours ago

          setHTML needs to support just about every element if it's going to be the standard way of rendering dynamic content. Certainly <svg> has to work or the API isn't useful.

          SanitizeHTML functions in JS have had big security holes before, around edge cases like null bytes in values, or what counts as a space in Unicode. Browsers decided to be lenient in what they accept, so that means any serialize-parse chain creates some risk.

          • LegionMammal978 2 hours ago

            If you're rendering dynamic HTML, then either the source is authorized to insert arbitrary dynamic content onto the domain, or it isn't. And if it isn't, then you'll always have a hard time unless you're as strict as possible with your sanitization, given how many nonlocal effects can be embedded into an HTML snippet.

            The more you allow, the less you know about what might happen. E.g., <svg> styling can very easily create clickjacking attacks. (If I wanted to allow SVGs at all, I'd consider shunting them into <img> tags with data URLs.) So anyone who does want to use these more 'advanced' features in the first place had better know what they're doing.

        • mubou2 3 hours ago

          Ah, I see what they're talking about. That's a good article; my brain totally skipped over that link. Thanks.

      • tobr 3 hours ago

        > They say "the API that we are building" and assume I know who they are and what they're working on, all the way until the very bottom.

        This is a common and rather tiresome critique of all kinds of blog posts. I think it is fair to assume the reader has a bit of contextual awareness when you publish on your personal blog. Yes, you were linked to it from a place without that context, but it’s readily available on the page, not a secret.

        • mubou2 3 hours ago

          Well that's... certainly a take. But I have to disagree. Most traffic coming to blog posts is not from people who know you and are personally following your posts, they're from people who clicked a link to the article someone shared or found it while googling something.

          It's not hard to add one line of context so readers aren't lost. Here, take this for example, combining a couple parts of the GitHub readme:

          > For those who are unfamiliar, the Sanitizer API is a proposed new browser API being incubated in the Sanitizer API WICG, with the goal of bringing this to the WHATWG.

          Easy. Can fit that in right after "this blog post will explain why", and now everyone is on the same page.

          • swiftcoder 3 hours ago

            > Most traffic coming to blog posts is not from people who know you and are personally following your posts

            Do we have data to back that up? Anecdotally the blogs I have operated over the years tend to mostly sustain on repeat traffic from followers (with occasional bursts of external traffic if something trends on social media)

          • tobr 3 hours ago

            > It's not hard

            It’s also not hard to look around for a few seconds to find that information, is my point.

      • masklinn 3 hours ago

        > Are there any examples where the first approach (sanitize to string and set inner html) is actually dangerous?

        The term to look for is “mutation xss” (or mxss).

  • cxr an hour ago

    > This is pretty similar to the Sanitizer that I wanted to build into the browser: […] But that is NOT the Sanitizer we ended up with.¶ And the reason is essentially Mutated XSS (mXSS). To quickly recap, the idea behind mXSS is[…]

    No, the reason is that the problem is underspecified and unsatisfiable.

    The whole notion of HTML "sanitization" is the ultimate "just do what I mean". It's the customer who cannot articulate what they need. It's «Hey, how about if there were some sort of `import "nobugs"`?»

    "HTML sanitization" is never going to be solved because it's not solvable.

    There's no getting around knowing whether or any arbitrary string is legitimate markup from a trusted source or some untrusted input that needs to be treated like text. This is a hard requirement. (And if you already have this information, then the necessary tools have been available for years—decades, even: `innerHTML` and `textContent`—or if you don't like the latter, then it's trivial to write your own `escapeText` subroutine that's correct, well-formed, and sound.) No new DOMPurify alternative or native API baked into the browser is going to change this, ever.

  • bikeshaving 3 hours ago

    This is interesting. The argument which I’m gleaning from the essay is that the old proposed API of having an intermediary new Sanitizer() class with a sanitize(input) method which returns a string is actually insecure because of mutated XSS (MXSS) bugs.

    The theory is that the parse->serialize->parse round-trip is not idempotent and that sanitization is element context-dependent, so having a pure string->string function opens a new class of vulnerabilities. Having a stateful setHTML() function defined on elements means the HTML context-specific rules for tables, SVG, MathML etc. are baked in, and eliminates double-parsing errors.

    Are MXSS errors actually that common?

  • jamesbvaughan 3 hours ago

    Aside from the article's content, I really like the inline exercise for the reader with the hidden/expandable answer section. It's fun and it successfully got me to read the proceeding section more closely than I would have otherwise.

  • cobbal 4 hours ago

    Makes sense. I think this is a variant of the "parse, don't validate" motto, but is more "parse, don't parse-serialize-parse" in the implementation.

  • IshKebab an hour ago

    > Traverse the HTML fragment and remove elements as configured.

    Well this is clearly wrong isn't it? You need a whitelist of elements, not a blacklist. That lesson is at least 2 decades old.

    • jkrems 9 minutes ago

      I mean... "as configured" can me either an allow OR a denylist. That sentence doesn't really prescribe doing it one way or the other..? You have to parse the denylisted elements because they will affect the rest of the parse, so you _have_ to remove them afterwards in the general case.

  • philipwhiuk 3 hours ago

    The downside of a new method is that it leaves innerHtml as a source of future security issues.

    • crote 3 hours ago

      Yes, but you can also easily lint on it: all uses of `context.innerHTML` are now suspect and should get a suggestion to use `context.setHTML` instead.

      With `const clean = DOMPurify.sanitize(input); context.innerHTML = clean;` your linter suddenly needs to do complex code analysis and keep track if each variable passed to `context.innerHTML` is clean or tainted.

    • wbobeirne 3 hours ago

      I feel like calling this a downside implies there's an alternative, but there's no way that `innerHtml`'s behavior could be changed. There are a lot of valid reasons for arbitrary HTML to be set, and changing that would break so many things.

      • cortesoft 2 hours ago

        There could be a better name for it? like `innerSanitizedHTML` or something, that makes it clear what the difference between the two calls are. There is nothing in the wording of setHTML that makes it clear it sanitizes where innerHTML doesn't.

    • cluckindan 3 hours ago

      Yes, one could simply make a setter for innerHTML which calls setHTML(). No code changes needed.

      • masklinn 3 hours ago

        That breaks existing usages of innerhtml which may legitimately need its more dangerous features.

        • cxr 33 minutes ago

          It seems obvious enough that parent is talking about changing the behavior of innerHTML within their own application, not for browser makers to change the implementation. It's unfair to take the most uncharitable interpretation and upbraid the other commenter for being insufficiently defensive[1] when they pressed "reply".

          1. <https://pchiusano.github.io/2014-10-11/defensive-writing.htm...>

          • masklinn 12 minutes ago

            > It seems obvious

            Doesn't seem obvious unless your dutch.

            Especially as the first things I would think obvious is: if breaking the behaviour of innerHTML is not a concern for your software why keep it at all? Delete the property or make it readonly.

  • nayuki 3 hours ago

    > HTML parsing is not stable and a line of HTML being parsed and serialized and parsed again may turn into something rather different

    This is why people should really use XHTML, the strict XML dialect of HTML, in order to avoid these nasty parsing surprises. It has the predictable behavior that you want.

    In XHTML, the code does exactly what it says it does. If you write <table><a></a></table> like the example on the mXSS page, then you get a table element and an anchor child. As another example, if you write <table><td>xyz</td></table>, that's exactly what you get, and there are no implicit <tbody> or <tr> inserted inside.

    It's just wild as I continue to watch the world double down for decades on HTML and all its wild behavior in parsing. Furthermore, HTML's syntax is a unique snowflake, whereas XML is a standardized language that just so happens to be used in SVG, MathML, Atom, and other standards - no need to relearn syntax every single time.

    • bayesnet 3 hours ago

      I don’t think this is right. XHTML guarantees well-formedness (matched closing tags et al) but doesn’t do anything for validity. It’s not semantically valid for <td> to be a direct child of <table>, so the user agent has to make the call as to what to display regardless of the (X)HTML flavor. The alternative is parsing failure on improperly nested HTML which I don’t think is desirable.

      • intrasight 3 hours ago

        > The alternative is parsing failure on improperly nested HTML which I don’t think is desirable.

        It was that decision that resulted in the current mess. Browser vendors could have given us a grace period to fix HTML that didn't validate against the schema. Instead they said "there is no schema"

        • bayesnet 2 hours ago

          The issue as I see it is that XML schemas are fine[0] for immutable documents but not suited for dynamic content. As a user it would be extraordinarily frustrating for a site or web app to break midflow because of a schema validation failure after a setHTML call or something.

          [0]: I’ve worked with XML schemas a lot and have grown to really dislike them actually but that’s neither here nor there