A deep dive into Linux's new mseal syscall

(blog.trailofbits.com)

107 points | by todsacerdoti 5 hours ago ago

26 comments

  • ykonstant 4 hours ago

    Interesting. The article mentions "spicy discussions" in the kernel mailing list. Is there any insider who can summarize objections and concerns? I tend to avoid reading the mailing list itself since it can get too spicy, and my headaches are already strong enough!

    The mechanism itself seems reasonable, but I am surprised that something like this doesn't already exist in the kernel.

    • ziddoap 3 hours ago

      Not sure if there was much more to it than the thread linked to, but it was basically Linus being Linus. He said stuff that made sense in a pretty blunt fashion.

      There were flags proposed that allowed the seal to be ignored.

      >So you say "we can't munmap in this *one* place, but all others ignore the sealing".

      Later was the spice.

      >And dammit, once something is sealed, it is SEALED. None of this crazy "one place honors the sealing, random other places do not".

      And later, even spicier, Linus says that seals cannot be ignored and that is non-negotiable. Any further suggestions to ignore a seal via a flag would result in the person being added to Linus' ignore list. (He, of course, said this with some profanities and capitals sprinkled in.)

      • js2 3 hours ago

        Wasn't just Linus. Earlier, from Theo de Raadt:

        > I don't think you understand the problem space well enough to come up with your own solution for it. I spent a year on this, and ship a complete system using it. You are asking such simplistic questions above it shocks me.

        https://lwn.net/ml/linux-kernel/95482.1697587015@cvs.openbsd...

        Via https://lwn.net/Articles/948129/

        • 0xbadcafebee 2 hours ago

          Not a great perspective... "It took me a year [or more] to understand this. The fact that you don't understand it shocks me." Dude, not everybody's as smart or experienced as you. Here's an opportunity to be a mentor.

          • nativeit an hour ago

            > Google has no shortage of experienced developers who could have reviewed this submission before it was posted publicly, but that does not appear to have happened, with the result that a relatively inexperienced developer was put into a difficult position. Feedback on the proposal was resisted rather than listened to. The result was an interaction that pleased nobody.

            • refulgentis 44 minutes ago

              > Google has no shortage of experienced developers who could have reviewed this submission before it was posted publicly,

              You'd be surprised. My understanding from folks on Chrome OS is they've already shedded most, if not all, of the most experienced old hands. (n.b. Chrome OS was absorbed by Android and development is, by and large, ceased on it according to same sources directly, and indirectly via Blind.)

    • greenavocado 2 hours ago

      https://lwn.net/ml/linux-kernel/7071.1697661373@cvs.openbsd....

          From:   Theo de Raadt <deraadt-AT-openbsd.org>
          To:   Jeff Xu <jeffxu-AT-google.com>
      
          > On Wed, Oct 18, 2023 at 8:17 AM Matthew Wilcox <willy@infradead.org> wrote:
          > >
          > > Let's start with the purpose.  The point of mimmutable/mseal/whatever is
          > > to fix the mapping of an address range to its underlying object, be it
          > > a particular file mapping or anonymous memory.  After the call succeeds,
          > > it must not be possible to make any address in that virtual range point
          > > into any other object.
          > >
          > > The secondary purpose is to lock down permissions on that range.
          > > Possibly to fix them where they are, possibly to allow RW->RO transitions.
          > >
          > > With those purposes in mind, you should be able to deduce for any syscall
          > > or any madvise(), ... whether it should be allowed.
          > >
          > I got it.
          > 
          > IMO: The approaches mimmutable() and mseal() took are different, but
          > we all want to seal the memory from attackers and make the linux
          > application safer.
      
          I think you are building mseal for chrome, and chrome alone.
      
          I do not think this will work out for the rest of the application space
          because
      
          1) it is too complicated
          2) experience with mimmutable() says that applications don't do any of it
          themselves, it is all in execve(), libc initialization, and ld.so.
          You don't strike me as an execve, libc, or ld.so developer.
    • greenavocado 2 hours ago

          From:   Matthew Wilcox <willy-AT-infradead.org>
          To:   Jeff Xu <jeffxu-AT-google.com>
      
          ...
      
          Yes, thank you for demonstrating that you have no idea what you need to
          block.
      
          > It is practical to keep syscall extentable, when the business logic is the same.
      
          I concur with Theo & Linus.  You don't know what you're doing.  I think
          the underlying idea of mimmutable() is good, but how you've split it up
          and how you've implemented it is terrible.
      
          ...
    • lathiat 3 hours ago
      • ykonstant 3 hours ago

        Very nice, thanks!

        Edit: I always find it funny that these articles on the mailing list tend to read like a sports announcer describing a boxing match!

  • metadat 3 hours ago

    Will it be possible to override / disable the `mseal' syscall with the LD_PRELOAD trick?

    • eska 3 hours ago

      mseal digresses from prior memory protection schemes on Linux because it is a syscall tailored specifically for exploit mitigation against remote attackers seeking code execution rather than potentially local ones looking to exfiltrate sensitive secrets in-memory.

      If a remote attacker can change the local environment then they must have already broken into your system.

    • monocasa 27 minutes ago

      There's a bunch of ways to override it if you have early control over the process. Another example: ptrace the executable, watch the system calls, and skip over any mseal(2)s.

      This system call is meant for a different threat model than "attacker has early access to your process before it started initializing".

    • Dwedit 3 hours ago

      Probably not LD_PRELOAD. It would need to be an imported function in order for LD_PRELOAD to have any effect. A raw syscall would not be interceptable that way.

      Discussion about intercepting linux syscalls: https://stackoverflow.com/questions/69859/how-could-i-interc...

      But building your own patched kernel that pretends that mseal works would be the simplest way to "disable" that feature. Programs that use mseal could still do sanity checks to see if mseal actually works or not. Then a compromised kernel would need secret ways to disable mseal after it has been applied, to stop the apps from checking for a non-functional mseal.

      • jandrese 3 hours ago

        I'm not sure what protection you could expect on any system where the kernel has been replaced by the attacker. Sure they can bypass mseal, but they are also bypassing all other security on the box.

        • Dwedit an hour ago

          Two different considerations for when you'd want to deny memory to other processes:

          Protecting against outside attackers

          Digital Rights Management

          Faking "mseal" is something you might intentionally do if you are trying to break DRM, and something you would not want to do if you are trying to defend against outside attackers.

    • chucky_z 3 hours ago

      You can override the mseal call wrapper but not the syscall itself.

      This is an interesting thought so I looked it up and this is how (all?) preload syscall overrides work. You override the wrapper but not the syscalls itself so if you’re doing direct syscalls I don’t think that can be overridden. Technically you could override the syscall function itself maybe?

      • jmmv an hour ago

        > Technically you could override the syscall function itself maybe?

        But then you can just write assembly code to issue the system call.

    • the8472 3 hours ago

      https://lwn.net/Articles/978010/ says there'll be a glibc tunable

    • cataphract 3 hours ago

      Depends whether the program calls into libc or inlines the syscalls, I imagine. Though you could use other mechanisms like secccomp.

  • throw0101a 2 hours ago

    mseal() and what comes after, October 20, 2023: https://lwn.net/Articles/948129/

    mseal() gets closer, January 19, 2024: https://lwn.net/Articles/958438/

    Memory sealing for the GNU C Library, June 12, 2024: https://lwn.net/Articles/978010/

  • unwind 3 hours ago

    Meta: the mseal() prototype in the article needs some editing, it is not syntacticallly correct as shown now. The first argument is shown as

        unsigned start addr
    
    But should probably be

        unsigned long start_addr
    • hifromwork an hour ago

      Seems to be OK now:

          int mseal(unsigned long start, size_t len, unsigned long flags)
  • westurner an hour ago

    - "Memory Sealing "Mseal" System Call Merged for Linux 6.10" (2024) https://news.ycombinator.com/item?id=40474510#40474551 :

    > How should CPython support the mseal() syscall?

  • xterminator 24 minutes ago

    OpenBSD has had it since forever [1]. Why is such an obvious feature only reaching Linux now?

    [1]https://man.openbsd.org/mimmutable.2

    • gilgamesh3 17 minutes ago

      >OpenBSD has had it since forever.

      OpenBSD introduced mimmutable in OpenBSD 7.3, which was released 10/4/2023 (for US people, it would be 4/10/2023), so it isn't "forever".

      Meanwhile Linux and FreeBSD has "memfd_create" forever while OpenBSD doesn't have anonymous files and relies on "shm_open".