Interesting. The article mentions "spicy discussions" in the kernel mailing list. Is there any insider who can summarize objections and concerns? I tend to avoid reading the mailing list itself since it can get too spicy, and my headaches are already strong enough!
The mechanism itself seems reasonable, but I am surprised that something like this doesn't already exist in the kernel.
Not sure if there was much more to it than the thread linked to, but it was basically Linus being Linus. He said stuff that made sense in a pretty blunt fashion.
There were flags proposed that allowed the seal to be ignored.
>So you say "we can't munmap in this *one* place, but all others ignore the sealing".
Later was the spice.
>And dammit, once something is sealed, it is SEALED. None of this crazy
"one place honors the sealing, random other places do not".
And later, even spicier, Linus says that seals cannot be ignored and that is non-negotiable. Any further suggestions to ignore a seal via a flag would result in the person being added to Linus' ignore list. (He, of course, said this with some profanities and capitals sprinkled in.)
> I don't think you understand the problem space well enough to come up with
your own solution for it. I spent a year on this, and ship a complete system
using it. You are asking such simplistic questions above it shocks me.
Not a great perspective... "It took me a year [or more] to understand this. The fact that you don't understand it shocks me." Dude, not everybody's as smart or experienced as you. Here's an opportunity to be a mentor.
> Google has no shortage of experienced developers who could have reviewed this submission before it was posted publicly, but that does not appear to have happened, with the result that a relatively inexperienced developer was put into a difficult position. Feedback on the proposal was resisted rather than listened to. The result was an interaction that pleased nobody.
> Google has no shortage of experienced developers who could have reviewed this submission before it was posted publicly,
You'd be surprised. My understanding from folks on Chrome OS is they've already shedded most, if not all, of the most experienced old hands. (n.b. Chrome OS was absorbed by Android and development is, by and large, ceased on it according to same sources directly, and indirectly via Blind.)
From: Theo de Raadt <deraadt-AT-openbsd.org>
To: Jeff Xu <jeffxu-AT-google.com>
> On Wed, Oct 18, 2023 at 8:17 AM Matthew Wilcox <willy@infradead.org> wrote:
> >
> > Let's start with the purpose. The point of mimmutable/mseal/whatever is
> > to fix the mapping of an address range to its underlying object, be it
> > a particular file mapping or anonymous memory. After the call succeeds,
> > it must not be possible to make any address in that virtual range point
> > into any other object.
> >
> > The secondary purpose is to lock down permissions on that range.
> > Possibly to fix them where they are, possibly to allow RW->RO transitions.
> >
> > With those purposes in mind, you should be able to deduce for any syscall
> > or any madvise(), ... whether it should be allowed.
> >
> I got it.
>
> IMO: The approaches mimmutable() and mseal() took are different, but
> we all want to seal the memory from attackers and make the linux
> application safer.
I think you are building mseal for chrome, and chrome alone.
I do not think this will work out for the rest of the application space
because
1) it is too complicated
2) experience with mimmutable() says that applications don't do any of it
themselves, it is all in execve(), libc initialization, and ld.so.
You don't strike me as an execve, libc, or ld.so developer.
From: Matthew Wilcox <willy-AT-infradead.org>
To: Jeff Xu <jeffxu-AT-google.com>
...
Yes, thank you for demonstrating that you have no idea what you need to
block.
> It is practical to keep syscall extentable, when the business logic is the same.
I concur with Theo & Linus. You don't know what you're doing. I think
the underlying idea of mimmutable() is good, but how you've split it up
and how you've implemented it is terrible.
...
mseal digresses from prior memory protection schemes on Linux because it is a syscall tailored specifically for exploit mitigation against remote attackers seeking code execution rather than potentially local ones looking to exfiltrate sensitive secrets in-memory.
If a remote attacker can change the local environment then they must have already broken into your system.
There's a bunch of ways to override it if you have early control over the process. Another example: ptrace the executable, watch the system calls, and skip over any mseal(2)s.
This system call is meant for a different threat model than "attacker has early access to your process before it started initializing".
Probably not LD_PRELOAD. It would need to be an imported function in order for LD_PRELOAD to have any effect. A raw syscall would not be interceptable that way.
But building your own patched kernel that pretends that mseal works would be the simplest way to "disable" that feature. Programs that use mseal could still do sanity checks to see if mseal actually works or not. Then a compromised kernel would need secret ways to disable mseal after it has been applied, to stop the apps from checking for a non-functional mseal.
I'm not sure what protection you could expect on any system where the kernel has been replaced by the attacker. Sure they can bypass mseal, but they are also bypassing all other security on the box.
Two different considerations for when you'd want to deny memory to other processes:
Protecting against outside attackers
Digital Rights Management
Faking "mseal" is something you might intentionally do if you are trying to break DRM, and something you would not want to do if you are trying to defend against outside attackers.
You can override the mseal call wrapper but not the syscall itself.
This is an interesting thought so I looked it up and this is how (all?) preload syscall overrides work. You override the wrapper but not the syscalls itself so if you’re doing direct syscalls I don’t think that can be overridden. Technically you could override the syscall function itself maybe?
Interesting. The article mentions "spicy discussions" in the kernel mailing list. Is there any insider who can summarize objections and concerns? I tend to avoid reading the mailing list itself since it can get too spicy, and my headaches are already strong enough!
The mechanism itself seems reasonable, but I am surprised that something like this doesn't already exist in the kernel.
Not sure if there was much more to it than the thread linked to, but it was basically Linus being Linus. He said stuff that made sense in a pretty blunt fashion.
There were flags proposed that allowed the seal to be ignored.
>So you say "we can't munmap in this *one* place, but all others ignore the sealing".
Later was the spice.
>And dammit, once something is sealed, it is SEALED. None of this crazy "one place honors the sealing, random other places do not".
And later, even spicier, Linus says that seals cannot be ignored and that is non-negotiable. Any further suggestions to ignore a seal via a flag would result in the person being added to Linus' ignore list. (He, of course, said this with some profanities and capitals sprinkled in.)
Wasn't just Linus. Earlier, from Theo de Raadt:
> I don't think you understand the problem space well enough to come up with your own solution for it. I spent a year on this, and ship a complete system using it. You are asking such simplistic questions above it shocks me.
https://lwn.net/ml/linux-kernel/95482.1697587015@cvs.openbsd...
Via https://lwn.net/Articles/948129/
Not a great perspective... "It took me a year [or more] to understand this. The fact that you don't understand it shocks me." Dude, not everybody's as smart or experienced as you. Here's an opportunity to be a mentor.
> Google has no shortage of experienced developers who could have reviewed this submission before it was posted publicly, but that does not appear to have happened, with the result that a relatively inexperienced developer was put into a difficult position. Feedback on the proposal was resisted rather than listened to. The result was an interaction that pleased nobody.
> Google has no shortage of experienced developers who could have reviewed this submission before it was posted publicly,
You'd be surprised. My understanding from folks on Chrome OS is they've already shedded most, if not all, of the most experienced old hands. (n.b. Chrome OS was absorbed by Android and development is, by and large, ceased on it according to same sources directly, and indirectly via Blind.)
https://lwn.net/ml/linux-kernel/7071.1697661373@cvs.openbsd....
This may help a bit: https://lwn.net/Articles/948129/
Very nice, thanks!
Edit: I always find it funny that these articles on the mailing list tend to read like a sports announcer describing a boxing match!
Will it be possible to override / disable the `mseal' syscall with the LD_PRELOAD trick?
mseal digresses from prior memory protection schemes on Linux because it is a syscall tailored specifically for exploit mitigation against remote attackers seeking code execution rather than potentially local ones looking to exfiltrate sensitive secrets in-memory.
If a remote attacker can change the local environment then they must have already broken into your system.
There's a bunch of ways to override it if you have early control over the process. Another example: ptrace the executable, watch the system calls, and skip over any mseal(2)s.
This system call is meant for a different threat model than "attacker has early access to your process before it started initializing".
Probably not LD_PRELOAD. It would need to be an imported function in order for LD_PRELOAD to have any effect. A raw syscall would not be interceptable that way.
Discussion about intercepting linux syscalls: https://stackoverflow.com/questions/69859/how-could-i-interc...
But building your own patched kernel that pretends that mseal works would be the simplest way to "disable" that feature. Programs that use mseal could still do sanity checks to see if mseal actually works or not. Then a compromised kernel would need secret ways to disable mseal after it has been applied, to stop the apps from checking for a non-functional mseal.
I'm not sure what protection you could expect on any system where the kernel has been replaced by the attacker. Sure they can bypass mseal, but they are also bypassing all other security on the box.
Two different considerations for when you'd want to deny memory to other processes:
Protecting against outside attackers
Digital Rights Management
Faking "mseal" is something you might intentionally do if you are trying to break DRM, and something you would not want to do if you are trying to defend against outside attackers.
You can override the mseal call wrapper but not the syscall itself.
This is an interesting thought so I looked it up and this is how (all?) preload syscall overrides work. You override the wrapper but not the syscalls itself so if you’re doing direct syscalls I don’t think that can be overridden. Technically you could override the syscall function itself maybe?
> Technically you could override the syscall function itself maybe?
But then you can just write assembly code to issue the system call.
https://lwn.net/Articles/978010/ says there'll be a glibc tunable
Depends whether the program calls into libc or inlines the syscalls, I imagine. Though you could use other mechanisms like secccomp.
mseal() and what comes after, October 20, 2023: https://lwn.net/Articles/948129/
mseal() gets closer, January 19, 2024: https://lwn.net/Articles/958438/
Memory sealing for the GNU C Library, June 12, 2024: https://lwn.net/Articles/978010/
Meta: the mseal() prototype in the article needs some editing, it is not syntacticallly correct as shown now. The first argument is shown as
But should probably beSeems to be OK now:
- "Memory Sealing "Mseal" System Call Merged for Linux 6.10" (2024) https://news.ycombinator.com/item?id=40474510#40474551 :
> How should CPython support the mseal() syscall?
OpenBSD has had it since forever [1]. Why is such an obvious feature only reaching Linux now?
[1]https://man.openbsd.org/mimmutable.2
>OpenBSD has had it since forever.
OpenBSD introduced mimmutable in OpenBSD 7.3, which was released 10/4/2023 (for US people, it would be 4/10/2023), so it isn't "forever".
Meanwhile Linux and FreeBSD has "memfd_create" forever while OpenBSD doesn't have anonymous files and relies on "shm_open".