Linux 6.18 Will Fix Lockups When Systemd Units Read Lots of Files

(phoronix.com)

36 points | by Bender 7 hours ago ago

23 comments

gpm 7 hours ago

When I stop and think about it writing access times for everything seems extremely wasteful... does anything actually use this field? Any reason I shouldn't change all my file systems to mount with noatime?

It's hard for me to imagine using it for anything myself, considering the number of times I do something like run a search (or a backup command) across literally every file I care about.

[-]

JoshTriplett 6 hours ago

Ideally, noatime would be the default, and applications that still care about atime would be updated to open with a new `O_ATIME` flag. Or, better yet, track it themselves independently.

It's completely reasonable to turn it on. And also, when you're writing applications for Linux, consider using the `O_NOATIME` flag in your file opens.

[-]

mmh0000 4 hours ago

`relatime` is the default on many distros, and has long been "best-practice" for first pass Linux performance turning.

  MOUNT(8)       
       relatime
           Update inode access times relative to modify or change time. Access time is only updated if the previous access time was earlier than or equal to the current
           modify or change time. (Similar to noatime, but it doesn’t break mutt(1) or other applications that need to know if a file has been read since the last time it
           was modified.)

           Since Linux 2.6.30, the kernel defaults to the behavior provided by this option (unless noatime was specified), and the strictatime option is required to
           obtain traditional semantics. In addition, since Linux 2.6.30, the file’s last access time is always updated if it is more than 1 day old.

lq9AJ8yrfs 5 hours ago

I think you're on to something, though it's optimistic to count on app devs to specify a novel platform-specific flag. "Don't break user-space" seems like it's in contest here.

Maybe if you could taint a process (or perhaps the inode and/or path instead) so that its and its children's opens get the new o_atime flag by default, so that systemd or whatever could set it for legacy processes (or files/paths) that need it.

Then distros or SRE's could put up with it without nagging all the SWEs about linuxisms. Some of whom may not know or care their code runs on linux.

NekkoDroid 6 hours ago

> does anything actually use this field?

Systemd in a way does. One of the systemd-tmpfiles entry option is to clean up unused files after some time (it ships defaults for /tmp/ after 10 days and /var/tmp/ after 30 days) and for this it checks atime, mtime and ctime to determin if it should delete the file (I think you can also take a flock on the file to prevent it from being deleted as well)

themafia 7 hours ago

The canonical example of an application that can break with 'noatime' is the "mutt" email client with mbox style single file email spools.

Most modern applications are not designed to operate on shared files like this so in general 'noatime' is safe for 99.9% of software.

williadc 5 hours ago

I use atime to identify archives that can be retired. It's common for circuit designer to release a lot of large files for their peers to analyze or incorporate into a parent/grandparent simulation. They will use that data for as long as it is still relevant, which means different things for different types of data, and the only consistent thing we've found is that if the data hasn't been accessed in awhile, then we can retire it.

bryanlarsen 7 hours ago

Mail readers that use the mbox format are pretty much the only common user.

Avamander 5 hours ago

I've found it useful for forensic reasons and debugging.

pengaru 6 hours ago

It's one of those things that you don't care about until you do.

As a former sysadmin through the dotcom booms, we regularly depended on atime for identifying which files are actively being used in myriad situations.

Sometimes you're just confirming a config file was actually reloaded in response to your HUP signal. Other times you're trying to find out which data files a customer's cgi-bin mess is making use of.

It's probably less relevant today where multi-user unix hosts are less common, but it was quite valuable information to maintain back then.

[-]

__turbobrew__ 5 hours ago

> we regularly depended on atime for identifying which files are actively being used in myriad situations

You can do that with bpf tooling now, for example the `opensnoop` BCC program can capture all file opens on demand. You can also write tools which capture all POSIX IOs to specific files/directories. I can see atime sometimes being useful in some super niche use cases such as hisenbugs you cannot reproduce reliably, but I would be reaching for BPF tools first.

[-]

homebrewer 5 hours ago

Forget bpf, Linux has the audit framework to solve exactly this problem, and it's been in heavy use for a couple of decades now

https://www.redhat.com/en/blog/configure-linux-auditing-audi...

https://linux-audit.com/linux-audit-framework/configuring-an...

CaliforniaKarl 7 hours ago

This seems to me to be a cgroup issue, not a systemd issue, though systemd's pervasive use of cgroups make it the most-obvious trigger.

[-]

themafia 7 hours ago

It was a shared queue with high contention and it was being sorted more than necessary. The fix is to use independent queues and to not sort the dirty list.

malkia 6 hours ago

I was listening to Matt Godbolt's the Two's Complement podcast "Squashing Compilers" and this got my attention, I think this was Ben Rady sharing his recent systemd issue - seems like related

https://youtu.be/Au15lSiAkeQ?si=sxxP2ia9vUkWY5qy&t=982

From the YouTube transcript:

"I don't know what systemd is doing to take so long cuz this is the rub systemd essentially takes 100% CPU twice over. So on our two core machine that we run these things on, I can run top that when I actually got it, I said to you the machine was unresponsive, right? Because all in kernel land, locks are being taken out left, right, and center. Um, you know, we're trying to mount these things in parallel at sensible levels because we want to try and mount"

porridgeraisin an hour ago

Another case of Accidentally Quadratic

jmclnx 7 hours ago

I admit, I do not fully understand systemd, but having to add logic like this is very odd. If "too many" is reached, couldn't they add a pause and throw a message into /var/log/messages ?

This indicates to me a very poor design. If not, it is a validation of the old UNIX saying "do one thing and do it well" and "keep programs small" (paraphrasing).

[-]

wrs 6 hours ago

It’s not really specific to systemd, it’s about cgroups in the kernel. If your code is running as a systemd unit so it gets its own cgroup, there you are.

pengaru 6 hours ago

  > I admit, I do not fully understand systemd ...

  > This indicates to me a very poor design. If not, it is a validation of the old UNIX saying "do one thing and do it well" and "keep programs small" (paraphrasing).

You don't need to fully understand systemd to understand TFA describes a kernel fix.

This isn't a systemd problem, systemd just makes use of cgroups. The kernel has a degenerate case handling lazy atime updates combined with cgroups.

[-]

pizlonator 6 hours ago

Kinda yeah?

I’d say it’s both a systemd issue and a kernel issue. The fact that systemd motivates kernel fixes does point to systemd being maybe just a bit overengineered

[-]

pengaru 6 hours ago

> I’d say it’s both a systemd issue and a kernel issue. The fact that systemd motivates kernel fixes does point to systemd being maybe just a bit overengineered

systemd is basically a victim here, you're quasi engaging in a tech form of victim blaming.

don't blame systemd for making use of kernel features (cgroups)

and without cgroups linux has no sandboxing capabilities, and would be largely irrelevant to today's workloads

[-]

pizlonator 6 hours ago

If I was blaming only systemd then you’d be right.

Look if I wrote a thing that caused kernel lockups then I’d blame myself even if the kernel dudes fixed the issue

[-]

lokar 5 hours ago

The kernel has a very clear API and expected behavior. Systemd is not doing anything wrong, it’s using the API correctly.

It’s a kernel bug.