Modernizing Linux swapping: introducing the swap table

(lwn.net)

66 points | by chmaynard 6 hours ago ago

79 comments

One pet peeve I have with virtual memory management on Linux is that, as memory usage approaches 100%, the kernel starts evicting executable pages because technically they're read-only and can be loaded from disk. Thus, the entire system grinds to a halt in a behavior that looks like swapping, because every program that wants to execute instructions has to load its instructions from disk again, only to have those instruction pages be evicted again when context switching to another program. This behavior is especially counter intuitive because disabling swap does not prevent this problem. There are no convenient settings for administrators for preventing this problem.

It's good that we have better swapping now, but I wish they'd address the above. I'd rather have programs getting OOMKilled or throwing errors before the system grinds to a halt, where I can't even ssh in and run 'ps'.

[-]

Rygian 2 hours ago

I suffer from the same behavior, ever since I moved from Ubuntu to Debian.

An interactive system that does not interact (terminal not reactive, can't ssh in, screen does not refresh) is broken. I don't understand why this is not a kernel bug.

On my system, to add insult to injury, when the system does come back twenty minutes later, I get a "helpful" pop-up from the Linux Kernel saying "Memory Shortage Avoided". Which is just plain wrong. The pop-up should say "sorry, the kernel bricked your system for a solid twenty minutes for no good reason, please file a report".

robinsonb5 3 hours ago

Indeed. I think what's really needed is some way to mark pages as "required for interactivity" so that nothing related to the user interface gets paged out, ever. That, I think, would go at least some way towards restoring the feeling of "having a computer's full attention" that we had thirty years ago.

[-]

FooBarWidget a few seconds ago

There is, mlock() or mlockall(), but it requires developer support. I wish there is an administrator knob that allows me to mark whole processes without needing to modify them.

direwolf20 an hour ago

An Electron app would mark its entire 2GB as required for interactivity. If you run 4 electron apps on an 8GB system you run out of memory.

[-]

robinsonb5 19 minutes ago

I don't mean interactivity within apps, per se - I mean the desktop and underlying OS, so that if an electron app goes unresponsive and eats all the free RAM the window manager can still kill it. Or you can still open a new terminal window, log in and kill it. Right now it can take several minutes to get a Linux system back under control once a swapstorm starts.

[-]

M95D 9 minutes ago

Alt+[SysRq,f]

Or Alt+[SysRq,h] for help

akdev1l 3 hours ago

Seems the applications can call mlockall() to do this

nolist_policy 3 hours ago

Linux swap has been fixed on Chromebooks for years thanks to MGLRU. It's upstream since Linux 6.1 and you can try it with

  echo y >/sys/kernel/mm/lru_gen/enabled

[-]

M95D 5 minutes ago

I had nothing but problems since that was introduced in 6.1. It seems that the kernel prefers to compact/defrag memory, repeatedly, each time freezing everything 1-2 seconds, rather than releasing some disk cache memory or swapping out.

tremon 2 hours ago

Documentation links:

https://docs.kernel.org/next/admin-guide/mm/multigen_lru.htm...

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/lin...

man8alexd 3 hours ago

Actively used executable pages are explicitly excluded from reclaim. And if they are not used, why should they stay in memory when the memory is constrained? It is not the first time I have heard complaints about executable pages, but it seems to be some kind of common misunderstanding.

https://news.ycombinator.com/item?id=45369516

rustyhancock 21 minutes ago

This explains some of the issues I was having on a laptop some months back.

And searching desperately for "just kill the damn thing" option.

112233 3 hours ago

Is there a way to make linux kernel schedule in a "batch friendly way"? Say I do "make -j" and get 200 gcc processes diong jobserver LTO link with 2GB RSS each. In my head, optimal way through such mess is get as many processes as can fit into RAM without swapping, run them to completion, and schedule additional processes as resources become available. A depth first, "infinite latency" mode.

Any combination of cgroups, /proc flags and other forbidden knobs to get such behaviour?

[-]

Neywiny 12 minutes ago

"make -j" has OOMd me more than it's worth. If it's a big project I just put in how many threads I want. I do hear your point but that is a solved problem.

direwolf20 an hour ago

It's not possible for the kernel to predict the memory needs of a process unfortunately

[-]

man8alexd an hour ago

It is possible to measure process memory utilitsation and set appropriate cgroup limits.

worldsavior 3 hours ago

Program instructions size is small thus loading is fast, so no need to worry about that too much. I'd look on different things first.

[-]

twic 3 hours ago

Have you measured this, or is this just an opinion?

[-]

man8alexd 3 hours ago

Look into /proc/<PID>/status and /proc/<PID>/smaps

ChocolateGod 2 hours ago

I'd like to see Linux gain support for actual memory compression, without the need to go through zram, similar to macOS/Windows.

[-]

homebrewer 2 hours ago

zram has been "obsolete" for years, I don't know why people still reach for it. Linux supports proper memory compression in the form of zswap

https://wiki.archlinux.org/title/Zswap

[-]

RealStickman_ an hour ago

I didn't realize zswap also uses in-memory compression. It might be a combination of poor naming and zram being continuously popular.

ChocolateGod 2 hours ago

Because I'd rather compress ram when running low on memory rather than swapping to my disks. zram is also default on some distros (e.g. Fedora).

[-]

homebrewer 2 hours ago

Did you read the link? Additional disk swap is optional, and if for some reason you would still like to have one, it's easy to disable writeback, using just the RAM.

And even if one enables zswap and configures nothing else, compressing RAM and only swapping out to disk under extreme pressure is still the default behavior.

JamesTRexx 2 hours ago

I use zswap, which is a non-fixed intermediate layer between RAM and swap and worked great on my old laptop which had a max of 4GB RAM. Even use it now on my current 32GB laptop.

Full compression would be nicer, but I'd also like to see ECC emulation (or alternative) as a cheaper alternative to the real hardware, although with current prices that might be less so.

mkurz 2 hours ago

zswap?

dist-epoch 2 hours ago

Both Canonical and Microsoft recommend enabling swap file for Ubuntu cloud images, even if you allocate plenty of RAM to the VM.

Any thoughts on that?

[-]

man8alexd 2 hours ago

https://chrisdown.name/2018/01/02/in-defence-of-swap.html

IshKebab 2 hours ago

Yeah because Linux's memory management is quite poor and running out of RAM without swap will often mean a hard reboot. Swap definitely helps a lot, even if it doesn't fully solve the problem.

To be honest I don't know why it's such an issue on Linux. Mac and Windows don't have this issue at all. Windows presumably because it doesn't over-commit memory. I'm not sure why Mac is so much better than Linux at memory management.

My eventual solution was to just buy a PC with a ton of RAM (128 GB). Haven't had any hard reboots due to OOM since then!

[-]

magicalhippo 10 minutes ago

> To be honest I don't know why it's such an issue on Linux.

edit: I wrote all this before realizing I overlooked that you answered it yourself, so below is my very elaborate explanation of what you said:

> Windows presumably because it doesn't over-commit memory.

I'm no expert but from what I've gathered this ultimately boils down to how Linux went with fork for multiprocessing, vs Windows focused on threads.

With fork, you clone the process. Since it's a clone it gets a copy of all the memory of the parent process. To make fork faster and consume less physical memory, Linux went with copy-on-write for the process' memory. This avoids an expensive copy, and also avoids duplicating memory which will only be read.

The downside is that Linux has no idea how much of the memory shared with the clone that the clone or the parent will modify after the fork call. If the clone just does a small job and exits quickly, neither it or the parent will modify a lot of pages, thus most of them are never actually copied. The fastest work is the work you never perform, so this is indeed fast.

However, in some cases the clone is long-lived and thus a lot of memory might eventually end up getting copied. Well, Linux needs to back those copies with physical memory, and so if there's not enough physical memory around it has to evict something. While Linux scrambles to perform the copy, the process which triggers it has to wait.

AFAIK one can configure Linux to reserve physical memory for a worst-case scenario where it has to copy all the cloned memory. However in almost all normal cases, this grossly overestimates the required memory and thus leads to swapping when technically it is not needed.

On Windows this is very different. Instead of spawning a cloned process to do extra work, you spawn a thread. And all threads belonging to a process shares the same memory. Thus there is no need to clone memory, no need for the copy-on-write optimization, and thus Windows has much better knowledge about how much free physical memory it actually has to work with.

Of course a thread on Windows can still allocate a huge amount of memory and trigger swapping that way, but Windows will never suddenly be in a situation where it then also needs to scramble to copy some shared pages.

direwolf20 an hour ago

My experience is different. Running out of RAM without swap will cause the most memory–hungry process to die, whereupon systemd restarts it. Running out of RAM with swap causes thrashing and you can't serve any requests or ssh logins. Someone has to press the reset button then.

[-]

rustyhancock 19 minutes ago

I suppose there are two common scenarios roughly.

If I did something (like try and decompress an archive) and I run out of memory I want that process to be killed.

If my system/config is simply not up to scratch and the normal services are causing thrashing that needs to be addressed directly and OOM kill isn't intended to help I don't think.

NekkoDroid an hour ago

> To be honest I don't know why it's such an issue on Linux. Mac and Windows don't have this issue at all. Windows presumably because it doesn't over-commit memory

To be fair, my Windows system grinds to a halt (not really, but it becomes very noticably less responsive in basically anything) when JetBrains is installing an update (mind you I only have SSDs with all JetBrains stuff being on an NVMe). I don't know what JetBrains is doing, but it consistently makes itself noticable when it is updating.

secondcoming 44 minutes ago

would that mean swapping involves network calls?

iberator 3 hours ago

Another useless feature into Linux kernel. Who uses swap space nowadays?! Last time I used swap on Linux device was around Pentium 2 era but in reality closer to 486DX era

[-]

Titan2189 3 hours ago

We use it in production. Workloads with unpredictable memory usage (32Mb to 4Gb per process), but we also want to start enough processes to saturate the CPU. Before we configured & enabled swap we were either sitting at low CPU utilisation or OOM

ch_123 3 hours ago

I ran Linux without swap for some years on a laptop with a large-for-the-time amount of RAM (about 8GB). It _mostly_ worked, but sudden spikes of memory usage would render the system unresponsive. Usually it would recover, but it in some cases it required a power cycle.

Similarly, on a server where you might expect most of the physical memory to get used, it ends up being very important for stability. Think of VM or container hosts in particular.

[-]

GCUMstlyHarmls 3 hours ago

I dont get why anti-swap is so prevalent in Linux discussions. Like, what does it hurt to stick 8-16-32gb extra "oh fuck" space on your drive.

Either you're going to never exhaust your system ram, so it doesn't matter, minimally exhaust it and swap in some peak load but at least nothing goes down, or exhaust it all and start having things get OOM'd which feels bad to me.

Am I out of touch? Surely it's the children who are wrong.

[-]

manuel_w 3 hours ago

The pro-swap stance has never made sense to me because it feels like a logical loop.

There’s a common rule of thumb that says you should have swap space equal to some multiple of your RAM.

For instance, if I have 8 GB of RAM, people recommend adding 8 GB of swap. But since I like having plenty of memory, I install 16 GB of RAM instead—and yet, people still tell me to use swap. Why? At that point, I already have the same total memory as those with 8 GB of RAM and 8 GB of swap combined.

Then, if I upgrade to 24 GB of RAM, the advice doesn’t change—they still insist on enabling swap. I could install an absurd amount of RAM, and people would still tell me to set up swap space.

It seems that for some, using swap has become dogma. I just don’t see the reasoning. Memory is limited either way; whether it’s RAM or RAM + swap, the total available space is what really matters. So why insist on swap for its own sake?

[-]

xorcist a minute ago

There is too much focus in this discussion about low memory situations. You want to avoid those as much as possible. Set reasonable ulimit for your applications.

The reason you want swap is because everything in the Linux (and all of UNIX really) is written with virtual memory in mind. Everything from applications to schedulers will have that use case in mind. That's the short answer.

Memory is expensive and storage is cheap. Even if you have 16 GB RAM in your box, and perhaps especially then, you will have some unused pages. Paging out those and utilizing more memory to buffer I/O will give you higher performance under most normal circumstances. So having a little bit of swap should help performance.

For laptops hibernation can be useful too.

viraptor 2 hours ago

You're mashing together two groups. One claims having swap is good actually. The other claims you need N times ram for swap. They're not the same group.

> Memory is limited either way; whether it’s RAM or RAM + swap

For two reasons: usage spikes and actually having more usable memory. There's lots of unused pages on a typical system. You get free ram for the price of cheap storage, so why wouldn't you?

man8alexd 3 hours ago

This rule of thumb is outdated by two decades.

The proper rule of thumb is to make the swap large enough to keep all inactive anonymous pages after the workload has stabilized, but not too large to cause swap thrashing and a delayed OOM kill if a fast memory leak happens.

[-]

tremon 2 hours ago

That's not useful as a rule of thumb, since you can't know the size of "all inactive anonymous pages" without doing extensive runtime analysis of the system under consideration. That's pretty much the opposite of what a rule of thumb is for.

[-]

man8alexd 2 hours ago

You are right, it is not a rule of thumb, and you can't determine optimal swap size right away. But you don't need "extensive runtime analysis". Start with a small swap - a few hundred megabytes (assuming the system has GBs of RAM). Check its utilization periodically. If it is full, add a few hundred megabytes more. That's all.

[-]

ZoomZoomZoom an hour ago

It's not like it's easy to shuffle partitions around. Swap files are a pain, so you need to reserve space at the end of the table. By the time you need to increase swap the previous partition is going to be full.

Better overcommit right away and live with the feeling you're wasting space.

[-]

man8alexd an hour ago

Exactly opposite. Don't use swap partitions, and use swap files, even multiple if necessary. Never allocate too much swap space. It is better to get OOM earlier then to wait for unresponsive system.

direwolf20 an hour ago

Hast thou discovered our lord and savior LVM?

dspillett 2 hours ago

> There’s a common rule of thumb that says you should have swap space equal to some multiple of your RAM.

That rule came about when RAM was measured in a couple of MB rather than GB, and hasn't made sense for a long time in most circumstances (if you are paging our a few GB of stuff on spinning drives your system is likely to be stalling so hard due to disk thrashing that you hit the power switch, and on SSDs you are not-so-slowly killing them due to the excess writing).

That doesn't mean it isn't still a good idea to have a little allocated just-in-case. And as RAM prices soar while IO throughput & latency are low, we may see larger Swap/RAM ratios being useful again as RAM sizes are constrained by working-sets aren't getting any smaller.

In a theoretical ideal computer, which the actual designs we have are leaky-abstraction laden implementations of, things are the other way around: all the online storage is your active memory and RAM is just the first level of cache. That ideal hasn't historically ended up being what we have because the disparities in speed & latency between other online storage and RAM have been so high (several orders of magnitude), fast RAM has been volatile, and hardware & software designs or not stable & correct enough such that regular complete state resets are necessary.

> Why? At that point, I already have the same total memory as those with 8 GB of RAM and 8 GB of swap combined.

Because your need for fast immediate storage has increased, so 8-quick-8-slow is no longer sufficient. You are right in that this doesn't mean you need 16-quick-16-slow is sensible, and 128-quick-128-slow would be ridiculous. But no swap at all doesn't make sense either: on your machine imbued with silly amounts of RAM are you really going to miss a few GB of space allocated just-in-case? When it could be the difference between slower operation for a short while and some thing(s) getting OOM-killed?

[-]

man8alexd an hour ago

Swap is not a replacement for RAM. It is not just slow. It is very-very-very slow. Even SSDs are 10^3 slower at random access with small 4K blocks. Swap is for allocated but unused memory. If the system tries to use swap as active memory, it is going to become unresponsive very quickly - 0.1% memory excess causes a 2x degradation, 1% - 10x degradation, 10% - 100x degradation.

Balinares 2 hours ago

Another factor other commenters haven't mentioned, although the article does bring it up: you may disable swap and you will still get paging behavior regardless, because in a pinch the kernel will reclaim pages that are mmapped to files. Most typically binaries and librairies. Which means the process in question will incur a map page read next time it schedules. But of course you're out of memory, so the kernel will need to page out another process's code page to make room, and when that process next schedules... Etc.

This has far worse degradation behavior than normal swapping of regular data pages. That at least gives you the breathing space to still schedule processes when under memory pressure, such as whichever OOM killer you favor.

[-]

man8alexd an hour ago

Binaries and libraries are not paged out. Being read-only, they are simply discarded from the memory. And I'll repeat, actively used executable pages are explicitly excluded from reclaim and never discarded.

t-3 2 hours ago

The reason you're supposed to have swap equal in size to your RAM is so that you can hibernate, not to make things faster. You can easily get away with far less than that because swap is rarely needed.

[-]

dspillett 2 hours ago

> so that you can hibernate

The “paging space needs to be X*RAM” and “paging space needs to be RAM+Y” predate hibernate being a common thing (even a thing at all), with hibernate being an extra use for that paging space not the reason it is there in the first place. Some OSs have hibernate space allocated separately from paging/swap space.

Balinares 2 hours ago

I do wish there was a way to reserve swap spaces for hibernation that don't contribute to the virtual memory. Else by construction the hibernation space is not sufficient for the entire virtual memory space, and hibernation will fail when the virtual memory is getting full.

ch_123 2 hours ago

You're implying that people are telling you to set up swap without any reason, when in fact there are good reasons - namely dealing with memory pressure. Maybe you could fit so much RAM into your computer that you never hit pressure - but why would you do that vs allocating a few GB of disk space for swap?

Also, as has been pointed out by another commenter, 8GB of swap for a system with 8GB of physical memory is overkill.

[-]

tremon 2 hours ago

I'm also in the GP's camp; RAM is for volatile data, disk is for data persistence. The first "why would you do that" that needs to be addressed is why volatile data should be written to disk. And "it's just a few % of your disk" is not a sufficient answer to that question.

[-]

112233 2 hours ago

> RAM is for volatile data, disk is for data persistence.

Genuinely curious where this idea has come from. Is it something being taught currently?

ch_123 2 hours ago

Because of cost - particularly given the current state of the RAM market. In order to have so much memory that you never hit memory spikes, you will deliberately need to buy RAM to never be used.

Note that simply buying more RAM than what you expect to use is not going to help. Going back to my post from earlier, I had a laptop with 8GB of RAM at a time where I would usually only need about 2-4GB of RAM for even relatively heavy usage. However, every once in a while, I would run something that would spike memory usage and make the system unresponsive. While I have much more than 8GB nowadays, I'm not convinced that it's enough to have completely outrun the risk of this sort of behaviour re-occuring.

direwolf20 an hour ago

Swap causes thrashing, making the whole system unusable, instead of a clean OOM kill

[-]

NekkoDroid 43 minutes ago

IMO OOM killing should be reserved for single processes misbehaving. When a lot of different applications just use a decent amount of memory and exhaust the system RAM swapping to disk is the appropriate thing to do.

[-]

man8alexd 38 minutes ago

When you set cgroup limits, you tell the kernel how to determine when a process is misbehaving and needs to be OOM-killed.

man8alexd an hour ago

swap causes thrashing if you have too large swap and no cgroup limits.

ch_123 3 hours ago

I think it's some kind of misplaced desire to be "lightweight" and avoid allocating disk space that cannot be used for regular storage. My motivation way back when for wanting to avoid swap was due to concerns about SSD wear issues, but those have been solved for a long time ago.

man8alexd 3 hours ago

8-16-32gb of swap space without cgroup limits would get the system into swap thrashing and make it unresponsive.

solstice 3 hours ago

I had a similar experience with Kubuntu on a xps13 from 2016 with only 8GB of RAM and the system suddenly freezing so hard that a hard reboot was required. While looking for the cause, I noticed that the system had only 250 MB of swap space. After increasing that to 10 GB there have been no further instances of freezing so far.

wongarsu 3 hours ago

It's unloved on Linux because using Linux under memory pressure sucks. But that's not a good reason to abandon improvements. Even more so with the direction RAM prices are headed

[-]

gf000 2 hours ago

It sucks for interactive use only. It could be solved in user space (see the other comment with cgroups), it just isn't.

man8alexd 3 hours ago

It sucks without proper cgroup limits because swap makes OOM slower to trigger. Either set the cgroup limits or make the swap small.

[-]

ChocolateGod 2 hours ago

This requires additional setup from the user, the default setup should just "work".

[-]

man8alexd 2 hours ago

There are different definitions of "just work".

SCdF 3 hours ago

You should still use swap. It's not "2x RAM" as advice anymore, and hasn't been for years: https://chrisdown.name/2018/01/02/in-defence-of-swap.html

tl;dr; give it 4-8GB and forget about it.

[-]

ch_123 3 hours ago

I've heard "square root of physical memory" as a heuristic, although in practice I use less than this with some of my larger systems.

[-]

man8alexd 3 hours ago

[-]

boomlinde an hour ago

That's not so much a rule of thumb as an assessment you can only make after thorough experimentation or careful analysis.

[-]

man8alexd an hour ago

You don't need "horough experimentation or careful analysis". Just keep free swap space below few hundred megabytes but above zero.

[-]

boomlinde an hour ago

"Keep swap space below few hundred megabytes but above zero" is a good example of a rule of thumb.

"Make the swap large enough to keep all inactive anonymous pages after the workload has stabilized, but not too large to cause swap thrashing and a delayed OOM kill if a fast memory leak happens" is not.

krautsauer 2 hours ago

I rely on it heavily. Have you tried zram swap?

sl-1 3 hours ago

It is still useful for many workloads, I use it in work and on my own machines