In defence of swap: common misconceptions (2018)

(chrisdown.name)

108 points | by jitl a day ago ago

138 comments

On my desktop system, most of my problems with swap come from dealing with the aftermath of an out-of-control process eating all my RAM. In this case, the offending program demands memory so quickly that everything from legitimate programs gets swapped out. These programs proceed to run poorly for the next several minutes to an hour depending on usage, since the OS only swaps pages back in once they are referenced, even if there is plenty of free space not even being used in the disk cache.

Eventually I wrote a small script that does the equivalent of "sudo swapoff -a && sudo swapon -a" to eagerly flush everything to RAM, but I was surprised by how many people seemed to think there's no legitimate reason to ever want to do so.

[-]

Sophira 16 hours ago

> I was surprised by how many people seemed to think there's no legitimate reason to ever want to do so.

Sounds like it's as legitimate as running the sync command - ie. ideally you should never need to do it, but in practice you sometimes do.

[-]

aidenn0 9 hours ago

I still run "sync" before removing a USB drive. I'm sure it's entirely unnecessary now, but old habits die hard.

[-]

ziml77 8 hours ago

You definitely want to ensure buffers are flushed. Because very, very annoyingly it's not default behavior on Linux distros for removable devices to be mounted with write caching disabled. I don't even know of an easy option to make Linux do that. I think you'd need to write some custom udev rule

ciupicri 17 hours ago

To add to the injury swapoff doesn't read from disk sequentially, but in some "random" order, so if you're using a hard-disk it's a huge pain, although even SSDs would benefit from this.

[-]

grogers 13 hours ago

When I was last messing with this ~10 years ago, even with SSD swapoff was just insanely slow. Even relatively small single digit GB swap partitions would take many minutes to drain. I think it was loading one page at a time from swap or something.

hugo1789 a day ago

That works if there is enough memory after the "bad" process has been killed. The question is, is it necessary? Many systems can live with processes performing a little bit poorly for some minutes and I wouldn't do it.

[-]

creer a day ago

It's fine that "many systems" can. But there is no easy way when the user or system can't. Flushing back to RAM is slow - that's not controversial. So it would help if there was a way to do this in advance of the need for the programs where that matters.

[-]

aeonik a day ago

You mean like vmtouch and madvise?

I use vmtouch all the time to preload or even lock certain data/code into RAM.

michaelt a day ago

> The question is, is it necessary? Many systems can live with processes performing a little bit poorly for some minutes and I wouldn't do it.

The outage ain't resolved until things are back to operating normally.

If things aren't back to 100% healthy, could be I didn't truly find the root cause of the problem - in which case I'll probably be woken up again in 30 minutes when the problem comes back.

[-]

whatevaa 14 hours ago

Desktops are not servers. There could be no problem, just some hungry legitimate program (or vm).

Zefiroj 17 hours ago

Check out the lru_gen_min_ttl from MGLRU.

a day ago

[deleted]

man8alexd a day ago

[dead]

ggm a day ago

Recognition that older linux swap strategies were unhelpful sometimes, which this piece of writing does, validates out past sense it wasn't working well. Regaining trust takes time.

Sometimes I think if backing store and swap were more clearly delineated we might have got to decent algorithms sooner. Having a huge amount of swap pre-emptively claimed was making it look like starvation, when it was just a runtime planning strategy. It's also confusing how top and vmstat report things.

Also, as a BSD mainly person, I think the differences stand out. I haven't noticed an OOM killer approach on BSD.

Ancient model: twice as much swap as memory

Old model: same amount of swap as memory

New model: amount of swap your experience tells you this job mix demands to manage memory pressure fairly, which is a bit of a tall ask sometimes, but basically pick a number up to memory size.

[-]

creshal a day ago

> Also, as a BSD mainly person, I think the differences stand out. I haven't noticed an OOM killer approach on BSD.

BSD allocators simply return errors if no more memory is available; for backwards compatibility reasons Linux is stuck with a fatally flawed API that doesn't.

[-]

jcalvinowens a day ago

You can trivially disable overcommit on Linux (vm.overcommit_memory=2) to get allocation failures instead of OOMs. But you will find yourself spending a lot more money on RAM :)

[-]

hugo1789 a day ago

And debug many tools which still ignore the fact that malloc could fail.

man8alexd a day ago

I assumed the same, but just discovered that FreeBSD has vm.overcommit too. But I'm not sure about its working.

[-]

toast0 19 hours ago

Overcommit is subtle. If you allocate a bunch of address space and don't touch it, that's one thing.

If you allocate and touch everything, and then try to allocate more, it's better to get an allocation error than an unsatifyable page fault later.

My understanding (which could very well be wrong) is Linux overcommit will continue to allocate address space when asked regardless of memory pressure; but FreeBSD overcommit will refuse allocations when there's too much memory pressure.

I'm pretty sure I've seen FreeBSD's OOM killer, but it needs a specific pattern of memory use, it's much more likely for an application to get a failed allocation and exit, freeing memory; than for all the applications to have unused allocations that they then use.

All that said, I prefer to run with a small swap, somewhere around 0.5-2GB. Memory pressure is hard to measure (although recent linux has a measure that I haven't used), but swap % and swap i/o are easy to measure. If your swap grows quickly, you might not have time to do any operations to fix it, but your stats should tell the tale. If your swap grows slowly enough, you can set thresholds and analyze the situation. If you have a lot of swap i/o that provides a measure of urgency.

[-]

jcalvinowens 17 hours ago

> If you allocate and touch everything, and then try to allocate more, it's better to get an allocation error than an unsatifyable page fault later.

It depends, but generally speaking I'd disagree with that.

The only time you actually want to see the allocation failures is if you're writing high reliability software where you've gone to the trouble to guarantee some sort of meaningful forward progress when memory is exhausted. That is VERY VERY hard, and quickly becomes impossible when you have non-trivial library dependencies.

If all you do is raise std::bad_alloc or call abort(), handling NULL return from malloc() is arguably a waste of icache: just let it crash. Dereferencing NULL is guaranteed to crash on Linux, only root can mmap() the lowest page.

Admittedly I'm anal, and I write the explicit code to check for it and call abort(), but I know very experienced programmers I respect who don't.

[-]

toast0 17 hours ago

> If all you do is raise std::bad_alloc or call abort(), handling NULL return from malloc() is arguably a waste of icache: just let it crash. Dereferencing NULL is guaranteed to crash on Linux, only root can mmap() the lowest page.

If you don't care to handle the error, which is a totally reasonable position, there's not a whole lot of difference between the allocator returning a pointer that will make you crash on use because it's zero, and a pointer that will make you crash on use because there are no pages available. There is some difference because if you get the allocation while there are no pages available, the fallible allocator has returned a permanently dead pointer and the unfailing allocator has returned a pointer that can work in the future.

But if you do want to respond to errors, it is easier to respond to a NULL return rather than to a failed page fault. I certainly agree it's not easy to do much other than abort in most cases, but I'd rather have the opportunity to try.

a day ago

[deleted]

kijin a day ago

For modern Linux servers with large amounts of RAM, my rule of thumb is between 1/8 and 1/32 of RAM, depending on what the machine is for.

For example, one of my database servers has 128GB of RAM and 8GB of swap. It tends to stabilize around 108GB of RAM and 5GB of swap usage under normal load, so I know that a 4GB swap would have been less than optimal. A larger swap would have been a waste as well.

[-]

ChocolateGod a day ago

I no longer use disk swap for servers, instead opting for Zram with a maximum is 50% of RAM capacity and a high swapiness value.

It'd be cool if Zram could apply to the RAM itself (like macOS) rather than needing a fake swap device.

[-]

LargoLasskhyfv 21 hours ago

Lookie lookie! Isn't it spooky?

https://github.com/CachyOS/CachyOS-Settings/blob/master/usr/...

Resulting in https://i.postimg.cc/hP37vvpJ/screenieshottie.png

Good enough...

[-]

ChocolateGod 21 hours ago

Yeh. I haven't yet figured out how to get zram to apply transparently to containers though, anything in another memory cgroup will never get compressed unless swap is explicitly exposed to it.

cmurf a day ago

zswap

https://docs.kernel.org/admin-guide/mm/zswap.html

The cgroup accounting also now works in zswap.

[-]

ChocolateGod 21 hours ago

Zswap requires a backing disk swap, Zram does not.

[-]

cmurf 16 hours ago

The backing disk or file will only be written to if cache eviction on the basis of LRU comes into play, which is fine because that's probably worth the write hit. The likelihood of thrashing, the biggest complaint about disk based swap, is far reduced.

zram based swap isn't free. Its efficiency depends on the compression ratio (and cost).

man8alexd a day ago

The proper rule of thumb is to make the swap large enough to keep all inactive anonymous pages after the workload has stabilized, but not too large to cause swap thrashing and a delayed OOM kill if a fast memory leak happens.

Another rule of thumb is that performance degradation due to the active working set spilling into the swap is exponential - 0.1% excess causes 2x degradation, 1% - 10x degradation, 10% - 100x degradation (assuming 10^3 difference in latency between RAM and SSD).

[-]

kijin a day ago

I would approach the issue from the other direction. Start by buying enough RAM to contain the active working set for the foreseeable future. Afterward, you can start experimenting with different swap sizes (swapfiles are easier to resize, and they perform exactly as well as swap partitions!) to see how many inactive anonymous pages you can safely swap out. If you can swap out several gigabytes, that's a bonus! But don't take that for granted. Always be prepared to move everything back into RAM when needed.

creshal a day ago

I wish people would actually read TFA instead of reflexively repeating nonsensical folk remedies.

[-]

JdeBP a day ago

I've been telling people about this since the days when there were operating systems still around that actually did swapping (16-bit OS/2, old Unix, Standard Mode DOS+Windows) rather than paging (32-bit OS/2, 386 Enhanced Mode DOS+Windows, Windows NT). I wrote a Frequently Given Answer about it in 2007, I had had to repeat the point so many times since the middle 1990s; and I was far from alone even then.

* http://jdebp.uk./FGA/dont-throw-those-paging-files-away.html

The erroneous folk wisdom is widespread. It often seems to lack any mention of the concepts of a resident set and a working set, and is always mixed in with a wishful thinking idea that somehow "new" computers obviate this, when the basic principles of demand paging are the same as they were four decades ago, Parkinson's Law can still be observed operating in the world of computers, and the "new" computers all of those years ago didn't manage to obviate paging files either.

[-]

p_ing a day ago

The swapfile.sys in Windows 8+ is used for process swapping (moving the entire private working set out of memory to disk), but only for UWP applications.

mmphosis 16 hours ago

TFA means The Fine Article

Panzerschrek a day ago

As I understand this article, swap is useful for cases where many long-lived programs (daemons) allocate a lot of memory, but almost never access it. But wouldn't it be better to avoid writing such programs? And how many memory such daemons can consume? A couple of hundred megabytes total? Is it really that much on modern systems?

My experience with swap shows, that it only makes things worse. When I program, my application may sometimes allocate a lot of memory due to some silly bug. In such case the whole system practically stops working - even mouse cursor can't move. If I am happy, OOM killer will eventually kill my buggy program, but after that it's not over - almost all used memory is now in swap and the whole system works snail-slow, presumably because kernel doesn't think it should really unswap previously swapped memory and does this only on demand and only page by page.

I a hypothetical case without swap this case isn't so painful. When main system memory is almost fully consumed, OOM killer kills the most memory hungry program and all other programs just continue working as before.

I think that overall reliance on swap is noways just a legacy of old times when main memory was scarce and back than it maybe was useful to have swap. OS kernels should be redesigned to work without swap, this will make system behavior smoother and kernel code may be simpler (all this swapping code may be removed) and thus faster.

[-]

Rohansi a day ago

> As I understand this article, swap is useful for cases where many long-lived programs (daemons) allocate a lot of memory, but almost never access it. But wouldn't it be better to avoid writing such programs?

Ideally yes, but is that something you keep in mind when you write software? Do you ever consider freeing memory just because it hasn't been used in a while? How do you decide when to free it? This is all handled automatically when you have swap enabled, and at a granularity that is much higher than you can practically manually implement it.

[-]

Panzerschrek a day ago

I write mostly C++ or Rust programs. In these languages memory is freed as soon as it's no longer in use (thanks to destructors). So, usually this shouldn't be actively kept in mind. The only exception are cases like caches, but long-running programs should use caching carefully - limit cache size and free cache entries after some amount of time.

Programs, which allocate large amounts of memory without strict necessity to do so, are just a consequence of swap existence. "Thanks" to swap they weren't properly tested in low-memory conditions and thus no necessary optimization were done.

[-]

Rohansi a day ago

You'll also need to consider that the allocator you're using may not immediately free memory to the system. That memory is free to be used by your application but considered as used memory mapped to your program.

Anyway, it's easy to discuss best practices but people actually following them is the actual issue. If you disable swap and the software you're running isn't optimized to minimize idle memory usage then your system will be forced to keep all of that data in RAM.

[-]

man8alexd a day ago

You are both confusing swap and memory overcommit policy. You can disable swap by compiling the kernel with `CONFIG_SWAP=no`, but it won't change the memory overcommit policy, and programs would still be able to allocate more memory than available on the system. There is no problem in allocating the virtual memory - if it isn't used, it never gets mapped to the physical memory. The problem is when a program tries to use more memory than the system has, and you will get OOMs even with the swap disabled. You can disable memory overcommit, but this is only going to result in malloc() failing early while you still have tons of memory.

[-]

Rohansi 11 hours ago

Overcommit is different. We are referring to infrequently used memory - allocated, has been written to, but is not accessed often.

[-]

man8alexd 6 hours ago

> Programs, which allocate large amounts of memory without strict necessity to do so, are just a consequence of swap existence.

This. The ability to allocate large amounts of memory is due to memory overcommit, not the "swap existence". If you disable swap, you can still allocate memory with almost no restrictions.

> This is all handled automatically when you have swap enabled

And this. This statement doesn't make any sense. If you disable swap, kernel memory management doesn't change, you only lose the ability to reclaim anon pages.

csmantle a day ago

A side note, stack memories are usually not physically returned to the OS. When (de)allocating on stack, only the stack pointer is moved within the pages preallocated by the OS.

jibal a day ago

> Programs, which allocate large amounts of memory without strict necessity to do so, are just a consequence of swap existence. "Thanks" to swap they weren't properly tested in low-memory conditions and thus no necessary optimization were done.

Who told you this? It's not remotely true.

Here's an article about this subject that you might want to read:

https://chrisdown.name/2018/01/02/in-defence-of-swap.html

rwmj a day ago

> In these languages memory is freed as soon as it's no longer in use (thanks to destructors).

Unless you have an almost pathological attention to detail, that is not true at all. And even if you do precisely scope your destructors, the underlying allocator won't return the memory to the OS (what matters here) immediately.

immibis a day ago

And were you aware that freeing memory only allows it to be reallocated within your process but doesn't actually release it from your process? State-of-the-art general-purpose allocators are actually still kind of shit.

zozbot234 a day ago

> I a hypothetical case without swap this case isn't so painful. When main system memory is almost fully consumed, OOM killer kills the most memory hungry program

That's not how it works in practice. What happens is that program pages (and read-only data pages) get gradually evicted from memory and the system still slows to a crawl (to the point where it becomes practically unresponsive) because every access to program text outside the current 4KB page now potentially involves a swap-in. Sure, eventually, the memory-hungry task will either complete successfully or the OOM killer will be called, but that doesn't help you if you care about responsiveness first and foremost (and in practice, desktop users do care about that - especially when they're trying to terminate that memory hog).

[-]

Panzerschrek a day ago

Why not just always preserving program code in memory? It's usually not that much - typical executable is usually several megabytes in size and many processes can share the same code memory pages (especially with shared libraries).

[-]

creshal a day ago

> It's usually not that much - typical executable is usually several megabytes in size and many processes can share the same code memory pages (especially with shared libraries)

Have a look at Chrome. Then have a look at all the Electron "desktop" apps, which all ship with a different Chrome version and different versions of shared libraries, which all can't share memory pages, because they're subtly different. You find similar patterns across many, many other workloads.

[-]

teddyh 21 hours ago

Or modern languages, like Rust and Go, which have decided that runtime dependencies are too hard and instead build enormous static binaries for everything.

man8alexd a day ago

Programs and shared libraries (pages with VM_EXEC attribute) are kept in the memory if they are actively used (have the "accessed" bit set by the CPU) and are least likely to be evicted.

inkyoto 21 hours ago

> Why not just always preserving program code in memory?

Because the code is never required in its entirety – only «currently» active code paths need to be resident in memory, the rest can be discarded when inactive (or never even gets loaded into memory to start off with) and paged back into memory on demand. Since code pages are read only, the inactive code pages can be just dropped without any detriment to the application whilst reducing the app's memory footprint.

> […] typical executable is usually several megabytes

Executable size != the size of the actually running code.

In modern operating systems with advanced virtual memory management systems, the actual resident code size can go as low as several kilobytes (or, rather, a handful of pages). This, of course, depends on whether the hot paths in the code have a close affinity to each other in the linked executable.

toast0 19 hours ago

You may benefit by reducing your swap size significantly.

The old rule of thumb of 1-2x your ram is way too much for most systems. The solution isn't to turn it off, but to have a sensible limit. Try with half a gig of swap and see how that does. It may give you time to notice the system is degraded and pick something to kill yourself and maybe even debug the memory issue if needed. You're not likely to have lasting performance issues from too many things swapped out after you or the OOM killer end the memory pressure, because not much of your memory will fit in swap.

MomsAVoxell a day ago

> But wouldn't it be better to avoid writing such programs?

Think long-term recording applications, such as audio or studio situations where you want to "fire and forget" reliable recording systems of large amounts of data consistently from multiple streams for extended durations, for example.

[-]

dns_snek a day ago

Why wouldn't you write that data to disk? Holding it all in RAM isn't exactly a reliable way of storing data.

[-]

MomsAVoxell a day ago

What do you think is happening with swap, exactly?

[-]

robotresearcher 18 hours ago

A process’s memory in swap does not persist after the process quits or crashes.

[-]

MomsAVoxell 18 hours ago

That is true, but the point is that having swap available, increases the time between recording samples, and needing to commit them to disk.

Well-written, long term recording software doesn’t quit or crash. It records what it needs to record, and - by using swap - gives itself plenty of time to flush the buffers using whatever techniques are necessary for safety.

Disclaimer: I’ve written this software, both with and without swap available in various embedded contexts, in real products. The answer to the question is that having swap means higher data rates can be attained before needing to sync.

[-]

robotresearcher 10 hours ago

> Well-written, long term recording software doesn’t quit or crash.

Power outages, hardware failures, and OS bugs happen to the finest application software.

I believe you from your experience that it can be useful to have recorded buffers swap out before flushing them to durable storage. But I do find it a bit surprising, since the swap system has to do the storage flush you are paying for the IO, why not do it durably?

The fine article argued that you can save engineer cycles by having the OS manage optimizing out-of-working set memory for you, but that isn’t what you’re arguing here.

I’m interested in understanding your point.

dns_snek 18 hours ago

That's weirdly passive aggressive, swap isn't durable data storage.

Reliably recording massive amounts of data for extended periods of time in a studio setting is the most obvious use case for a fixed-size buffer that gets flushed to durable storage at short and predictable time intervals. You wouldn't want a segfault wiping out the entire day's worth of work, would you?

[-]

MomsAVoxell 18 hours ago

I didn’t mean to imply that swap was durable data storage.

Having swap/more memory available just means you have more buffers before needing to commit and in certain circumstances this can be very beneficial, such as when processing of larger amounts of logged data is needed prior to committing, etc.

There is certainly a case for both having and using swap, and disabling it entirely, depending on the data load and realtime needs of the application. Processing data and saving data have different requirements, and the point is really that there is no black and white on this. Use swap if it’s appropriate to the application - don’t use it, if it isn’t.

[-]

dns_snek 17 hours ago

I don't really understand what problem you're solving by doing it that way.

Instead of storing data (let's call them samples) to durable storage to begin with, you're letting the OS write them to swap which incurs the same cost, but then you need to read them from swap and write them to a different partition again (~triple the original cost).

jcynix a day ago

> When I program, my application may sometimes allocate a lot of memory due to some silly bug. In such case the whole system practically stops working [...]

You can limit resource usage per process thus your buggy application could be killed long before the system comes to a crawl. See your shell' s entry on its limit/ulimit built-in or use

man prlimit(1) - get and set process resource limits

blueflow a day ago

Programs run from program text, program text is mapped in as named pages (disk cache). They are evictable! And without swap, they will get evicted on high memory pressure. Program text thrashing is worse than having swap.

The problem is not the existence of swap, but that people are unaware that the disk cache is equally important for performance.

[-]

man8alexd a day ago

VM_EXEC pages are explicitly deprioritized from the reclaim by the kernel. Unlike any other pages, they are put into the active LRU on the first use and remain in the active LRU if they are active.

[-]

blueflow 15 hours ago

... until there are no deprioritized pages left to evict.

[-]

man8alexd 15 hours ago

Pages from the active LRU are not evicted. Pages from the inactive LRU with the "accessed" bit set are also not evicted.

Panzerschrek a day ago

It's yet another old crap - to load program code from disk on-demand. Nowadays it's just easier to load the whole executable into memory and always preserve it.

kalleboo a day ago

> When I program, my application may sometimes allocate a lot of memory due to some silly bug

I had one of those cases a few years ago when a program I was working on was leaking 12 MP raw image buffers in a drawing routine. I set it off running and went browsing HN/chatting with friends. A few minutes later I was like "this process is definitely taking too long" and when I went to check on it, it was using up 200+ GB of RAM (on a 16 GB machine) which had all gone to swap.

I hadn't noticed a thing! Modern SSDs are truly a marvel... (this was also on macOS rather than Linux, which may have a better swap implementation for desktop purposes)

Ferret7446 a day ago

Kinda, basically. Swap is a cost optimization for "bad" programs.

Having more RAM is always better performance, but swap allows you to skimp out on RAM in certain cases for almost identical performance but lower cost (of buying more RAM), if you run programs that allocate a lot of memory that it subsequently doesn't use. I hear Java is notoriously bad at this, so if you run a lot of heavy enterprise Java software, swap can get you the same performance with half the RAM.

(It is also a "GC strategy", or stopgap for memory leaks. Rather than managing memory, you "could" just never free memory, and allocate a fat blob of swap and let the kernel swap it out.)

creshal a day ago

> But wouldn't it be better to avoid writing such programs?

Yes, indeed, the world would be a better place if we had just stopped writing Java 20 years ago.

> And how many memory such daemons can consume? A couple of hundred megabytes total?

Consider the average Java or .net enterprise programmer, who spends his entire career gluing together third-party dependencies without ever understanding what he's doing: Your executable is a couple hundred megabytes already, then you recursively initialize all the AbstractFactorySingletonFactorySingletonFactories with all their dependencies monkey patched with something worse for compliance reasons, and soon your program spends 90 seconds simply booting up and sits at two or three dozen gigabytes of memory consumption before it has served its first request.

> Is it really that much on modern systems?

If each of your Java/.net business app VMs needs 50 or so gigabytes to run smoothly, you can only squeeze ten of them in an 1U pizza box with a mere half terabyte RAM; while modern servers allow you to cram in multiple terabytes, do you really want to spend several tens of thousands of dollars on extra RAM, when swap storage is basically free?

Cloud providers do the same math, and if you look at e.g. AWS, swap on EBS costs as much per month as the same amount of RAM costs per hour. That's almost three orders of magnitude cheaper.

> When I program, my application may sometimes allocate a lot of memory due to some silly bug.

Yeah, that's on you. Many, many mechanism let you limit the per-process memory consumption.

But as TFA tries to explain, dealing with this situation is not the purpose of swap, and never has been. This is a pathological edge case.

> almost all used memory is now in swap and the whole system works snail-slow, presumably because kernel doesn't think it should really unswap previously swapped memory and does this only on demand and only page by page.

This requires multiple conditions to be met

- the broken program is allocating a lot of RAM, but not quickly enough to trigger the OOM killer before everything has been swapped out

- you have a lot of swap (do you follow the 1990s recommendation of having 1-2x the RAM amount as swap?)

- the broken program sits in the same cgroup as all the programs you want to keep working even in an OOM situation

Condition 1 can't really be controlled, since it's a bug anyway.

Condition 2 doesn't have to be met unless you explicitly want it to. Why do you?

Condition 3 is realistically on desktop environments, despite years of messing around with flatpaks and snaps and all that nonsense they're not making it easy for users to isolate programs they run that haven't been pre-containerized.

But simply reducing swap to a more realistic size (try 4GB, see how far it gets you) will make this problem much less dramatic, as only parts of the RAM have to get flushed back.

> I a hypothetical case without swap this case isn't so painful. When main system memory is almost fully consumed, OOM killer kills the most memory hungry program and all other programs just continue working as before.

And now you're wasting RAM that could be used for caching file I/O. Have you benchmarked how much time you're wasting through that?

> I think that overall reliance on swap is noways just a legacy of old times when main memory was scarce and back than it maybe was useful to have swap.

No, you just still don't understand the purpose of swap.

Also, "old times"? You mean today? Because we still have embedded environments, we have containers, we have VMs, almost all software not running on a desktop is running in strict memory constraints.

> and kernel code may be simpler (all this swapping code may be removed)

So you want to remove all code for file caching? Bold strategy.

inkyoto 20 hours ago

Swapping (or, rather, paging – I don't think there is an operating system in existence today that swaps out entire processes) does not make modern systems slower – it is a delusion and an urban legend that originated in the sewers of the intertubes and is based on an uninformed opinion rather than the understanding and knowledge of how virtual memory systems work. It has been regurgitated to death, and the article explains it really well why it is a delusion.

20-30 years ago, heavy paging often crippled consumer Intel based PC's[0] because paging went to slow mechanical hard disks on PATA/IDE, a parallel device bus (until 2005 circa), which had little parallelism and initially no native command queuing; SCSI drives did offer features such as tagged command queuing and efficient scatter-gather but were uncommon on desktops leave alone laptops. Today the bottlenecks are largely gone – abundant RAM, switched interconnects such as PCIe, SATA with NCQ/AHCI, and solid-state storage, especially NVMe, provide low-latency, highly parallel I/O – so paging still signals memory pressure yet is far less punishing on modern laptops and desktops.

Swap space today has a quieter benefit: lower energy use. On systems with LPDDR4/LPDDR5, the memory controller can place inactive banks into low-power or deep power-down states; by compressing memory and paging out cold, dirty pages to swap, the OS reduces the number of banks that must stay active, cutting DRAM refresh and background power. macOS on Apple Silicon is notably aggressive with memory compression and swap and works closely with the SoC power manager, which can contribute to the strong battery life of Apple laptops compared with competitors, albeit this is only one factor amongst several.

[0] RISC workstations and servers have had switched interconnects since day 1.

[-]

man8alexd 19 hours ago

[dead]

jitl a day ago

I am testing a distributed database-like system at work that makes heavy use of swap. At startup, we read a table from S3 and compute a recursive materialized view over it. This needs about 4TB of “memory” per node while computing, which we provide as 512gb of RAM + 3900GB of NVMe zswap enabled swap devices. Once the computation is complete, we’re left with a much smaller working set index (about 400gb) we use to serve queries. For this use-case, swap serves as a performant and less labor intensive approach to manually spilling the computation to disk in application code (although there is some mlock going on; it’s not entirely automatic). This is like a very extreme version of the initialization-only pages idea discussed in the articule.

The warm up computation does take like 1/4 the time if it can live entirely in RAM, but using NVMe as “discount RAM” reduces the United States dollar cost of the system by 97% compared to RAM-only.

[-]

zozbot234 a day ago

The problem with heavy swapping on NVMe (or other flash memory) is that it wears out the flash storage very quickly, even for seemingly "reasonable" workloads. In a way, the high performance of NVMe can work against you. Definitely something you want to check out via SMART or similar wearout stats.

[-]

justsomehnguy 22 minutes ago

> that it wears out the flash storage very quickly

Only if you use a consumer grade flash with a non-consumer grade usage.

For anything with DPWD >= 1 it's not an issue, eg:

https://news.ycombinator.com/item?id=45273937

ciupicri 18 hours ago

For what it's worth, these are the lifetime estimates for the Micron 7450 SSD [1]:

    Model  Capacity  4K Rand  128K Seq
               [GB]    [TBW]     [TBW]
    PRO        3840    7_300    24_400
    PRO        7680   14_000    48_800
    MAX        3200   17_500    30_900
    MAX        6400   35_000    61_800

> Values represent the theoretical maximum endurance for the given transfer size and type. Actual lifetime will vary by workload …

> Total bytes written calculated assuming drive is 100% full (user capacity) with workload of 100% random aligned 4KB writes.

[1]: page 6/17, https://assets.micron.com/adobe/assets/urn:aaid:aem:d133a40b...

jitl 20 hours ago

Let’s say we’re spending $1 million on hardware hypothetically with the swap setup.

At that price point, either we use swap and let the kernel engineers move data from RAM to disk and back, or we disable swap and need user space code to move the same data to disk and back. We’d need to price out writing & maintaining the user space implementation (mmap perhaps?) for it to be fair price comparison.

To avoid SSD wear and tear, we could spend $29 million a year more to put the data in RAM only. Not worth!

(We rent EC2 instances from AWS, so SSD wear is baked into the pricing)

p_ing a day ago

While what you stated is overall not true, who cares with a 97% cost savings vs RAM? Just pop in another NVMe when one fails.

inkyoto 20 hours ago

Not an issue for the commenter – since they have mentioned S3, they are either using AWS EBS or instance attached scratch NVMe's which the vendor (AWS) takes care of.

The AWS control plane will detect an ailing SSD backing up the EBS and will proactively evacuate the data before the physical storage goes pear shaped.

If it is an EC2 instance with an instance attached NVMe, the control plane will issue an alert that can be automatically acted upon, and the instance can be bounced with a new EC2 instance allocated from a pool of the same instance type and get a new NVMe. Provided, of course, the design and implementation of the running system are stateless and can rebuild the working set upon a restart.

[-]

jitl 17 hours ago

EBS is slow. No way we would use it for swap. Gotta be instance storage device. And yes, we can rebuild a node from source data, we do so regularly to release changes anyways.

[-]

inkyoto 8 hours ago

I figured that you were using instance attached NVMe's since you mentioned the scale of your load – an EBS even with the io2 Express storage class can't keep up with a physical NVMe drive on high intensity I/O tasks.

Regardless, AWS takes care the hardware cycling / migration in either case.

man8alexd a day ago

[dead]

dsr_ a day ago

Have you considered having one box with 4TB of RAM to do the computation, then sending it around to all the other nodes?

[-]

jitl 21 hours ago

Each node handles an independent ~4TB shard of data in horizontal scale-out fashion. Perhaps we could try some complex shenanigans where we rent 4TB RAM nodes, compute, send to 512GB RAM nodes then terminate the 4TB nodes but that’s a bunch of extra complexity for not much of a win.

dist-epoch a day ago

What's the reduction of cost measured in Euros though?

tsoukase a day ago

In my humble experience, if you run out of memory in Linux you are f... up, irrespective of swap present and/or OOM getting in.

On the other side, a Raspberry Pi freezed unexpectedly (not due to low memory) until a very small swap file was enabled. It was almost never used but the freezes stopped. Fun swap stories.

[-]

fuzzfactor a day ago

>There's also a lot of misunderstanding about the purpose of swap – many people just see it as a kind of "slow extra memory" for use in emergencies, but don't understand how it can contribute during normal load to the healthy operation of an operating system as a whole.

That's the long-standing defect that needs to be corrected then, there should be no dependence on swap existing whatsoever as long as you have more than enough memory for the entire workload.

ciupicri 15 hours ago

The problem with swap for me is that in time for some reason Chrome tends to be swapped out even if free shows me some available memory when I notice its sluggishness. I've tried setting vm.swapping to <20, even 5, but it doesn't help too much.

[-]

blueflow 15 hours ago

Because Chrome's program text itself is reclaimable and counts as "available".

[-]

ciupicri 13 hours ago

Though I can deactivate the swap partition from the HDD and afterwards it's running acceptable. Perhaps it's using the faster zram, but I don't remember noticing zram usage increasing as shown my swapon -s. Or am I missing something?

[-]

blueflow 2 hours ago

Do you have the pressure stall information data in /proc/pressure? If you have it enabled, reboot, use chrome until your get your stall again, and then look which of the files in there has the highest total= number, thats likely it.

I doubt that disabling swap reduces your stalls, but i'd like to see the numbers for that.

man8alexd 14 hours ago

active pages are not reclaimable.

[-]

blueflow 2 hours ago

Pages will not stay in the active LRU if the accessor is stalled.

Edit: You keep repeating that all over the thread - i started my week an hour ago and i had two of these memory stall events during the weekend. Again its the machines with big binaries running and no swap. Maybe you could provide a better explanation than "this doesn't happen"?

hedora 14 hours ago

Is linux considered stable with swap completely disabled these days?

In the past, they recommended against that for deadlock reasons.

On the workloads I care about (desktop, and servers that avoid mmap), the anonymous dirty page part of the kernel heap is < 10% of RAM, so swap is mostly just there to waste space and slow down the oomkiller.

naniwaduni a day ago

The way I learned it, swap is basically the inverse of file caching: in much the way that extra memory can be used to cache more frequently-used files, then evicted when a "better" use for that memory comes around; swap can be used to save rarely-used anonymous memory so that you can "evict" them when there are other things you'd rather have in memory, then pull them back into memory if they ever become relevant again.

Or to look at it from another perspective... it lets you reclaim unprovable memory leaks.

fpoling 19 hours ago

The article has not mentioned memory compression as an alternative to swap which many Linux distributions enable by default.

On the other hand these days latest SSD are way faster than memory compression even with LUKS encryption on and even when compression uses LZ4 compression. Plus modern SSDs do not suffer from frequent writes as before so on my laptop I disabled the memory compression and then all reasoning from the article applies again.

Then on a development laptop running compilations/containers/VMs/browser vm.swappines does not seems matter that much if one has enough memory. So I no longer tune it to 100 or more and leave at the default 60%.

[-]

vlovich123 19 hours ago

> these days latest SSD are way faster than memory compression

That's a really provocative claim. Any benchmarks to support this?

[-]

fpoling 4 hours ago

On my laptop with Samsung PRO 990 SSD and Intel Core Ultra 7 165U CPU with 64G RAM under Debian 13:

Read test via coping to RAM memory from LUKS-encrypted BTRFS against /tmp that is a standard RAM disk:

  $ dd of=/tmp/input-90K.jsonl if=input-90K.jsonl conv=fdatasync bs=10M iflag=direct oflag=direct
  2596+1 records in
  2596+1 records out
  27225334502 bytes (27 GB, 25 GiB) copied, 20.4403 s, 1.3 GB/s

Write test:

  $ dd if=/tmp/input-90K.jsonl of=tmp.jsonl conv=fdatasync bs=10M oflag=direct
  2596+1 records in
  2596+1 records out
  27225334502 bytes (27 GB, 25 GiB) copied, 16.8612 s, 1.6 GB/s

Preparing RAM disk with zram compression to emulate zram:

  $ sudo zramctl --algorithm=lz4 --size=30GiB /dev/zram0
  $ sudo mkfs.ext4 /dev/zram0
  $ sudo mount /dev/zram0 /mnt
  $ df -h /mnt
  Filesystem      Size  Used Avail Use% Mounted on
  /dev/zram0       30G  2.1M   28G   1% /mnt

Write test to lz4-compressed ZRAM:

  $ sudo dd if=/tmp/input-90K.jsonl of=/mnt/tmp.jsonl conv=fdatasync bs=10M oflag=direct iflag=direct
  2596+1 records in
  2596+1 records out
  27225334502 bytes (27 GB, 25 GiB) copied, 93.2813 s, 292 MB/s

Read test from lz4-compressed ZRAM:

  $ dd of=/tmp/input-90K.jsonl if=/mnt/tmp.jsonl conv=fdatasync bs=10M oflag=direct iflag=direct
  2596+1 records in
  2596+1 records out
  27225334502 bytes (27 GB, 25 GiB) copied, 34.8479 s, 781 MB/s

So SSD with LUKS is 1.5 times faster than zram for read and 5 times faster than zram for write.

Note without LUKS but native SSD encryption the speed of SSD will be at least 2 times faster. Also using recent kernel is important so LUKS uses CPU instructions for AES encryptions. Without that SSD under LUKS will be several times slower.

[-]

man8alexd 2 hours ago

You are measuring sequential throughput with a block size of 10M. Swap I/O is random 4K pages (with default readahead 32K and clustered swapout 1M), with the read latency being the most important factor.

[-]

fpoling an hour ago

> are measuring sequential throughput

On SSD I am measuring LUKS performance in fact as IO is much faster then LUKS encryption using specialized CPU instructions. As I wrote, without LUKS the numbers at least twice faster even with random access.

The point is that in 2025 with the latest SSDs there is no point in using compressed memory. Ewen with LUKS encryption it will be faster than even highly tuned swap setup.

In 2022-23 when LUKS was not optimized it was different so I used hardware encryption on SSD after realizing that even lz4 compression was significantly slower than SSD.

tmtvl a day ago

Nothing about zram/zswap? I know that zram is more performant but I wonder how it holds up under high memory pressure compared to zswap.

kijin a day ago

> 6. Disabling swap doesn't prevent pathological behaviour at near-OOM, although it's true that having swap may prolong it. Whether the global OOM killer is invoked with or without swap, or was invoked sooner or later, the result is the same: you are left with a system in an unpredictable state. Having no swap doesn't avoid this.

This is the most important reason I try to avoid having a large swap. The duration of pathological behavior at near-OOM is proportional to the amount of swap you have. The sooner your program is killed, the sooner your monitoring system can detect it ("Connection refused" is much more clear cut than random latency spikes) and reboot/reprovision the faulty server. We no longer live in a world where we need to keep a particular server online at all cost. When you have an army of servers, a dead server is preferable to a misbehaving server.

OP tries to argue that a long period of thrashing will give you an opportunity for more visibility and controlled intervention. This does not match my experience. It takes ages even to log in to a machine that is thrashing hard, let alone run any serious commands on it. The sooner you just let it crash, the sooner you can restore the system to a working state and inspect the logs in a more comfortable environment.

[-]

mickeyp a day ago

That assumes the OOM killer kills the right thing. It may well choose to kill something ancillary, which causes your OOM program to just hang or misbehave wildly.

The real danger in all of this, swap or no, is the shitty OOMKiller in Linux.

[-]

xdfgh1112 a day ago

You can apply memory quotas to the individual processes with cgroups. You can also adjust how likely a process is to be killed.

kijin a day ago

The OOM killer will be just as shitty whether you have swap or not. But the more swap you have, the longer your program will be allowed to misbehave. I prefer a quick and painless death.

man8alexd a day ago

Nowadays, the OOM killer always chooses the largest process in the system/cgroup by default.

bawolff a day ago

> OP tries to argue that a long period of thrashing will give you an opportunity for more visibility and controlled intervention.

I didn't get that impression. My read was that OP was arguing for user-space process killers so the system doesn't get to the point where the system becomes unresponsive due to thrashing.

[-]

kijin a day ago

From the article:

> With swap: ... We have more visibility into the instigators of memory pressure and can act on them more reasonably, and can perform a controlled intervention.

But of course if you're doing this kind of monitoring, you can probably just check your processes' memory usage and curb them long before they touch swap.

danw1979 a day ago

Amen to failing fast.

A machine that is responding just enough to keep a circuit breaker closed is the scourge of distributed systems.

heavyset_go a day ago

Maybe I'm just insane, but if I'm on a machine with ample memory, and a process for some reason can't allocate resources, I want that process to fail ASAP. Same thing with high memory pressure situations, just kill greedy/hungry processes, please.

Like something is going very wrong if the system is in that state, so I want everything to die immediately.

[-]

gfv a day ago

sysctl vm.overcommit_memory=2. However, programs for *nix-based systems usually expect overcommit to be on, for example, to support fork(). This is a stark contrast with Windows NT model, where an allocation will fail if it doesn't fit in the remaining memory+swap.

[-]

man8alexd a day ago

People disable memory overcommit, expecting to fix OOMs, and then they get surprised when their programs start failing mallocs while there are still tons of discardable page cache in the system.

https://unix.stackexchange.com/q/797835/1027 https://unix.stackexchange.com/q/797841/1027

cmurf a day ago

systems-oomd does this.

The kernel oom killer is concerned with kernel survival, not user space performance.

01HNNWZ0MV43FF a day ago

`sudo apt-get install earlyoom`

Configure it to fire at like 5% and forget it.

I've never seen the OOM do its dang job with or without swap.

[-]

cmckn a day ago

If you've tried `systemd-oomd`, I'm curious what your thoughts are: https://www.freedesktop.org/software/systemd/man/latest/syst...

[-]

man8alexd a day ago

In my environment, `systemd-oomd` does nothing with the default settings.

internet_points a day ago

I enable oom_kill on sysrq, so I can hit alt+sysrq+f to invoke OOM; in /etc/sysctl.d/10-magic-sysrq.conf I have `kernel.sysrq = 240` (ie. 128+64+32+16, 128 being the one for the f key).

shatsky a day ago

Author pushes abstract idea about "page reclamation" in front of ideas of performance, reliability and controllable service degradation which people actually want; because author believes that it is the one and only solution to them; and then defends swap because it is good for it.

No, this is just plain wrong. There are very specific problems which happen when there is not enough memory.

1. File-backed page reads causing more disk reads, eventually ending with "programs being executed from disk" (shared libraries are also mmaped) which feels like system lockup. This does not need any "egalitarian reclamation" abstraction and swap, and swap does not solve it. But it can be solved simply by reserving some minimal amount of memory for buf/cache, with which system is still responsive. 2. Eventually failure to allocate more memory for some process. Any solutions like "page reclamation" with pushing unused pages to some swap can only increase maximum amount of memory which can be used before it happens, from one finite value to bigger finite value. When there is no memory to free without losing data, some process must be killed. Swap does not solve this. The least bad solution would be to warn user in advance and let them choose processes to kill.

[-]

man8alexd a day ago

Neither executables nor shared libraries are going to be evicted if they are in active use and have the "accessed" bit set in their page tables. This code has been present in the kernel mm/vmscan.c at least since 2012.

[-]

shatsky 16 hours ago

Will look into that again. If you're right about unevictability of these pages, what is the mechanism which causes sudden extreme degradation of performance when system is almost out of memory due to some app gradually consuming it, from quite responsive system to totally unresponsive system which can stay stuck with trashing disk for ages until oom will fire?

[-]

man8alexd 14 hours ago

Once your active working set starts spilling into swap, the performance degradation goes exponential. The difference in latency between RAM and SSD is orders of magnitude. Assuming 10^3 difference, 0.1% memory excess causes 2x degradation, 1% - 10x degradation, 10% - 100x degradation.

tanelpoder a day ago

This wasn't on Linux, but on one of the old-school commercial Unixes - a customer had memory leaks in some of their daemon processes. They couldn't fix them for some reason.

So they invested in additional swap space, let the processes slowly grow, swap out leaked stuff and restart them all over the weekend...

[-]

rwmj a day ago

I wrote a chat server back in the 2000s which would gradually use more and more memory over a period of months. After extensive debugging, I couldn't find any memory leak and concluded the problem was likely inside glibc or caused by memory fragmentation. Solution was to have a cron job that ran every 3 months and rebooted the machine.

userbinator a day ago

Under no/low memory contention

on some workloads this may represent a non-trivial drop in performance due to stale, anonymous pages taking space away from more important use

WTF?

[-]

creshal a day ago

Welcome to the wonderful world of Java programs. When your tomcat abomination pulls in 500 dependencies for one method call each, and 80% of the methods aren't even called in regular use except to perform dependency injection mumbo jumbo during the 90 seconds your tomcat needs to start up, you easily end up with 70% of your application's anon pages being completely useless, but if you can't banish them to swap, they'll prevent the code on the hot path from having any memory left over for file caching.

So even if you never run into OOM situations, adding a couple gigabytes of swap lets you free up that many gigabytes of RAM for file caching, and suddenly your application is on average 5x faster - but takes 3 seconds longer to service that one obscure API call that needs to dig all those pages back up. YMMV if you prefer consistently poor performance over inconsistent but usually much better performance.

[-]

marginalia_nu a day ago

Java's performance for cold code is bad period. This doesn't really have to do with code being paged out (that very rarely happens) but due to the JIT compiler not having warmed up the appropriate execution paths so that it runs in interpreted mode, often made worse as static object initialization happening when the first code that needs that particular class runs, and if you're unlucky with how the system was designed that may introduce cascading class initialization.

Though any halfway competent Java developer following modern best practices will know to build systems that don't have these characteristics.

[-]

creshal a day ago

> Though any halfway competent Java developer following modern best practices will know to build systems that don't have these characteristics.

I'll let you know if I ever meet any. Until then, another terabyte of RAM for tomcat.

[-]

marginalia_nu 20 hours ago

Java generally performs much better when it isn't given huge amounts of memory to work with.

a day ago

[deleted]

goopypoop a day ago

swap.avi is its own damning defence

shawnz 21 hours ago

It's crazy to me that even Fedora disables swap to disk by default now. It really speaks to how broadly misunderstood swap is

vbezhenar a day ago

I've always created swap of 1.5x - 4x RAM size on every Linux computer I've had to manage and never had any issues with it. That's my rule that I learned many years ago, follow to this day and will follow.

Worst thing: I left 5% of my SSD unused which will actually be used for garbage collection and other staff. That's OK.

What I don't understand is why modern Linux is so shy of touching swap. With old kernels, Linux happily pushed unused pages to a swap, so even if you don't eat memory, your swap will be filled with tens or hundreds MB of memory and that's a great thing. Modern kernel just keeps swap usage at 0, until memory is exhausted.

[-]

bawolff a day ago

> What I don't understand is why modern Linux is so shy of touching swap. With old kernels, Linux happily pushed unused pages to a swap, so even if you don't eat memory, your swap will be filled with tens or hundreds MB of memory and that's a great thing. Modern kernel just keeps swap usage at 0, until memory is exhausted.

The article has the answer.

creshal a day ago

> I've always created swap of 1.5x - 4x RAM size on every Linux computer I've had to manage and never had any issues with it.

That's a couple terabyte of swap on servers these days, and even on laptops I wouldn't want to deal with 300-ish GB swap.

kevin_thibedeau a day ago

I haven't used swap for 15 years. You have to be judicious about heavy app usage with only 16GiB. With 32GiB, I've never triggered OOM.

[-]

aidenn0 9 hours ago

If you never trigger OOM, then you can basically only benefit from enabling a modest amount of swap.