User here, who also acts as a Level 2 support for storage.
The article contains some solid logic plus an assumption that I disagree with.
Solid logic: you should prefer zswap if you have a device that can be used for swap.
Solid logic: zram + other swap = bad due to LRU inversion (zram becomes a dead weight in memory).
Advice that matches my observations: zram works best when paired with a user-space OOM killer.
Bold assumption: everybody who has an SSD has a device that can be used for swap.
The assumption is simply false, and not due to the "SSD wear" argument. Many consumer SSDs, especially DRAMless ones (e.g., Apacer AS350 1TB, but also seen on Crucial SSDs), under synchronous writes, will regularly produce latency spikes of 10 seconds or more, due to the way they need to manage their cells. This is much worse than any HDD. If a DRAMless consumer SSD is all that you have, better use zram.
Thank you for reading and your critique! What you're describing is definitely a real problem, but I'd challenge slightly and suggest the outcome is usually the inverse of what you might expect.
One of the counterintuitive things here is that _having_ disk swap can actually _decrease_ disk I/O. In fact this is so important to us on some storage tiers that it is essential to how we operate. Now, that sounds like patent nonsense, but hear me out :-)
With a zram-only setup, once zram is full, there is nowhere for anonymous pages to go. The kernel can't evict them to disk because there is no disk swap, so when it needs to free memory it has no choice but to reclaim file cache instead. If you don't allow the kernel to choose which page is colder across both anonymous and file-backed memory, and instead force it to only reclaim file caches, it is inevitable that you will eventually reclaim file caches that you actually needed to be resident to avoid disk activity, and those reads and writes hit the same slow DRAMless SSD you were trying to protect.
In the article I mentioned that in some cases enabling zswap reduced disk writes by up to 25% compared to having no swap at all. Of course, the exact numbers will vary across workloads, but the direction holds across most workloads that accumulate cold anonymous pages over time, and we've seen it hold on constrained environments like BMCs, servers, desktop, VR headsets, etc.
So, counter-intuitively, for your case, it may well be the case that zswap reduces disk I/O rather than increasing it with an appropriately sized swap device. If that's not the case that's exactly the kind of real-world data that helps us improve things on the mm side, and we'd love to hear about it :-)
1. Thanks for partially (in paragraph 4 but not paragraph 5) preempting the obvious objection. Distinguishing between disk reads and writes is very important for consumer SSDs, and you quoted exactly the right metric in paragraph 4: reduction of writes, almost regardless of the total I/O. Reads without writes are tolerable. Writes stall everything badly.
2. The comparison in paragraph 4 is between no-swap and zswap, and the results are plausible. But the relevant comparison here is a three-way one, between no-swap, zram, and zswap.
3. It's important to tune earlyoom "properly" when using zram as the only swap. Setting the "-m" argument too low causes earlyoom to miss obvious overloads that thrash the disk through page cache and memory-mapped files. On the other hand, with earlyoom, I could not find the right balance between unexpected OOM kills and missing the brownouts, simply because, with earlyoomd, the usage levels of RAM and zram-based swap are the only signals available for a decision. Perhaps systemd-oomd will fare better. The article does mention the need for tuning the userspace OOM killer to an uncomfortable degree.
I have already tried zswap with a swap file on a bad SSD, but, admittedly, not together with earlyoomd. With an SSD that cannot support even 10 MB/s of synchronous writes, it browns out, while zram + earlyoomd can be tuned not to brown out (at the expense of OOM kills on a subjectively perfectly well performing system). I will try backing-store-less zswap when it's ready.
And I agree that, on an enterprise SSD like Micron 7450 PRO, zswap is the way to go - and I doubt that Meta uses consumer SSDs.
> Many consumer SSDs ... under synchronous writes, will regularly produce latency spikes of 10 seconds or more
Surely "regularly" is a significant overstatement. Most people have practically never seen this failure mode. And if it only occurs under a heavy write workload, that's not something that's supposed to happen purely as a result of swapping.
> Many consumer SSDs, especially DRAMless ones (e.g., Apacer AS350 1TB, but also seen on Crucial SSDs), under synchronous writes, will regularly produce latency spikes of 10 seconds or more, due to the way they need to manage their cells.
Is there an experiment you'd recommend to reliably show this behavior on such a SSD (or ideally to become confident a given SSD is unaffected)? Is it as simple as writing flat-out for say, 10 minutes, with O_DIRECT so you can easily measure latency of individual writes? do you need a certain level of concurrency? or a mixed read/write load? etc? repeated writes to a small region vs writes to a large region (or maybe given remapping that doesn't matter)? Is this like a one-liner with `fio`? does it depend on longer-term state such as how much of the SSD's capacity has been written and not TRIMed?
Also, what could one do in advance to know if they're about to purchase such an SSD? You mentioned one affected model. You mentioned DRAMless too, but do consumer SSD spec sheets generally say how much DRAM (if any) the devices have? maybe some known unaffected consumer models? it'd be a shame to jump to enterprise prices to avoid this if that's not necessary.
I have a few consumer SSDs around that I've never really pushed; it'd be interesting to see if they have this behavior.
> Also, what could one do in advance to know if they're about to purchase such an SSD? You mentioned one affected model.
Typically QLC is significantly worse at this than TLC, since the "real" write speed is very low. In my experience any QLC is very susceptible to long pauses in write heavy scenarios.
It does depend on controller though. As an example, check out the sustained write benchmark graph here[1], you can see that a number of models starts this oscillating pattern after exhausting the pseudo-SLC buffer, indicating the controller is taking a time-out to rearrange things in the background. Others do it too but more irregularly.
> You mentioned DRAMless too, but do consumer SSD spec sheets generally say how much DRAM (if any) the devices have?
I rely on TechPowerUp, as an example compare the Samsung 970 Evo[2] to 990 Evo[3] under DRAM cache section.
At this point just throw your shitty SSD in the garbage bin^W^W USB box and buy a proper one. OOMing would always cost you more.
And if you still need to use a shitty SSD then just increase your swap size dramatically, giving a breathing room for the drive and implicitly doing an overprovisioning for it.
Would be nice if zswap could be configured to have no backing cache so it could completely replace zram. Having two slightly different systems is weird.
There's not really any difference between swap on disk being full and swap in ram being full, either way something needs to get OOM killed.
Simplifying the configuration would probably also make it easier to enable by default in most distros. It's kind of backwards that the most common Linux distros other than ChromeOS are behind Mac and Windows in this regard.
This is actually something we're actively working on! Nhat Pham is working on a patch series called "virtual swap space" (https://lwn.net/Articles/1059201/) which decouples zswap from its backing store entirely. The goal is to consolidate on a single implementation with proper MM integration rather than maintaining two systems with very different failure modes. It should be out in the next few months, hopefully.
Very much agreed. I feel like distros still regularly get this wrong (as evidence, Ubuntu, PopOS and Fedora all have fairly different swap configs from each other).
With zram, I can just use zram-generator[0] and it does everything for me and I don't even need to set anything up, other than installing the systemd generator, which on some distros, it's installed by default.
Is there anything equivalent for zswap?
Otherwise, I'm not surprised most people are just using zram, even if sub-optimal.
Snag: I had issues getting it to use zstd at boot. Not sure if it's a bug or some peculiarity with Debian. Ended up compiling my own kernel for other reasons, and was finally able to get zstd by default, but otherwise I'd have to make/add it to a startup script.
That's a banger article, I don't even like low level stuff and yet read the whole thing. Hopefully I will have opportunity to use some of it if I ever get around to switch my personal notebook back to linux
The better distros have it (ZRAM) enabled by default for desktops (I think PopOS and Fedora). In my personal experience every desktop Linux should use memory compression (except you have an absurd amount of RAM) because it helps so much, especially with everything related to browser and/or electron usage!
Windows and macOS have it enabled by default for many years (even if it works a little different there).
There are quite a few numbers in the article, although of course I'm happy to hear any more you'd like presented.
* A counterintuitive 25% reduction in disk writes at Instagram after enabling zswap
* Eventual ~5:1 compression ratio on Django workloads with zswap + zstd
* 20-30 minute OOM stalls at Cloudflare with the OOM killer never once firing under zram
The LRU inversion argument is just plain from the code presented and a logical consequence of how swap priority and zram's block device architecture interact, I'm not sure numbers would add much there.
> The LRU inversion argument is just plain from the code presented and a logical consequence of how swap priority and zram's block device architecture interact, I'm not sure numbers would add much there.
Yes, while it is all very plausible, the run times of a given workload (on a given, documented system) known to cause memory pressure to the point of swapping with vanilla Linux (default swappiness or some appropriate value), zram and zswap would be appreciated.
https://linuxblog.io/zswap-better-than-zram/ at least qualifies that zswap performs better when using a fast NVMe device as swap device and zram remains superior for devices with slow or no swap device.
User here, who also acts as a Level 2 support for storage.
The article contains some solid logic plus an assumption that I disagree with.
Solid logic: you should prefer zswap if you have a device that can be used for swap.
Solid logic: zram + other swap = bad due to LRU inversion (zram becomes a dead weight in memory).
Advice that matches my observations: zram works best when paired with a user-space OOM killer.
Bold assumption: everybody who has an SSD has a device that can be used for swap.
The assumption is simply false, and not due to the "SSD wear" argument. Many consumer SSDs, especially DRAMless ones (e.g., Apacer AS350 1TB, but also seen on Crucial SSDs), under synchronous writes, will regularly produce latency spikes of 10 seconds or more, due to the way they need to manage their cells. This is much worse than any HDD. If a DRAMless consumer SSD is all that you have, better use zram.
Thank you for reading and your critique! What you're describing is definitely a real problem, but I'd challenge slightly and suggest the outcome is usually the inverse of what you might expect.
One of the counterintuitive things here is that _having_ disk swap can actually _decrease_ disk I/O. In fact this is so important to us on some storage tiers that it is essential to how we operate. Now, that sounds like patent nonsense, but hear me out :-)
With a zram-only setup, once zram is full, there is nowhere for anonymous pages to go. The kernel can't evict them to disk because there is no disk swap, so when it needs to free memory it has no choice but to reclaim file cache instead. If you don't allow the kernel to choose which page is colder across both anonymous and file-backed memory, and instead force it to only reclaim file caches, it is inevitable that you will eventually reclaim file caches that you actually needed to be resident to avoid disk activity, and those reads and writes hit the same slow DRAMless SSD you were trying to protect.
In the article I mentioned that in some cases enabling zswap reduced disk writes by up to 25% compared to having no swap at all. Of course, the exact numbers will vary across workloads, but the direction holds across most workloads that accumulate cold anonymous pages over time, and we've seen it hold on constrained environments like BMCs, servers, desktop, VR headsets, etc.
So, counter-intuitively, for your case, it may well be the case that zswap reduces disk I/O rather than increasing it with an appropriately sized swap device. If that's not the case that's exactly the kind of real-world data that helps us improve things on the mm side, and we'd love to hear about it :-)
1. Thanks for partially (in paragraph 4 but not paragraph 5) preempting the obvious objection. Distinguishing between disk reads and writes is very important for consumer SSDs, and you quoted exactly the right metric in paragraph 4: reduction of writes, almost regardless of the total I/O. Reads without writes are tolerable. Writes stall everything badly.
2. The comparison in paragraph 4 is between no-swap and zswap, and the results are plausible. But the relevant comparison here is a three-way one, between no-swap, zram, and zswap.
3. It's important to tune earlyoom "properly" when using zram as the only swap. Setting the "-m" argument too low causes earlyoom to miss obvious overloads that thrash the disk through page cache and memory-mapped files. On the other hand, with earlyoom, I could not find the right balance between unexpected OOM kills and missing the brownouts, simply because, with earlyoomd, the usage levels of RAM and zram-based swap are the only signals available for a decision. Perhaps systemd-oomd will fare better. The article does mention the need for tuning the userspace OOM killer to an uncomfortable degree.
I have already tried zswap with a swap file on a bad SSD, but, admittedly, not together with earlyoomd. With an SSD that cannot support even 10 MB/s of synchronous writes, it browns out, while zram + earlyoomd can be tuned not to brown out (at the expense of OOM kills on a subjectively perfectly well performing system). I will try backing-store-less zswap when it's ready.
And I agree that, on an enterprise SSD like Micron 7450 PRO, zswap is the way to go - and I doubt that Meta uses consumer SSDs.
It's very rare for disk reads to hang your UI (you would need to be running blocking operations ij the UI thread)
But a swap witu high latency will occasionally hang the interface, and with it hang any means to free memory manually
Counterargument: you can mostly disable zswap writeback, so it will only use the swap partition when hibernating[1].
[1]https://wiki.archlinux.org/title/Power_management/Suspend_an...
> Many consumer SSDs ... under synchronous writes, will regularly produce latency spikes of 10 seconds or more
Surely "regularly" is a significant overstatement. Most people have practically never seen this failure mode. And if it only occurs under a heavy write workload, that's not something that's supposed to happen purely as a result of swapping.
> Many consumer SSDs, especially DRAMless ones (e.g., Apacer AS350 1TB, but also seen on Crucial SSDs), under synchronous writes, will regularly produce latency spikes of 10 seconds or more, due to the way they need to manage their cells.
Is there an experiment you'd recommend to reliably show this behavior on such a SSD (or ideally to become confident a given SSD is unaffected)? Is it as simple as writing flat-out for say, 10 minutes, with O_DIRECT so you can easily measure latency of individual writes? do you need a certain level of concurrency? or a mixed read/write load? etc? repeated writes to a small region vs writes to a large region (or maybe given remapping that doesn't matter)? Is this like a one-liner with `fio`? does it depend on longer-term state such as how much of the SSD's capacity has been written and not TRIMed?
Also, what could one do in advance to know if they're about to purchase such an SSD? You mentioned one affected model. You mentioned DRAMless too, but do consumer SSD spec sheets generally say how much DRAM (if any) the devices have? maybe some known unaffected consumer models? it'd be a shame to jump to enterprise prices to avoid this if that's not necessary.
I have a few consumer SSDs around that I've never really pushed; it'd be interesting to see if they have this behavior.
> Also, what could one do in advance to know if they're about to purchase such an SSD? You mentioned one affected model.
Typically QLC is significantly worse at this than TLC, since the "real" write speed is very low. In my experience any QLC is very susceptible to long pauses in write heavy scenarios.
It does depend on controller though. As an example, check out the sustained write benchmark graph here[1], you can see that a number of models starts this oscillating pattern after exhausting the pseudo-SLC buffer, indicating the controller is taking a time-out to rearrange things in the background. Others do it too but more irregularly.
> You mentioned DRAMless too, but do consumer SSD spec sheets generally say how much DRAM (if any) the devices have?
I rely on TechPowerUp, as an example compare the Samsung 970 Evo[2] to 990 Evo[3] under DRAM cache section.
[1]: https://www.tomshardware.com/pc-components/ssds/samsung-990-... (second image in IOMeter graph)
[2]: https://www.techpowerup.com/ssd-specs/samsung-970-evo-1-tb.d...
[3]: https://www.techpowerup.com/ssd-specs/samsung-990-evo-plus-1...
At this point just throw your shitty SSD in the garbage bin^W^W USB box and buy a proper one. OOMing would always cost you more.
And if you still need to use a shitty SSD then just increase your swap size dramatically, giving a breathing room for the drive and implicitly doing an overprovisioning for it.
Would be nice if zswap could be configured to have no backing cache so it could completely replace zram. Having two slightly different systems is weird.
There's not really any difference between swap on disk being full and swap in ram being full, either way something needs to get OOM killed.
Simplifying the configuration would probably also make it easier to enable by default in most distros. It's kind of backwards that the most common Linux distros other than ChromeOS are behind Mac and Windows in this regard.
This is actually something we're actively working on! Nhat Pham is working on a patch series called "virtual swap space" (https://lwn.net/Articles/1059201/) which decouples zswap from its backing store entirely. The goal is to consolidate on a single implementation with proper MM integration rather than maintaining two systems with very different failure modes. It should be out in the next few months, hopefully.
> Would be nice if zswap could be configured to have no backing cache
You can technically get this behavior today using /dev/ram0 as a swap device, but it's very awkward and almost certainly a bad idea.
Very much agreed. I feel like distros still regularly get this wrong (as evidence, Ubuntu, PopOS and Fedora all have fairly different swap configs from each other).
I wrote about this recently too:
https://www.theregister.com/2026/03/13/zram_vs_zswap/
I prefer zswap to zram and as I linked at the end of the piece, it's not just me:
https://linuxblog.io/zswap-better-than-zram/
Maybe I am overthinking but I am wondering if this piece about myths is in any way a response to my article?
With zram, I can just use zram-generator[0] and it does everything for me and I don't even need to set anything up, other than installing the systemd generator, which on some distros, it's installed by default. Is there anything equivalent for zswap? Otherwise, I'm not surprised most people are just using zram, even if sub-optimal.
[0]: https://crates.io/crates/zram-generator
Kernel arguments are the primary method: https://wiki.archlinux.org/title/Zswap#Using_kernel_boot_par...
Snag: I had issues getting it to use zstd at boot. Not sure if it's a bug or some peculiarity with Debian. Ended up compiling my own kernel for other reasons, and was finally able to get zstd by default, but otherwise I'd have to make/add it to a startup script.
enabling != configuring. Are you saying this is all that's necessary, assuming an existing swap device exists? That should be made clearer.
Edit: To be extra clear. When I was researching this, I ended up going with zram only because:
* It is the default for Fedora.
* zramctl gives me live statistics of used and compressed size.
* The zswap doc didn't help my confusion on how backing devices work (I guess they're any swapon'd device?)
From memory:
And then you have to create the systemd service unit file, which is the longest.zramctl works as-is.
This should've been a bash script...
That's a banger article, I don't even like low level stuff and yet read the whole thing. Hopefully I will have opportunity to use some of it if I ever get around to switch my personal notebook back to linux
Is this advice also applicable to Desktops installations?
The better distros have it (ZRAM) enabled by default for desktops (I think PopOS and Fedora). In my personal experience every desktop Linux should use memory compression (except you have an absurd amount of RAM) because it helps so much, especially with everything related to browser and/or electron usage!
Windows and macOS have it enabled by default for many years (even if it works a little different there).
I did an Archinstall setup this weekend, and that also suggested zram.
I used to put swap on zram when my laptop had one of those early ssds, that people would tell you not to put swap on for fear of wearing them out
Setup was tedious
can you make a follow-up here for the best way to setup swap to support full disk encryption+hybernation?
So much polemic and no numbers? If it is a performance issue, show me the numbers!
There are quite a few numbers in the article, although of course I'm happy to hear any more you'd like presented.
* A counterintuitive 25% reduction in disk writes at Instagram after enabling zswap
* Eventual ~5:1 compression ratio on Django workloads with zswap + zstd
* 20-30 minute OOM stalls at Cloudflare with the OOM killer never once firing under zram
The LRU inversion argument is just plain from the code presented and a logical consequence of how swap priority and zram's block device architecture interact, I'm not sure numbers would add much there.
> The LRU inversion argument is just plain from the code presented and a logical consequence of how swap priority and zram's block device architecture interact, I'm not sure numbers would add much there.
Yes, while it is all very plausible, the run times of a given workload (on a given, documented system) known to cause memory pressure to the point of swapping with vanilla Linux (default swappiness or some appropriate value), zram and zswap would be appreciated.
https://linuxblog.io/zswap-better-than-zram/ at least qualifies that zswap performs better when using a fast NVMe device as swap device and zram remains superior for devices with slow or no swap device.
thank goodness Kubernetes got support for swap; zswap has been a great boon for one of my workloads