Looking at the htop screenshot, I notice the lack of swap. You may want to enable earlyoom, so your whole server doesn't go down when a service goes bananas. The Linux Kernel OOM killer is often a bit too late to trigger.
You can also enable zram to compress ram, so you can over-provision like the pros'. A lot of long-running software leaks memory that compresses pretty well.
Even better than earlyoom is systemd-oomd[0] or oomd[1].
systemd-oomd and oomd use the kernel's PSI[2] information which makes them more efficient and responsive, while earlyoom is just polling.
earlyoom keeps getting suggested, even though we have PSI now, just because people are used to using it and recommending it from back before the kernel had cgroups v2.
> systemd-oomd periodically polls PSI statistics for the system and those cgroups to decide when to take action.
It's unclear if the docs for systemd-oomd are incorrect or misleading; I do see from the kernel.org link that the recommended usage pattern is to use the `poll` system call, which in this context would mean "not polling", if I understand correctly.
Unrelated to the topic, it seems awfully unintuitive to name a function ‘poll’ if the result is ‘not polling.’ I’m guessing there’s some history and maybe backwards-compatible rewrites?
Another option would be to have more memory that required over-engineer and to adjust the oom score per app, adding early kill weight to non critical apps and negative weight to important apps. oom_score_adj is already set to -1000 by OpenSSH for example.
Another useful thing to do is effecively disable over-commit on all staging and production servers (0 ratio instead of 2 memory to fully disable as these do different things, memory 0 still uses formula)
vm.overcommit_memory = 0
vm.overcommit_ratio = 0
Also use a formula to set min_free and reserved memory using a formula from Redhat that I do not have handy based on installed memory. min_free can vary from 512KB to 16GB depending on installed memory.
At least that worked for me in about 50,000 physical servers for over a decade that were not permitted to have swap and installed memory varied from 144GB to 4TB of RAM. OOM would only occur when the people configuring and pushing code would massively over-commit and not account for memory required by the kernel. Not following best practices defined by Java and thats a much longer story.
Another option is to limit memory per application in cgroups but that requires more explaining than I am putting in an HN comment.
Another useful thing is to never OOM kill in the first place on servers that are only doing things in memory and need not commit anything to disk. So don't do this on a disked database. This is for ephemeral nodes that should self heal. Wait 60 seconds so drac/ilo can capture crash message and then earth shattering kaboom...
For a funny side note, those options can also be used as a holy hand grenade to intentionally unsafely reboot NFS diskless farms when failing over to entirely different NFS server clusters. setting panic to 15 mins, triggering OOM panic by setting min_free to 16TB at the command line via Ansible not in sysctl.conf, swapping clusters, arp storm and reconverge.
Yeah, no way. As soon as you hit swap, _most_ apps are going to have a bad, bad time. This is well known, so much so that all EC2 instances in AWS disable it by default. Sure, they want to sell you more RAM, but it's also just true that swap doesn't work for today's expectations.
Maybe back in the 90s, it was okay to wait 2-3 seconds for a button click, but today we just assume the thing is dead and reboot.
This is a wrong belief because a) SSDs make swap almost invisible, so you can have that escape ramp if something goes wrong b) SWAP space is not solely an escape ramp which RAM overflows into anymore.
In the age of microservices and cattle servers, reboot/reinstall might be cheap, but in the long run it is not. A long running server, albeit being cattle, is always a better solution because esp. with some excess RAM, the server "warms up" with all hot data cached and will be a low latency unit in your fleet, given you pay the required attention to your software development and service configuration.
Secondly, Kernel swaps out unused pages to SWAP, relieving pressure from RAM. So, SWAP is often used even if you fill 1% of your RAM. This allows for more hot data to be cached, allowing better resource utilization and performance in the long run.
So, eff it, we ball is never a good system administration strategy. Even if everything is ephemeral and can be rebooted in three seconds.
Sure, some things like Kubernetes forces "no SWAP, period" policies because it kills pods when pressure exceeds some value, but for more traditional setups, it's still valuable.
My work Ubuntu laptop has 40GB of RAM and and a very fast Nvme SSD, if it gets under memory pressure it slows to a crawl and is for all practical purposes frozen while swapping wildly for 15-20 minutes.
So no, my experience with swap isn't that it's invisible with SSD.
I don't know your exact situation, but be sure you're not mixing up "thrashing" with "using swap". Obviously, thrashing implies swap usage, but not the other way around.
I've experimented with no-swap and find the same thing happens. I think the issue is that linux can also evict executable pages (since it can just reload them from disk).
I've had good experience with linux's multi-generation LRU feature, specifically the /sys/kernel/mm/lru_gen/min_ttl_ms feature that triggers OOM-killer when the "working set of the last N ms doesn't fit in memory".
It's seldom invisible, but in my experience how visible it is depends on the size/modularity/performance/etc of what's being swapped and the underlying hardware.
On my 8gb M1 Mac, I can have a ton of tabs open and it'll swap with minimal slowdown. On the other hand, running a 4k external display and a small (4gb) llm is at best horrible and will sometimes require a hard reset.
I've seen similar with different combinations of software/hardware.
Linux being absolute dogshit if it’s under any sort of memory pressure is the reason, not swap or no swap. Modern systems would be much better off tweaking dirty bytes/ratios, but fundamentally the kernel needs to be dragged into the XXI century sometime.
This is not really true of most SSDs. When Linux is really thrashing the swap it’ll be essentially unusable unless the disk is _really_ fast. Fast enough SSDs are available though. Note that when it’s really thrashing the swap the workload is 100% random 4KB reads and writes in equal quantities. Many SSDs have high read speeds and high write speeds but have much worse performance under mixed workloads.
I once used an Intel Optane drive as swap for a job that needed hundreds of gigabytes of ram (in a computer that maxed out at 64 gigs). The latency was so low that even while the task was running the machine was almost perfectly usable; in fact I could almost watch videos without dropping frames at the same time.
How long is long running? You should be getting the warm caches after at most a few hours.
> Secondly, Kernel swaps out unused pages to SWAP, relieving pressure from RAM. So, SWAP is often used even if you fill 1% of your RAM. This allows for more hot data to be cached, allowing better resource utilization and performance in the long run.
Yes, and you can observe that even in your desktop at home (if you are running something like Linux).
> So, eff it, we ball is never a good system administration strategy. Even if everything is ephemeral and can be rebooted in three seconds.
I wouldn't be so quick. Google ran their servers without swap for ages. (I don't know if they still do it.) They decided that taking the slight inefficiency in memory usage, because they have to keep the 'leaked' pages around in actual RAM, is worth it to get predictability in performance.
For what it's worth, I add generous swap to all my personal machines, mostly so that the kernel can offload cold / leaked pages and keep more disk content cached in RAM. (As a secondary reason: I also like to have a generous amount of /tmp space that's backed by swap, if necessary.)
With swap files, instead of swap partitions, it's fairly easy to shrink and grow your swap space, depending on what your needs for free space on your disk are.
It doesn't. SSDs came a long way but so did memory dies and buses, and with that the way programs work also changed as more and more they are able to fit their stacks and heaps on memory more often than not.
I have had a problem with shellcheck that for some reason eats up all my ram when I open I believe .zshrc and trust me, it's not invisible. The system crawls to a halt.
If we're talking about SATA SSDs which top at 600MBps, then yes, an aggressive application can make itself known. However, if you have a modern NVMe, esp. a 4x4 one like Samsung 9x0 series or if you're using a Mac, I bet you'll notice the problem much later, if ever. Remember the SSD trashing problem on M1 Macs? People never noticed that system used SWAP that heavily and trashed the SSD on board.
Then, if you're using a server with a couple of SAS or NVMe SSDs, you'll not notice the problem again, esp. if these are backed by RAID (even md counts).
Now that you say, I have a new Lenovo yoga with those SoC ram with crazy parallel channel config (16gb spread across 8 dies of 2gb). It's noticeably faster than my Acer nitro with dual channel 16gb ddr5. I'll check that, but I'd say it's not what the average home user (and even server I'd risk saying) would have.
> it's not invisible. The system crawls to a halt.
I’m gonna guess you’re not old enough to remember computers with memory measured in MB and IDE hard disks? Swapping was absolutely brutal back then. I agree with the other poster, swap hitting an SSD is a barely noticeable in comparison.
I think I've not made myself as clear as I could. Swap is important for efficient system performance way before you hit OOM on main memory. It's not, however, going to save system responsiveness in case of OOM. This is what I mean.
The trade-off depends on how your system is set up.
Eg Google used to (and perhaps still does?) run their servers without swap, because they had built fault tolerance in their fleet anyway, so were happier to deal with the occasional crash than with the occasional slowdown.
For your desktop at home, you'd probably rather deal with a slowdown that gives you a chance to close a few programs, then just crashing your system. After all, if you are standing physically in front of your computer, you can always just manually hit the reset button, if the slowdown is too agonising.
Swap delays the 'fundamental issue', if you have a leak that keeps growing.
If your problem doesn't keep growing, and you just have more data that programs want to keep in memory than you have RAM, but the actual working set of what's accessed frequently still fits in RAM, then swap perfectly solves this.
Think lots of programs open in the background, or lots of open tabs in your browser, but you only ever rapidly switch between at most a handful at a time. Or you are starting a memory hungry game and you don't want to be bothered with closing all the existing memory hungry programs that idle in the background while you play.
I run a chat server on a small instance; when someone uploads a large image to the chat, the 'thumbnail the image' process would cause the OOM-killer to take out random other processes.
Adding a couple of gb of swap means the image resizing is _slow_, but completes without causing issues.
The problem is freezing the system for hours or more to delay the issue is not worth it. I'd rather a program get killed immediately than having my system locked up for hours before a program gets killed.
The fundamental issue here is what the linux fanboys literally think what killing a working process and most of the time the process[0] is a good solution for not solving the fundamental problem of memory allocation in the Linux kernel.
Availability of swap allows you to avoid malloc failure in a rare case your processes request more memory than physically (or 'physically', heh) present in the system. But in the mind of so called linux administrators even if a one byte of the swap would be used then the system would immediately crawl to a stop and never would recover itself. Why it always should be the worst and the most idiotic scenario instead of a sane 'needed 100MB more, got it - while some shit in the memory which wasn't accessed since the boot was swapped out - did the things it needed to do and freed that 100MB' is never explained by them.
[0] imagine a dedicated machine for *SQL server - which process would have the most memory usage on that system?
Also: When those processes that haven't been active since boot (and which may never be active again) are swapped out, more system RAM can become available for disk caching to help performance of things that are actively being used.
And that's... that's actually putting RAM to good use, instead of letting it sit idle. That's good.
(As many are always quick to point out: Swap can't fix a perpetual memory leak. But I don't think I've ever seen anyone claim that it could.)
What if I care more about the performance of things that aren't being used right now than the things that are? I'm sick of switching to my DAW and having to listen to my drive thrash when I try to play a (say) sampler I had loaded.
A long running Linux system uses 100% of its RAM. Every byte unused for applications will be used as a disk cache, given you read more data than your total RAM amount.
This cache is evictable, but it'll be there eventually.
Linux used to don't touch unused pages in the RAM in the older days if your RAM was not under pressure, but now it swaps out pages unused for a long time. This allows more cache space in RAM.
> how does caching to swap help?
I think I failed to convey what I tried to say. Let me retry:
Kernel doesn't cache to SSD. It swaps out unused (not accessed) but unevictable pages to SWAP, assuming that these pages will stay stale for a very long time, allowing more RAM to be used as cache.
When I look to my desktop system, in 12 days, Kernel moved 2592MB of my RAM to SWAP despite having ~20GB of free space. ~15GB of this free space is used as disk cache.
So, to have 2.5GB more disk cache, Kernel moved 2592 MB of non-accessed pages to SWAP.
Yes, and if I am writing an API service, for example, I don’t want to suddenly add latency because I hit pages that have been swapped out. I want guarantees about my API call latency variance, at least when the server isn’t overloaded.
I DON’T WANT THE KERNEL PRIORITIZING CACHE OVER NRU PAGES.
If you’re writing services in anything higher level than C you’re leaking something somewhere that you probably have no idea exists and the runtime won’t ever touch again.
You better not write your API in Python, or any language/library that uses amortised algorithms in the standard (like Rust and C++ do). And let's not mention garbage collection.
That’s a fair question. A page is the smallest allocatable unit of RAM, from the OS/kernel perspective. The size is set by the CPU, traditionally 4kB, but these days 8kB-4MB are also common.
When you call malloc(), it requests a big chunk of memory from the OS, in units of pages. It then uses an allocator to divide it up into smaller, variable length chunks to form each malloc() request.
You may have heard of “heap” memory vs “stack” memory. The stack of course is the execution/call stack, and heap is called that because the “heap allocator” is the algorithm originally used for keeping track of unused chunks of these pages.
(This is beginner CS stuff so sorry if it came off as patronizing—I assume you’re either not a coder or self-taught, which is fine.)
If you are interested in human consumption, there's "free --human" which decided on useful units by itself. The "--human" switch is also available for "du --human" or "df --human" or "ls -l --human". It's often abbreviated as "-h", but not always, since that also often stands for "--help".
The OS uses almost all the ram in your system (it just doesn't tell you because then users complain that their OS is too ram heavy). The primary thing it uses it for is caching as much of your storage system as possible. (e.g. all of the filesystem metadata and most of the files anyone on the system has touched recently). As such, if you have RAM that hasn't been touched recently, the OS can page it out and make the rest of the system faster.
At the cost of tanking performance for the less frequently used code path. Sometimes it is more important to optimize in ways that minimize worst case performance rather than a marginal improvement to typical work loads. This is often the case for distributed systems, e.g. SaaS backends.
From my understanding, the comment I'm replying to uses EC2 example to portray that swapping is wrong in any and all circumstances, and I just replied with my experience with my system administrator hat.
I'm not an AWS guy. I can see and touch the servers I manage, and in my experience, SWAP works, and works well.
Just for context EC2 typically uses network storage that, for obvious reasons, often has fairly rubbish latency and performance characteristics. Swap works fine if you have local storage, though obviously it burns through your SSD/NVME drive faster and can other side effects on it's performance (usually not particularly noticeable).
This is not about belief, but lived experience. Setting up swap to me is a choice between a unresponsive system (with swap) or a responsive system with a few oom kills or downed system.
Swap also works really well for desktop workloads. (I guess that's why Apple uses it so heavily on their Macbooks etc.)
With a good amount of swap, you don't have to worry about closing programs. As long as your 'working set' stays smaller than your RAM, your computer stays fast and responsive, regardless of what's open and idling in the background.
It doesn’t happen often, and I have a multi user system with unpredictable workloads. It’s also not about swap filling up, but giving the pretense the system is operable in a memory exhausted state which means oom killer doesn’t run, but the system is unresponsive and never recovers.
Without swap oom killer runs and things become responsive.
"as soon as you hit swap" is a bad way of looking at things. Looking around at some servers I run, most of them have .5-2GB of swap used despite a bunch of gigabytes of free memory. That data is never or almost never going to be touched, and keeping it in memory would be a waste. On a smaller server that can be a significant waste.
Swap is good to have. The value is limited but real.
Also not having swap doesn't prevent thrashing, it just means that as memory gets completely full you start dropping and re-reading executable code over and over. The solution is the same in both cases, kill programs before performance falls off a cliff. But swap gives you more room before you reach the cliff.
Many won't enable swap. For some swap wouldn't help anyways, but others it could help soak up spikes. The latter in some cases will upgrade to a larger instance without even evaluating if swap could help, generating AWS more money.
Either way it's far-fetched to derive intention from the fact.
>It's a bit wasteful to provision your computers so that all the cold data lives in expensive RAM.
But that's a job applications are already doing. They put data that's being actively worked on in RAM they leave all the rest in storage. Why would you need swap once you can already fit the entire working set in RAM?
Because then you have more active working memory as infrequently used pages are moved to compressed swap and can be used for more page cache or just normal resident memory.
Swap ram by itself would be stupid but no one doing this isn’t also turning on compression.
You mean to tell me most applications you've ever used read the entire file system, loading every file into memory, and rely on the OS to move the unused stuff to swap?
How programs use ram also changed from the 90s. Back then they were written targeting machines that they knew would have a hard time fitting all their data in memory, so hitting swap wouldn't hurt perceived performance too drastically since many operations were already optimized to balance data load between memory and disk.
Nowadays when a program hits swap it's not going to fallback to a different memory usage profile that prioritises disk access. It's going to use swap as if it were actual ram, so you get to see the program choking the entire system.
If your GC is a moving collector, then absolutely this is something to watch out for.
There are, however, a number of runtimes that will leave memory in place. They are effectively just calling `malloc` for the objects and `free` when the GC algorithm detects an object is dead.
Go, the CLR, Ruby, Python, Swift, and I think node(?) all fit in this category. The JVM has a moving collector.
Python’s not a mover but the cycle breaker will walk through every object in the VM.
Also since the refcounts are inline, adding a reference to a cold object will update that object. IIRC Swift has the latter issue as well (unless the heap object’s RC was moved to the side table).
MemBalancer is a relatively new analysis paper that argues having swap allows maximum performance by allowing small excesses, that avoids needing to over-provision ram instead. The kind of gc does not matter since data spends very little time in that state and on the flip side, most of the time the application has twice has access to twice as much memory to use
A moving collector has to move to somewhere and, generally by it's nature, it's constantly moving data all across the heap. That's what makes it end up touching a lot more memory while also requiring more memory. On minor collections I'll move memory between 2 different locations and on major collections it'll end up moving the entire old gen.
It's that "touching" of all the pages controlled by the GC that ultimately wrecks swap performance. But also the fact that moving collector like to hold onto memory as downsizing is pretty hard to do efficiently.
Non-moving collectors are generally ultimately using C allocators which are fairly good at avoiding fragmentation. Not perfect and not as fast as a moving collector, but also fast enough for most use cases.
Java's G1 collector would be the worst example of this. It's constantly moving blocks of memory all over the place.
> It's that "touching" of all the pages controlled by the GC that ultimately wrecks swap performance. But also the fact that moving collector like to hold onto memory as downsizing is pretty hard to do efficiently.
The memory that's now not in use, but still held onto, can be swapped out.
Every garbage collector has to constantly sift through the entire reference graph of the running program to figure out what objects have become garbage. Generational GC's can trace through the oldest generations less often, but that's about it.
Tracing garbage collectors solve a single problem really really well - managing a complex, possibly cyclical reference graph, which is in fact inherent to some problems where GC is thus irreplaceable - and are just about terrible wrt. any other system-level or performance-related factor of evaluation.
> Every garbage collector has to constantly sift through the entire reference graph of the running program to figure out what objects have become garbage.
There's a lot of "it depends" here.
For example, an RC garbage collector (Like swift and python?) doesn't ever trace through the graph.
The reason I brought up moving collectors is by their nature, they take up a lot more heap space, at least 2x what they need. The advantage of the non-moving collectors is they are much more prompt at returning memory to the OS. The JVM in particular has issues here because it has pretty chunky objects.
> The reason I brought up moving collectors is by their nature, they take up a lot more heap space, at least 2x what they need.
If the implementer cares about memory use it won't. There are ways to compact objects that are a lot less memory-intensive than copying the whole graph from A to B and then deleting A.
This is really interesting and I've never really heard about this. What is going on with the kernel team then? Are they just going to keep swap as-is for backwards compatibility then everyone else just disables it? Or if this advice just for high performance clusters?
No. I use swap for my home machines. Most people should leave swap enabled. In fact I recommend the setup outlined in the kernel docs for tmpfs: https://docs.kernel.org/filesystems/tmpfs.html which is to have a big swap and use tmpfs for /tmp and /var/tmp.
As someone else said, swap is important not only in the case the system exhaust main memory, but it's used to efficiently use system memory before that (caching, offload page blocks to swap that aren't frequently used etc...)
My 2cents is that in a lot of cases swap is being used for unimportant stuff leave more RAM for your app. Do a "ps aux" and look at all the RAM used by weird stuff. Good news is those things will be swapped out.
Example on my personal VPS
$ free -m
total used free shared buff/cache available
Mem: 3923 1225 328 217 2369 2185
Swap: 1535 1335 200
The beauty of ZRAM is that on any modern-ish CPU it's surprisingly fast. We're talking 2-3 ms instead of 2-3 seconds ;)
I regularly use it on my Snapdragon 870 tablet (not exactly a top of the line CPU) to prevent OOM crashes (it's running an ancient kernel and the Android OOM killer basically crashes the whole thing) when running a load of tabs in Brave and a Linux environment (through Tmux) at the same time.
ZRAM won't save you if you do actually need to store and actively use more than the physical memory but if 60% of your physical memory is not actively used (think background tabs or servers that are running but not taking requests) it absolutely does wonders!
On most (web) app servers I happily leave it enabled to handle temporary spikes, memory leaks or applications that load a whole bunch of resources that they never ever use.
I'm also running it on my Kubernetes cluster. It allows me to set reasonable strict memory limits while still having the certainty that Pods can handle (short) spikes above my limit.
Is it possible you misread the comment you're replying to? They aren't recommending adding swap, they're recommending adjusting the memory tunables to make the OOM killer a bit more aggressive so that it starts killing things before the whole server goes to hell.
YMMV. Garbage-collected/pointer-chasing languages suffer more from swapping because they touch more of the heap all the time. AWS suffers more from swap because EBS is ridiculously slow and even their instance-attached NVMe is capped compared physical NVMe sticks.
Does HDD vs SSD matter at all these days? I can think of certain caching use-cases where swapping to an SSD might make sense, if the access patterns were "bursty" to certain keys in the cache
It's still extremely slow and can cause very unpredictable performance. I have swap setup with swappiness=1 on some boxes, but I wouldn't generally recommend it.
what an ignorant and clueless comment. Guess what? Todays disks are NVMe drives which are orders of magnitude faster than the 5400rpm HDDs of the 90s. Today's swap is 90s RAM.
in either case, what do you do? if you can't reach a box and it's otherwise safe to do so, you just reboot it. so is it just a matter of which situation occurs more often?
The thing is you can survive memory exhaustion if the oom killer can do its job, which it can't many times when there's swap. I guess the topmost response to this thread talks about an earlyoom tool that might alleivate this, but I've never used it, and I don't find swap helpful anyway so there's no need for me to go down this route.
It's not just 3 seconds for a button click, every time I've run out of RAM on a Linux system, everything locks up and it thrashes. It feels like 100x slowdown. I've had better experiences when my CPU was underclocked to 20% speed. I enable swap and install earlyoom. Let processes die, as long as I can move the mouse and operate a terminal.
Yup, this is a thing. It happens because file-backed program text and read-only data eventually get evicted from RAM (to make room for process memory) so every access to code and/or data beyond the current 4K page can potentially involve a swap-in from disk. It would be nice if we had ways of setting up the system so that pages of code or data that are truly critical for real-time responsiveness (including parts of the UI) could not get evicted from RAM at all (except perhaps to make room for the OOM reaper itself to do its job) - but this is quite hard to do in practice.
Because some portion of the RAM used by your daemons isn't actually being accessed, and using that RAM to store file cache is actually a better use than storing idle memory. The old rule about "as much swap as main memory" definitely doesn't hold any more, but a few GB to store unneeded wired memory to dedicate more room to file cache is still useful.
As a small example from a default Ubuntu installation, "unattended-upgrades" is holding 22MB of RSS, and will not impact system performance at all if it spends next week swapped out. Bigger examples can be found in monolithic services where you don't use some of the features but still have to wire them into RAM. You can page those inactive sections of the individual process into swap, and never notice.
Like a highway brake failure ramp, you have room for handling failures gentler. So services don't just get outright killed. If you monitor your swap usage, any usage of swap gives you early warning that your services require more memory already.
Gives you some time to upgrade, or tune services before it goes ka-boom.
If your memory usage is creeping up, the way you'll find out that you need more memory is by monitoring memory usage via the same mechanisms you'd hypothetically use to monitor your swap usage.
If your memory usage spikes suddenly, a nominal amount of swap isn't stopping anything from getting killed; you're at best buying yourself a few seconds, so unless you spend your time just staring at the server, it'll be dead anyways.
Some workloads may do better with zswap. Cache is compressed, and pages evicted to disk based swap on an LRU basis.
The case of swap thrashing sounds like a misbehaving program, which can maybe be tamed by oomd.
System responsiveness though needs a complete resource control regime in place, that preserves minimum resources for certain critical processes. This is done with cgroupsv2. By establishing minimum resources, the kernel will limit resources for other processes. Sure, they will suffer. That’s the idea.
Of course swap should be enabled. But oom killer has always allowed access to an otherwise unreachable system. The pause is there so you can impress your junior padawan who rushed to you in a hurry.
Depends on the algorithm (and how much CPU is in use); if you have a spare CPU, the faster algorithms can more-or-less keep up with your memory bandwidth, making the overhead negligible.
And of course the overhead is zero when you don't page-out to swap.
Swap to disk involves a relatively small pipe (usually 10x smaller than RAM). So instead of paying the cost to page out to disk immediately, you create compressed pages and store that in a dedicated RAM region for compressed swap.
This has a number of benefits: in practice more “active” space is freed up as unused pages are compressed and often compressible. Often times that can be freed application memory that is reserved within application space but in the free space of the allocator, especially if that allocator zeroes it those pages in the background, but even active application memory (eg if you have a browser a lot of the memory is probably duplicated many times across processes). So for a usually invisible cost you free up more system RAM. Additionally, the overhead of the swap is typically not much more than a memcpy even compressed which means that you get dedup and if you compressed erroneously (data still needed) paging it back in is relatively cheap.
It also plays really well with disk swap since the least frequently used pages of that compressed swap can be flushed to disk leaving more space in the compressed RAM region for additional pages. And since you’re flushing retrieving compressed pages from disk you’re reducing writes on an SSD (longevity) and reducing read/write volume (less overhead than naiive direct swap to disk).
Basically if you think of it as tiered memory, you’ve got registers, l1 cache, l2 cache, l3 cache, normal RAM, compressed swap RAM, disk swap - it’s an extra interim tier that makes the system more efficient.
> zram, formerly called compcache, is a Linux kernel module for creating a compressed block device in RAM, i.e. a RAM disk with on-the-fly disk compression. The block device created with zram can then be used for swap or as a general-purpose RAM disk
To clarify OP's represention of the tool, it compresses swap space not resident ram. Outside of niche use-cases, compressing swap has overall little utility.
Incorrect, with zram you swap ram to compressed ram.
It has the benefit of absorbing memory leaks (which for whatever reason compress really well) and compressing stale memory pages.
Under actual memory pressure performance will degrade. But in many circumstances where your powerful CPU is not fully utilized you can 2x or even 3x your effective RAM (you can opt for zstd compression). zram also enables you to make the trade-off of picking a more powerful CPU for the express purpose of multiplying your RAM if the workload is compatible with the idea.
PS: On laptops/workstations, zram will not interfere with an SSD swap partition if you need it for hibernation. Though it will almost never be used for anything else if you configure your zram to be 2x your system memory.
To enable a swap file in Linux, first create the swap file using a command like sudo dd if=/dev/zero of=/swapfile bs=1G count=1 for a 1GB file. Then, set it up with sudo mkswap /swapfile and activate it using sudo swapon /swapfile. To make it permanent, add /swapfile swap swap defaults 0 0 to your /etc/fstab file.
Works really well with no problems that I've seen. Really helps give a bit more of a buffer before applications get killed. Like others have said, with SSD the performance hit isn't too bad.
They both offer virtualized guests under a hypervisor host. EC2 does have more offload specialization hardware but for the most part they are functionally equivalent, unless I'm missing something...
Just saw Nate Berkopec who does a lot of rails performance stuff posting about the same idea yesterday saying Heroku is 25-50x price for performance which is so insane. They clearly have zero interest in competing on price.
It's a shame they don't just license all their software stack at a reasonable price with a similar model like Sidekiq and let you sort out actually decent hardware. It's insane to consider Heroku if anything has gotten more expensive and worse compared to a decade ago yet in comparison similar priced server hardware has gotten WAY better of a decade. $50 for a dyno with 1 GB of ram in 2025 is robbery. It's even worse considering running a standard rails app hasn't changed dramatically from a resources perspective and if anything has become more efficient. It's comical to consider how many developers are shipping apps on Heroku for hundreds of dollars a month on machines with worse performance/resources than the macbook they are developing it on.
It's the standard playback that damn near everything in society is going for though just jacking prices and targeting the wealthiest least price sensitive percentiles instead of making good products at fair prices for the masses.
Jacked up prices isn't what is happening here. There is a psychological effect that Heroku and other cloud vendors are (wittingly or unwittingly) the beneficiary of. Customer expectations are anchored in the price they pay when they start using the service, and without deliberate effort, those expectations change in _linear_ fashion. Humans think in linear terms, while actual compute hardware improvements are exponential.
Heroku's pricing has _remained the same_ for at least seven years, while hardware has improved exponentially. So when you look at their pricing and see a scam, what you're actually doing is comparing a 2025 anchor to a mid-2010s price that exists to retain revenue. At the big cloud vendors, they differentiate customers by adding obstacles to unlocking new hardware performance in the form of reservations and updated SKUs. There's deliberate customer action that needs to take place. Heroku doesn't appear to have much competition, so they keep their prices locked and we get to read an article like this whenever a new engineer discovers just how capable modern hardware is.
I mean Heroku is also offering all of the ancillary stuff around their product. It's not literally "just" hosting. It's pretty nice to not have to manage a kube cluster, to get stuff like ephemeral QA envs and the like, etc....
Heroku has obviously stagnated now but their stack is _very cool_ for if you have a fairly simple system but still want all the nice parts of a mode developed ops system. It almost lets you get away with not having an ops team for quite a while. I don't know any other provider that is low-effort "decent" ops (Fly seems to directionally want to be new Heroku but is still missing a _lot_ in my book, though it also has a lot)
Heroku is the Vercel of Rails: people will pay a fortune for it simply because it works. This has always been their business model, so it’s not really a new development. There’s little competition since the demand isn’t explosive and the margin is thin, so you end up with stagnation
You'd be surprised. There are very few because it takes a lot more work to build reliable systems across mid-market cloud providers (flakey APIs, missing functionality, etc). Plus you need to know the idiosyncrasies of all the various frameworks + build systems.
That said, they are emerging. I'm actually working on a drop-in Vercel competitor at https://www.sherpa.sh. We're 70% lower cost by running on EU based CDN and dedicated servers (Hetzner, etc). But we had to build the relationships to solve all the above challenges first.
> It's a shame they don't just license all their software stack at a reasonable price with a similar model like Sidekiq and let you sort out actually decent hardware
We built and open sourced https://canine.sh for exactly that reason. There’s no reason PaaS providers should be charging such a giant markup over already marked up cloud providers.
Heroku is pricing for “# of FTE headcount that can be terminated for switching to Heroku”; in that sense, this article’s $3000/mo bill is well below 1.0 FTE/month at U.S. pricing, so it’s not interesting to Heroku to address. I’m not defending this pricing lens, but it’s why their pricing is so high: if you aren’t switching to Heroku to layoff at least 1-2 FTE of salary per billing period, or using Heroku to replace a competitor’s equivalent replacement thereof, Heroku’s value assigned to you as a customer is net negative and they’d rather you went elsewhere. They can’t slam the door shut on the small fry, or else the unicorns would start up elsewhere, but they can set the pricing in FTE-terms and VCs will pay it for their moonshots without breaking a sweat.
This looks decent for what it is. I feel like there are umpteen solutions for easy self-hosted compute (and tbh even a plain Linux VM isn't too bad to manage). The main reason to use a PAAS provider is a managed database with built-in backups.
Its the flexibility and power of Kubernetes that I think is incredible. Scaling to multiple nodes is trivial, if your entire data plane is blown away, the recovery is trivial.
You can also self host almost any open source service without any fuss, and perform internal networking with telepresence. (For example, if you want to run an internal metabase that is not available on public internet, you can just run `telepresence connect`, and then visit the private instance at metabase.svc.cluster.local).
Canine tries to leverage all the best practices and pre-existing tools that are already out there.
But agreed, business critical databases probably shouldn't belong on Kubernetes.
Fully agreed - our recommendation is to /not/ run your prod Postgres db yourself, but use one of the many great dedicated options out there - Crunchy Data, Neon, Supabase, or AWS RDS..!
It really depends upon how much data you have. If its enough to just dump then go crazy. If it isn't its a bit more trouble.
Regardless, you're going to have a much easier time developing your app if your datastore access latency is submillisecond rather than tens of milliseconds.
You're running at a pretty small scale if running your database locally for sub-milisecond latency is practical. The database solution provided by the DBA team in a data center is going to have about the same latency as RDS or equivalent. Typical intra-datacenter network latency alone is going to be 1-3ms.
> $50 for a dyno with 1 GB of ram in 2025 is robbery
AWS isn't much better honestly.. $50/month gets you an m7a.medium which is 1 vCPU (not core) and 4GB of RAM. Yes that's more memory but any wonder why AWS is making money hand-over-fist..
This, plus as a backup plan going from Heroku to AWS wouldn't necessarily solve the problem, at least with our infra. When us-east-1 went down this week so did Heroku for us.
Not sure if it's an apples-to-apples comparison with Heroku's $50 Standard-2X dyno, but an Amazon Lightsail instance with 1GB of RAM and 2 vCPUs is $7/month.
That is assuming you need that 1 core 24/7, you can get 2 core / 8gb for $43, this will most likely fit 90% of workloads (steady traffic with spikes, or 9-5 cadence).
If you reserve that instance you can get it for 40% cheaper, or get 4 cores instead.
Yes it's more expensive than OVH but you also get everything AWS to offer.
I am not sure what's there to license. The hard and expensive part is in the labor to keep everything running. You are paying to make DevSecOps Somebody Else's Problem. You are paying for A Solution. You are not paying for software. There are plenty of Heroku clones mentioned in this thread.
I know you mean this sarcastically but I actually 100% agree with this particular on the steak point. Especially with beef prices at all time record highs and restaurant inflation being out of control post pandemic. It takes so much of the enjoyment out of things for me if I feel i'm being ripped off left and right.
What you're missing here is that companies happily pay the premium to Heroku because it lets them focus on building the product and generating business rather than wasting precious engineering time managing infra.
By the time the product is a success and reaches a scale where it becomes cost prohibitive, they have enough resources to expand or migrate away anyway.
I suppose for solo devs it might be cheaper to setup a box for fun, but even then, I would argue that not everyone enjoys doing devops and prefers spending their time elsewhere.
Not the best comment but I agree with the sentiment. I fear far too often, people complain about price when there are competitors/other cheaper options that could be used with a little more effort. If people cared so much then they should just use the alternative.
No one gets hurt if someone else chooses to waste their money on Heroku so why are people complaining? Of course it applies in cases where there aren't a lot of competitors but there are literally hundreds of different of different options for deploying applications and at least a dozen of them are just as reliable and cheaper than Heroku.
I'm hurt because a service I'm using is based on Heroku. I'm on the "unlimited" plan but they have backtracked on that and now say I'm too big for them...
The problem with Heroku's pricing is that it's set high enough that I no longer use it and neither does anyone else I know. I suspect they either pivoted to a different target market than me, which would be inconvenient but I'd be okay with it, or killed off their own growth potential by trying to extract revenue, which I would find sad.
This argument doesn't work with such commoditized software. It's more like comparing an oil change for $100 plus an hour of research and a short drive against a convenient oil change right next door for $2,500.
Nobody is forced to go to the expensive one. If they are still in business then enough people apparently consider it a reasonable deal. You might not, but others do. Whether I'm being downvoted or not.
It's just trendy to bash cloud and praise on-premises in 2025. In a few years that will turn around. Then in another few years it will turn around again.
The cloud has made people forget how far you can get with a single machine.
Hosting staging envs in pricey cloud envs seems crazy to me but I understand why you would want to because modern clouds can have a lot of moving parts.
Teaching a whole bunch of developers some cloud basics and having a few cloud people around is relatively cheap for quite a while. Plus, having test/staging/prod on similar configurations will help catch mistakes earlier. None of that "localstack runs just fine but it turns out Amazon SES isn't available in region antartica-east-1". Then, eventually, you pay a couple people's wages extra in cloud bills, and leaving the cloud becomes profitable.
Cloud isn't worth it until suddenly it is because you can't deploy your own servers fast enough, and then it's worth it until it exceeds the price of a solid infrastructure team and hardware. There's a curve to how much you're saving by throwing everything in the cloud.
Deploying to your private cloud requires basically the same skills. Containers, k8s or whatnot, S3, etc. Operating a large DB on bare metal is different from using a managed DB like Aurora, bit for developers, the difference is hardly visible.
RDS/managed database is extremely nice I will admit, otherwise I agree. Similarly s3, if you're going to do object storage, then running minio or whatever locally is probably not cheaper overall than R2 or similar.
The cloud has made people afraid of linux servers. The markup is essentially just the price business has to pay because of developer insecurity. The irony is that self hosting is relatively simple, and alot of fun. Personally never got the appeal of Heroku, Vercel and similar, because theres nothing better than spinning up a server and setting it up from scratch. Every developer should try it.
> The irony is that self hosting is relatively simple, and alot of fun. Personally never got the appeal of Heroku, Vercel and similar, because theres nothing better than spinning up a server and setting it up from scratch.
It's fun the first time, but becomes an annoying faff when it has to be repeated constantly.
In Heroku, Vercel and similar you git push and you're running. On a linux server you set up the OS, the server authentication, the application itself, the systemctl jobs, the reverse proxy, the code deployment, the ssl key management, the monitoring etc etc.
I still do prefer a linux server due to the flexibility, but the UX could be a lot better.
I use NixOS and a lot of it is in a single file. I just saw some ansible coming by here, and although I have no experience with it, it looked a lot simpler than Nix (for someone from the old Linux world, like me… eventhough Nix is, looking through your eyelashes, just a pile of key/value pairs).
Ansible, Salt and Puppet are mostly industry standard. Those tools are commonly referred to as configuration management (systems).
Ansible basically automates the workflow of: log in to X, do step X (if Y is not present). It has broad support for distros and OSes. It's mostly imperative and can be used like a glorified task runner.
Salt let's you mostly declaratively describe the state of a system. It comes with a agent/central host system for distributing this configuration from the central host to the minions (push).
Puppet is also declarative and also comes with an agent/central host system but uses a pull based approach.
Specialized/ exotic options are also available, like mgmt or NixOS.
Takes less than a day, because most of the stuff is scriptable. And for a simple compute node setup at Hetzner (I.e. no bare metal, but just a VM) it takes me less than half an hour.
I dunno, the cloud has mostly made me afraid of the cloud. You can bury yourself in towering complexity so easily on AWS. (The highly managed stuff like Vercel I don't have much experience with, so maybe it's different.)
I will recommend to try GCP or Azure, the complexity is lower there! AWS is great for big corporate that needs a lot of lego pieces to do their own custom setup. At the contrario, GCP and Azure solutions are often more bundled.
"Self hosting" may actually be referring not to hosting your own on-prem hardware, but to renting bare metal in which case the concerns around power usage, networking, etc. are offloaded to the provider.
my take is that it's fun up until there's just enough brittleness and chaos.. too many instance of the same thing but with too many env variables setup by hand and then fuzzy bug starts to pile up
Honestly I think it's the database that makes devs insecure. The stakes are high and you usually want PITR and regular backups even for low traffic apps. Having a "simple" turnkey service for this that can run in any environment (dedicated, VPS, colo, etc.) would be huge.
I think this is partly responsible for the increased popularity of sqlite as a backend. It's super simple and lightstream for recovery isn't that complicated.
Most apps don't need 5 9s, but they do care about losing data. Eliminate the possibility of losing data, without paying tons of $ to also eliminate potential outages, and you'll get a lot of customers.
Never got the appeal of having someone else do something for you, and giving them money, in exchange for goods and services? Vercel is easy. You pay them to make it easy. When you're just getting started, you start on easy mode before you jump into the deep end of the pool. Everybody's got a different cup of tea, and some like it hot and others like it cold.
Flour, salt, and water are exceedingly cheap. I have to imagine the loaf of bread I buy from my baker reflects considerably more than a 50x markup compared to baking my own.
It’s a lot cheaper than me learning to bake as well as he does—not to mention dedicating the time every day to get my daily bread—and I’ll never need bread on the kind of scale that would make it worth my time to do so.
Bread is a great example! You can buy a loaf for $3-4. It is not a 50x markup. Like growing your own veggies, baking bread is for fun, not for economics.
But the cloud is different. None of the financial scale benefits are passed on to you. You save serious money running it in-house. The arguments around scale have no validity for the vast, vast majority of use cases.
Vercel isn't selling bread: they're selling a fancy steak dinner, and yes, you can make steak at home for much less, and if you eat fancy steak dinners at fancy restaurants every night you're going to go broke.
So the key is to understand whether your vendors are selling you bread, or a fancy steak dinner, and to not make the mistake of getting the two confused.
That’s a tremendously clarifying framework, and it makes a lot of sense to me. Thank you.
I wonder, though—at the risk of overextending the metaphor—what if I don’t have a kitchen, but I need the lunch meeting to be fed? Wouldn’t (relatively expensive) catering routinely make sense? And isn’t the difference between having steak catered and having sandwiches catered relatively small compared to the alternative of building out a kitchen?
What if my business is not meaningfully technical: I’ll set up applications to support our primary function, and they might even be essential to the meat of our work. But essential in the same way water and power are: we only notice it when it’s screwed up. Day-to-day, our operational competency is in dispatching vehicles or making sandwiches or something. If we hired somebody with the expertise to maintain things, they’d sit idle—or need a retainer commensurate with what the Vercels and Herokus of the world are charging. We only need to think about the IT stuff when it breaks—and maybe to the extent that, when we expect a spike, we can click one button to have twice as much “application.”
In that case, isn’t it conceivable that it could be worth the premium to buy our way out of managing some portion of the lower levels of the stack?
Please do yourself a flavour and check the price of flour.
Water is cheap, yes. Salt isn't all that cheap, but you only need a little bit.
> [...] and I’ll never need bread on the kind of scale that would make it worth my time to do so.
If you need bread by hand, it's a very small scale affair. Your physique and time couldn't afford you large scale bread making. You'd a big special mixer and a big special oven etc for that. And you'd probably want a temperature and moisture controlled room just for letting your dough rise.
I blush to admit that I do from time to time pay $21 for a single sourdough loaf. It’s exquisite, it’s vastly superior to anything I could make myself (or anything I’ve found others doing). So I’m happy to pay the extreme premium to keep the guy in business and maintain my reliable access to it.
It weighs a couple of pounds, though I’m not clear how the water weight factors in to the final weight of a loaf. And I’m sure that flour is fancier than this one. I take your point—I don’t belong in the bread industry :)
Well, in your case, you are mostly paying for the guy's labour, I presume.
(Similarly to how you pay Amazon or Google etc not just for the raw cloud resources, but for the system they provide.)
I grew up in Germany, but now live in Singapore. What's sold as 'good' sourdough bread here would make you fail your baker's training in Germany: huge holes in the dough and other defects. How am I supposed to spread butter over this? And Mischbrot, a mixture of rye and wheat, is almost impossible to find.
So we make our own. The goal is mostly to replicate the everyday bread you can buy in Germany for cheap, not to hit any artisanal highs. (Though they are massively better IMHO than anything sold as artisanal here.)
Interestingly, the German breads we are talking about are mostly factory made. Factory bread can be good, if that's what customers demand.
Going on a slight tangent: with tropical heat and humidity, non-sourdough bread goes stale and moldy almost immediately. Sourdough bread can last for several days or even a week without going moldy in a paper bag on the kitchen counter outside the fridge, depending on how sour you go. If you are willing to toast your bread, going stale during that time isn't much of an issue either.
(Going dry is not much of an issue with any bread here--- sourdough or not, because it's so humid.)
Wait, what? Salt is literally one of the cheapest of all materials per kilogram that exists in all contexts, including non-food contexts. The cost is almost purely transportation from the point of production. High quality salt is well under a dollar a pound. I am currently using salt that I bought 500g for 0.29 euro. You can get similar in the US (slightly more expensive).
This was a meme among chemical engineers. Some people complain in reviews on Amazon that the salt they buy is cut with other chemicals that make it less salty. The reality is that there is literally nothing you could cut it with that is cheaper than salt.
One way or another, salt is not a major driver of cost in bread, because there's relatively little salt in bread. (If there's 1kg of flour, you might have 20g of salt.)
Yeah, but then we're just haggling. If you know how to change the belt on your car and already have the tools, it's different from when you're stranded with no tools and no garage and no belt.
If you're a mechanic you're supposed to know how to change the belt on your car. It would be insane if you write code and work with computers for a living but you dont know how to set up a web server.
I am pretty sure I know much more about code than you do, and at the same time you probably know much more about web servers and sysadmin than I do. I don't mind if it stays like that. And I am saying this having programmed my own web server in Java about 25 years ago.
Hum... Writing a game engine is a high-difficulty task that should be available to any reasonably good software developer with a few months to study for it. Making it in assembly is a sure way to take 10 times the time of another low level language like C, but shouldn't be an impossibility either.
Configuring a web server is a low-difficulty task that should be available for any good software developer with 3 days to study for it. It's absurd for a developer to need to configure a web server, but insist on paying a large rent and cede control to some 3rd party instead of just doing it.
I think OP is using these less as staging and more as dev environments for individual developers. That seems like a great use of a single server to me.
I'd still like a staging + prod, but keeping the dev environments on a separate beefy server seems smart.
I've been using a development server for about 9 years and the best thing I ever did was move to a machine with a low-power Xeon D for a time. It made development painful enough that I quickly fixed the performance issues I was able to overlook on more powerful hardware. I recommend it, even just as an exercise.
For similar reasons, in the Google office I worked in you had the option to connect to a really intentionally crappy wifi that was simulating a 2G connection.
The "platform" software runs on is just other software. If your prod environment is managed kubernetes then you don't lose much if your staging environment is self-hosted kubernetes.
Load balancers, IAM roles, kubernetes upgrades, postgres upgrades, security settings, DNS records, http routes... there's a lot that can go wrong and makes it useful to have a staging environment.
The cloud was a good deal in 2006 when the smallest aws machine was about the size of a ok dev desktop and took over two years of renting to justify buying the physical machine outright.
Today the smallest, and even large, aws machines are a joke, comparable to a mobile phone from 15 years ago to a terrible laptop today, and take about three to six months to in rent as buying the hardware outright.
If you're on the cloud without getting 75% discount you will save money and headcount by doing everything on prem.
This could be the premise for a fun project based infra learning site.
You get X resources in the cloud and know that a certain request/load profile will run against it. You have to configure things to handle that load, and are scored against other people.
also how far you can get with a single machine has changed massively in the past 15 years. 15 years ago a (really beefy) single machine meant 8 cores with 256GB ram and a couple TB of storage. Now a single machine can be 256 cores on 8TB of ram and a PB of storage.
Exactly, and the performance of consumer tech is wildly faster. Eg, a Ryzen 5825U mini pc with 16GB memory is ~$250USD with 512GB nvme. That thing will outperform of 14 core Xeon from ~2016 on multicore workloads and absolutely thrash it in single thread. Yes lack of ECC is not good for any serious workload, but great for lower environments/testing/prototyping, and it sips power at ~50W full tilt.
The best part is when you start with a $3000/month cloud bill during development and finally realize that hosting the production instance this way would actually cost $300k/month, but now it's too late to change it quickly.
You put your staging env in the same (kind of) place you put your prod system because you need to replicate your prod environment as faithfully as possible. You also then get to re-use your deployment code.
Do you plan on keeping it in your home? At that point I'd be worried about ISP networking or power guarantees unless you plan on upgrading to business rates for both. If you mean colo, well, if you're sure you'll be using it in X years, it's worth it, but the flexibility of month-to-month might be preferable.
> And while Hetzner's price-performance is exceptional, its limited presence in the US was a consideration; for this staging workload, it wasn't an issue, but it's a factor for production services targeting US users.
What is this referring to? Concerns about capacity if you need to scale up quickly? Or just "political"/marketing considerations about people not being used to being served by a Hetzner server?
Just to be aware when you say "Even with all 6 environments and other projects running, the server's resource usage remained low. The average CPU load stayed under 10%, and memory usage sat at just ~14 GB of the available 32 GB."
The load average in htop is actually per CPU core. So if you have 8 CPU cores like in your screenshot, a load average of 0.1 is actually 1.25% (10% / 8) of total CPU capacity - even better :).
Cool blog! I've been having so much success with this type of pattern!
what does this service offer over an established tool like Coolify? currently hosting most of my services on a cheap Hetzner VPS so i'm interested what Disco has to offer
Coolify and other self-hosting options such as Kamal are great. We're all in the same boat!
I'd say the main differences is that we 1) we offer a more streamlined CLI and UI rather than offering extensive app/installation options 2) have an api-key based system that lets team members collaborate without having to manage ssh access/keys.
Generally speaking, I'd say our approach and tooling/UX tends to be more functional/pragmatic (like Heroku) than one with every possible option.
https://devpu.sh/ is another alternative, it has a nice UI built with Hypermedia (HTMX).
I am building https://github.com/openrundev/openrun/. Main difference is that OpenRun has a declarative interface, no need for manual CLI commands or UI operations to manage apps. Another difference is that OpenRun is implemented as a proxy, it does not depend on Traefik/Nginx etc. This allows OpenRun to implement features like scaling down to zero, RBAC access control for app access, audit logs etc.
Downside with OpenRun is that is does not plan to support deploying pre-packaged apps, no Docker compose support. Streamlit/Gradio/FastHTML/Shiny/NiceGUI apps for teams are the target use case. Coolify has the best support and catalog of pre-packaged apps.
We've had a similar experience at Hack Club, the nonprofit I run that helps high schoolers get into coding and electronics.
We used to be on Heroku and the cost wasn't just the high monthly bill - it was asking "is this little utility app I just wrote really worth paying $15/month to host?" before working on it.
This year we moved to a self-hosted setup on Coolify and have about 300 services running on a single server for $300/month on Hetzner. For the most part, it's been great and let us ship a lot more code!
My biggest realization is that for an organization like us, we really only need 99% uptime on most of our services (not 99.99%). Most developer tools are around helping you reach 99.99% uptime. When you realize you only need 99%, the world opens up.
Disco looks really cool and I'm excited to check it out!
Cheers, let me know if you do / hop onto our Discord for any questions.
We know of two similar cases: a bootcamp/dev school in Puerto Rico that lets its students deploy all of their final projects to a single VPS, and a Raspberry Pi that we've set up at the Recurse Center [0] which is used to host (double checking now) ~75 web projects. On a single Pi!
Heroku's pricing is wild. About a decade ago I just about fell out of my chair when I found out the startup I was at was burning upwards of $10k/mo just to generate QR codes (made out of html tables so that they would reliably display in emails). It worked out to something like $0.15/code
The lead who wrote it had never even profiled code before, after some changes we cut it down to ~$0.01/per, but that's still insane.
What in the world?? Surely there must be something more than "generate a HTML page with 500 elements". Any edge cloud hosting lets you do that for free.
The article's title seems inaccurate - as far as I understood there never was a $3000/mo bill; there was a $500/(mo,instance) staging setup that has been rightly optimized to $55/mo before running six instances.
> Critically, all staging environments would share a single "good enough" Postgres instance directly on the server, eliminating the need for expensive managed database add-ons that, on Heroku, often cost more than the dynos themselves.
Heroku also has cheaper managed database add-ons, why not use something like that for staging? The move to self hosting might still make sense, my point is that perhaps the original staging costs of $500/mo could have been lower from the start.
I answered elsewhere with the list of dynos, but the short version is that between the list of services that each deployment required, and the size of the database, it truly (and unfortunately) did end up costing $500 per staging.
The situation is interesting, and self-hosting is indeed a very nice solution often. However, I wanted to comment on the article itself - it seems to be very heavily AI-edited. Anyone who has spent time with LLMs will easily see it. But even that's not the issue; the main issue is that the article is basically a marketing piece.
For example, the "Bridging the Gap: Why Not Just Docker Compose?" section is a 1:1 copy of the points in the "Powerful simplicity" on the landing page - https://disco.cloud/
And this blog post is the (only) case study that they showcase on their main page.
You're absolutely right! Here are some three points why:
- ...
I'm kidding :-)
Our library is open source, and we're very happy and proud that Idealist is using us to save a bit of cash. Is it marketing if you're proud of your work? :-) Cheers
Marketing should be marketing and clearly so. Tech blogs are about sharing information with the community (Netflix Tech blog is a good example) NOT selling something. Marketing masquerading as a tech blog is offputting to a lot of people. People don't like being fooled with embedded advertising and putting ad copy into such pieces is at best annoying.
Nah, people are stupid. Including me. It's all marketing. Netflix's tech blog is marketing to engineers to want to go work there and to promote their product. If you want to see things though the lense that all advertising is bad, you'll make your life miserable because it's all advertising in one way or another.
That's the problem with framing everything that way. This HN comment is marketing for my brand, my username, I sell t-shirts on my website! That's not why I'm commenting here, but the term is that broad because we're using it colloquially. It's a human psychology thing that I get entrapped into too. Calling it out doesn't make it not work. When you use the lens of marketing, your comment is marketing that you are not marketing, which is a specific category and advertising profile to be filed away in a database somewhere, if we go to the theoretical extremes.
What you've done is taken something I've written, redefined a core term in a way I obviously didn't mean, and then told me I'm wrong because of your redefinition.
When you put it that way, you make me sound like an ass. Is that how I'm coming across? What did I redefine? I'm refuting the fairytale where some content is pure and untainted by marketing. Netflix writes posts that make engineers want to work there and people think, "hey, that's smart!" That’s marketing.
I think a big difference is when someone is pretending to be all about something else and tries to sneakily market to you. One thing is getting a free water bottle with an ad, another thing is when someone is inviting you to a "party" with free food and drinks and it turns out to be a MLM "party".
Netflix is giving away free water bottles (I hate them, but I use their fast.com super often to test the speeds), another is pretending to be a blog post, but actually being an ad (if that was the case here). You just feel lied to. You cannot take anything seriously you read there, as it will probably be super biased and you cannot get your time back now.
Maybe not an ass, that's too strong, but it's a common online pattern where someone transforms your point into an entirely different meaning and then disagrees with that transformation. It's annoying.
I'm complaining about thinly veiled ad copy wearing the mask of shared technical notes. This is seen as a bad faith effort by the publisher of such notes and a dirty trick played on the reader. Advertising should announce itself for what it is.
I'm very clearly making a distinction, I like A, I don't like B.
You're taking that, saying I must actually hate both A and B, and by the way C through Z because nobody is 111% pure of heart and everybody must have at least some motivation for doing something and nobody is entirely altruistic.... which is just this crazy extreme that it's clear I don't believe at all.
I like the incentive structure that leads Netflix to produce objectively high quality articles sharing with the community in a way that really seems to be entirely untainted by the motivation.
Ad copy in tech notes does seem to taint the motivation and quality of them, it can be innocent but it doesn't seem like it and is generally irritating to a lot of people.
Dislike of a certain kind of advertising doesn't mean I'm sitting around miserable because nobody is truly altruistic as you suggest, and that the issue. My lines of thinking aren't taken to a silly extreme. A lot of disagreements these days are people reinterpreting their opposition as exclusively extremist and that's a problem.
You keep saying it's clear when it isn't. We don't know what's going on in your mind. Did you know there are people out there that won't eat anything that came from any animal products? That's crazy extreme! But there are tons of vegans out there. So what's seems extreme to one person is someone else's normal, and someone else's normal is extreme.
You say you like A and don't like B. You don't like B because it has X in it. But A also has X in it. So why do you like A but not B? It's not logically consistent. We disagree on how much X is in A. You want X to be clearly marked with red tape. It's not clear how reasonable and feasible that is or isn't. I'm saying if you're looking for X, you're going to find trace amounts of it everywhere once you start looking for it. X isn't some previously unheard of chemical that's gonna give you cancer or leaky gut though, it's other people making money. It's been chosen for us, that money is how the world works. It's not how I would do it, but I'm not in charge of the world, so it's a moot point. Everyone is weird about money in their own special way. I am no exception. What sticks in my craw is when people have problems with other people making money. How they make money is material. I'm not okay with making money off of sex trafficking or CSAM, for example, but advertising a product with an interesting bit of writing beforehand isn't that. So on the spectrum of your kid's painting that they made for you in school with crayon that were ethically sourced and drew on recycled paper, to the in your face red plastic Coca-Cola banner wrapped around the side of a bus that's gonna be fed to whales to choke and die on, where this particular blog post lies is for you to determine for yourself. Where I'm really getting at is that requiring X to be at a certain level has the unintended consequence that only big corporations with giant bags of money can create content that passes this purity test of yours, is, if we do some extrapolating, self-defeating.
I'm not sure you're functionally literate and you're beginning to ramble. So yes you're coming off as an asshole and just shouldn't respond like this. When I glance at your reply and you're bringing up sex trafficking somehow... yeah no thanks. This is the kind of reply definitely not worth engaging in.
> But even that's not the issue; the main issue is that the article is basically a marketing piece.
Why is that an issue? Is it forbidden by HN guidelines? Or would you like all marketing to be marked as such? Which articles aren't marketing, one way or another?
It's funny they have this marketing blog post based on competing on price yet don't disclose any of their pricing on their site only a schedule a meeting which is just about the biggest RED FLAG on pricing there is.
Our library is open source, the price is 0!! :-) Haha
We're actually mostly talking to people (that "schedule a meeting") to see how we can help them migrate their stuff away (from Heroku, Vercel, etc.)
But we're not sure of the pricing model yet - probably Entreprise features like Gitlab does, while remaining open source. It's a tough(er) balance than running a hosted service where you can "just" (over)charge people.
This isn't the first time an article is also marketing. Besides, what is wrong with marketing something via a use case article?
This is a fairly tame example of it and I found it an interesting and useful read, knowing full well it was also marketing.
I guess I'm not quite understanding why you need six staging servers provisioned at $500 a pop? And if you need that because you have a large team...what percentage of your engineering spend is $3000 vs $100k+/yr salaries?
Especially when I got look at the site in question (idealist.org) and it seems to be a pretty boring job board product.
6 staging servers: main, dev, and any branches that you want to let other (non tech people) QA.
As for the staging servers, for each deployment, it was a mix of Performance-M dynos, multiple Standard dynos, RabbitMQ, a database large enough, etc. - it adds up quickly.
Finally, Idealist serves ~100k users per day - behind the product is a lot of boring tech that makes it reliable & fast. :-)
From what I read, they're using them as dev environments. Like running many services at once for a single developer to tie into. That's why they wanted multiple ones, one for each dev.
This thinking definitely drives enterprise products, and is exactly what makes it hard for small companies. "You can pay a lot simply because you clearly can afford to" doesn't lead to great products, even if it often does lead to profitable companies.
Any VPS you fancy that fits the price/performance/location/support you want, then point Coolify/Dokploy/whatever at it.
I did just this using Coolify, Mythic Beasts running Django & Postgres the other month from Google App Engine. Hilariously easy, even with my extremely rusty skills.
It is worth learning to use Docker Swarm. Deployments are as simple as pushing a new container to your registry and running one command. I built a free CLI tool rove.dev that simplifies provisioning and does service diffing.
Either that or use a PaaS that deploys to VMs. Can't make recommendations here but you could start by looking at Semaphore, Dokku, Dokploy.
https://render.com/ is probably the closest, I'm really enjoying using them. Workflow is the same as heroku, but cheaper, no nightly restarts, supports new python versions etc..
From looking at your docs, it appears like using and connecting GitHub is a necessary prerequisite for using Disco. Is that correct? Can disco also deploy an existing Docker image in a registry of my choosing without a build step? (Something like this with Kamal: `kamal --skip-push --version latest`)
Correct, GitHub is necessary at this point to deploy code.
However, yes, you can ask Disco to fetch an existing Docker image (we use that to self-host RabbitMQ). An example of deploying Meilisearch's image is here [0] with the tutorial here [1].
Do you typically build your Docker images and push them to a registry? Curious to learn more about your deployment process.
Yes, I try to keep my CI pipelines somewhat platform-agnostic so even though I'm mostly using GitHub, my workflow is typically to first build a Docker image and push it to a registry, then use Kamal to deploy that image.
Yes, I'm just as curious as you on _why_ does a staging setup needs the same amount of resources as prod.
All of my staging setups are on a ~$15 Hetzner server, with a GitHub Action to `docker compose build && docker compose up -d` remotely, with an Apache service with a wildcard certificate and dynamic host names. We have 3..n staging setups, with each PR spinning up a new staging site just for that PR.
It's been working with us for years, for a team of 10 developers.
It is clear that Heroku is not interested in reducing their prices. But I don’t think this is a Heroku problem. Vercel is also the same, which makes me think there is a fundamental issue with the PaaS business model that stops it from competing on price while the commoditised part their business (data centers) are always reducing their prices.
The challenge I always face with homebrew PaaS solutions is that you always end up moving from managing your app to managing your PaaS.
This might not be true right now but as complexity of your app grows it’s almost always the eventual outcome.
Having been in the industry for 20 years, I can remember we were processing high loads with... Algorithms. It wasn't a cloud cost saving initiative back then, but a necessity if you had scale, you could just not throw money at scaling. Feels like we shifted optimization from algorithms to cloud costs savings...
Title seems slightly exaggerated since by my reading there was no actual $3000 / month bill? Still a great use-case
This seems like a good idea to have plentiful dev environments and avoid a bad pricing model. If your production instance is still on Heroku, you might still want a staging environment on Heroku since a Hetzner server and your production instance might have subtle differences.
It is hilarious, don't get me wrong - I really appreciate more people moving away from these "Hi-Tech" deployment styles and cloud services and the rest, but it is like rediscovering hot water.
The draw of a docker-compose-like interface for deployment is so alluring that I have spent the last year or so working on a tool called Defang that takes a compose file and deploys it to the cloud. We don't support Hetzner (yet), but we do support AWS, GCP, and DO. We provision networking, IAM, compute, database, secrets, etc in your cloud account, so you maintain full control, but you also get the ergonomics of compose.
If you are on a PaaS and you want to reduce cost without losing ergonomics and scalability, it might be interesting.
Quite sad to see devs nowadays has lost abilities to self-host. I know it can be overwhelming with Linux, networking, db, backup, hardware load.... However, it's not rocket science!
Cool to hear on the savings.
But now the team has to maintain two different deployment models, so you have account for the ongoing cost that your team now has to own and maintain two different processes of deployment (prod & staging).
The key element here is the need to continuously exercise both processes (Heroku + your staging server), to work out both processes & maintain familiarity on both.
Depending on the amount of staff involved in the above, it might eclipse the compute savings, but only OP knows those details. I'm sure they are a smart bunch.
I work at Render (render.com); we have over 4 million developers on the platform, and we've migrated many large (and small) Heroku customers over because of our more modern capabilities and scalable pricing.
You have your range of options - it depends on the size of your team, the kind of apps you're running, etc. The answer can be anything from an "ssh script" to AWS (or K8S), etc.
If you're running something that's too expensive for your taste and can share more information, happy to brainstorm some options.
I was looking on Hetzner after that recent article and their server marketplace has $34/month server that had something like an Intel Core i7 with 64GB RAM and 2x512GB SSDs. Compare that to EC2 pricing.
Just something to consider if you are in a professional environment before switching your entire infra: maintenance cost is expensive. I strongly suggest to throw man-days in your cost calculation.
To prevent security vulnerabilities, the team will need to write some playbooks to auto-update regularly your machine, hoping for no breaking changes. Or instead write a pipeline for immutable OS images updates. And it often mean testing on an additional canary VM first.
Scaling up the VM from a compute point of view is not that straightforward as well, and will require depending of the provider either downtime or to migrate the entire deployments to a new instance.
Scaling from a disk size point of view, you will need to play with filesystems.
And depending on the setup you are using, you might have to manage lets encrypt, authentication and authorization, secrets vaults, etc (here at least Disco manages the SSL certs for you)
If you are large enough, you will need an ops team to manage allowing your developers to write terraform and manage AWS costs already.
If you are small enough, you are not going to be truly affected by downtime. If you are just a little bigger, a single hot spare is going to be sufficient.
The place where you get dinged is heavy growth in personnel and bandwidth. You end up needing to solve CPU bound activities quicker because it hurts the whole system. You need to start thinking about sticky round robin load balancing and other fun pieces.
This is where the cloud can allow you to trade money for velocity. Eventually, though, you will need to pay up.
That said, the average SaaS can go a long way with a single server per product.
> I strongly suggest to throw man-days in your cost calculation.
Only if those man-days actually incur a marginal cost. If it's just employees you already have spending their time on things, then it's not worth factoring in because it's a cost you pay regardless.
From having talked to many folks, migrations are psychologically very, very, very very hard.
At least, the "fear" factor (will the new system work? what bugs will it introduce? how much time will I spend, etc.) pushes a lot of folks to accept a very big price differential aka known knowns versus unknowns...
It's understandable really. It's just that once you've migrated, you almost definitely never want to go back :-)
...but this CX33 "server" being discussed - is a 6 bucks a month VPS [0]
Normally you build a prototype on laptop and move it out to fat hardware when it outgrows that. Here they started with 3k infra and then later realized it runs on toaster. Completely back to front.
Maybe they just never iterated on a local version and nobody developed an intuition for requirements. Switched straight to iterating on a nebulous cloud where you can't tell how much horsepower is behind the cloudfunctions etc.
Presumably there is a perfectly reasonably explanation and it's just not spelled out, it just seems weird based on given info
The longest is to adapt your app to a Dockerfile-based deployment, if it isn't already containerized. We have examples for most languages - for Flask, for example, the whole file is 10 lines long [0]
But to provision a new server, as these are "stateless" (per 12 Factor) servers, it's just 1) get a VPS 2) install Docker+Disco using our curl|sh install script 3) authorize github 4) deploy a "project" (what we call an app), setting the env vars.
One thing however to note, is that by having a different non-prod and prod environment, it will be possible to test only the application, and not the infra.
Which means, that if they want to test what it will look like running in cloud for prod, they are going to either need a pre-prod environment or go yolo
I don't mean to hate but i find it incredibly alarming that i'm lately seeing all the seemingly seniorly positioned people writing articles about how they just realized that you can actually just buy a vps, setup a deployment workflow and write a revealing blog about "drastically cutting costs".
It's like juniors who did not recieve a proper training/education got hired into companies where someone told them to go serverless on some heroku or vercel, or use some incredibly expensive aws service because that's a "modern correct way" to do it, except now they were a developer for long enough to get a "senior" title in their job title now are in positions of actually modelling this architecture themselves
It sounds more like poor choices. 6 staging environments sounds a bit overkill.
If you can fit them all on a 4 cpu / 32gb machine, you can easily forgo them and run the stack locally on a dev machine. IME staging environments are generally snowflakes that are hard to stand up (no automation).
> you can easily forgo them and run the stack locally
Not if you're running with external resources of specific type, or want to share the ongoing work with others. Or need to setup 6 different projects with 3 different databases at the same time. It really depends on your setup and way of working. Sometimes you can do local staging easily, sometimes it's going to be a lot of pain.
Once everything is installed/running, a very tldr diagram would be:
GitHub (webhook on git push) -> Docker swarm running Caddy -> Disco Daemon REST API which will ask Docker to build the image, and then does a blue-green zero-time deployment swap
But yeah, a clearer/better diagram would be great. Thanks for the push!
i'd be interested what the load is like on that CCX33 server - i've got a lower-spec VPS from Hetzner and even from there I'm only using about 25%-30% CPU/RAM with a moderate load
> Even with all 6 environments and other projects running, the server's resource usage remained low. The average CPU load stayed under 10%, and memory usage sat at just ~14 GB of the available 32 GB.
Every time I've worked somewhere without one, we've wanted it and wasted more developer hours than the cost of having it trying to reproduce issues while working around the differences in the environments.
Why people discover it only today? I remember making comments about it years ago.
I even shown one customer that their elaborate cluster costing £10k a month could run on a £10 vps faster and with less headache (they set it up for "big data" thinking 50GB is massive. There was no expectation of the database growing substantially beyond that).
Their response? Investors said it must run on the cloud, because they don't want to lose their money if homegrown setup goes down.
Yes. The "cloud" is sold on grounds of "efficiency" but really it's just an ideological decision to increase outsourcing and reduce the employees' bargaining power.
(Except this backfires, because a service running on a RHEL or Debian machine might go on for 5-10 years untouched without any particular issue, security aside, while anything relying on kubernetes or the hyperscaler's million little services needs to be tweaked every 6 months and re-engineered every few years or it will completely stop working.)
We do have a UI, we're just so behind on the documentation, it's not even funny ha.
If you setup a server with the curl|sh install script on the homepage, you'll get a url at the end that directs you there. And you can use the CLI too of course.
> The Real Insight: Staging Became a Free Commodity
Not free, it became a productivity boost.
You now have a $35k annual budget for the maintenance, other overhead, and lost productivity. What do you spend it on?
> The team also took on responsibility for server monitoring, security updates, and handling any infrastructure issues themselves
For a place that’s paying devs $150k a year that might math out. It absolutely does not for places paying devs $250k+ a year.
One of the great frustrations of my mid career is how often people tried to bargain for more speed by throwing developers at my already late project when what would have actually helped almost immediately was more hardware and tooling. But that didn’t build my boss’ or his bosses’ empires. Don’t give me a $150k employee to train, give me $30k in servers.
Absolutely no surprise at all when devs were complicit with Cloud migrations because now you could ask forgiveness instead of permission for more hardware.
Looking at the htop screenshot, I notice the lack of swap. You may want to enable earlyoom, so your whole server doesn't go down when a service goes bananas. The Linux Kernel OOM killer is often a bit too late to trigger.
You can also enable zram to compress ram, so you can over-provision like the pros'. A lot of long-running software leaks memory that compresses pretty well.
Here is how I do it on my Hetzner bare-metal servers using Ansible: https://gist.github.com/fungiboletus/794a265cc186e79cd5eb2fe... It also works on VMs.
Even better than earlyoom is systemd-oomd[0] or oomd[1].
systemd-oomd and oomd use the kernel's PSI[2] information which makes them more efficient and responsive, while earlyoom is just polling.
earlyoom keeps getting suggested, even though we have PSI now, just because people are used to using it and recommending it from back before the kernel had cgroups v2.
[0]: https://www.freedesktop.org/software/systemd/man/latest/syst...
[1]: https://github.com/facebookincubator/oomd
[2]: https://docs.kernel.org/accounting/psi.html
"earlyoom is just polling"?
> systemd-oomd periodically polls PSI statistics for the system and those cgroups to decide when to take action.
It's unclear if the docs for systemd-oomd are incorrect or misleading; I do see from the kernel.org link that the recommended usage pattern is to use the `poll` system call, which in this context would mean "not polling", if I understand correctly.
Unrelated to the topic, it seems awfully unintuitive to name a function ‘poll’ if the result is ‘not polling.’ I’m guessing there’s some history and maybe backwards-compatible rewrites?
Poll takes a timeout parameter. ‘Not polling’ is just a really long timeout
"Let the underlying platform do the polling and return once the condition is met"
Another option would be to have more memory that required over-engineer and to adjust the oom score per app, adding early kill weight to non critical apps and negative weight to important apps. oom_score_adj is already set to -1000 by OpenSSH for example.
Another useful thing to do is effecively disable over-commit on all staging and production servers (0 ratio instead of 2 memory to fully disable as these do different things, memory 0 still uses formula) Also use a formula to set min_free and reserved memory using a formula from Redhat that I do not have handy based on installed memory. min_free can vary from 512KB to 16GB depending on installed memory. At least that worked for me in about 50,000 physical servers for over a decade that were not permitted to have swap and installed memory varied from 144GB to 4TB of RAM. OOM would only occur when the people configuring and pushing code would massively over-commit and not account for memory required by the kernel. Not following best practices defined by Java and thats a much longer story.Another option is to limit memory per application in cgroups but that requires more explaining than I am putting in an HN comment.
Another useful thing is to never OOM kill in the first place on servers that are only doing things in memory and need not commit anything to disk. So don't do this on a disked database. This is for ephemeral nodes that should self heal. Wait 60 seconds so drac/ilo can capture crash message and then earth shattering kaboom...
For a funny side note, those options can also be used as a holy hand grenade to intentionally unsafely reboot NFS diskless farms when failing over to entirely different NFS server clusters. setting panic to 15 mins, triggering OOM panic by setting min_free to 16TB at the command line via Ansible not in sysctl.conf, swapping clusters, arp storm and reconverge.Thanks for sharing I think these are very useful suggestions.
Yeah, no way. As soon as you hit swap, _most_ apps are going to have a bad, bad time. This is well known, so much so that all EC2 instances in AWS disable it by default. Sure, they want to sell you more RAM, but it's also just true that swap doesn't work for today's expectations.
Maybe back in the 90s, it was okay to wait 2-3 seconds for a button click, but today we just assume the thing is dead and reboot.
This is a wrong belief because a) SSDs make swap almost invisible, so you can have that escape ramp if something goes wrong b) SWAP space is not solely an escape ramp which RAM overflows into anymore.
In the age of microservices and cattle servers, reboot/reinstall might be cheap, but in the long run it is not. A long running server, albeit being cattle, is always a better solution because esp. with some excess RAM, the server "warms up" with all hot data cached and will be a low latency unit in your fleet, given you pay the required attention to your software development and service configuration.
Secondly, Kernel swaps out unused pages to SWAP, relieving pressure from RAM. So, SWAP is often used even if you fill 1% of your RAM. This allows for more hot data to be cached, allowing better resource utilization and performance in the long run.
So, eff it, we ball is never a good system administration strategy. Even if everything is ephemeral and can be rebooted in three seconds.
Sure, some things like Kubernetes forces "no SWAP, period" policies because it kills pods when pressure exceeds some value, but for more traditional setups, it's still valuable.
My work Ubuntu laptop has 40GB of RAM and and a very fast Nvme SSD, if it gets under memory pressure it slows to a crawl and is for all practical purposes frozen while swapping wildly for 15-20 minutes.
So no, my experience with swap isn't that it's invisible with SSD.
I don't know your exact situation, but be sure you're not mixing up "thrashing" with "using swap". Obviously, thrashing implies swap usage, but not the other way around.
If it’s frozen, or if the mouse suddenly takes seconds to respond to every movement, then it’s not just using some swap. It’s thrashing for sure.
I've experimented with no-swap and find the same thing happens. I think the issue is that linux can also evict executable pages (since it can just reload them from disk).
I've had good experience with linux's multi-generation LRU feature, specifically the /sys/kernel/mm/lru_gen/min_ttl_ms feature that triggers OOM-killer when the "working set of the last N ms doesn't fit in memory".
It's seldom invisible, but in my experience how visible it is depends on the size/modularity/performance/etc of what's being swapped and the underlying hardware.
On my 8gb M1 Mac, I can have a ton of tabs open and it'll swap with minimal slowdown. On the other hand, running a 4k external display and a small (4gb) llm is at best horrible and will sometimes require a hard reset.
I've seen similar with different combinations of software/hardware.
Linux being absolute dogshit if it’s under any sort of memory pressure is the reason, not swap or no swap. Modern systems would be much better off tweaking dirty bytes/ratios, but fundamentally the kernel needs to be dragged into the XXI century sometime.
This is not really true of most SSDs. When Linux is really thrashing the swap it’ll be essentially unusable unless the disk is _really_ fast. Fast enough SSDs are available though. Note that when it’s really thrashing the swap the workload is 100% random 4KB reads and writes in equal quantities. Many SSDs have high read speeds and high write speeds but have much worse performance under mixed workloads.
I once used an Intel Optane drive as swap for a job that needed hundreds of gigabytes of ram (in a computer that maxed out at 64 gigs). The latency was so low that even while the task was running the machine was almost perfectly usable; in fact I could almost watch videos without dropping frames at the same time.
How long is long running? You should be getting the warm caches after at most a few hours.
> Secondly, Kernel swaps out unused pages to SWAP, relieving pressure from RAM. So, SWAP is often used even if you fill 1% of your RAM. This allows for more hot data to be cached, allowing better resource utilization and performance in the long run.
Yes, and you can observe that even in your desktop at home (if you are running something like Linux).
> So, eff it, we ball is never a good system administration strategy. Even if everything is ephemeral and can be rebooted in three seconds.
I wouldn't be so quick. Google ran their servers without swap for ages. (I don't know if they still do it.) They decided that taking the slight inefficiency in memory usage, because they have to keep the 'leaked' pages around in actual RAM, is worth it to get predictability in performance.
For what it's worth, I add generous swap to all my personal machines, mostly so that the kernel can offload cold / leaked pages and keep more disk content cached in RAM. (As a secondary reason: I also like to have a generous amount of /tmp space that's backed by swap, if necessary.)
With swap files, instead of swap partitions, it's fairly easy to shrink and grow your swap space, depending on what your needs for free space on your disk are.
> SSDs make swap almost invisible
It doesn't. SSDs came a long way but so did memory dies and buses, and with that the way programs work also changed as more and more they are able to fit their stacks and heaps on memory more often than not.
I have had a problem with shellcheck that for some reason eats up all my ram when I open I believe .zshrc and trust me, it's not invisible. The system crawls to a halt.
It depends on the SSD, I may say.
If we're talking about SATA SSDs which top at 600MBps, then yes, an aggressive application can make itself known. However, if you have a modern NVMe, esp. a 4x4 one like Samsung 9x0 series or if you're using a Mac, I bet you'll notice the problem much later, if ever. Remember the SSD trashing problem on M1 Macs? People never noticed that system used SWAP that heavily and trashed the SSD on board.
Then, if you're using a server with a couple of SAS or NVMe SSDs, you'll not notice the problem again, esp. if these are backed by RAID (even md counts).
Now that you say, I have a new Lenovo yoga with those SoC ram with crazy parallel channel config (16gb spread across 8 dies of 2gb). It's noticeably faster than my Acer nitro with dual channel 16gb ddr5. I'll check that, but I'd say it's not what the average home user (and even server I'd risk saying) would have.
> it's not invisible. The system crawls to a halt.
I’m gonna guess you’re not old enough to remember computers with memory measured in MB and IDE hard disks? Swapping was absolutely brutal back then. I agree with the other poster, swap hitting an SSD is a barely noticeable in comparison.
What do you prefer:
( ) a 1% chance the system would crawl to a halt but would work
( ) a 1% change the kernel would die and nothing would work
I think I've not made myself as clear as I could. Swap is important for efficient system performance way before you hit OOM on main memory. It's not, however, going to save system responsiveness in case of OOM. This is what I mean.
The trade-off depends on how your system is set up.
Eg Google used to (and perhaps still does?) run their servers without swap, because they had built fault tolerance in their fleet anyway, so were happier to deal with the occasional crash than with the occasional slowdown.
For your desktop at home, you'd probably rather deal with a slowdown that gives you a chance to close a few programs, then just crashing your system. After all, if you are standing physically in front of your computer, you can always just manually hit the reset button, if the slowdown is too agonising.
That’s very common to distributed systems: much better to have a failed node than a slow node. Slow nodes are often contagious.
Can someone explain this to me? Doesn't swap just delay the fundamental issue? Or is there a qualitative difference?
Swap delays the 'fundamental issue', if you have a leak that keeps growing.
If your problem doesn't keep growing, and you just have more data that programs want to keep in memory than you have RAM, but the actual working set of what's accessed frequently still fits in RAM, then swap perfectly solves this.
Think lots of programs open in the background, or lots of open tabs in your browser, but you only ever rapidly switch between at most a handful at a time. Or you are starting a memory hungry game and you don't want to be bothered with closing all the existing memory hungry programs that idle in the background while you play.
I run a chat server on a small instance; when someone uploads a large image to the chat, the 'thumbnail the image' process would cause the OOM-killer to take out random other processes.
Adding a couple of gb of swap means the image resizing is _slow_, but completes without causing issues.
The problem is freezing the system for hours or more to delay the issue is not worth it. I'd rather a program get killed immediately than having my system locked up for hours before a program gets killed.
https://news.ycombinator.com/item?id=45007821
> Doesn't swap just delay the fundamental issue?
The fundamental issue here is what the linux fanboys literally think what killing a working process and most of the time the process[0] is a good solution for not solving the fundamental problem of memory allocation in the Linux kernel.
Availability of swap allows you to avoid malloc failure in a rare case your processes request more memory than physically (or 'physically', heh) present in the system. But in the mind of so called linux administrators even if a one byte of the swap would be used then the system would immediately crawl to a stop and never would recover itself. Why it always should be the worst and the most idiotic scenario instead of a sane 'needed 100MB more, got it - while some shit in the memory which wasn't accessed since the boot was swapped out - did the things it needed to do and freed that 100MB' is never explained by them.
[0] imagine a dedicated machine for *SQL server - which process would have the most memory usage on that system?
Indeed.
Also: When those processes that haven't been active since boot (and which may never be active again) are swapped out, more system RAM can become available for disk caching to help performance of things that are actively being used.
And that's... that's actually putting RAM to good use, instead of letting it sit idle. That's good.
(As many are always quick to point out: Swap can't fix a perpetual memory leak. But I don't think I've ever seen anyone claim that it could.)
What if I care more about the performance of things that aren't being used right now than the things that are? I'm sick of switching to my DAW and having to listen to my drive thrash when I try to play a (say) sampler I had loaded.
Just set swappiness to [say] 5, 2, 1, or even 0, and move on with your project with a system that is more reluctant to go into swap.
And maybe plan on getting more RAM.
(It's your system. You're allowed to tune it to fit your usage.)
Sounds like you just need more memory.
Kubernetes supports swap now.
I still don’t use it though.
What pressure? If your ram is underutilized, what pressure are you talking about?
If the slowest drive on the machine is the SSD, how does caching to swap help?
A long running Linux system uses 100% of its RAM. Every byte unused for applications will be used as a disk cache, given you read more data than your total RAM amount.
This cache is evictable, but it'll be there eventually.
Linux used to don't touch unused pages in the RAM in the older days if your RAM was not under pressure, but now it swaps out pages unused for a long time. This allows more cache space in RAM.
> how does caching to swap help?
I think I failed to convey what I tried to say. Let me retry:
Kernel doesn't cache to SSD. It swaps out unused (not accessed) but unevictable pages to SWAP, assuming that these pages will stay stale for a very long time, allowing more RAM to be used as cache.
When I look to my desktop system, in 12 days, Kernel moved 2592MB of my RAM to SWAP despite having ~20GB of free space. ~15GB of this free space is used as disk cache.
So, to have 2.5GB more disk cache, Kernel moved 2592 MB of non-accessed pages to SWAP.
Yes, and if I am writing an API service, for example, I don’t want to suddenly add latency because I hit pages that have been swapped out. I want guarantees about my API call latency variance, at least when the server isn’t overloaded.
I DON’T WANT THE KERNEL PRIORITIZING CACHE OVER NRU PAGES.
The easiest way to do this is to disable swap.
If you’re writing services in anything higher level than C you’re leaking something somewhere that you probably have no idea exists and the runtime won’t ever touch again.
You better not write your API in Python, or any language/library that uses amortised algorithms in the standard (like Rust and C++ do). And let's not mention garbage collection.
Or you can set the vm.swappiness sysctl to 0.
I’m asking because I genuinely don’t know - what are “pages” here?
That’s a fair question. A page is the smallest allocatable unit of RAM, from the OS/kernel perspective. The size is set by the CPU, traditionally 4kB, but these days 8kB-4MB are also common.
When you call malloc(), it requests a big chunk of memory from the OS, in units of pages. It then uses an allocator to divide it up into smaller, variable length chunks to form each malloc() request.
You may have heard of “heap” memory vs “stack” memory. The stack of course is the execution/call stack, and heap is called that because the “heap allocator” is the algorithm originally used for keeping track of unused chunks of these pages.
(This is beginner CS stuff so sorry if it came off as patronizing—I assume you’re either not a coder or self-taught, which is fine.)
Edit:
The command you want to use is "free -m".
This is from another system I have close:
2MB of SWAP used, 1423 MB RAM used, 29GB cache, 1042 MB Free. Total RAM 32 GB.If you are interested in human consumption, there's "free --human" which decided on useful units by itself. The "--human" switch is also available for "du --human" or "df --human" or "ls -l --human". It's often abbreviated as "-h", but not always, since that also often stands for "--help".
Thanks! My other problem was formatting. Just wanted to share that I see 0 swap usage and nowhere near 100% memory usage as a counterpoint.
The OS uses almost all the ram in your system (it just doesn't tell you because then users complain that their OS is too ram heavy). The primary thing it uses it for is caching as much of your storage system as possible. (e.g. all of the filesystem metadata and most of the files anyone on the system has touched recently). As such, if you have RAM that hasn't been touched recently, the OS can page it out and make the rest of the system faster.
At the cost of tanking performance for the less frequently used code path. Sometimes it is more important to optimize in ways that minimize worst case performance rather than a marginal improvement to typical work loads. This is often the case for distributed systems, e.g. SaaS backends.
In EC2 using any kind of swapping is just wrong, the comment you replied to already made all the points that can be made though.
From my understanding, the comment I'm replying to uses EC2 example to portray that swapping is wrong in any and all circumstances, and I just replied with my experience with my system administrator hat.
I'm not an AWS guy. I can see and touch the servers I manage, and in my experience, SWAP works, and works well.
Just for context EC2 typically uses network storage that, for obvious reasons, often has fairly rubbish latency and performance characteristics. Swap works fine if you have local storage, though obviously it burns through your SSD/NVME drive faster and can other side effects on it's performance (usually not particularly noticeable).
This is a wrong belief
This is not about belief, but lived experience. Setting up swap to me is a choice between a unresponsive system (with swap) or a responsive system with a few oom kills or downed system.
> This is not about belief, but lived experience.
I mean, I manage some servers, and this is my experience.
> Setting up swap to me is a choice between a unresponsive system (with swap) or a responsive system with a few oom kills or downed system.
Sorry, but are you sure that you budgeted your system requirements correctly? A Linux system shall neither fill SWAP nor trigger OOM regularly.
Swap also works really well for desktop workloads. (I guess that's why Apple uses it so heavily on their Macbooks etc.)
With a good amount of swap, you don't have to worry about closing programs. As long as your 'working set' stays smaller than your RAM, your computer stays fast and responsive, regardless of what's open and idling in the background.
It doesn’t happen often, and I have a multi user system with unpredictable workloads. It’s also not about swap filling up, but giving the pretense the system is operable in a memory exhausted state which means oom killer doesn’t run, but the system is unresponsive and never recovers.
Without swap oom killer runs and things become responsive.
"as soon as you hit swap" is a bad way of looking at things. Looking around at some servers I run, most of them have .5-2GB of swap used despite a bunch of gigabytes of free memory. That data is never or almost never going to be touched, and keeping it in memory would be a waste. On a smaller server that can be a significant waste.
Swap is good to have. The value is limited but real.
Also not having swap doesn't prevent thrashing, it just means that as memory gets completely full you start dropping and re-reading executable code over and over. The solution is the same in both cases, kill programs before performance falls off a cliff. But swap gives you more room before you reach the cliff.
Yeahna, thats just memory exhaustion.
Swap helps you use ram more efficiently, as you put the hot stuff in swap and let the rest fester on disk.
Sure if you overwhelm it, then you're gonna have a bad day, but thats the same without swap.
Seriously, swap is good, don't believe the noise.
It's good, and Aws shouldn't disable it by default, but it won't save the system from OOM.
I bet there's a big "burns through our SSDs faster" spreadsheet column or similar that caused it to be disabled.
Maybe. Or maybe it's an arbitrary decision.
Many won't enable swap. For some swap wouldn't help anyways, but others it could help soak up spikes. The latter in some cases will upgrade to a larger instance without even evaluating if swap could help, generating AWS more money.
Either way it's far-fetched to derive intention from the fact.
I don’t understand. If you provision the system with enough RAM, then you can for every page in RAM, hot or not.
Only if you have more RAM than disk space, which is wasteful for many applications.
Running out of memory kills performance. It is better to kill the VM and restart it so that any active VM remains low latency.
That is my interpretation of what people are saying upthread, at least. To which posters such as yourself are saying “you still need swap.” Why?
RAM costs money, disk space costs less money.
It's a bit wasteful to provision your computers so that all the cold data lives in expensive RAM.
>It's a bit wasteful to provision your computers so that all the cold data lives in expensive RAM.
But that's a job applications are already doing. They put data that's being actively worked on in RAM they leave all the rest in storage. Why would you need swap once you can already fit the entire working set in RAM?
Because then you have more active working memory as infrequently used pages are moved to compressed swap and can be used for more page cache or just normal resident memory.
Swap ram by itself would be stupid but no one doing this isn’t also turning on compression.
Sure, some applications are written to manually do a job that your kernel can already do for you.
In that case, and if you are only running these applications, the need for swap is much less.
You mean to tell me most applications you've ever used read the entire file system, loading every file into memory, and rely on the OS to move the unused stuff to swap?
No? What makes you think so?
Then what do you mean, some applications organize hot and cold data in RAM and storage respectively? Just about every application does it.
When building distributed systems, service degradation means you’ll have to provision more systems. Cheaper to provision fewer systems with more RAM.
It depends on what you are doing, and how your system behaves.
If you size your RAM and swap right, you get no service degradation, but still get away with using less RAM.
But when I was at Google (about a decade ago), they followed exactly the philosophy you were outlining and disabled swap.
How programs use ram also changed from the 90s. Back then they were written targeting machines that they knew would have a hard time fitting all their data in memory, so hitting swap wouldn't hurt perceived performance too drastically since many operations were already optimized to balance data load between memory and disk.
Nowadays when a program hits swap it's not going to fallback to a different memory usage profile that prioritises disk access. It's going to use swap as if it were actual ram, so you get to see the program choking the entire system.
Exactly. Nowadays, most web services are run in a GC'ed runtime. That VM will walk pointers all over the place and reach into swap all the time.
Depends entirely on the runtime.
If your GC is a moving collector, then absolutely this is something to watch out for.
There are, however, a number of runtimes that will leave memory in place. They are effectively just calling `malloc` for the objects and `free` when the GC algorithm detects an object is dead.
Go, the CLR, Ruby, Python, Swift, and I think node(?) all fit in this category. The JVM has a moving collector.
Python’s not a mover but the cycle breaker will walk through every object in the VM.
Also since the refcounts are inline, adding a reference to a cold object will update that object. IIRC Swift has the latter issue as well (unless the heap object’s RC was moved to the side table).
MemBalancer is a relatively new analysis paper that argues having swap allows maximum performance by allowing small excesses, that avoids needing to over-provision ram instead. The kind of gc does not matter since data spends very little time in that state and on the flip side, most of the time the application has twice has access to twice as much memory to use
A moving GC should be better at this, because it can compact your memory.
A moving collector has to move to somewhere and, generally by it's nature, it's constantly moving data all across the heap. That's what makes it end up touching a lot more memory while also requiring more memory. On minor collections I'll move memory between 2 different locations and on major collections it'll end up moving the entire old gen.
It's that "touching" of all the pages controlled by the GC that ultimately wrecks swap performance. But also the fact that moving collector like to hold onto memory as downsizing is pretty hard to do efficiently.
Non-moving collectors are generally ultimately using C allocators which are fairly good at avoiding fragmentation. Not perfect and not as fast as a moving collector, but also fast enough for most use cases.
Java's G1 collector would be the worst example of this. It's constantly moving blocks of memory all over the place.
> It's that "touching" of all the pages controlled by the GC that ultimately wrecks swap performance. But also the fact that moving collector like to hold onto memory as downsizing is pretty hard to do efficiently.
The memory that's now not in use, but still held onto, can be swapped out.
Every garbage collector has to constantly sift through the entire reference graph of the running program to figure out what objects have become garbage. Generational GC's can trace through the oldest generations less often, but that's about it.
Tracing garbage collectors solve a single problem really really well - managing a complex, possibly cyclical reference graph, which is in fact inherent to some problems where GC is thus irreplaceable - and are just about terrible wrt. any other system-level or performance-related factor of evaluation.
> Every garbage collector has to constantly sift through the entire reference graph of the running program to figure out what objects have become garbage.
There's a lot of "it depends" here.
For example, an RC garbage collector (Like swift and python?) doesn't ever trace through the graph.
The reason I brought up moving collectors is by their nature, they take up a lot more heap space, at least 2x what they need. The advantage of the non-moving collectors is they are much more prompt at returning memory to the OS. The JVM in particular has issues here because it has pretty chunky objects.
> The reason I brought up moving collectors is by their nature, they take up a lot more heap space, at least 2x what they need.
If the implementer cares about memory use it won't. There are ways to compact objects that are a lot less memory-intensive than copying the whole graph from A to B and then deleting A.
Modern garbage collectors have come a long way.
Even not so modern ones: have you heard of generational garbage collection?
But even in eg Python they introduced 'immortal objects' which the GC knows not to bother with.
This is really interesting and I've never really heard about this. What is going on with the kernel team then? Are they just going to keep swap as-is for backwards compatibility then everyone else just disables it? Or if this advice just for high performance clusters?
No. I use swap for my home machines. Most people should leave swap enabled. In fact I recommend the setup outlined in the kernel docs for tmpfs: https://docs.kernel.org/filesystems/tmpfs.html which is to have a big swap and use tmpfs for /tmp and /var/tmp.
As someone else said, swap is important not only in the case the system exhaust main memory, but it's used to efficiently use system memory before that (caching, offload page blocks to swap that aren't frequently used etc...)
My 2cents is that in a lot of cases swap is being used for unimportant stuff leave more RAM for your app. Do a "ps aux" and look at all the RAM used by weird stuff. Good news is those things will be swapped out.
Example on my personal VPS
The beauty of ZRAM is that on any modern-ish CPU it's surprisingly fast. We're talking 2-3 ms instead of 2-3 seconds ;)
I regularly use it on my Snapdragon 870 tablet (not exactly a top of the line CPU) to prevent OOM crashes (it's running an ancient kernel and the Android OOM killer basically crashes the whole thing) when running a load of tabs in Brave and a Linux environment (through Tmux) at the same time.
ZRAM won't save you if you do actually need to store and actively use more than the physical memory but if 60% of your physical memory is not actively used (think background tabs or servers that are running but not taking requests) it absolutely does wonders!
On most (web) app servers I happily leave it enabled to handle temporary spikes, memory leaks or applications that load a whole bunch of resources that they never ever use.
I'm also running it on my Kubernetes cluster. It allows me to set reasonable strict memory limits while still having the certainty that Pods can handle (short) spikes above my limit.
Is it possible you misread the comment you're replying to? They aren't recommending adding swap, they're recommending adjusting the memory tunables to make the OOM killer a bit more aggressive so that it starts killing things before the whole server goes to hell.
YMMV. Garbage-collected/pointer-chasing languages suffer more from swapping because they touch more of the heap all the time. AWS suffers more from swap because EBS is ridiculously slow and even their instance-attached NVMe is capped compared physical NVMe sticks.
Does HDD vs SSD matter at all these days? I can think of certain caching use-cases where swapping to an SSD might make sense, if the access patterns were "bursty" to certain keys in the cache
It's still extremely slow and can cause very unpredictable performance. I have swap setup with swappiness=1 on some boxes, but I wouldn't generally recommend it.
HDDs are much, much slower than SSD.
If swapping to SSD is 'extremely slow', what's your term for swapping to HDD?
‘Hard reboot’ (not OP)
what an ignorant and clueless comment. Guess what? Todays disks are NVMe drives which are orders of magnitude faster than the 5400rpm HDDs of the 90s. Today's swap is 90s RAM.
Where on earth did you get this misconception?
Lived experience? With swap system stays up but is unresponsive, without it is either responsive due to oom kill or completely down.
in either case, what do you do? if you can't reach a box and it's otherwise safe to do so, you just reboot it. so is it just a matter of which situation occurs more often?
The thing is you can survive memory exhaustion if the oom killer can do its job, which it can't many times when there's swap. I guess the topmost response to this thread talks about an earlyoom tool that might alleivate this, but I've never used it, and I don't find swap helpful anyway so there's no need for me to go down this route.
It's not just 3 seconds for a button click, every time I've run out of RAM on a Linux system, everything locks up and it thrashes. It feels like 100x slowdown. I've had better experiences when my CPU was underclocked to 20% speed. I enable swap and install earlyoom. Let processes die, as long as I can move the mouse and operate a terminal.
> It feels like 100x slowdown.
Yup, this is a thing. It happens because file-backed program text and read-only data eventually get evicted from RAM (to make room for process memory) so every access to code and/or data beyond the current 4K page can potentially involve a swap-in from disk. It would be nice if we had ways of setting up the system so that pages of code or data that are truly critical for real-time responsiveness (including parts of the UI) could not get evicted from RAM at all (except perhaps to make room for the OOM reaper itself to do its job) - but this is quite hard to do in practice.
To learn tricks like this what resource do you recommend I read? System administrators handbook? (Still on my TOREAD queue)
It's always a good idea to have a tiny amount of swap just in case. Like 1GB.
Why?
Because some portion of the RAM used by your daemons isn't actually being accessed, and using that RAM to store file cache is actually a better use than storing idle memory. The old rule about "as much swap as main memory" definitely doesn't hold any more, but a few GB to store unneeded wired memory to dedicate more room to file cache is still useful.
As a small example from a default Ubuntu installation, "unattended-upgrades" is holding 22MB of RSS, and will not impact system performance at all if it spends next week swapped out. Bigger examples can be found in monolithic services where you don't use some of the features but still have to wire them into RAM. You can page those inactive sections of the individual process into swap, and never notice.
Like a highway brake failure ramp, you have room for handling failures gentler. So services don't just get outright killed. If you monitor your swap usage, any usage of swap gives you early warning that your services require more memory already.
Gives you some time to upgrade, or tune services before it goes ka-boom.
If your memory usage is creeping up, the way you'll find out that you need more memory is by monitoring memory usage via the same mechanisms you'd hypothetically use to monitor your swap usage.
If your memory usage spikes suddenly, a nominal amount of swap isn't stopping anything from getting killed; you're at best buying yourself a few seconds, so unless you spend your time just staring at the server, it'll be dead anyways.
Thanks for this. We resorted to setting ram thresholds in systemd.
Is earlyoom a better solution than that to prevent an erratic process from making an instance unresposnsive?
Some workloads may do better with zswap. Cache is compressed, and pages evicted to disk based swap on an LRU basis.
The case of swap thrashing sounds like a misbehaving program, which can maybe be tamed by oomd.
System responsiveness though needs a complete resource control regime in place, that preserves minimum resources for certain critical processes. This is done with cgroupsv2. By establishing minimum resources, the kernel will limit resources for other processes. Sure, they will suffer. That’s the idea.
Of course swap should be enabled. But oom killer has always allowed access to an otherwise unreachable system. The pause is there so you can impress your junior padawan who rushed to you in a hurry.
What's the performance hit from compressing ram?
It's sometimes not a hit, because CPUs have caches and memory bandwidth is the limiting factor.
Depends on the algorithm (and how much CPU is in use); if you have a spare CPU, the faster algorithms can more-or-less keep up with your memory bandwidth, making the overhead negligible.
And of course the overhead is zero when you don't page-out to swap.
I haven’t scientifically measured, but you don’t compress the whole ram. It is more about reserving a part of the ram to have very fast swap.
For an algorithm using the whole memory, that’s a terrible idea.
> It is more about reserving a part of the ram to have very fast swap.
I understand all of those words, but none of the meaning. Why would I reserve RAM in order to put fast swap on it?
Swap to disk involves a relatively small pipe (usually 10x smaller than RAM). So instead of paying the cost to page out to disk immediately, you create compressed pages and store that in a dedicated RAM region for compressed swap.
This has a number of benefits: in practice more “active” space is freed up as unused pages are compressed and often compressible. Often times that can be freed application memory that is reserved within application space but in the free space of the allocator, especially if that allocator zeroes it those pages in the background, but even active application memory (eg if you have a browser a lot of the memory is probably duplicated many times across processes). So for a usually invisible cost you free up more system RAM. Additionally, the overhead of the swap is typically not much more than a memcpy even compressed which means that you get dedup and if you compressed erroneously (data still needed) paging it back in is relatively cheap.
It also plays really well with disk swap since the least frequently used pages of that compressed swap can be flushed to disk leaving more space in the compressed RAM region for additional pages. And since you’re flushing retrieving compressed pages from disk you’re reducing writes on an SSD (longevity) and reducing read/write volume (less overhead than naiive direct swap to disk).
Basically if you think of it as tiered memory, you’ve got registers, l1 cache, l2 cache, l3 cache, normal RAM, compressed swap RAM, disk swap - it’s an extra interim tier that makes the system more efficient.
>...but you don’t compress the whole ram.
I do: https://postimg.cc/G8Gcp3zb (casualmeasurement.png)
> zram, formerly called compcache, is a Linux kernel module for creating a compressed block device in RAM, i.e. a RAM disk with on-the-fly disk compression. The block device created with zram can then be used for swap or as a general-purpose RAM disk
To clarify OP's represention of the tool, it compresses swap space not resident ram. Outside of niche use-cases, compressing swap has overall little utility.
Incorrect, with zram you swap ram to compressed ram.
It has the benefit of absorbing memory leaks (which for whatever reason compress really well) and compressing stale memory pages.
Under actual memory pressure performance will degrade. But in many circumstances where your powerful CPU is not fully utilized you can 2x or even 3x your effective RAM (you can opt for zstd compression). zram also enables you to make the trade-off of picking a more powerful CPU for the express purpose of multiplying your RAM if the workload is compatible with the idea.
PS: On laptops/workstations, zram will not interfere with an SSD swap partition if you need it for hibernation. Though it will almost never be used for anything else if you configure your zram to be 2x your system memory.
Haven't used swap since 2010.
How do you get swap on a VPS?
Search "linux enable swap in a file"
Yes. I think might also need to chmod 600 /swapfile. I do this on all my VPS, especially helps for small VPS with only 1GB ram:
Works really well with no problems that I've seen. Really helps give a bit more of a buffer before applications get killed. Like others have said, with SSD the performance hit isn't too bad.IME SWAP has been explicitly disabled by the VPS providers.
Partly it's a money thing (they want to sell you RAM), partly it's so that the shared disk isn't getting thrashed by multiple VPS
Strongly suggest you try doing that on a VPS, then report back
What do you think is going to happen? I tested it out on an ec2 instance just now and it seems to have worked as one would expect.
EC2 != VPS
They both offer virtualized guests under a hypervisor host. EC2 does have more offload specialization hardware but for the most part they are functionally equivalent, unless I'm missing something...
https://news.ycombinator.com/item?id=45007821
And that was like... two years ago? 1GB of RAM and actually ~700MB usable before I found the proper magik incantations to really disable kdump.
Also have used 1GB machines for literally years.
Strongly suggest you shouldn't strongly suggest.
Uh.. your link... doesn't show how a VPS can have SWAP enabled
You do understand what's being discussed... right?
Literally up the chain: https://news.ycombinator.com/item?id=45663111
Or you have a very peculiar understanding what 'VPS' means.
https://en.wikipedia.org/wiki/Virtual_private_server
As I said earlier
https://news.ycombinator.com/threads?id=awesome_dude#4566311...
Just saw Nate Berkopec who does a lot of rails performance stuff posting about the same idea yesterday saying Heroku is 25-50x price for performance which is so insane. They clearly have zero interest in competing on price.
It's a shame they don't just license all their software stack at a reasonable price with a similar model like Sidekiq and let you sort out actually decent hardware. It's insane to consider Heroku if anything has gotten more expensive and worse compared to a decade ago yet in comparison similar priced server hardware has gotten WAY better of a decade. $50 for a dyno with 1 GB of ram in 2025 is robbery. It's even worse considering running a standard rails app hasn't changed dramatically from a resources perspective and if anything has become more efficient. It's comical to consider how many developers are shipping apps on Heroku for hundreds of dollars a month on machines with worse performance/resources than the macbook they are developing it on.
It's the standard playback that damn near everything in society is going for though just jacking prices and targeting the wealthiest least price sensitive percentiles instead of making good products at fair prices for the masses.
Jacked up prices isn't what is happening here. There is a psychological effect that Heroku and other cloud vendors are (wittingly or unwittingly) the beneficiary of. Customer expectations are anchored in the price they pay when they start using the service, and without deliberate effort, those expectations change in _linear_ fashion. Humans think in linear terms, while actual compute hardware improvements are exponential.
Heroku's pricing has _remained the same_ for at least seven years, while hardware has improved exponentially. So when you look at their pricing and see a scam, what you're actually doing is comparing a 2025 anchor to a mid-2010s price that exists to retain revenue. At the big cloud vendors, they differentiate customers by adding obstacles to unlocking new hardware performance in the form of reservations and updated SKUs. There's deliberate customer action that needs to take place. Heroku doesn't appear to have much competition, so they keep their prices locked and we get to read an article like this whenever a new engineer discovers just how capable modern hardware is.
I mean Heroku is also offering all of the ancillary stuff around their product. It's not literally "just" hosting. It's pretty nice to not have to manage a kube cluster, to get stuff like ephemeral QA envs and the like, etc....
Heroku has obviously stagnated now but their stack is _very cool_ for if you have a fairly simple system but still want all the nice parts of a mode developed ops system. It almost lets you get away with not having an ops team for quite a while. I don't know any other provider that is low-effort "decent" ops (Fly seems to directionally want to be new Heroku but is still missing a _lot_ in my book, though it also has a lot)
Heroku is the Vercel of Rails: people will pay a fortune for it simply because it works. This has always been their business model, so it’s not really a new development. There’s little competition since the demand isn’t explosive and the margin is thin, so you end up with stagnation
Vercel should have a ton of competition on account of the frontend space being much larger than Heroku's market.
Netlify sets the same prices.
Just throw it into a cloud bucket from CI and be done with it.
You'd be surprised. There are very few because it takes a lot more work to build reliable systems across mid-market cloud providers (flakey APIs, missing functionality, etc). Plus you need to know the idiosyncrasies of all the various frameworks + build systems.
That said, they are emerging. I'm actually working on a drop-in Vercel competitor at https://www.sherpa.sh. We're 70% lower cost by running on EU based CDN and dedicated servers (Hetzner, etc). But we had to build the relationships to solve all the above challenges first.
> It's a shame they don't just license all their software stack at a reasonable price with a similar model like Sidekiq and let you sort out actually decent hardware
We built and open sourced https://canine.sh for exactly that reason. There’s no reason PaaS providers should be charging such a giant markup over already marked up cloud providers.
Heroku is pricing for “# of FTE headcount that can be terminated for switching to Heroku”; in that sense, this article’s $3000/mo bill is well below 1.0 FTE/month at U.S. pricing, so it’s not interesting to Heroku to address. I’m not defending this pricing lens, but it’s why their pricing is so high: if you aren’t switching to Heroku to layoff at least 1-2 FTE of salary per billing period, or using Heroku to replace a competitor’s equivalent replacement thereof, Heroku’s value assigned to you as a customer is net negative and they’d rather you went elsewhere. They can’t slam the door shut on the small fry, or else the unicorns would start up elsewhere, but they can set the pricing in FTE-terms and VCs will pay it for their moonshots without breaking a sweat.
This looks decent for what it is. I feel like there are umpteen solutions for easy self-hosted compute (and tbh even a plain Linux VM isn't too bad to manage). The main reason to use a PAAS provider is a managed database with built-in backups.
Its the flexibility and power of Kubernetes that I think is incredible. Scaling to multiple nodes is trivial, if your entire data plane is blown away, the recovery is trivial.
You can also self host almost any open source service without any fuss, and perform internal networking with telepresence. (For example, if you want to run an internal metabase that is not available on public internet, you can just run `telepresence connect`, and then visit the private instance at metabase.svc.cluster.local).
Canine tries to leverage all the best practices and pre-existing tools that are already out there.
But agreed, business critical databases probably shouldn't belong on Kubernetes.
Fully agreed - our recommendation is to /not/ run your prod Postgres db yourself, but use one of the many great dedicated options out there - Crunchy Data, Neon, Supabase, or AWS RDS..!
It really depends upon how much data you have. If its enough to just dump then go crazy. If it isn't its a bit more trouble.
Regardless, you're going to have a much easier time developing your app if your datastore access latency is submillisecond rather than tens of milliseconds.
So that extra trouble might be worth it...
You're running at a pretty small scale if running your database locally for sub-milisecond latency is practical. The database solution provided by the DBA team in a data center is going to have about the same latency as RDS or equivalent. Typical intra-datacenter network latency alone is going to be 1-3ms.
Does it run Sentry and I can send logs, metrics, and traces to it, and the queries are fast?
> $50 for a dyno with 1 GB of ram in 2025 is robbery
AWS isn't much better honestly.. $50/month gets you an m7a.medium which is 1 vCPU (not core) and 4GB of RAM. Yes that's more memory but any wonder why AWS is making money hand-over-fist..
This, plus as a backup plan going from Heroku to AWS wouldn't necessarily solve the problem, at least with our infra. When us-east-1 went down this week so did Heroku for us.
Not sure if it's an apples-to-apples comparison with Heroku's $50 Standard-2X dyno, but an Amazon Lightsail instance with 1GB of RAM and 2 vCPUs is $7/month.
m7a doesn't use HyperThreading; 1 vCPU is a full dedicated core.
To compare to Heroku's standard dynos (which are shared hosting) you want the t3a family which is also shared, and much cheaper.
That is assuming you need that 1 core 24/7, you can get 2 core / 8gb for $43, this will most likely fit 90% of workloads (steady traffic with spikes, or 9-5 cadence).
If you reserve that instance you can get it for 40% cheaper, or get 4 cores instead.
Yes it's more expensive than OVH but you also get everything AWS to offer.
Now I know why the teaching platform I use is trying to kick me off...
Every other time I login to the admin site I get a Heroku error.
I am not sure what's there to license. The hard and expensive part is in the labor to keep everything running. You are paying to make DevSecOps Somebody Else's Problem. You are paying for A Solution. You are not paying for software. There are plenty of Heroku clones mentioned in this thread.
Yeah, I choose railway app for my PaaS hosting for this reason
It's insane how much my local shop charges for an oil change, I can do it much cheaper myself!
It's insane how much a restaurant charges for a decent steak, I can do it much cheaper myself!
...!
I know you mean this sarcastically but I actually 100% agree with this particular on the steak point. Especially with beef prices at all time record highs and restaurant inflation being out of control post pandemic. It takes so much of the enjoyment out of things for me if I feel i'm being ripped off left and right.
What you're missing here is that companies happily pay the premium to Heroku because it lets them focus on building the product and generating business rather than wasting precious engineering time managing infra.
By the time the product is a success and reaches a scale where it becomes cost prohibitive, they have enough resources to expand or migrate away anyway.
I suppose for solo devs it might be cheaper to setup a box for fun, but even then, I would argue that not everyone enjoys doing devops and prefers spending their time elsewhere.
Where’s the beef inflation? Local butcher has prime rib fillet $30 AUD/KG cut to your liking.
My understanding is that here in Oz we get access to cheaper beef than the rest of the world...
One also doesn't get shamed by the steak snobs if you have different steak preferences.
Or having to cut the steak with a serrated "steak" knife that tears the meat.
Not the best comment but I agree with the sentiment. I fear far too often, people complain about price when there are competitors/other cheaper options that could be used with a little more effort. If people cared so much then they should just use the alternative.
No one gets hurt if someone else chooses to waste their money on Heroku so why are people complaining? Of course it applies in cases where there aren't a lot of competitors but there are literally hundreds of different of different options for deploying applications and at least a dozen of them are just as reliable and cheaper than Heroku.
I'm hurt because a service I'm using is based on Heroku. I'm on the "unlimited" plan but they have backtracked on that and now say I'm too big for them...
The problem with Heroku's pricing is that it's set high enough that I no longer use it and neither does anyone else I know. I suspect they either pivoted to a different target market than me, which would be inconvenient but I'd be okay with it, or killed off their own growth potential by trying to extract revenue, which I would find sad.
The price value proposition here seems similar to that of a stadium hot dog.
This argument doesn't work with such commoditized software. It's more like comparing an oil change for $100 plus an hour of research and a short drive against a convenient oil change right next door for $2,500.
Nobody is forced to go to the expensive one. If they are still in business then enough people apparently consider it a reasonable deal. You might not, but others do. Whether I'm being downvoted or not.
> If they are still in business then enough people apparently consider it a reasonable deal.
Or they didn't check. A business still existing is pretty weak evidence that the pricing is reasonable.
It's just trendy to bash cloud and praise on-premises in 2025. In a few years that will turn around. Then in another few years it will turn around again.
Indeed, there are levels to the asymmetry though. Oil change might be ~5x cheaper vs the 20-50x claimed for Heroku...
> for an oil change, I can do it much cheaper myself
Really? I mean oil changes are pretty cheap. You can get an oil change at walmart for like 40 bucks.
And you get the stripped out bolt hole for free too.
The cloud has made people forget how far you can get with a single machine.
Hosting staging envs in pricey cloud envs seems crazy to me but I understand why you would want to because modern clouds can have a lot of moving parts.
Teaching a whole bunch of developers some cloud basics and having a few cloud people around is relatively cheap for quite a while. Plus, having test/staging/prod on similar configurations will help catch mistakes earlier. None of that "localstack runs just fine but it turns out Amazon SES isn't available in region antartica-east-1". Then, eventually, you pay a couple people's wages extra in cloud bills, and leaving the cloud becomes profitable.
Cloud isn't worth it until suddenly it is because you can't deploy your own servers fast enough, and then it's worth it until it exceeds the price of a solid infrastructure team and hardware. There's a curve to how much you're saving by throwing everything in the cloud.
Deploying to your private cloud requires basically the same skills. Containers, k8s or whatnot, S3, etc. Operating a large DB on bare metal is different from using a managed DB like Aurora, bit for developers, the difference is hardly visible.
RDS/managed database is extremely nice I will admit, otherwise I agree. Similarly s3, if you're going to do object storage, then running minio or whatever locally is probably not cheaper overall than R2 or similar.
The cloud has made people afraid of linux servers. The markup is essentially just the price business has to pay because of developer insecurity. The irony is that self hosting is relatively simple, and alot of fun. Personally never got the appeal of Heroku, Vercel and similar, because theres nothing better than spinning up a server and setting it up from scratch. Every developer should try it.
> The irony is that self hosting is relatively simple, and alot of fun. Personally never got the appeal of Heroku, Vercel and similar, because theres nothing better than spinning up a server and setting it up from scratch.
It's fun the first time, but becomes an annoying faff when it has to be repeated constantly.
In Heroku, Vercel and similar you git push and you're running. On a linux server you set up the OS, the server authentication, the application itself, the systemctl jobs, the reverse proxy, the code deployment, the ssl key management, the monitoring etc etc.
I still do prefer a linux server due to the flexibility, but the UX could be a lot better.
I use NixOS and a lot of it is in a single file. I just saw some ansible coming by here, and although I have no experience with it, it looked a lot simpler than Nix (for someone from the old Linux world, like me… eventhough Nix is, looking through your eyelashes, just a pile of key/value pairs).
Nix is great, but it still requires some training and expertise.
And the overlap between what Nix does and what the 'cloud' does for you is only partial. (Eg it can still make sense to use Nix in the cloud.)
> It's fun the first time, but becomes an annoying faff when it has to be repeated constantly.
Certainly true, but there are a whole lot of tools to automate those operations so that you aren't doing them constantly.
Mind sharing these tools and what each one does?
Ansible, Salt and Puppet are mostly industry standard. Those tools are commonly referred to as configuration management (systems).
Ansible basically automates the workflow of: log in to X, do step X (if Y is not present). It has broad support for distros and OSes. It's mostly imperative and can be used like a glorified task runner.
Salt let's you mostly declaratively describe the state of a system. It comes with a agent/central host system for distributing this configuration from the central host to the minions (push).
Puppet is also declarative and also comes with an agent/central host system but uses a pull based approach.
Specialized/ exotic options are also available, like mgmt or NixOS.
Thanks, this is very detailed! Could you share some real-world use cases for these tools?
Actually I am looking for tools to automate DevOps and security for self-hosting
"The irony is that self hosting is relatively simple"
cloud is easy until is not, for 90% of us maybe we dont need a multi region with hot and cold storage
for those that need it, its neccesary
And all of that takes, what, a week? As a one time thing?
Takes less than a day, because most of the stuff is scriptable. And for a simple compute node setup at Hetzner (I.e. no bare metal, but just a VM) it takes me less than half an hour.
But if you're that familiar with it, the overpriced turnkey stuff wouldn't look so tempting in the first place.
[flagged]
Can you please edit out swipes, putdowns, name-calling, etc., from your HN posts? It's not what this site is for, and destroys what it is for.
This is in the site guidelines: https://news.ycombinator.com/newsguidelines.html.
I dunno, the cloud has mostly made me afraid of the cloud. You can bury yourself in towering complexity so easily on AWS. (The highly managed stuff like Vercel I don't have much experience with, so maybe it's different.)
I will recommend to try GCP or Azure, the complexity is lower there! AWS is great for big corporate that needs a lot of lego pieces to do their own custom setup. At the contrario, GCP and Azure solutions are often more bundled.
> the price business has to pay because of developer insecurity
Is it mostly developer insecurity, or mostly tech leadership insecurity?
It is way more than that though.
It offloads things like - Power Usage - Colo Costs - Networking (a big one) - Storage (SSD wear / HDD pools) - etc
It is a long list but what doesnt allow you do it make trade offs like spending way less but accept downtime if your switch dies etc etc.
For a staging env these are things you might want to do.
"Self hosting" may actually be referring not to hosting your own on-prem hardware, but to renting bare metal in which case the concerns around power usage, networking, etc. are offloaded to the provider.
my take is that it's fun up until there's just enough brittleness and chaos.. too many instance of the same thing but with too many env variables setup by hand and then fuzzy bug starts to pile up
Honestly I think it's the database that makes devs insecure. The stakes are high and you usually want PITR and regular backups even for low traffic apps. Having a "simple" turnkey service for this that can run in any environment (dedicated, VPS, colo, etc.) would be huge.
I think this is partly responsible for the increased popularity of sqlite as a backend. It's super simple and lightstream for recovery isn't that complicated.
Most apps don't need 5 9s, but they do care about losing data. Eliminate the possibility of losing data, without paying tons of $ to also eliminate potential outages, and you'll get a lot of customers.
isn't that just neon db???? but without losing data part
Neon is definitely way more complex than what I'm talking about.
Never got the appeal of having someone else do something for you, and giving them money, in exchange for goods and services? Vercel is easy. You pay them to make it easy. When you're just getting started, you start on easy mode before you jump into the deep end of the pool. Everybody's got a different cup of tea, and some like it hot and others like it cold.
Sure I love having someone else do work for me and paying them for that, the question is if that work is worth a 50x markup.
Flour, salt, and water are exceedingly cheap. I have to imagine the loaf of bread I buy from my baker reflects considerably more than a 50x markup compared to baking my own.
It’s a lot cheaper than me learning to bake as well as he does—not to mention dedicating the time every day to get my daily bread—and I’ll never need bread on the kind of scale that would make it worth my time to do so.
Bread is a great example! You can buy a loaf for $3-4. It is not a 50x markup. Like growing your own veggies, baking bread is for fun, not for economics.
But the cloud is different. None of the financial scale benefits are passed on to you. You save serious money running it in-house. The arguments around scale have no validity for the vast, vast majority of use cases.
Vercel isn't selling bread: they're selling a fancy steak dinner, and yes, you can make steak at home for much less, and if you eat fancy steak dinners at fancy restaurants every night you're going to go broke.
So the key is to understand whether your vendors are selling you bread, or a fancy steak dinner, and to not make the mistake of getting the two confused.
That’s a tremendously clarifying framework, and it makes a lot of sense to me. Thank you.
I wonder, though—at the risk of overextending the metaphor—what if I don’t have a kitchen, but I need the lunch meeting to be fed? Wouldn’t (relatively expensive) catering routinely make sense? And isn’t the difference between having steak catered and having sandwiches catered relatively small compared to the alternative of building out a kitchen?
What if my business is not meaningfully technical: I’ll set up applications to support our primary function, and they might even be essential to the meat of our work. But essential in the same way water and power are: we only notice it when it’s screwed up. Day-to-day, our operational competency is in dispatching vehicles or making sandwiches or something. If we hired somebody with the expertise to maintain things, they’d sit idle—or need a retainer commensurate with what the Vercels and Herokus of the world are charging. We only need to think about the IT stuff when it breaks—and maybe to the extent that, when we expect a spike, we can click one button to have twice as much “application.”
In that case, isn’t it conceivable that it could be worth the premium to buy our way out of managing some portion of the lower levels of the stack?
Please do yourself a flavour and check the price of flour.
Water is cheap, yes. Salt isn't all that cheap, but you only need a little bit.
> [...] and I’ll never need bread on the kind of scale that would make it worth my time to do so.
If you need bread by hand, it's a very small scale affair. Your physique and time couldn't afford you large scale bread making. You'd a big special mixer and a big special oven etc for that. And you'd probably want a temperature and moisture controlled room just for letting your dough rise.
$16 for a 50 pound sack right now
https://postmates.com/store/restaurant-depot-4538-s-sheridan...
I blush to admit that I do from time to time pay $21 for a single sourdough loaf. It’s exquisite, it’s vastly superior to anything I could make myself (or anything I’ve found others doing). So I’m happy to pay the extreme premium to keep the guy in business and maintain my reliable access to it.
It weighs a couple of pounds, though I’m not clear how the water weight factors in to the final weight of a loaf. And I’m sure that flour is fancier than this one. I take your point—I don’t belong in the bread industry :)
Well, in your case, you are mostly paying for the guy's labour, I presume.
(Similarly to how you pay Amazon or Google etc not just for the raw cloud resources, but for the system they provide.)
I grew up in Germany, but now live in Singapore. What's sold as 'good' sourdough bread here would make you fail your baker's training in Germany: huge holes in the dough and other defects. How am I supposed to spread butter over this? And Mischbrot, a mixture of rye and wheat, is almost impossible to find.
So we make our own. The goal is mostly to replicate the everyday bread you can buy in Germany for cheap, not to hit any artisanal highs. (Though they are massively better IMHO than anything sold as artisanal here.)
Interestingly, the German breads we are talking about are mostly factory made. Factory bread can be good, if that's what customers demand.
See https://en.wikipedia.org/wiki/Mischbrot
Going on a slight tangent: with tropical heat and humidity, non-sourdough bread goes stale and moldy almost immediately. Sourdough bread can last for several days or even a week without going moldy in a paper bag on the kitchen counter outside the fridge, depending on how sour you go. If you are willing to toast your bread, going stale during that time isn't much of an issue either.
(Going dry is not much of an issue with any bread here--- sourdough or not, because it's so humid.)
> Salt isn't all that cheap
Wait, what? Salt is literally one of the cheapest of all materials per kilogram that exists in all contexts, including non-food contexts. The cost is almost purely transportation from the point of production. High quality salt is well under a dollar a pound. I am currently using salt that I bought 500g for 0.29 euro. You can get similar in the US (slightly more expensive).
This was a meme among chemical engineers. Some people complain in reviews on Amazon that the salt they buy is cut with other chemicals that make it less salty. The reality is that there is literally nothing you could cut it with that is cheaper than salt.
Well, salt is more expensive than water.
But sure, it's cheap otherwise. Point granted.
One way or another, salt is not a major driver of cost in bread, because there's relatively little salt in bread. (If there's 1kg of flour, you might have 20g of salt.)
bread ingreadient is cheap but the equipment that you need to do baking is not
also skills, some people just bake better than others
> bread ingreadient is cheap but the equipment that you need to do baking is not
It's actually not too bad, if look at the capital cost of a bread factory amortised over each loaf of bread.
The equipment is comparatively more expensive for a home baker who only bakes perhaps two loafs a week.
Yeah, but then we're just haggling. If you know how to change the belt on your car and already have the tools, it's different from when you're stranded with no tools and no garage and no belt.
If you're a mechanic you're supposed to know how to change the belt on your car. It would be insane if you write code and work with computers for a living but you dont know how to set up a web server.
I am pretty sure I know much more about code than you do, and at the same time you probably know much more about web servers and sysadmin than I do. I don't mind if it stays like that. And I am saying this having programmed my own web server in Java about 25 years ago.
A whole lot of coding and working with computers doesn't involve setting up a web server. It's not insane at all.
It would be insane if you write code and work with computers for a living but you dont know how to write a game engine in assembly.
Hum... Writing a game engine is a high-difficulty task that should be available to any reasonably good software developer with a few months to study for it. Making it in assembly is a sure way to take 10 times the time of another low level language like C, but shouldn't be an impossibility either.
Configuring a web server is a low-difficulty task that should be available for any good software developer with 3 days to study for it. It's absurd for a developer to need to configure a web server, but insist on paying a large rent and cede control to some 3rd party instead of just doing it.
Installing a web server is in no way the same as writing a game engine, let alone in assembly, and I think you know that.
Fully replicating prod is helpful. Saves time since deployment is similar and does a better test of what prod will be.
Completely agree. It’s not a staging server if it’s hosted on a different platform.
I think OP is using these less as staging and more as dev environments for individual developers. That seems like a great use of a single server to me.
I'd still like a staging + prod, but keeping the dev environments on a separate beefy server seems smart.
I've been using a development server for about 9 years and the best thing I ever did was move to a machine with a low-power Xeon D for a time. It made development painful enough that I quickly fixed the performance issues I was able to overlook on more powerful hardware. I recommend it, even just as an exercise.
For similar reasons, in the Google office I worked in you had the option to connect to a really intentionally crappy wifi that was simulating a 2G connection.
The "platform" software runs on is just other software. If your prod environment is managed kubernetes then you don't lose much if your staging environment is self-hosted kubernetes.
Load balancers, IAM roles, kubernetes upgrades, postgres upgrades, security settings, DNS records, http routes... there's a lot that can go wrong and makes it useful to have a staging environment.
The cloud was a good deal in 2006 when the smallest aws machine was about the size of a ok dev desktop and took over two years of renting to justify buying the physical machine outright.
Today the smallest, and even large, aws machines are a joke, comparable to a mobile phone from 15 years ago to a terrible laptop today, and take about three to six months to in rent as buying the hardware outright.
If you're on the cloud without getting 75% discount you will save money and headcount by doing everything on prem.
This could be the premise for a fun project based infra learning site.
You get X resources in the cloud and know that a certain request/load profile will run against it. You have to configure things to handle that load, and are scored against other people.
All it means is that the cloud doesn't work like a power socket, which was the whole point of it.
Things like Lambda do fit in this model, but they are too inefficient to model every workload.
Amazon lacks vision.
also how far you can get with a single machine has changed massively in the past 15 years. 15 years ago a (really beefy) single machine meant 8 cores with 256GB ram and a couple TB of storage. Now a single machine can be 256 cores on 8TB of ram and a PB of storage.
Exactly, and the performance of consumer tech is wildly faster. Eg, a Ryzen 5825U mini pc with 16GB memory is ~$250USD with 512GB nvme. That thing will outperform of 14 core Xeon from ~2016 on multicore workloads and absolutely thrash it in single thread. Yes lack of ECC is not good for any serious workload, but great for lower environments/testing/prototyping, and it sips power at ~50W full tilt.
Curiously, RAM sizes haven't gone up much for consumer tech.
As an example: my Macbook Pro from 2015 had 16 GiB RAM, and that's what my MacBook Air from 2025 also has.
Ehhh Macbook Pros can be configured with up to 128 now, iirc 16 was the max back then. But I guess the baseline hasn't moved as much.
Yes, there has been some movement. But even an 8 fold increase (128/16) over a decade is nothing compared to what we used to see in the past.
Oh, and the new machine has unified RAM. The old machine had a bit of extra RAM in the GPU that I'm not counting here.
As far as I can tell, the new RAM is a lot faster. That counts for something. And presumably also uses less power.
The cloud has made people forget that the internet is decentralized.
The weird thing is the relationship between developer costs and operations costs. For startups that pay salaries, $3000 a month is a pittance!*
* The big caveat: If you don't incur the exact same devops costs that would have happened with a linux instance.
Many tools (containers in particular) have cropped up that have made things like quick, redundant deployment pretty straightforward and cheap.
The best part is when you start with a $3000/month cloud bill during development and finally realize that hosting the production instance this way would actually cost $300k/month, but now it's too late to change it quickly.
You put your staging env in the same (kind of) place you put your prod system because you need to replicate your prod environment as faithfully as possible. You also then get to re-use your deployment code.
Cloud often has everyone thinking it's still 2008.
With some obvious exceptions there isnt much you cant run on a 200 Core machine wrt web services.
you can literally buy a used dell desktop that matches the spec for hetzner (8 core, 32 gigs of ram) for under 500 USD. Why wouldnt you just do that?
As cloud marches on it continues to seem like a grift.
Do you plan on keeping it in your home? At that point I'd be worried about ISP networking or power guarantees unless you plan on upgrading to business rates for both. If you mean colo, well, if you're sure you'll be using it in X years, it's worth it, but the flexibility of month-to-month might be preferable.
Because that used desktop is subject to power outages, internet outages, the cleaners unplugging it, etc. Datacenters have redundancy on everything.
Also you still have to pay for the electricity on that thing.
The cloud costs includes everything.
And you'll need some $100/month to colocate that thing, so you are better spending some more and buying a reasonable server that uses only 1U.
> And while Hetzner's price-performance is exceptional, its limited presence in the US was a consideration; for this staging workload, it wasn't an issue, but it's a factor for production services targeting US users.
What is this referring to? Concerns about capacity if you need to scale up quickly? Or just "political"/marketing considerations about people not being used to being served by a Hetzner server?
From memory there aren't many dedicated servers available in the US.
Latency to the US from Europe?
Heya, Disco is the open source PaaS I've been working on with my friend Antoine Leclair.
Lots of conversation & discussion about self-hosting / cloud exits these days (pros, cons, etc.) Happy to engage :-)
Cheers!
Just to be aware when you say "Even with all 6 environments and other projects running, the server's resource usage remained low. The average CPU load stayed under 10%, and memory usage sat at just ~14 GB of the available 32 GB."
The load average in htop is actually per CPU core. So if you have 8 CPU cores like in your screenshot, a load average of 0.1 is actually 1.25% (10% / 8) of total CPU capacity - even better :).
Cool blog! I've been having so much success with this type of pattern!
Sharp eye! Thanks. Fixed
Thanks for sharing. I have an app I'm working on and this seems perfect for it.
what does this service offer over an established tool like Coolify? currently hosting most of my services on a cheap Hetzner VPS so i'm interested what Disco has to offer
Coolify and other self-hosting options such as Kamal are great. We're all in the same boat!
I'd say the main differences is that we 1) we offer a more streamlined CLI and UI rather than offering extensive app/installation options 2) have an api-key based system that lets team members collaborate without having to manage ssh access/keys.
Generally speaking, I'd say our approach and tooling/UX tends to be more functional/pragmatic (like Heroku) than one with every possible option.
Or Dokku, Dokploy or CapRover
Would be great to have a comparison on the main page of Disco
There's quite a few now. Coolify, Dokku, CapRover, Kamal.
https://devpu.sh/ is another alternative, it has a nice UI built with Hypermedia (HTMX).
I am building https://github.com/openrundev/openrun/. Main difference is that OpenRun has a declarative interface, no need for manual CLI commands or UI operations to manage apps. Another difference is that OpenRun is implemented as a proxy, it does not depend on Traefik/Nginx etc. This allows OpenRun to implement features like scaling down to zero, RBAC access control for app access, audit logs etc.
Downside with OpenRun is that is does not plan to support deploying pre-packaged apps, no Docker compose support. Streamlit/Gradio/FastHTML/Shiny/NiceGUI apps for teams are the target use case. Coolify has the best support and catalog of pre-packaged apps.
There's also Canine and Kubero
https://news.ycombinator.com/item?id=44292103
https://news.ycombinator.com/item?id=44873057
We've had a similar experience at Hack Club, the nonprofit I run that helps high schoolers get into coding and electronics.
We used to be on Heroku and the cost wasn't just the high monthly bill - it was asking "is this little utility app I just wrote really worth paying $15/month to host?" before working on it.
This year we moved to a self-hosted setup on Coolify and have about 300 services running on a single server for $300/month on Hetzner. For the most part, it's been great and let us ship a lot more code!
My biggest realization is that for an organization like us, we really only need 99% uptime on most of our services (not 99.99%). Most developer tools are around helping you reach 99.99% uptime. When you realize you only need 99%, the world opens up.
Disco looks really cool and I'm excited to check it out!
Cheers, let me know if you do / hop onto our Discord for any questions.
We know of two similar cases: a bootcamp/dev school in Puerto Rico that lets its students deploy all of their final projects to a single VPS, and a Raspberry Pi that we've set up at the Recurse Center [0] which is used to host (double checking now) ~75 web projects. On a single Pi!
[0] https://www.recurse.com/
Can I ask which hetzner instance you use?
And if you really needed 99.99%, you would be wise to avoid the hyperscalers: see AWS' recent multi-hour long outage.
300 services?? What do they all do?
Tons of little Slack bots and apps and stuff! It’s a vibrant community and people are always making cool little tools
Oh hey, you’re not getting booted after all!
(Just remember to take regular backups now, so that when this 5 year deal expires you don’t get into the same situation again :-)
Heroku's pricing is wild. About a decade ago I just about fell out of my chair when I found out the startup I was at was burning upwards of $10k/mo just to generate QR codes (made out of html tables so that they would reliably display in emails). It worked out to something like $0.15/code
The lead who wrote it had never even profiled code before, after some changes we cut it down to ~$0.01/per, but that's still insane.
What in the world?? Surely there must be something more than "generate a HTML page with 500 elements". Any edge cloud hosting lets you do that for free.
The article's title seems inaccurate - as far as I understood there never was a $3000/mo bill; there was a $500/(mo,instance) staging setup that has been rightly optimized to $55/mo before running six instances.
> Critically, all staging environments would share a single "good enough" Postgres instance directly on the server, eliminating the need for expensive managed database add-ons that, on Heroku, often cost more than the dynos themselves.
Heroku also has cheaper managed database add-ons, why not use something like that for staging? The move to self hosting might still make sense, my point is that perhaps the original staging costs of $500/mo could have been lower from the start.
I answered elsewhere with the list of dynos, but the short version is that between the list of services that each deployment required, and the size of the database, it truly (and unfortunately) did end up costing $500 per staging.
The situation is interesting, and self-hosting is indeed a very nice solution often. However, I wanted to comment on the article itself - it seems to be very heavily AI-edited. Anyone who has spent time with LLMs will easily see it. But even that's not the issue; the main issue is that the article is basically a marketing piece.
For example, the "Bridging the Gap: Why Not Just Docker Compose?" section is a 1:1 copy of the points in the "Powerful simplicity" on the landing page - https://disco.cloud/
And this blog post is the (only) case study that they showcase on their main page.
You're absolutely right! Here are some three points why:
- ...
I'm kidding :-)
Our library is open source, and we're very happy and proud that Idealist is using us to save a bit of cash. Is it marketing if you're proud of your work? :-) Cheers
There's a tone issue.
Marketing should be marketing and clearly so. Tech blogs are about sharing information with the community (Netflix Tech blog is a good example) NOT selling something. Marketing masquerading as a tech blog is offputting to a lot of people. People don't like being fooled with embedded advertising and putting ad copy into such pieces is at best annoying.
https://netflixtechblog.com/
Nah, people are stupid. Including me. It's all marketing. Netflix's tech blog is marketing to engineers to want to go work there and to promote their product. If you want to see things though the lense that all advertising is bad, you'll make your life miserable because it's all advertising in one way or another.
Is it? Was this, your HN comment, marketing?
Mine isn't, unless you make the meaning of that term so broad that it essentially lost any meaningful meaning. (Intentionally meta.)
That's the problem with framing everything that way. This HN comment is marketing for my brand, my username, I sell t-shirts on my website! That's not why I'm commenting here, but the term is that broad because we're using it colloquially. It's a human psychology thing that I get entrapped into too. Calling it out doesn't make it not work. When you use the lens of marketing, your comment is marketing that you are not marketing, which is a specific category and advertising profile to be filed away in a database somewhere, if we go to the theoretical extremes.
What you've done is taken something I've written, redefined a core term in a way I obviously didn't mean, and then told me I'm wrong because of your redefinition.
When you put it that way, you make me sound like an ass. Is that how I'm coming across? What did I redefine? I'm refuting the fairytale where some content is pure and untainted by marketing. Netflix writes posts that make engineers want to work there and people think, "hey, that's smart!" That’s marketing.
I think a big difference is when someone is pretending to be all about something else and tries to sneakily market to you. One thing is getting a free water bottle with an ad, another thing is when someone is inviting you to a "party" with free food and drinks and it turns out to be a MLM "party".
Netflix is giving away free water bottles (I hate them, but I use their fast.com super often to test the speeds), another is pretending to be a blog post, but actually being an ad (if that was the case here). You just feel lied to. You cannot take anything seriously you read there, as it will probably be super biased and you cannot get your time back now.
Maybe not an ass, that's too strong, but it's a common online pattern where someone transforms your point into an entirely different meaning and then disagrees with that transformation. It's annoying.
I'm complaining about thinly veiled ad copy wearing the mask of shared technical notes. This is seen as a bad faith effort by the publisher of such notes and a dirty trick played on the reader. Advertising should announce itself for what it is.
I'm very clearly making a distinction, I like A, I don't like B.
You're taking that, saying I must actually hate both A and B, and by the way C through Z because nobody is 111% pure of heart and everybody must have at least some motivation for doing something and nobody is entirely altruistic.... which is just this crazy extreme that it's clear I don't believe at all.
I like the incentive structure that leads Netflix to produce objectively high quality articles sharing with the community in a way that really seems to be entirely untainted by the motivation.
Ad copy in tech notes does seem to taint the motivation and quality of them, it can be innocent but it doesn't seem like it and is generally irritating to a lot of people.
Dislike of a certain kind of advertising doesn't mean I'm sitting around miserable because nobody is truly altruistic as you suggest, and that the issue. My lines of thinking aren't taken to a silly extreme. A lot of disagreements these days are people reinterpreting their opposition as exclusively extremist and that's a problem.
You keep saying it's clear when it isn't. We don't know what's going on in your mind. Did you know there are people out there that won't eat anything that came from any animal products? That's crazy extreme! But there are tons of vegans out there. So what's seems extreme to one person is someone else's normal, and someone else's normal is extreme.
You say you like A and don't like B. You don't like B because it has X in it. But A also has X in it. So why do you like A but not B? It's not logically consistent. We disagree on how much X is in A. You want X to be clearly marked with red tape. It's not clear how reasonable and feasible that is or isn't. I'm saying if you're looking for X, you're going to find trace amounts of it everywhere once you start looking for it. X isn't some previously unheard of chemical that's gonna give you cancer or leaky gut though, it's other people making money. It's been chosen for us, that money is how the world works. It's not how I would do it, but I'm not in charge of the world, so it's a moot point. Everyone is weird about money in their own special way. I am no exception. What sticks in my craw is when people have problems with other people making money. How they make money is material. I'm not okay with making money off of sex trafficking or CSAM, for example, but advertising a product with an interesting bit of writing beforehand isn't that. So on the spectrum of your kid's painting that they made for you in school with crayon that were ethically sourced and drew on recycled paper, to the in your face red plastic Coca-Cola banner wrapped around the side of a bus that's gonna be fed to whales to choke and die on, where this particular blog post lies is for you to determine for yourself. Where I'm really getting at is that requiring X to be at a certain level has the unintended consequence that only big corporations with giant bags of money can create content that passes this purity test of yours, is, if we do some extrapolating, self-defeating.
I'm not sure you're functionally literate and you're beginning to ramble. So yes you're coming off as an asshole and just shouldn't respond like this. When I glance at your reply and you're bringing up sex trafficking somehow... yeah no thanks. This is the kind of reply definitely not worth engaging in.
> But even that's not the issue; the main issue is that the article is basically a marketing piece.
Why is that an issue? Is it forbidden by HN guidelines? Or would you like all marketing to be marked as such? Which articles aren't marketing, one way or another?
It's funny they have this marketing blog post based on competing on price yet don't disclose any of their pricing on their site only a schedule a meeting which is just about the biggest RED FLAG on pricing there is.
Our library is open source, the price is 0!! :-) Haha
We're actually mostly talking to people (that "schedule a meeting") to see how we can help them migrate their stuff away (from Heroku, Vercel, etc.)
But we're not sure of the pricing model yet - probably Entreprise features like Gitlab does, while remaining open source. It's a tough(er) balance than running a hosted service where you can "just" (over)charge people.
heh my first instinct was to go see how they're making money. Guess that's coming soon
This isn't the first time an article is also marketing. Besides, what is wrong with marketing something via a use case article? This is a fairly tame example of it and I found it an interesting and useful read, knowing full well it was also marketing.
I guess I'm not quite understanding why you need six staging servers provisioned at $500 a pop? And if you need that because you have a large team...what percentage of your engineering spend is $3000 vs $100k+/yr salaries?
Especially when I got look at the site in question (idealist.org) and it seems to be a pretty boring job board product.
6 staging servers: main, dev, and any branches that you want to let other (non tech people) QA.
As for the staging servers, for each deployment, it was a mix of Performance-M dynos, multiple Standard dynos, RabbitMQ, a database large enough, etc. - it adds up quickly.
Finally, Idealist serves ~100k users per day - behind the product is a lot of boring tech that makes it reliable & fast. :-)
you're telling me 100k people are looking for jobs in non-profits on your specific site daily? Are you sure you don't have a bot/scraper problem?
Honestly, 100k/day sounds low for Idealist. It's the go-to place for volunteer and non-profit work, which is quite a considerable market.
From what I read, they're using them as dev environments. Like running many services at once for a single developer to tie into. That's why they wanted multiple ones, one for each dev.
$3000/month = 36k/year
That's more than 1/3 of the cost of a developer there.
That will save you some week of a person's work to set things up and half-a-day every couple of months to keep it running. Rounding way up.
Yes, everyone forget to compute man-days in the cost calculation
This thinking definitely drives enterprise products, and is exactly what makes it hard for small companies. "You can pay a lot simply because you clearly can afford to" doesn't lead to great products, even if it often does lead to profitable companies.
Hosting staging on a fundamentally different architecture and resources than prod (and dev I think) is a disater waiting to happen.
Unless they plan to move prod and dev as well, and using staging now as a test platform.
Once few problems glitch when moving to prod, they may no longer think they are saving much money.
What’s the best alternative to heroku today for someone that doesn’t want to do any sysadmin and just dump a Django site and database somewhere?
Any VPS you fancy that fits the price/performance/location/support you want, then point Coolify/Dokploy/whatever at it.
I did just this using Coolify, Mythic Beasts running Django & Postgres the other month from Google App Engine. Hilariously easy, even with my extremely rusty skills.
It is worth learning to use Docker Swarm. Deployments are as simple as pushing a new container to your registry and running one command. I built a free CLI tool rove.dev that simplifies provisioning and does service diffing.
Either that or use a PaaS that deploys to VMs. Can't make recommendations here but you could start by looking at Semaphore, Dokku, Dokploy.
I'm looking for simple k8s alternatives like docker swarm and kamal. Rove looks really interesting.
https://render.com/ is probably the closest, I'm really enjoying using them. Workflow is the same as heroku, but cheaper, no nightly restarts, supports new python versions etc..
Whats wrong with just spinning up a server on hetzner. At most you need to setup nginx and a systemctl service
Oracle has free VPS if your requirements aren't huge. Hobby project etc.
PythonAnywhere: https://www.pythonanywhere.com/
Any place you can get a vps from.
3000 to 55? Par for the course.
$55 server
$550 aws server
$3000 aws based paas server
Cool project!
From looking at your docs, it appears like using and connecting GitHub is a necessary prerequisite for using Disco. Is that correct? Can disco also deploy an existing Docker image in a registry of my choosing without a build step? (Something like this with Kamal: `kamal --skip-push --version latest`)
Correct, GitHub is necessary at this point to deploy code.
However, yes, you can ask Disco to fetch an existing Docker image (we use that to self-host RabbitMQ). An example of deploying Meilisearch's image is here [0] with the tutorial here [1].
Do you typically build your Docker images and push them to a registry? Curious to learn more about your deployment process.
[0] https://github.com/letsdiscodev/sample-meilisearch/blob/main...
[1] https://disco.cloud/docs/deployment-guides/meilisearch
Yes, I try to keep my CI pipelines somewhat platform-agnostic so even though I'm mostly using GitHub, my workflow is typically to first build a Docker image and push it to a registry, then use Kamal to deploy that image.
Doesn't staging need to be a (downsized) replica of prod, infra wise to give confidence that changes will be stable and working in prod?
Genuine question.
Yes, I'm just as curious as you on _why_ does a staging setup needs the same amount of resources as prod.
All of my staging setups are on a ~$15 Hetzner server, with a GitHub Action to `docker compose build && docker compose up -d` remotely, with an Apache service with a wildcard certificate and dynamic host names. We have 3..n staging setups, with each PR spinning up a new staging site just for that PR.
It's been working with us for years, for a team of 10 developers.
It is clear that Heroku is not interested in reducing their prices. But I don’t think this is a Heroku problem. Vercel is also the same, which makes me think there is a fundamental issue with the PaaS business model that stops it from competing on price while the commoditised part their business (data centers) are always reducing their prices.
The challenge I always face with homebrew PaaS solutions is that you always end up moving from managing your app to managing your PaaS.
This might not be true right now but as complexity of your app grows it’s almost always the eventual outcome.
Heroku and Vercel don’t ever have any intention of competing on price
They offer convenience
It’s not just convenience. This single box is a single point of failure.
On the other hand for $3k/month you can just hire someone to do it for you (part time at least, but I doubt it's remotely a full-time job).
Having been in the industry for 20 years, I can remember we were processing high loads with... Algorithms. It wasn't a cloud cost saving initiative back then, but a necessity if you had scale, you could just not throw money at scaling. Feels like we shifted optimization from algorithms to cloud costs savings...
Title seems slightly exaggerated since by my reading there was no actual $3000 / month bill? Still a great use-case
This seems like a good idea to have plentiful dev environments and avoid a bad pricing model. If your production instance is still on Heroku, you might still want a staging environment on Heroku since a Hetzner server and your production instance might have subtle differences.
It is hilarious, don't get me wrong - I really appreciate more people moving away from these "Hi-Tech" deployment styles and cloud services and the rest, but it is like rediscovering hot water.
> Bridging the Gap: Why Not Just Docker Compose?
The draw of a docker-compose-like interface for deployment is so alluring that I have spent the last year or so working on a tool called Defang that takes a compose file and deploys it to the cloud. We don't support Hetzner (yet), but we do support AWS, GCP, and DO. We provision networking, IAM, compute, database, secrets, etc in your cloud account, so you maintain full control, but you also get the ergonomics of compose.
If you are on a PaaS and you want to reduce cost without losing ergonomics and scalability, it might be interesting.
Quite sad to see devs nowadays has lost abilities to self-host. I know it can be overwhelming with Linux, networking, db, backup, hardware load.... However, it's not rocket science!
Cool to hear on the savings. But now the team has to maintain two different deployment models, so you have account for the ongoing cost that your team now has to own and maintain two different processes of deployment (prod & staging).
The key element here is the need to continuously exercise both processes (Heroku + your staging server), to work out both processes & maintain familiarity on both.
Depending on the amount of staff involved in the above, it might eclipse the compute savings, but only OP knows those details. I'm sure they are a smart bunch.
Congrats Greg & Antoine! disco.cloud is really needed, I hope you guys get the visibility you deserve and some momentum from the community!
Can anyone comment on how Disco compares to Dokku?
Two question.
What's in it for Disco ?
What's the pricing ?
How many work hours per month does keeping this thing stable take.
If it takes over 15 Heroku is cheaper.
Hosting with bare metal is still expensive, you pay in other ways.
Hetzner cloud has instances in US, which could work since they don't need the stability of dedicated for staging/dev.
I love the convenience of Heroku but hate their predatory pricing. Who's fixing this?
Fly was supposed to fix Heroku but my bill more than doubled since they changed how they charge for shared CPUs.
https://community.fly.io/t/cpu-quotas-update/23473
I work at Render (render.com); we have over 4 million developers on the platform, and we've migrated many large (and small) Heroku customers over because of our more modern capabilities and scalable pricing.
https://render.com/docs/migrate-from-heroku
You have your range of options - it depends on the size of your team, the kind of apps you're running, etc. The answer can be anything from an "ssh script" to AWS (or K8S), etc.
If you're running something that's too expensive for your taste and can share more information, happy to brainstorm some options.
AWS Elastic Beanstalk gives you more or less the same experience but charges you normal EC2 instance pricing. It's as cheap as PaaS gets.
I was looking on Hetzner after that recent article and their server marketplace has $34/month server that had something like an Intel Core i7 with 64GB RAM and 2x512GB SSDs. Compare that to EC2 pricing.
Single server is very cheap for hobbyist.
Just something to consider if you are in a professional environment before switching your entire infra: maintenance cost is expensive. I strongly suggest to throw man-days in your cost calculation.
To prevent security vulnerabilities, the team will need to write some playbooks to auto-update regularly your machine, hoping for no breaking changes. Or instead write a pipeline for immutable OS images updates. And it often mean testing on an additional canary VM first.
Scaling up the VM from a compute point of view is not that straightforward as well, and will require depending of the provider either downtime or to migrate the entire deployments to a new instance.
Scaling from a disk size point of view, you will need to play with filesystems.
And depending on the setup you are using, you might have to manage lets encrypt, authentication and authorization, secrets vaults, etc (here at least Disco manages the SSL certs for you)
If you are large enough, you will need an ops team to manage allowing your developers to write terraform and manage AWS costs already.
If you are small enough, you are not going to be truly affected by downtime. If you are just a little bigger, a single hot spare is going to be sufficient.
The place where you get dinged is heavy growth in personnel and bandwidth. You end up needing to solve CPU bound activities quicker because it hurts the whole system. You need to start thinking about sticky round robin load balancing and other fun pieces.
This is where the cloud can allow you to trade money for velocity. Eventually, though, you will need to pay up.
That said, the average SaaS can go a long way with a single server per product.
> I strongly suggest to throw man-days in your cost calculation.
Only if those man-days actually incur a marginal cost. If it's just employees you already have spending their time on things, then it's not worth factoring in because it's a cost you pay regardless.
Amazing to see this article in 2025. Feel like it's 2015 all over again!
Heroku is cool in that it helps you get running and autoscaled, but it would be much cheaper for anyone with traffic to just get a dedicated box
Good improvement, but 50x overpayment until a rethink is also pretty wild.
From having talked to many folks, migrations are psychologically very, very, very very hard.
At least, the "fear" factor (will the new system work? what bugs will it introduce? how much time will I spend, etc.) pushes a lot of folks to accept a very big price differential aka known knowns versus unknowns...
It's understandable really. It's just that once you've migrated, you almost definitely never want to go back :-)
On software stack I definitely get the fear.
...but this CX33 "server" being discussed - is a 6 bucks a month VPS [0]
Normally you build a prototype on laptop and move it out to fat hardware when it outgrows that. Here they started with 3k infra and then later realized it runs on toaster. Completely back to front.
Maybe they just never iterated on a local version and nobody developed an intuition for requirements. Switched straight to iterating on a nebulous cloud where you can't tell how much horsepower is behind the cloudfunctions etc.
Presumably there is a perfectly reasonably explanation and it's just not spelled out, it just seems weird based on given info
[0] https://www.hetzner.com/cloud
Small correction - the blog article talks about a CCX33 (go to "Dedicated General Purpose" [0]) with 32 Gb of RAM, not a "Shared" CX33.
[0 ]https://www.hetzner.com/cloud
I like Heroku for my needs but have noticed oddities in the pricing that can make a small app cost much more than a differently arranged large app.
Nice! Way to go for non-prod environments. (For prod you'd need some redundancy at least.)
Quick question: how long would it take to provision and set up another server if this one dies?
The longest is to adapt your app to a Dockerfile-based deployment, if it isn't already containerized. We have examples for most languages - for Flask, for example, the whole file is 10 lines long [0]
But to provision a new server, as these are "stateless" (per 12 Factor) servers, it's just 1) get a VPS 2) install Docker+Disco using our curl|sh install script 3) authorize github 4) deploy a "project" (what we call an app), setting the env vars.
All in all ~10 minutes for a new machine.
[0] https://github.com/gregsadetsky/example-flask-site/blob/main...
One thing however to note, is that by having a different non-prod and prod environment, it will be possible to test only the application, and not the infra.
Which means, that if they want to test what it will look like running in cloud for prod, they are going to either need a pre-prod environment or go yolo
I bet less time than it takes AWS to recover from a significant event. And I bet it happens less often too.
Ideally these things should go in an Ansible playbook or whatever people are using these days to manage their pets.
I mean the availability of the hardware. It's a dedicated server, AFAICT.
At $55/mo, they could buy another server in another state.
I don't mean to hate but i find it incredibly alarming that i'm lately seeing all the seemingly seniorly positioned people writing articles about how they just realized that you can actually just buy a vps, setup a deployment workflow and write a revealing blog about "drastically cutting costs".
It's like juniors who did not recieve a proper training/education got hired into companies where someone told them to go serverless on some heroku or vercel, or use some incredibly expensive aws service because that's a "modern correct way" to do it, except now they were a developer for long enough to get a "senior" title in their job title now are in positions of actually modelling this architecture themselves
Heroku's pricing model made me shy away even from using them for small stuff. Why get comfortable on a stack that disincentivizes success?
It is absolutely nuts to me that this machine:
AMD Ryzen™ 7 3700X CPU 8 cores / 16 threads @ 3.6 GHz Generation: Matisse (Zen2) RAM 64 GB DDR4 ECC
Drives 4 x 22 TB HDD 2 x 1 TB SSD
is only 104 euros a month on Hetzner.
The STORAGE alone would cost $1624 a month in most clouds
It sounds more like poor choices. 6 staging environments sounds a bit overkill.
If you can fit them all on a 4 cpu / 32gb machine, you can easily forgo them and run the stack locally on a dev machine. IME staging environments are generally snowflakes that are hard to stand up (no automation).
$500/month each is a gross overpayment.
> you can easily forgo them and run the stack locally
Not if you're running with external resources of specific type, or want to share the ongoing work with others. Or need to setup 6 different projects with 3 different databases at the same time. It really depends on your setup and way of working. Sometimes you can do local staging easily, sometimes it's going to be a lot of pain.
Very cool project. Is there an overview of the architecture? Perhaps a diagram or some drawing?
I mean something like a list of moving parts so I can understand how it works. Perhaps something like this:
https://caprover.com/#:~:text=CapRover%20Architecture%20at%2...
Although LLM generated, https://deepwiki.com/letsdiscodev/disco-daemon is pretty impressive and has some arch diagrams. But I fully agree, we should have that on the site.
Once everything is installed/running, a very tldr diagram would be:
GitHub (webhook on git push) -> Docker swarm running Caddy -> Disco Daemon REST API which will ask Docker to build the image, and then does a blue-green zero-time deployment swap
But yeah, a clearer/better diagram would be great. Thanks for the push!
Fair enough. That LLM generated doc was surprisingly educational.
And your description is a great macro view of it. Thanks!
i'd be interested what the load is like on that CCX33 server - i've got a lower-spec VPS from Hetzner and even from there I'm only using about 25%-30% CPU/RAM with a moderate load
From the article:
> Even with all 6 environments and other projects running, the server's resource usage remained low. The average CPU load stayed under 10%, and memory usage sat at just ~14 GB of the available 32 GB.
Do they really need a full mirror of production?
Every time I've worked somewhere without one, we've wanted it and wasted more developer hours than the cost of having it trying to reproduce issues while working around the differences in the environments.
Why people discover it only today? I remember making comments about it years ago.
I even shown one customer that their elaborate cluster costing £10k a month could run on a £10 vps faster and with less headache (they set it up for "big data" thinking 50GB is massive. There was no expectation of the database growing substantially beyond that).
Their response? Investors said it must run on the cloud, because they don't want to lose their money if homegrown setup goes down.
So there is that.
Yes. The "cloud" is sold on grounds of "efficiency" but really it's just an ideological decision to increase outsourcing and reduce the employees' bargaining power.
(Except this backfires, because a service running on a RHEL or Debian machine might go on for 5-10 years untouched without any particular issue, security aside, while anything relying on kubernetes or the hyperscaler's million little services needs to be tweaked every 6 months and re-engineered every few years or it will completely stop working.)
Uhh a “multi-gigabyte Postgres database” is not “substantial”
The kind of headline that is worth learning more about.
Dokku can be an option if needed to maintain heroku endpoints.
Might as well ask this: anyone know any server provides that are like half the cost of hetzner. I know that's asking a lot but still.
Netcup is cheaper than Hetzner, but it doesn’t have some of the other features and reviews are mixed.
I think that https://lowendbox.com/ might be a good place to start looking for that
noo UI???? I mean we already have coolify and dokploy that doing the same for more
but glad we have new product offering for this
We do have a UI, we're just so behind on the documentation, it's not even funny ha.
If you setup a server with the curl|sh install script on the homepage, you'll get a url at the end that directs you there. And you can use the CLI too of course.
But yeah, thanks for the reminder!
Any Elixir/Gleam/Erlang (distributed) support?
I don't know! I do see a Docker image for Elixir, so I'm pretty sure that would work. But the distributed aspect is harder to answer right now.
How do you typically deploy this?
Render (because it's on k8s) and Fly handle distributed erlang out of the box, so I don't have to think much about it. Heroku does not.
oh good point
I love these types of stories. Please submit more of this type.
Bring back sanity to tech.
> The Real Insight: Staging Became a Free Commodity
Not free, it became a productivity boost.
You now have a $35k annual budget for the maintenance, other overhead, and lost productivity. What do you spend it on?
> The team also took on responsibility for server monitoring, security updates, and handling any infrastructure issues themselves
For a place that’s paying devs $150k a year that might math out. It absolutely does not for places paying devs $250k+ a year.
One of the great frustrations of my mid career is how often people tried to bargain for more speed by throwing developers at my already late project when what would have actually helped almost immediately was more hardware and tooling. But that didn’t build my boss’ or his bosses’ empires. Don’t give me a $150k employee to train, give me $30k in servers.
Absolutely no surprise at all when devs were complicit with Cloud migrations because now you could ask forgiveness instead of permission for more hardware.