These sorts of core-density increases are how I win cloud debates in an org.
* Identify the workloads that haven't scaled in a year. Your ERPs, your HRIS, your dev/stage/test environments, DBs, Microsoft estate, core infrastructure, etc. (EDIT, from zbentley: also identify any cross-system processing where data will transfer from the cloud back to your private estate to be excluded, so you don't get murdered with egress charges)
* Run the cost analysis of reserved instances in AWS/Azure/GCP for those workloads over three years
* Do the same for one of these high-core "pizza boxes", but amortized over seven years
* Realize the savings to be had moving "fixed infra" back on-premises or into a colo versus sticking with a public cloud provider
Seriously, what took a full rack or two of 2U dual-socket servers just a decade ago can be replaced with three 2U boxes with full HA/clustering. It's insane.
Back in the late '10s, I made a case to my org at the time that a global hypervisor hardware refresh and accompanying VMware licenses would have an ROI of 2.5yrs versus comparable AWS infrastructure, even assuming a 50% YoY rate of license inflation (this was pre-Broadcom; nowadays, I'd be eyeballing Nutanix, Virtuozzo, Apache Cloudstack, or yes, even Proxmox, assuming we weren't already a Microsoft shop w/ Hyper-V) - and give us an additional 20% headroom to boot. The only thing giving me pause on that argument today is the current RAM/NAND shortage, but even that's (hopefully) temporary - and doesn't hurt the orgs who built around a longer timeline with the option for an additional support runway (like the three-year extended support contracts available through VARs).
If we can't bill a customer for it, and it's not scaling regularly, then it shouldn't be in the public cloud. That's my take, anyway. It sucks the wind from the sails of folks gung-ho on the "fringe benefits" of public cloud spend (box seats, junkets, conference tickets, etc...), but the finance teams tend to love such clear numbers.
The main cost with on-prem is not the price of the gear but the price of acquiring talent to manage the gear. Most companies simply don't have the skillset internally to properly manage these servers, or even the internal talent to know whether they are hiring a good infrastructure engineer or not during the interview process.
For those that do, your scaling example works against you. If today you can merge three services into one, then why do you need full time infrastructure staff to manage so few servers? And remember, you want 24/7 monitoring, replication for disaster recovery, etc. Most businesses do not have IT infrastructure as a core skill or differentiator, and so they want to farm it out.
> even the internal talent to know whether they are hiring a good infrastructure engineer or not during the interview process.
This is really the core problem. Every time I’ve done the math on a sizable cloud vs on-prem deployment, there is so much money left on the table that the orgs can afford to pay FAANG-level salaries for several good SREs but never have we been able to find people to fill the roles or even know if we had found them.
The numbers are so much worse now with GPUs. The cost of reserved instances (let alone on-demand) for an 8x H100 pod even with NVIDIA Enterprise licenses included leaves tens of thousands per pod for the salary of employees managing it. Assuming one SREs can manage at least four racks the hardware pays for itself, if you can find even a single qualified person.
$120K isn't going to cover the fully loaded costs of an SRE who can set up and run that.
Hiring 1 person to run the infrastructure means that 1 person is on-call 24/7 forever.
If there's an issue with the server while they're sick or on vacation, you just stop and wait.
If they take a new job, you need to find someone to take over or very quickly hire a replacement.
There's a second bus factor: What happens when that 8xH100 starts to get flakey? You can't move the jobs to another server because you only have one. You can start diagnosing things and replacing parts and hope it gets to the root issue, but that's more downtime.
Going on-prem like this is highly risky. It works well until the hardware starts developing problems or the person in charge gets a new job. The weeks and months lost to dealing with the server start to become a problem. The SRE team starts to get tired of having to do all of their work on weekends because they can't block active use during the week. Teams start complaining that they need to use cloud to keep their project moving forward.
> $120K isn't going to cover the fully loaded costs of an SRE who can set up and run that.
> Hiring 1 person to run the infrastructure means that 1 person is on-call 24/7 forever.
> If there's an issue with the server while they're sick or on vacation, you just stop and wait.
Very much depends on what you're doing, of course, but "you just stop and wait" for sickness/vacation sometimes is actually good enough uptime -- especially if it keeps costs down. I've had that role before... That said, it's usually better to have two or three people who know the systems though (even if they're not full time dedicated to them) to reduce the bus factor.
If a business which require at least a quarter million bucks worth of hardware for the basic operation yet it can't pay the market rate for someonr who would operate it - maybe the basics of that business is not okay?
This factually did not play out like this in my experience.
The company did need the same exact people to manage AWS anyway. And the cost difference was so high that it was possible to hire 5 more people which wasn't needed anyway.
Not only the cost but not needing to worry about going over the bandwidth limit and having soo much extra compute power made a very big difference.
Imo the cloud stuff is just too full of itself if you are trying to solve a problem that requires compute like hosting databases or similar. Just renting a machine from a provider like Hetzner and starting from there is the best option by far.
> The company did need the same exact people to manage AWS anyway.
That is incorrect. On AWS you need a couple DevOps that will Tring together the already existing services.
With on premise, you need someone that will install racks, change disks, setup high availability block storage or object storage, etc. Those are not DevOps people.
People will install racks and swap drives for significantly less money than DevOps, lol. People who can build LEGO sets are cheaper than software developers.
Is it still a problem in 2026 when unemployment in IT is rising? Reasons can be argued (the end of ZIRP or AI) but hiring should be easier than it was at any time during the last 10 years.
I know of AWS's reputation as a business and what the devs say who work there, so I have no argument against your point, except to say that they do manage to make it work. Somewhere in there must be some unsung heroes keeping the whole thing online.
> main cost with on-prem is not the price of the gear but the price of acquiring talent to manage the gear
Not quite. If you hire a bad talent to manage your 'cloud gear' then you would find what the mistakes which would cost you nothing on-premises would cost you in the cloud. Sometimes - a lot.
Given how good Apple Silicon is these days, why not just buy a spec'd out Mac Studio (or a few) for $15k (512 GB RAM, 8 TB NVMe), maybe pay for S3 only to sync data across machines. No talent required to manage the gear. AWS EC2 costs for similar hardware would net out in something ridiculous like 4 months.
That’s definitely the right call in some cases. But as soon as there’s any high-interconnect-rate system that has to be in cloud (appliances with locked in cloud billing contracts, compute that does need to elastically scale and talks to your DB’s pizza box, edge/CDN/cache services with lots of fallthrough to sources of truth on-prem), the cloud bandwidth costs start to kill you.
I’ve had success with this approach by keeping it to only the business process management stacks (CRMs, AD, and so on—examples just like the ones you listed). But as soon as there’s any need for bridging cloud/onprem for any data rate beyond “cronned sync” or “metadata only”, it starts to hurt a lot sooner than you’d expect, I’ve found.
Yep, 100%, but that's why identifying compatible workloads first is key. A lot of orgs skip right to the savings pitch, ignorant of how their applications communicate with one another - and you hit the nail on the head that applications doing even some processing in a cloud provider will murder you on egress fees by trying to hybrid your app across them.
Folks wanting one or the other miss savings had by effectively leveraging both.
Any experience with the mid-to-small cloud providers that provide un-metered network ports and/or free interconnect with partner providers?
(For various reasons, I just care about VPS/bare metal, and S3-compatiblity.)
I'm looking at those because I'm having difficulty forecasting bandwidth usage, and the pessimistic scenarios seem to have me inside the acceptable use policies of the small providers while still predicting AWS would cost 5-10x more for the same workload.
What has surprised me about the cloud is that the price has been towards ever increasing prices for cores. Yet the market direction is the opposite, what used to be a 1/2 or a 1/4 of a box is now 1/256 and its faster and yet the price on the cloud has gone ever up for that core. I think their business plan is to wipe out all the people who used to maintain the on premise machines and then they can continue to charge similar prices for something that is only getting cheaper.
Its hard drive and SSD space prices that stagger me on the cloud. Where one of the server CPUs might only be about 2x the price of buy a CPU for a few years if you buy less in a small system (all be it with less clock speed usually on the cloud) the drive space is at least 10-100x the price of doing it locally. Its got a bit more potential redudency but for that overhead you can repeat that data a lot of times.
As time has gone on the deal of cloud has got worse as the hardware got more cores.
Do note though that AIUI these are all E-cores, have poor single-threaded performance and won't support things like AVX512. That is going to skew your performance testing a lot. Some workloads will be fine, but for many users that are actually USING the hardware they buy this is likely to be a problem.
If that's you then the GraniteRapids AP platform that launched previously to this can hit similar numbers of threads (256 for the 6980P). There are a couple of caveats to this though - firstly that there are "only" 128 physical cores and if you're using VMs you probably don't want to share a physical core across VMs, secondly that it has a 500W TDP and retails north of $17000, if you can even find one for sale.
Overall once you're really comparing like to like, especially when you start trying to have 100+GbE networking and so on, it gets a lot harder to beat cloud providers - yes they have a nice fat markup but they're also paying a lot less for the hardware than you will be.
Most of the time when I see takes like this it's because the org has all these fast, modern CPUs for applications that get barely any real load, and the machines are mostly sitting idle on networks that can never handle 1/100th of the traffic the machine is capable of delivering. Solving that is largely a non-technical problem not a "cloud is bad" problem.
E-cores aren't that slow, yesteryear ones were already around Skylake levels of performance (clock for clock). Now one might say that's a 10+ year old uarch, true, but those ten years were the slowest ten years in computing since the beginning of computing, at least as far as sequential programs are concerned.
At my job we use HyperV, and finding someone who actually knows HyperV is difficult and expensive. Throw in Cisco networking, storage appliances, etc to make it 99.99% uptime...
Also that means you have just one person, you need at least two if you don't want gaps in staffing, more likely three.
Then you still need all the cloud folks to run that.
We have a hybrid setup like this, and you do get a bit of best of both worlds, but ultimately managing onprem or colo infra is a huge pain in the ass. We only do it due to our business environment.
Cloud = the right choice when just starting. It isn't about infra cost, it is about mental cost. Setting up infra is just another thing that hurts velocity. By the time you are serving a real load for the first time though you need to have the discussion about a longer term strategy and these points are valid as part of that discussion.
I guess it depends, but infra is also a lot simpler when starting out. It really isnt much harder (easier even?) to setup services on a box or two than managing AWS.
Im pretty sure a box like this could run our whole startup, hosting PG, k8s, our backend apis, etc, would be way easier to setup, and not cost 2 devops and $40,000 a month to do it.
Is infra really that hard to set up? It seems like infra is something a infra expert could establish to get the infra going and then your infra would be set up and you would always have infra.
As a big on-prem guy, I think cloud makes sense for early startups. Lead time on servers and networking setup can be significant, and if you don't know how much you need yet you will either be resource starved or burn all your cash on unneeded capacity.
On-prem wins for a stable organization every time though.
You have to pay that infra person and shield them from "infra works, why are we paying so much for IT staff" layoffs. Then you have ongoing maintenance costs like UPS battery replacement and redundant internet connections, on top of the usual hardware attrition.
Is using virtualization the only good way of taking a 288-core box and splitting it up into multiple parallel workloads? One time I rented a 384-core AMD EPYC baremetal VM in GCP and I could not for the life of me get parallelized workloads to scale just using baremetal linux. I wanted to run a bunch of CPU inference jobs in parallel (with each one getting 16 cores), but the scaling was atrocious - the more parallel jobs you tried to add, the slower all of them ran. When I checked htop the CPU was very underutilized, so my theory was that there was a memory bottleneck somewhere happening with ONNX/torch (something to do with NUMA nodes?) Anyway, I wasn't able to test using proxmox or vmware on there to split up cpu/memory resources; we decided instead to just buy a bunch of smaller-core-count AMD Ryzen 1Us instead, which scaled way better with my naive approach.
How did the speed of one or two jobs on the EPYC compare to the Ryzen?
And 384 actual cores or 384 hyperthreading cores?
Inference is so memory bandwidth heavy that my expectations are low. An EPYC getting 12 memory channels instead of 2 only goes so far when it has 24x as many cores.
Is that personnel cost more than running on someone else's infra? Just counting the amount of people a company now need just to maintain their cloud/kubernetes/whatever setup, paired with "devops" meaning all devs now have to spend time on this stuff, I could almost wager we would spend less on personnel if we just chucked a few laptops in a closet and sshed in.
> These sorts of core-density increases are how I win cloud debates in an org.
The core density is bullshit when each core is so slow that it can't do any meaningful work. The reality is that Intel is 3 times behind AMD/TSMC on performance vs power consumption ratio.
People would be better off having a look at the high frequency models (9xx5F models like the 9575F), that was the first generation of CPU server to reach ~5 GHz and sustain it on 32+ cores.
Intel seem to be deliberately hiding the clock frequency of this thing, the xeon-6-plus-product-deck.pdf has no mention of clock frequency or how LLC is shared.
> These sorts of core-density increases are how I win cloud debates in an org.
AMD has had these sorts of densities available for a minute.
> Identify the workloads that haven't scaled in a year.
I have done this math recently, and you need to stop cherry picking and move everything. And build a redundant data center to boot.
Compute is NOT the major issue for this sort of move:
Switching and bandwidth will be major costs. 400gb is a minimum for interconnects and for most orgs you are going to need at least that much bandwidth top of rack.
Storage remains problematic. You might be able to amortize compute over this time scale, but not storage. 5 years would be pushing it (depending on use). And data center storage at scale was expensive before the recent price spike. Spinning rust is viable for some tasks (backup) but will not cut it for others.
Human capital: Figuring out how to support the hardware you own is going to be far more expensive than you think. You need to expect failures and staff accordingly, that means resources who are going to be, for the most part, idle.
> If we can't bill a customer for it, and it's not scaling regularly, then it shouldn't be in the public cloud. That's my take, anyway. It sucks the wind from the sails of folks gung-ho on the "fringe benefits" of public cloud spend (box seats, junkets, conference tickets, etc...), but the finance teams tend to love such clear numbers.
I agree, but.
For one, it's not just the machines themselves. You also need to budget in power, cooling, space, the cost of providing redundant connectivity and side gear (e.g. routers, firewalls, UPS).
Then, you need a second site, no matter what. At least for backups, ideally as a full failover. Either your second site is some sort of cloud, which can be a PITA to set up without introducing security risks, or a second physical site, which means double the expenses.
If you're a publicly listed company, or live in jurisdictions like Europe, or you want to have cybersecurity insurance, you have data retention, GDPR, SOX and a whole bunch of other compliance to worry about as well. Sure, you can do that on-prem, but you'll have a much harder time explaining to auditors how your system works when it's a bunch of on-prem stuff vs. "here's our AWS Backup plans covering all servers and other data sources, here is the immutability stuff, here are plans how we prevent backup expiry aka legal hold".
Then, all of that needs to be maintained, which means additional staff on payroll, if you own the stuff outright your finance team will whine about depreciation and capex, and you need to have vendors on support contracts just to get firmware updates and timely exchanges for hardware under warranty.
Long story short, as much as I prefer on-prem hardware vs the cloud, particularly given current political tensions - unless you are a 200+ employee shop, the overhead associated with on-prem infrastructure isn't worth it.
> Then, you need a second site, no matter what. At least for backups, ideally as a full failover. Either your second site is some sort of cloud, which can be a PITA to set up without introducing security risks, or a second physical site, which means double the expenses.
You can technically have backblaze's unlimited backup option which costs around 7$ for a given machine although its more intended for windows, there have been people who make it work and Daily backups and it should work with gdpr (https://www.backblaze.com/company/policy/gdpr) with something like hetzner perhaps if you are worried about gdpr too much and OVH storage boxes (36 TB iirc for ~55$ is a good backup box) and you should try to follow 3-2-1 strategy.
> Then, all of that needs to be maintained, which means additional staff on payroll, if you own the stuff outright your finance team will whine about depreciation and capex, and you need to have vendors on support contracts just to get firmware updates and timely exchanges for hardware under warranty.
I can't speak for certain but its absolutely possible to have something but iirc for companies like dell, its possible to have products be available on a monthly basis available too and you can simply colocate into a decent datacenter. Plus points in that now you can get 10-50 GB ports as well if you are too bandwidth hungry and are available for a lot lot more customizable and the hardware is already pretty nice as GP observed. (Yes Ram prices are high, lets hope that is temporary as GP noted too)
I can't speak about firmware updates or timely exchanges for hardware under security.
That being said, I am not saying this is for everyone as well. It does essentially boils down to if they have expertise in this field/can get expertise in this field or not for cheaper than their aws bills or not. With many large AWS bills being in 10's of thousands of dollars if not hundreds of thousands of dollars, I think that far more companies might be better off with the above strategy than AWS actually.
> The only thing giving me pause on that argument today is the current RAM/NAND shortage
Not a shortage - price gouging. And it would mean an increase in the 'cloud' prices because they need to refresh the HW too. So by the summer the equation would be back to it.
With packages like this (lots of cores, multi-chip packaging, lots of memory channels), the architecture is increasingly a small cluster on a package rather than a monolithic CPU.
I wonder whether the next bottleneck becomes software scheduling rather than silicon - OS/runtimes weren’t really designed with hundreds of cores and complex interconnect topologies in mind.
Yes there are scheduling issues, Numa problems , etc caused by the cluster in a box form factor.
We had a massive performance issue a few years ago that we fixed by mapping our processes to the numa zones topology . The default design of our software would otherwise effectively route all memory accesses to the same numa zone and performance went down the drain.
Often the Linux scheduling improvements come a year or two after the chip. Also, Linux makes moment-by-moment scheduling and allocation decisions that are unaware of the big picture of workload requirements.
I don't think there are any fundamental bottlenecks here. There's more scheduling overhead when you have a hundred processes on a single core than if you have a hundred processes on one hundred cores.
The bottlenecks are pretty much hardware-related - thermal, power, memory and other I/O. Because of this, you presumably never get true "288 core" performance out of this - as in, it's not going to mine Bitcoin 288 as fast as a single core. Instead, you have less context-switching overhead with 288 tasks that need to do stuff intermittently, which is how most hardware ends up being used anyway.
Maybe no fundamental bottlenecks but it's easy to accidentally write software that doesn't scale as linearly as it should, e.g. if there's suddenly more lock contention than you were expecting, or in a more extreme case if you have something that's O(n^2) in time or space, where n is core count.
afaik the mainline limit is 4096 threads. HP sells server with 32 sockets x 60 cores/socket x 2 threads/core = 3840 threads, so we are pretty close to that limit.
There definitely are bottlenecks. The one I always think of is the kernel's networking stack. There's no sense in using the kernel TCP stack when you have hundreds of independent workloads. That doesn't make any more sense than it would have made 20 years ago to have an external TCP appliance at the top of your rack. Userspace protocol stacks win.
> I wonder whether the next bottleneck becomes software scheduling rather than silicon
Yep, the scheduling has been a problem for a while. There was an amazing article few years ago about how the Linux kernel was accidentally hardcoded to 8 cores, you can probably google and find it.
IMO the most interesting problem right now is the cache, you get a cache miss every time a task is moving core. Problem, with thousands of threads switching between hundreds of cores every few milliseconds, we're dangerously approaching the point where all the time is spent trashing and reloading the CPU cache.
That's the one. Funny thing, it's not actually clickbait.
The bug made it to the kernel mailing list where some Intel people looked into it and confirmed there is a bug. There is a problem where is the kernel allocation logic was capped to 8 cores, which leaves a few percent of performance off the table as the number of cores increase and the allocation is less and less optimal.
It's classic tragedy of the commons. CPU have got so complicated, there may only be a handful of people in the world who could work and comprehend a bug like this.
As a Yocto enthusiast, I am curious as to how much elapsed realtime would be needed for a clean Yocto build. Yocto is thread heavy, so with 288, it oughta be good.
Helped a friend make a difficult career decision (cozy job vs something hard and new + moving to a new city) that ultimately ended up with him working on the project. Glad that happened. I love to see people grow.
A bad moment to have a make-or-break moment for your CPU business - a lot of customers will probably hold off purchases right now because of the RAM prices, no matter how good your CPU might be.
Isn't this new server CPU a drop in replacement though? So the DC could pull off the old CPU, drop in the new one and not touch the existing RAM setup, yet be able to deliver better performance within the limits of the existing RAM. Then once RAM prices drop (okay that might be a while) separately upgrade the RAM at a different time.
I bought a bundle with 512GB of RAM and an older 24-core EPYC (7F72) + supermicro motherboard on ebay a bit over a year ago, it was really an amazing deal and has made for a truly nice NAS. If you're okay with stuff that's old enough that you can buy decommissioned server stuff, you can get really high-quality gear at surprisingly low prices.
Companies decommission hardware on a schedule after all, not when it stops working.
EDIT: Though looking for similar deals now, I can only find ones up to 128GB RAM and they're near twice the price I paid. I got 7F72 + motherboard + 512GB DDR4 for $1488 (uh, I swear that's what I paid, $1488.03. Didn't notice the 1488 before.) The closest I can find now is 7F72 + motherboard + 128GB DDR4 for over $2500. That's awful
I've heard it claimed that the era of being able to do this (buy slightly old used server hardware cheap on ebay) is coming to an end because, in the quest for ever more efficiency, the latest server hardware is no longer compatible with off-the-shelf power supplies etc. (there was more but that's the part that I remember) and therefore won't have any value on the second hand market.
I hope it was wrong, but it seems at least plausible to me. I'm sure that probably fixes could be made for all these issues, but the reason the current paradigm works is that, other than the motherboard and CPU, everything else you need is standard, consumer grade equipment which is therefore cheap. If you need to start buying custom (new) power supplies etc. to go along, then the price may not make as much sense anymore.
I'm curious, what is the powe draw for such a system? Of course, it heavily depends on the disks, but does it idle under 200W?
I personally feel like I will downscale my homelab hardware to reduce its power draw. My HW is rather old (and leagues below yours), more recent HW tends to be more efficient, but I have no idea how well these high end server boards can lower their idle power consumption?
When I was looking in October, I hadn't bought hardware for the better part of a decade, and I saw all these older posts on forums for DDR4 at $1/GB, but the lowest I could find was at least $2/GB used. These days? HAH!
If I had a decent sales channel I might be speculating on DDR4/DDR5 RAM and holding it because I expect prices to climb even higher in the coming months.
AMD also has some weird cpus like the 7c13 7r13, that are way way way below their normal price bands. You don't even have to buy used to get a ridiculous systems... Until 4 months ago (RIP ram prices).
https://www.servethehome.com/amd-epyc-7c13-is-a-surprisingly...
Oh, nice! I always wanted one of those, a many-core build server running ARM would be excellent for Yocto. Anything running in quemu in the rootfs is so slow on x86 and I've seen the rootfs postprocess step take a long time.
Though... these days, getting enough RAM to support builds across 80 cores would be twice the price of the whole rest of the system I'm guessing.
Aside from the memory cost being exorbitant, 4th/5th gen ES CPUs aren’t horribly expensive for the core count you get. 8480s and 8592s have been quite accessible.
Stuffed an 8480+ ES with 192gb of memory across 8 channels and it’s actually not too bad.
Just give it a few years and you'll be able to buy the thing for a fraction of the 'current' price. By that time it will be considered to be 'slow' and 'power-hungry' and people will wonder why you're intent on running older hardware but it'll still work just fine. The DL380 G7 under the stairs here also used to cost an arm and a leg while I got it for some finger nail clippings.
I’ve not kept up with Intel in a while, but one thing that stood out to me is these are all E cores— meaning no hyperthreading. Is something like this competitive, or preferred, in certain applications? Also does anyone know if there have been any benchmarks against AMDs 192 core Epyc CPU?
In HPC, like physics simulation, they are preferred. There's almost no benefit of HT. What's also preferred is high cluck frequencies. These high core count CPUs nerd their clixk frequencies though.
Without the hyperthreading (E-cores) you get more consistent performance between running tasks, and cloud providers like this because they sell "vCPUs" that should not fluctuate when someone else starts a heavy workload.
Sort of. They can just sell even numbers of vCPUs, and dedicate each hyper-thread pair to the same tenant. That prevents another tenant from creating hyper-threading contention for you.
For those, wouldn't hyperthreading be a win? Some fraction of the time, you'd get evicted to the hyperthread that shares your L1 cache (and the hypervisor could strongly favor that).
"Is something like this competitive, or preferred, in certain applications?"
They cite a very specific use case in the linked story: Virtualized RAN. This is using COTS hardware and software for the control plane for a 5G+ cell network operation. A large number of fast, low power cores would indeed suit such a application, where large numbers of network nodes are coordinated in near real time.
It's entirely possible that this is the key use case for this device: 5G networks are huge money makers and integrators will pay full retail for bulk quantities of such devices fresh out of the foundry.
E cores didn't just ruin P cores, it ruined AVX-512 altogether. We were getting so close to near-universal AVX-512 support; enough to bother actually writing AVX-512 versions of things. Then, Intel killed it.
That's finally set to be resolved with Nova Lake later this year, which will support AVX10 (the new iteration of AVX512) across both core types. Better very late than never.
I love the AVX512 support in Zen 5 but the lack of Valgrind support for many of the AVX512 instructions frustrates me almost daily. I have to maintain a separate environment for compiling and testing because of it.
It all depends on your exact workload, and I’ll wait to see benchmarks before making any confident claims, but in general if you have two threads of execution which are fine on an E-core, it’s better to actually put them on two E-cores than one hyperthreaded P-core.
I don't know the nitty-gritty of why, but some compute intensive tasks don't benefit from hyperthreading. If the processor is destined for those tasks, you may as well use that silicon for something actually useful.
For an application like a build server, the only metric that really matters is total integer compute per dollar and per watt. When I compile e.g a Yocto project, I don't care whether a single core compiles a single C file in a millisecond or a minute; I care how fast the whole machine compiles what's probably hundred thousands of source files. If E-cores gives me more compute per dollar and watt than P-cores, give me E-cores.
Of course, having fewer faster cores does have the benefit that you require less RAM... Not a big deal before, you could get 512GB or 1TB of RAM fairly cheap, but these days it might actually matter? But then at the same time, if two E-cores are more powerful than one hyperthreaded P-core, maybe you actually save RAM by using E-cores? Hyperthreading is, after all, only a benefit if you spawn one compiler process per CPU thread rather than per core.
EDIT: Why in the world would someone downvote this perspective? I'm not even mad, just confused
It's for building embedded Linux distros, and your typical Linux distro contains quite a lot of C++ and Rust code these days (especially if you include, say, a browser, or Qt). But you have parallelism across packages, so even if one core is busy doing a serial linking step, the rest of your cores are busy compiling other packages (or maybe even linking other packages).
That said, there are sequential steps in Yocto builds too, notably installing packages into the rootfs (it uses dpkg, opkg or rpm, all of which are sequential) and any code you have in the rootfs postprocessing step. These steps usually aren't a significant part of a clean build, but can be a quite substantial part of incremental builds.
I think some of why is size on die. 288 E cores vs 72 P cores.
Also, there's so many hyperthreading vulnerabilities as of late they've disabled on hyperthreaded data center boards that I'd imagine this de-risks that entirely.
Core density plus power makes so many things worthwhile. Generally human cost of managing hardware scales with number of components under management. CPUs very reliable. So once you get lots of CPU and RAM on single machine you can run with very few.
But right pricing hardware is hard if you’re small shop. My mind is hard-locked onto Epyc processors without thought. 9755 on eBay is cheap as balls. Infinity cores!
Problem with hardware is lead time etc. cloud can spin up immediately. Great for experimentation. Organizationally useful. If your teams have to go through IT to provision machine and IT have to go through finance so that spend is reliable, everybody slows down too much. You can’t just spin up next product.
But if you’re small shop having some Kubernetes on rack is maybe $15k one time and $1.2k on going per month. Very cheap and you get lots and lots of compute!
Previously skillset was required. These days you plug Ethernet port, turn on Claude Code dangerously skip permissions “write a bash script that is idempotent that configures my Mikrotik CCR, it’s on IP $x on interface $y”. Hotspot on. Cold air blowing on face from overhead coolers. 5 minutes later run script without looking. Everything comes up.
Still, foolish to do on prem by default perhaps (now that I think about it): if you have cloud egress you’re dead, compliance story requires interconnect to be well designed. More complicated than just basics. You need to know a little before it makes sense.
Feel like reasoning LLM. I now have opposite position.
Not competitive at all. It's easily visible on the laptop lines, where the same GPU manufactured on TSMC has 3 times the power/performance ratio compared to the Intel one.
Putting more cores is just another desperate move to play the benchmark. Power is roughly quadratic with frequency, every time you fall behind competition, you can double the number of cores and reduce the frequency by 1.414 to compensate.
Repeat a few times and you get CPU with hundreds of cores, but each core is so slow it can hardly do any work.
??? GPU vs CPU workloads are completely different. Comparing Panther Lake iGPU vs Ryzen iGPU is not going to tell you much about how high density server CPU performance will work out.
The Panther Lake vs Ryzen laptop performance comparisons show that Pather Lake does well, basically trading against top end Ryzen AI laptop chips in both absolute performance, and performance per watt.
If you're not aware, Intel has released a lineup of laptops, with some models having the GPU made by them and some having the same GPU made by TSMC. That makes the comparison very direct. TSMC can deliver nearly 3 times the power/performance.
GPU and CPU manufacturing is the same thing, same node, same result. GPU is always maximizing perf/power ratio because it's embarrassingly parallel, leaving no room to game the benchmark. CPU can be gamed by having a single fast core, that drops performance in half as soon as you use another core.
I used to run many hosts with 28 cores per host. If performance scales, it's nicer to have a few 288 core hosts rather than a few hundred 28 core hosts.
Getting the performance to scale can be hard, of course. The less inter-core communication the better. Things that tend to work well are either stuff where a bunch of data comes in and a single thread works on it for a significant amount of time then ships the result or things where you can rely on the NIC(s) to split traffic and you can process the network queue for a connecrion on the same core that handles the userspace stuff (see Receive Side Scaling), but you need a fancy NIC to have 288 network queues.
These almost always run many smaller virtual machines on top of a hypervisor. The target market is large enterprise or hyperscalers like the public clouds, Meta, etc...
Meanwhile, somebody put 8192 arm cores on a chip and ran a risc-v emulator on top of that which emulated a 6502 which then emulated a 288 core xeon and it used 0.01% of the power and outperformed the Intel chip in every other metric 10:1, probably.
So, they're selling this as an AI accelerator, with drop in compatibility with existing boards, and no boost to RAM bandwidth.
As I understand things, it would be extremely unusual to ship a chip that was bound by floating point throughput, not uncached memory access, especially in the desktop/laptop space.
I haven't been following the Intel server space too carefully, so it's an honest question: Was the old thing compute and not bandwidth limited, or is this going to be running inference at the same throughput (though maybe with lower power consumption)?
No, they're not selling this as an "AI accelerator":
Here is the quote:
"The company says operators deploying 5G Advanced and future 6G networks increasingly rely on server CPUs for virtualized RAN and edge AI inference, as they do not want to re-architect their data centers in a bid to accommodate AI accelerators."
Edge AI usually means very small models that run fine on CPUs.
These sorts of core-density increases are how I win cloud debates in an org.
* Identify the workloads that haven't scaled in a year. Your ERPs, your HRIS, your dev/stage/test environments, DBs, Microsoft estate, core infrastructure, etc. (EDIT, from zbentley: also identify any cross-system processing where data will transfer from the cloud back to your private estate to be excluded, so you don't get murdered with egress charges)
* Run the cost analysis of reserved instances in AWS/Azure/GCP for those workloads over three years
* Do the same for one of these high-core "pizza boxes", but amortized over seven years
* Realize the savings to be had moving "fixed infra" back on-premises or into a colo versus sticking with a public cloud provider
Seriously, what took a full rack or two of 2U dual-socket servers just a decade ago can be replaced with three 2U boxes with full HA/clustering. It's insane.
Back in the late '10s, I made a case to my org at the time that a global hypervisor hardware refresh and accompanying VMware licenses would have an ROI of 2.5yrs versus comparable AWS infrastructure, even assuming a 50% YoY rate of license inflation (this was pre-Broadcom; nowadays, I'd be eyeballing Nutanix, Virtuozzo, Apache Cloudstack, or yes, even Proxmox, assuming we weren't already a Microsoft shop w/ Hyper-V) - and give us an additional 20% headroom to boot. The only thing giving me pause on that argument today is the current RAM/NAND shortage, but even that's (hopefully) temporary - and doesn't hurt the orgs who built around a longer timeline with the option for an additional support runway (like the three-year extended support contracts available through VARs).
If we can't bill a customer for it, and it's not scaling regularly, then it shouldn't be in the public cloud. That's my take, anyway. It sucks the wind from the sails of folks gung-ho on the "fringe benefits" of public cloud spend (box seats, junkets, conference tickets, etc...), but the finance teams tend to love such clear numbers.
The main cost with on-prem is not the price of the gear but the price of acquiring talent to manage the gear. Most companies simply don't have the skillset internally to properly manage these servers, or even the internal talent to know whether they are hiring a good infrastructure engineer or not during the interview process.
For those that do, your scaling example works against you. If today you can merge three services into one, then why do you need full time infrastructure staff to manage so few servers? And remember, you want 24/7 monitoring, replication for disaster recovery, etc. Most businesses do not have IT infrastructure as a core skill or differentiator, and so they want to farm it out.
> even the internal talent to know whether they are hiring a good infrastructure engineer or not during the interview process.
This is really the core problem. Every time I’ve done the math on a sizable cloud vs on-prem deployment, there is so much money left on the table that the orgs can afford to pay FAANG-level salaries for several good SREs but never have we been able to find people to fill the roles or even know if we had found them.
The numbers are so much worse now with GPUs. The cost of reserved instances (let alone on-demand) for an 8x H100 pod even with NVIDIA Enterprise licenses included leaves tens of thousands per pod for the salary of employees managing it. Assuming one SREs can manage at least four racks the hardware pays for itself, if you can find even a single qualified person.
Self-hosted 8xH100 is ~$250k, depreciated across three years => $80k/year, with power and cooling => $90k/year (~$10/hour total).
AWS charges $55/hour for EC2 p5.48xlarge instance, which goes down with 1 or 3 year commitments.
With 1 year commitment, it costs ~$30/hour => $262k per year.
3-year commitment brings price down to $24/hour => $210k per year.
This price does NOT include egress, and other fees.
So, yeah, there is a $120k-$175k difference that can pay for a full-time on-site SRE, even if you only need one 8xH100 server.
Numbers get better if you need more than one server like that.
$120K isn't going to cover the fully loaded costs of an SRE who can set up and run that.
Hiring 1 person to run the infrastructure means that 1 person is on-call 24/7 forever.
If there's an issue with the server while they're sick or on vacation, you just stop and wait.
If they take a new job, you need to find someone to take over or very quickly hire a replacement.
There's a second bus factor: What happens when that 8xH100 starts to get flakey? You can't move the jobs to another server because you only have one. You can start diagnosing things and replacing parts and hope it gets to the root issue, but that's more downtime.
Going on-prem like this is highly risky. It works well until the hardware starts developing problems or the person in charge gets a new job. The weeks and months lost to dealing with the server start to become a problem. The SRE team starts to get tired of having to do all of their work on weekends because they can't block active use during the week. Teams start complaining that they need to use cloud to keep their project moving forward.
> $120K isn't going to cover the fully loaded costs of an SRE who can set up and run that.
> Hiring 1 person to run the infrastructure means that 1 person is on-call 24/7 forever.
> If there's an issue with the server while they're sick or on vacation, you just stop and wait.
Very much depends on what you're doing, of course, but "you just stop and wait" for sickness/vacation sometimes is actually good enough uptime -- especially if it keeps costs down. I've had that role before... That said, it's usually better to have two or three people who know the systems though (even if they're not full time dedicated to them) to reduce the bus factor.
If a business which require at least a quarter million bucks worth of hardware for the basic operation yet it can't pay the market rate for someonr who would operate it - maybe the basics of that business is not okay?
> There's a second bus factor: What happens when that 8xH100 starts to get flakey?
These come in a non-flakey variant?
This factually did not play out like this in my experience.
The company did need the same exact people to manage AWS anyway. And the cost difference was so high that it was possible to hire 5 more people which wasn't needed anyway.
Not only the cost but not needing to worry about going over the bandwidth limit and having soo much extra compute power made a very big difference.
Imo the cloud stuff is just too full of itself if you are trying to solve a problem that requires compute like hosting databases or similar. Just renting a machine from a provider like Hetzner and starting from there is the best option by far.
> The company did need the same exact people to manage AWS anyway.
That is incorrect. On AWS you need a couple DevOps that will Tring together the already existing services.
With on premise, you need someone that will install racks, change disks, setup high availability block storage or object storage, etc. Those are not DevOps people.
To be clear, I'm not writing about on-premise. I mean difference between managed cloud and renting dedicated servers
Even if you do include physical server setup and maintenance, one or two days per month is probably enough enough for a couple hundred rack units.
Ops people are typically more useful given you probably already have devs.
People will install racks and swap drives for significantly less money than DevOps, lol. People who can build LEGO sets are cheaper than software developers.
"Those are not DevOps people."
Real Devops people are competent from physical layer to software layer.
Signed,
Aerospace Devop
> price of acquiring talent to manage the gear
Is it still a problem in 2026 when unemployment in IT is rising? Reasons can be argued (the end of ZIRP or AI) but hiring should be easier than it was at any time during the last 10 years.
Hiring people is still fucked in 2026 in my experience. HR processes are extremely dysfunctional at many organizations...
hiring in 2026 is 100x harder than ever before
As opposed to talent to manage the AWS? Sorry, AWS loses here as well.
I know of AWS's reputation as a business and what the devs say who work there, so I have no argument against your point, except to say that they do manage to make it work. Somewhere in there must be some unsung heroes keeping the whole thing online.
What about the cost of k8s and AWS experts etc.?
> main cost with on-prem is not the price of the gear but the price of acquiring talent to manage the gear
Not quite. If you hire a bad talent to manage your 'cloud gear' then you would find what the mistakes which would cost you nothing on-premises would cost you in the cloud. Sometimes - a lot.
Managing AWS is a ton of work anyway
Given how good Apple Silicon is these days, why not just buy a spec'd out Mac Studio (or a few) for $15k (512 GB RAM, 8 TB NVMe), maybe pay for S3 only to sync data across machines. No talent required to manage the gear. AWS EC2 costs for similar hardware would net out in something ridiculous like 4 months.
That’s definitely the right call in some cases. But as soon as there’s any high-interconnect-rate system that has to be in cloud (appliances with locked in cloud billing contracts, compute that does need to elastically scale and talks to your DB’s pizza box, edge/CDN/cache services with lots of fallthrough to sources of truth on-prem), the cloud bandwidth costs start to kill you.
I’ve had success with this approach by keeping it to only the business process management stacks (CRMs, AD, and so on—examples just like the ones you listed). But as soon as there’s any need for bridging cloud/onprem for any data rate beyond “cronned sync” or “metadata only”, it starts to hurt a lot sooner than you’d expect, I’ve found.
Yep, 100%, but that's why identifying compatible workloads first is key. A lot of orgs skip right to the savings pitch, ignorant of how their applications communicate with one another - and you hit the nail on the head that applications doing even some processing in a cloud provider will murder you on egress fees by trying to hybrid your app across them.
Folks wanting one or the other miss savings had by effectively leveraging both.
Any experience with the mid-to-small cloud providers that provide un-metered network ports and/or free interconnect with partner providers?
(For various reasons, I just care about VPS/bare metal, and S3-compatiblity.)
I'm looking at those because I'm having difficulty forecasting bandwidth usage, and the pessimistic scenarios seem to have me inside the acceptable use policies of the small providers while still predicting AWS would cost 5-10x more for the same workload.
Vultr and Digital Ocean both offer Direct Connects. I've had good experience with their VPSes.
Netcup and OVH provide free un-metered ports. There are actually lots of options available on the market. BuyVM is another good one.
What has surprised me about the cloud is that the price has been towards ever increasing prices for cores. Yet the market direction is the opposite, what used to be a 1/2 or a 1/4 of a box is now 1/256 and its faster and yet the price on the cloud has gone ever up for that core. I think their business plan is to wipe out all the people who used to maintain the on premise machines and then they can continue to charge similar prices for something that is only getting cheaper.
Its hard drive and SSD space prices that stagger me on the cloud. Where one of the server CPUs might only be about 2x the price of buy a CPU for a few years if you buy less in a small system (all be it with less clock speed usually on the cloud) the drive space is at least 10-100x the price of doing it locally. Its got a bit more potential redudency but for that overhead you can repeat that data a lot of times.
As time has gone on the deal of cloud has got worse as the hardware got more cores.
Do note though that AIUI these are all E-cores, have poor single-threaded performance and won't support things like AVX512. That is going to skew your performance testing a lot. Some workloads will be fine, but for many users that are actually USING the hardware they buy this is likely to be a problem.
If that's you then the GraniteRapids AP platform that launched previously to this can hit similar numbers of threads (256 for the 6980P). There are a couple of caveats to this though - firstly that there are "only" 128 physical cores and if you're using VMs you probably don't want to share a physical core across VMs, secondly that it has a 500W TDP and retails north of $17000, if you can even find one for sale.
Overall once you're really comparing like to like, especially when you start trying to have 100+GbE networking and so on, it gets a lot harder to beat cloud providers - yes they have a nice fat markup but they're also paying a lot less for the hardware than you will be.
Most of the time when I see takes like this it's because the org has all these fast, modern CPUs for applications that get barely any real load, and the machines are mostly sitting idle on networks that can never handle 1/100th of the traffic the machine is capable of delivering. Solving that is largely a non-technical problem not a "cloud is bad" problem.
E-cores aren't that slow, yesteryear ones were already around Skylake levels of performance (clock for clock). Now one might say that's a 10+ year old uarch, true, but those ten years were the slowest ten years in computing since the beginning of computing, at least as far as sequential programs are concerned.
I just don't know if the human capital is there.
At my job we use HyperV, and finding someone who actually knows HyperV is difficult and expensive. Throw in Cisco networking, storage appliances, etc to make it 99.99% uptime...
Also that means you have just one person, you need at least two if you don't want gaps in staffing, more likely three.
Then you still need all the cloud folks to run that.
We have a hybrid setup like this, and you do get a bit of best of both worlds, but ultimately managing onprem or colo infra is a huge pain in the ass. We only do it due to our business environment.
Cloud = the right choice when just starting. It isn't about infra cost, it is about mental cost. Setting up infra is just another thing that hurts velocity. By the time you are serving a real load for the first time though you need to have the discussion about a longer term strategy and these points are valid as part of that discussion.
I guess it depends, but infra is also a lot simpler when starting out. It really isnt much harder (easier even?) to setup services on a box or two than managing AWS.
Im pretty sure a box like this could run our whole startup, hosting PG, k8s, our backend apis, etc, would be way easier to setup, and not cost 2 devops and $40,000 a month to do it.
Is infra really that hard to set up? It seems like infra is something a infra expert could establish to get the infra going and then your infra would be set up and you would always have infra.
As a big on-prem guy, I think cloud makes sense for early startups. Lead time on servers and networking setup can be significant, and if you don't know how much you need yet you will either be resource starved or burn all your cash on unneeded capacity.
On-prem wins for a stable organization every time though.
Secure and reliable infrastructure is hard to set and keep secure and reliable over time.
You have to pay that infra person and shield them from "infra works, why are we paying so much for IT staff" layoffs. Then you have ongoing maintenance costs like UPS battery replacement and redundant internet connections, on top of the usual hardware attrition.
It's unfortunately not so cut and dry
Based on the evidence, not only is infrastructure really hard to set up in the first place, it is incredibly error-prone to adjust to new demand.
Is using virtualization the only good way of taking a 288-core box and splitting it up into multiple parallel workloads? One time I rented a 384-core AMD EPYC baremetal VM in GCP and I could not for the life of me get parallelized workloads to scale just using baremetal linux. I wanted to run a bunch of CPU inference jobs in parallel (with each one getting 16 cores), but the scaling was atrocious - the more parallel jobs you tried to add, the slower all of them ran. When I checked htop the CPU was very underutilized, so my theory was that there was a memory bottleneck somewhere happening with ONNX/torch (something to do with NUMA nodes?) Anyway, I wasn't able to test using proxmox or vmware on there to split up cpu/memory resources; we decided instead to just buy a bunch of smaller-core-count AMD Ryzen 1Us instead, which scaled way better with my naive approach.
How did the speed of one or two jobs on the EPYC compare to the Ryzen?
And 384 actual cores or 384 hyperthreading cores?
Inference is so memory bandwidth heavy that my expectations are low. An EPYC getting 12 memory channels instead of 2 only goes so far when it has 24x as many cores.
Is your calculation also taking cost of energy and personnel that keeps your own infra running?
Is that personnel cost more than running on someone else's infra? Just counting the amount of people a company now need just to maintain their cloud/kubernetes/whatever setup, paired with "devops" meaning all devs now have to spend time on this stuff, I could almost wager we would spend less on personnel if we just chucked a few laptops in a closet and sshed in.
> These sorts of core-density increases are how I win cloud debates in an org.
The core density is bullshit when each core is so slow that it can't do any meaningful work. The reality is that Intel is 3 times behind AMD/TSMC on performance vs power consumption ratio.
People would be better off having a look at the high frequency models (9xx5F models like the 9575F), that was the first generation of CPU server to reach ~5 GHz and sustain it on 32+ cores.
Intel seem to be deliberately hiding the clock frequency of this thing, the xeon-6-plus-product-deck.pdf has no mention of clock frequency or how LLC is shared.
That only works if purchasers in the organisation are immune to kickbacks.
> These sorts of core-density increases are how I win cloud debates in an org.
AMD has had these sorts of densities available for a minute.
> Identify the workloads that haven't scaled in a year.
I have done this math recently, and you need to stop cherry picking and move everything. And build a redundant data center to boot.
Compute is NOT the major issue for this sort of move:
Switching and bandwidth will be major costs. 400gb is a minimum for interconnects and for most orgs you are going to need at least that much bandwidth top of rack.
Storage remains problematic. You might be able to amortize compute over this time scale, but not storage. 5 years would be pushing it (depending on use). And data center storage at scale was expensive before the recent price spike. Spinning rust is viable for some tasks (backup) but will not cut it for others.
Human capital: Figuring out how to support the hardware you own is going to be far more expensive than you think. You need to expect failures and staff accordingly, that means resources who are going to be, for the most part, idle.
> If we can't bill a customer for it, and it's not scaling regularly, then it shouldn't be in the public cloud. That's my take, anyway. It sucks the wind from the sails of folks gung-ho on the "fringe benefits" of public cloud spend (box seats, junkets, conference tickets, etc...), but the finance teams tend to love such clear numbers.
I agree, but.
For one, it's not just the machines themselves. You also need to budget in power, cooling, space, the cost of providing redundant connectivity and side gear (e.g. routers, firewalls, UPS).
Then, you need a second site, no matter what. At least for backups, ideally as a full failover. Either your second site is some sort of cloud, which can be a PITA to set up without introducing security risks, or a second physical site, which means double the expenses.
If you're a publicly listed company, or live in jurisdictions like Europe, or you want to have cybersecurity insurance, you have data retention, GDPR, SOX and a whole bunch of other compliance to worry about as well. Sure, you can do that on-prem, but you'll have a much harder time explaining to auditors how your system works when it's a bunch of on-prem stuff vs. "here's our AWS Backup plans covering all servers and other data sources, here is the immutability stuff, here are plans how we prevent backup expiry aka legal hold".
Then, all of that needs to be maintained, which means additional staff on payroll, if you own the stuff outright your finance team will whine about depreciation and capex, and you need to have vendors on support contracts just to get firmware updates and timely exchanges for hardware under warranty.
Long story short, as much as I prefer on-prem hardware vs the cloud, particularly given current political tensions - unless you are a 200+ employee shop, the overhead associated with on-prem infrastructure isn't worth it.
> Then, you need a second site, no matter what. At least for backups, ideally as a full failover. Either your second site is some sort of cloud, which can be a PITA to set up without introducing security risks, or a second physical site, which means double the expenses.
You can technically have backblaze's unlimited backup option which costs around 7$ for a given machine although its more intended for windows, there have been people who make it work and Daily backups and it should work with gdpr (https://www.backblaze.com/company/policy/gdpr) with something like hetzner perhaps if you are worried about gdpr too much and OVH storage boxes (36 TB iirc for ~55$ is a good backup box) and you should try to follow 3-2-1 strategy.
> Then, all of that needs to be maintained, which means additional staff on payroll, if you own the stuff outright your finance team will whine about depreciation and capex, and you need to have vendors on support contracts just to get firmware updates and timely exchanges for hardware under warranty.
I can't speak for certain but its absolutely possible to have something but iirc for companies like dell, its possible to have products be available on a monthly basis available too and you can simply colocate into a decent datacenter. Plus points in that now you can get 10-50 GB ports as well if you are too bandwidth hungry and are available for a lot lot more customizable and the hardware is already pretty nice as GP observed. (Yes Ram prices are high, lets hope that is temporary as GP noted too)
I can't speak about firmware updates or timely exchanges for hardware under security.
That being said, I am not saying this is for everyone as well. It does essentially boils down to if they have expertise in this field/can get expertise in this field or not for cheaper than their aws bills or not. With many large AWS bills being in 10's of thousands of dollars if not hundreds of thousands of dollars, I think that far more companies might be better off with the above strategy than AWS actually.
> The only thing giving me pause on that argument today is the current RAM/NAND shortage
Not a shortage - price gouging. And it would mean an increase in the 'cloud' prices because they need to refresh the HW too. So by the summer the equation would be back to it.
With packages like this (lots of cores, multi-chip packaging, lots of memory channels), the architecture is increasingly a small cluster on a package rather than a monolithic CPU.
I wonder whether the next bottleneck becomes software scheduling rather than silicon - OS/runtimes weren’t really designed with hundreds of cores and complex interconnect topologies in mind.
Yes there are scheduling issues, Numa problems , etc caused by the cluster in a box form factor.
We had a massive performance issue a few years ago that we fixed by mapping our processes to the numa zones topology . The default design of our software would otherwise effectively route all memory accesses to the same numa zone and performance went down the drain.
Intel contributes to Linux, how is this a problem?
Often the Linux scheduling improvements come a year or two after the chip. Also, Linux makes moment-by-moment scheduling and allocation decisions that are unaware of the big picture of workload requirements.
I don't think there are any fundamental bottlenecks here. There's more scheduling overhead when you have a hundred processes on a single core than if you have a hundred processes on one hundred cores.
The bottlenecks are pretty much hardware-related - thermal, power, memory and other I/O. Because of this, you presumably never get true "288 core" performance out of this - as in, it's not going to mine Bitcoin 288 as fast as a single core. Instead, you have less context-switching overhead with 288 tasks that need to do stuff intermittently, which is how most hardware ends up being used anyway.
Maybe no fundamental bottlenecks but it's easy to accidentally write software that doesn't scale as linearly as it should, e.g. if there's suddenly more lock contention than you were expecting, or in a more extreme case if you have something that's O(n^2) in time or space, where n is core count.
That's a great point. Linux has introduced io_uring, and I believe that gives us the native primitives to hide latency better?
But that's just one piece of the puzzle, I guess.
I think linux can handle upto 1024 cores just fine.
afaik the mainline limit is 4096 threads. HP sells server with 32 sockets x 60 cores/socket x 2 threads/core = 3840 threads, so we are pretty close to that limit.
I had no idea we had socket counts so high, do you know where I could find a picture of one?
https://xkcd.com/619/
There definitely are bottlenecks. The one I always think of is the kernel's networking stack. There's no sense in using the kernel TCP stack when you have hundreds of independent workloads. That doesn't make any more sense than it would have made 20 years ago to have an external TCP appliance at the top of your rack. Userspace protocol stacks win.
io_uring?
If anything, uring makes the problem much worse by reducing the cost of one process flooding kernel internals in a single syscall.
> I wonder whether the next bottleneck becomes software scheduling rather than silicon
Yep, the scheduling has been a problem for a while. There was an amazing article few years ago about how the Linux kernel was accidentally hardcoded to 8 cores, you can probably google and find it.
IMO the most interesting problem right now is the cache, you get a cache miss every time a task is moving core. Problem, with thousands of threads switching between hundreds of cores every few milliseconds, we're dangerously approaching the point where all the time is spent trashing and reloading the CPU cache.
I searched for "Linux kernel limited to 8 cores" and found this
https://news.ycombinator.com/item?id=38260935
> This article is clickbait and in no way has the kernel been hardcoded to a maximum of 8 cores.
That's the one. Funny thing, it's not actually clickbait.
The bug made it to the kernel mailing list where some Intel people looked into it and confirmed there is a bug. There is a problem where is the kernel allocation logic was capped to 8 cores, which leaves a few percent of performance off the table as the number of cores increase and the allocation is less and less optimal.
It's classic tragedy of the commons. CPU have got so complicated, there may only be a handful of people in the world who could work and comprehend a bug like this.
As a Yocto enthusiast, I am curious as to how much elapsed realtime would be needed for a clean Yocto build. Yocto is thread heavy, so with 288, it oughta be good.
Sure looks like a lot of glue holding that CPU together :)
As soon as I read chiplets I thought about this too! Glad even intel agrees that chiplet architecture is the way forward.
Helped a friend make a difficult career decision (cozy job vs something hard and new + moving to a new city) that ultimately ended up with him working on the project. Glad that happened. I love to see people grow.
A bad moment to have a make-or-break moment for your CPU business - a lot of customers will probably hold off purchases right now because of the RAM prices, no matter how good your CPU might be.
Isn't this new server CPU a drop in replacement though? So the DC could pull off the old CPU, drop in the new one and not touch the existing RAM setup, yet be able to deliver better performance within the limits of the existing RAM. Then once RAM prices drop (okay that might be a while) separately upgrade the RAM at a different time.
If you have enough cores, you could pool the L1 together for makeshift RAM!
One day I hope to be rich enough to put a CPU like this (with proportional RAM and storage) in my proxmox cluster.
Some of the AMD offerings like this on Ebay are pretty close to affordable! It's the RAM that's killer these days...
I still regret not buying 1TB of RAM back in ~October...
I bought a bundle with 512GB of RAM and an older 24-core EPYC (7F72) + supermicro motherboard on ebay a bit over a year ago, it was really an amazing deal and has made for a truly nice NAS. If you're okay with stuff that's old enough that you can buy decommissioned server stuff, you can get really high-quality gear at surprisingly low prices.
Companies decommission hardware on a schedule after all, not when it stops working.
EDIT: Though looking for similar deals now, I can only find ones up to 128GB RAM and they're near twice the price I paid. I got 7F72 + motherboard + 512GB DDR4 for $1488 (uh, I swear that's what I paid, $1488.03. Didn't notice the 1488 before.) The closest I can find now is 7F72 + motherboard + 128GB DDR4 for over $2500. That's awful
I've heard it claimed that the era of being able to do this (buy slightly old used server hardware cheap on ebay) is coming to an end because, in the quest for ever more efficiency, the latest server hardware is no longer compatible with off-the-shelf power supplies etc. (there was more but that's the part that I remember) and therefore won't have any value on the second hand market.
I hope it was wrong, but it seems at least plausible to me. I'm sure that probably fixes could be made for all these issues, but the reason the current paradigm works is that, other than the motherboard and CPU, everything else you need is standard, consumer grade equipment which is therefore cheap. If you need to start buying custom (new) power supplies etc. to go along, then the price may not make as much sense anymore.
I'm curious, what is the powe draw for such a system? Of course, it heavily depends on the disks, but does it idle under 200W?
I personally feel like I will downscale my homelab hardware to reduce its power draw. My HW is rather old (and leagues below yours), more recent HW tends to be more efficient, but I have no idea how well these high end server boards can lower their idle power consumption?
RAM! (And NAND SSDs too now, probably...)
When I was looking in October, I hadn't bought hardware for the better part of a decade, and I saw all these older posts on forums for DDR4 at $1/GB, but the lowest I could find was at least $2/GB used. These days? HAH!
If I had a decent sales channel I might be speculating on DDR4/DDR5 RAM and holding it because I expect prices to climb even higher in the coming months.
AMD also has some weird cpus like the 7c13 7r13, that are way way way below their normal price bands. You don't even have to buy used to get a ridiculous systems... Until 4 months ago (RIP ram prices). https://www.servethehome.com/amd-epyc-7c13-is-a-surprisingly...
Wait long enough and these will be cheap on eBay.
By that point we'll be desiring the new 1000 core count CPUs though.
Do you remember what you dreamed about 7 years ago? An Ampere Altra 80-core-CPU was sold for less than 210€ on eBay in January.
Oh, nice! I always wanted one of those, a many-core build server running ARM would be excellent for Yocto. Anything running in quemu in the rootfs is so slow on x86 and I've seen the rootfs postprocess step take a long time.
Though... these days, getting enough RAM to support builds across 80 cores would be twice the price of the whole rest of the system I'm guessing.
Aside from the memory cost being exorbitant, 4th/5th gen ES CPUs aren’t horribly expensive for the core count you get. 8480s and 8592s have been quite accessible.
Stuffed an 8480+ ES with 192gb of memory across 8 channels and it’s actually not too bad.
This is the 2026 version of "I need a beowulf cluster of these".
‘Can you imagine a Beowulf cluster of these’
Just give it a few years and you'll be able to buy the thing for a fraction of the 'current' price. By that time it will be considered to be 'slow' and 'power-hungry' and people will wonder why you're intent on running older hardware but it'll still work just fine. The DL380 G7 under the stairs here also used to cost an arm and a leg while I got it for some finger nail clippings.
> with proportional RAM and storage
Let's not get carried away here
Yeah this is their make or break moment.
Because if this is not thunder Intel will default.
I promise you. Heard it from some youtuber as well, trust me.
I’ve not kept up with Intel in a while, but one thing that stood out to me is these are all E cores— meaning no hyperthreading. Is something like this competitive, or preferred, in certain applications? Also does anyone know if there have been any benchmarks against AMDs 192 core Epyc CPU?
In HPC, like physics simulation, they are preferred. There's almost no benefit of HT. What's also preferred is high cluck frequencies. These high core count CPUs nerd their clixk frequencies though.
Without the hyperthreading (E-cores) you get more consistent performance between running tasks, and cloud providers like this because they sell "vCPUs" that should not fluctuate when someone else starts a heavy workload.
Sort of. They can just sell even numbers of vCPUs, and dedicate each hyper-thread pair to the same tenant. That prevents another tenant from creating hyper-threading contention for you.
OP is probably talking about shared vCPUs, not dedicated
For those, wouldn't hyperthreading be a win? Some fraction of the time, you'd get evicted to the hyperthread that shares your L1 cache (and the hypervisor could strongly favor that).
"Is something like this competitive, or preferred, in certain applications?"
They cite a very specific use case in the linked story: Virtualized RAN. This is using COTS hardware and software for the control plane for a 5G+ cell network operation. A large number of fast, low power cores would indeed suit such a application, where large numbers of network nodes are coordinated in near real time.
It's entirely possible that this is the key use case for this device: 5G networks are huge money makers and integrators will pay full retail for bulk quantities of such devices fresh out of the foundry.
is RAM a concern in these cluster applications, cause if prices stay up, how do you get them off the shelf if you also need TB of memory.
E core vs P core is an internal power struggle between two design teams that looks on the surface like ARM’s big.LITTLE approach
E cores ruined P cores by forcing the removal of AVX-512 from consumer P cores
Which is why I used AMD in my last desktop computer build
E cores didn't just ruin P cores, it ruined AVX-512 altogether. We were getting so close to near-universal AVX-512 support; enough to bother actually writing AVX-512 versions of things. Then, Intel killed it.
That's finally set to be resolved with Nova Lake later this year, which will support AVX10 (the new iteration of AVX512) across both core types. Better very late than never.
I love the AVX512 support in Zen 5 but the lack of Valgrind support for many of the AVX512 instructions frustrates me almost daily. I have to maintain a separate environment for compiling and testing because of it.
I guess it competes with the like of Ampere's ARM servers? I'm sure there are use cases for lots and lots of weak cores, in telecom especially.
It all depends on your exact workload, and I’ll wait to see benchmarks before making any confident claims, but in general if you have two threads of execution which are fine on an E-core, it’s better to actually put them on two E-cores than one hyperthreaded P-core.
I don't know the nitty-gritty of why, but some compute intensive tasks don't benefit from hyperthreading. If the processor is destined for those tasks, you may as well use that silicon for something actually useful.
https://www.comsol.com/support/knowledgebase/1096
Yeah of you are running Comsol you need real cores + high clock frequency + high memory bandwidth.
Gaming CPUs and some EPYCs are the best
For an application like a build server, the only metric that really matters is total integer compute per dollar and per watt. When I compile e.g a Yocto project, I don't care whether a single core compiles a single C file in a millisecond or a minute; I care how fast the whole machine compiles what's probably hundred thousands of source files. If E-cores gives me more compute per dollar and watt than P-cores, give me E-cores.
Of course, having fewer faster cores does have the benefit that you require less RAM... Not a big deal before, you could get 512GB or 1TB of RAM fairly cheap, but these days it might actually matter? But then at the same time, if two E-cores are more powerful than one hyperthreaded P-core, maybe you actually save RAM by using E-cores? Hyperthreading is, after all, only a benefit if you spawn one compiler process per CPU thread rather than per core.
EDIT: Why in the world would someone downvote this perspective? I'm not even mad, just confused
Yocto's for embedded projects though, right?
I imagine that means less C++/Rust than most, which means much less time spent serialized on the linker / cross compilation unit optimizer.
It's for building embedded Linux distros, and your typical Linux distro contains quite a lot of C++ and Rust code these days (especially if you include, say, a browser, or Qt). But you have parallelism across packages, so even if one core is busy doing a serial linking step, the rest of your cores are busy compiling other packages (or maybe even linking other packages).
That said, there are sequential steps in Yocto builds too, notably installing packages into the rootfs (it uses dpkg, opkg or rpm, all of which are sequential) and any code you have in the rootfs postprocessing step. These steps usually aren't a significant part of a clean build, but can be a quite substantial part of incremental builds.
I think some of why is size on die. 288 E cores vs 72 P cores.
Also, there's so many hyperthreading vulnerabilities as of late they've disabled on hyperthreaded data center boards that I'd imagine this de-risks that entirely.
It's a trade off. Hyperthreading takes up space on the die and the power budget.
As to E core itself - it's ARM's playbook.
Core density plus power makes so many things worthwhile. Generally human cost of managing hardware scales with number of components under management. CPUs very reliable. So once you get lots of CPU and RAM on single machine you can run with very few.
But right pricing hardware is hard if you’re small shop. My mind is hard-locked onto Epyc processors without thought. 9755 on eBay is cheap as balls. Infinity cores!
Problem with hardware is lead time etc. cloud can spin up immediately. Great for experimentation. Organizationally useful. If your teams have to go through IT to provision machine and IT have to go through finance so that spend is reliable, everybody slows down too much. You can’t just spin up next product.
But if you’re small shop having some Kubernetes on rack is maybe $15k one time and $1.2k on going per month. Very cheap and you get lots and lots of compute!
Previously skillset was required. These days you plug Ethernet port, turn on Claude Code dangerously skip permissions “write a bash script that is idempotent that configures my Mikrotik CCR, it’s on IP $x on interface $y”. Hotspot on. Cold air blowing on face from overhead coolers. 5 minutes later run script without looking. Everything comes up.
Still, foolish to do on prem by default perhaps (now that I think about it): if you have cloud egress you’re dead, compliance story requires interconnect to be well designed. More complicated than just basics. You need to know a little before it makes sense.
Feel like reasoning LLM. I now have opposite position.
Am I the only one disappointed they didn't settle for 286 cores?
During the 8th gen they made an i7-8086... Hopefully Intel hasn't fired that person.
8086K, actually. I still run one inside one of my PCs!
At least you got the Intel® Core™ Ultra 9 Processor 386H :)
I wonder if they can bin out ones that have a dead core or two specifically for this purpose.
So TLDR is it competitive?
What are the dimensions and dynamics here vs EPYC?
This is really what I want to understand. Where can we see real world performance benchmarks?
Phoronix should have them soon. Or if they don't it means the performance is bad.
Not competitive at all. It's easily visible on the laptop lines, where the same GPU manufactured on TSMC has 3 times the power/performance ratio compared to the Intel one.
Putting more cores is just another desperate move to play the benchmark. Power is roughly quadratic with frequency, every time you fall behind competition, you can double the number of cores and reduce the frequency by 1.414 to compensate.
Repeat a few times and you get CPU with hundreds of cores, but each core is so slow it can hardly do any work.
??? GPU vs CPU workloads are completely different. Comparing Panther Lake iGPU vs Ryzen iGPU is not going to tell you much about how high density server CPU performance will work out.
The Panther Lake vs Ryzen laptop performance comparisons show that Pather Lake does well, basically trading against top end Ryzen AI laptop chips in both absolute performance, and performance per watt.
If you're not aware, Intel has released a lineup of laptops, with some models having the GPU made by them and some having the same GPU made by TSMC. That makes the comparison very direct. TSMC can deliver nearly 3 times the power/performance.
GPU and CPU manufacturing is the same thing, same node, same result. GPU is always maximizing perf/power ratio because it's embarrassingly parallel, leaving no room to game the benchmark. CPU can be gamed by having a single fast core, that drops performance in half as soon as you use another core.
Why do you needs so many cores for? Apache threads? Any old school wizard here?
I used to run many hosts with 28 cores per host. If performance scales, it's nicer to have a few 288 core hosts rather than a few hundred 28 core hosts.
Getting the performance to scale can be hard, of course. The less inter-core communication the better. Things that tend to work well are either stuff where a bunch of data comes in and a single thread works on it for a significant amount of time then ships the result or things where you can rely on the NIC(s) to split traffic and you can process the network queue for a connecrion on the same core that handles the userspace stuff (see Receive Side Scaling), but you need a fancy NIC to have 288 network queues.
Host it in proxxmox, run 8 different services on it each with 32 cores.
These almost always run many smaller virtual machines on top of a hypervisor. The target market is large enterprise or hyperscalers like the public clouds, Meta, etc...
Yeah, virtualization, many (small) containers / VMs.
data processing
if 18A is Intel's make-or-break, its a break. Their next node looks promising.
Meanwhile, somebody put 8192 arm cores on a chip and ran a risc-v emulator on top of that which emulated a 6502 which then emulated a 288 core xeon and it used 0.01% of the power and outperformed the Intel chip in every other metric 10:1, probably.
You know, a link would be great for this comment.
Well Linux was booted on an Intel 4004, emulating a MIPS R3000. Looks like it booted in 4.76 days. I don't believe this article was AI fabricated.
https://arstechnica.com/gadgets/2024/09/hacker-boots-linux-o...
Somehow, that still doesn't sound real, but it looks like it is. Wow. Though that one was written by their recently fired hallucination writer.
Too risky.
https://theonion.com, probably
Ah, nice to see a fellow lover of the finest news publication on the planet.
Only slightly related, but six years ago I was able to run 400 ZX Spectrum (Z80) emulator instances simultaneously on an AWS graphics workstation.
https://youtu.be/BjeVzEQW4C8?si=0I7UGU0Xz5WUT4ek
I remember that. Neat stuff.
So, they're selling this as an AI accelerator, with drop in compatibility with existing boards, and no boost to RAM bandwidth.
As I understand things, it would be extremely unusual to ship a chip that was bound by floating point throughput, not uncached memory access, especially in the desktop/laptop space.
I haven't been following the Intel server space too carefully, so it's an honest question: Was the old thing compute and not bandwidth limited, or is this going to be running inference at the same throughput (though maybe with lower power consumption)?
No, they're not selling this as an "AI accelerator":
Here is the quote:
"The company says operators deploying 5G Advanced and future 6G networks increasingly rely on server CPUs for virtualized RAN and edge AI inference, as they do not want to re-architect their data centers in a bid to accommodate AI accelerators."
Edge AI usually means very small models that run fine on CPUs.
A very small model is going to be, what, 8GB? That'll easily blow through the caches. You're going to end up bottlenecked on DRAM either way.
So, I wonder if this is going to be any faster than the previous generation for edge AI.