I really love Oxide to an unhealthy amount (it's become a bit of a meme among my colleagues), but sometimes I do wonder whether they went about their go-to-market the right way. They really tried to do everything at once - custom servers, custom router, custom rack, everything. Their accomplishments are technologically impressive, but, as somebody who is in a position to make purchasing decisions, not economically attractive. They're 3x more expensive than our existing hardware, two generations behind (I'm aware they're on track for a refresh) and don't have any GPUs. E.g. what I would have loved to see is just an after-market BMC/NIC/firmware solution using their stack. Plug it into a cheap Gigabyte system (their BMC is pluggable and NIC is OCP) and just have the control plane manage it as a whole box. I'd have easily paid serveral thousand $ per server just for that. All the rack scale integration, virtualization, migration, network storage, etc stuff is cool, but not everyone needs it. Get your foot in the door at customers, build up some volume for better deals with AMD, and then start building the custom rack stuff ... Of course it's easy to be a critic from the side lines. As I said, I do really love what the Oxide folks are doing, I just really hope it'll become possible for me to buy their gear at some point.
Oxide are doing great work. Hoping they can probe the market a bit more for us out on the sidelines preparing to drop in and compete with some similar tech.
Id also wish I could get to play around with a cheaper version of their tech, but they probably havw enough customers that really want a large-scale solution that is completely customizable
> When we started Oxide, the DC bus bar stood as one of the most glaring differences between the rack-scale machines at the hyperscalers and the rack-and-stack servers that the rest of the market was stuck with. That a relatively simple piece of copper was unavailable to commercial buyers
It seems that 0xide was founded in 2019 and Open Compute Project had been specifying dc bus bars for 6 years at that point. People could purchase racks if they wanted, but it seems like, by large, people didn't care enough to go whole hog in on it.
Wonder if the economics have changed or if it's still just neat but won't move the needle.
You simply can't buy OCP hardware is part of the issue, not new anyway. What you're going to find is "OCP Inspired" hardware that has some overlap with the full OCP specification but is almost always meant to run on 240VAC on 19in racks because nobody wants to invest the money in something that can't be bought from CDW.
I remember the one time I had OCP hardware in data center, and how it was essentially rumoured it's better to not ask too much how it got there - not the level of "fell of a truck", but some possibility it was ex-(big tech) equipment acquired through favours, or some really insistent negotiating with Quanta till "to be sold to (big tech)" racks ended up with us
It's normally incredibly difficult for employees to disrupt at massive companies that would be the type which runs a data center. Disruption usually enters the corp in a sales deck, much like the one Oxide would have.
Yes. I think as an engineer at this level you need to also have the patience to deal with the bean counters.
But as I’ve grown in my career I’ve actually found that line of thinking refreshing. Can you quantify benefit? If it requires too many assumptions it’s probably not worth it.
But then again there’s always the Vp or the svp who wants to “showcase his towers’ innovative spirit” and then there goes money that could be used for better things. The innovative spirit of the day is random Llm apps.
Things like -48VDC bus bars in the 'telco' world significantly predate the OCP, all the way back to like 1952 in the Bell system.
In general, the telco world concept hasn't changed much. You have AC grid power coming from your local utility into some BIG ASS RECTIFIERS which create -48VDC (and are responsible for charging your BIG ASS BATTERY BANK to float voltage), then various DC fuses/breakers going to distribution of -48VDC bus bars powering the equipment in a CO.
Re: Open Compute, the general concept of what they did was go to a bunch of 1U/2U server power supply manufacturers and get them to make a series of 48VDC-to-12VDC power supplies (which can be 92%+ efficient), and cut out the need for legacy 5VDC feed from power supply into ATX-derived-design x86-64 motherboards.
OCP hardware is only really accessible to hyperscalers. You can't go out and just buy a rack or two, the Taiwanese OEMs don't do direct deals that small. Even if they did, no integration is done for you. You would have to integrate the compute hardware from one company, the network fabric from another company, and then the OS and everything else from yet another. That's a lot of risk, a lot of engineering resources, a lot of procurement overhead, and a lot of different vendors pointing fingers at each other when something doesn't work.
If you're Amazon or Google, you can do this stuff yourself. If you're a normal company, you probably won't have the inhouse expertise.
On the other hand, Oxide sells a turnkey IaaS platform that you can just roll off the pallet, plug in and start using immediately. You only need to pay one company, and you have one company to yell at if something goes wrong.
You can buy a rack of 1-2U machines from Dell, HPE or Cisco with VMware or some other HCI platform, but you don't get that power efficiency or the really nice control plane Oxide have on their platform.
But isn’t it a little surprising (I’m not an expert) that Dell or Supermicro or somefirm like that hadn’t already started offering an approachable access to either OCP gear or a proprietary knockoff of it? Presumably that may still happen if Oxide is seen to have proven the market.
Azure tried this, not with their hyperscaler stuff, but with Azure Operator Nexus.
Basically an "opinionated" combination of Dell, Arista, and Pure storage with a special Azure AKS running on top and a metric ton of management and orchestration smarts. The target customer base was telcos who needed local capabilities in their data centers and who might otherwise have gone to OCP.
As far as I can surmise, it's dead, but not EOLed. Microsoft nuked the operator business unit earlier in the year, and judging by recent job postings from contract shops, AT&T might be the only customer.
I believe the telco’s did dc power for years so I don’t think this anything new. Any old hands out there want to school us on how it was done in the old days?
Every old telco technician had a story about dropping a wrench on a busbar or other bare piece of high powered transmission equipment and having to shut that center down, get out the heavy equipment, and cut it off because the wrench had been welded to the bus bars.
Note that the rack doesn't accept DC input, like lots of (e.g., NEBS certified) telco equipment. There's a bus bar, but it's enclosed within the rack itself. The rack takes single- or three-phase AC inputs to power the rectifiers, which are then attached to the internal bus bar.
huge gauge copper cables going around a central office (google "telcoflex IV")
big DC breaker/fuse panels
specialized dc fuse panels for power distribution at the top of racks, using little tiny fuses
100% overhead steel ladder rack type cable trays, since your typical telco CO was never a raised floor type environment (UNLIKE legacy 1960s/1970s mainframe computer rooms), so all the power was kept accessible by a team of people working on stepladders.
The same general thing continues today in serious telco/ISP operations, with tech features to bring it into the modern era. The rectifiers are modular now, and there's also rectiverters. Monitoring is much better. People are moving rapidly away from wet cell 2V lead acid battery banks and AGM sealed lead acid stuff to LiFePo4 battery systems.
DC fuse panels can come with network-based monitoring, ability to turn on/off devices remotely.
equipment is a whole lot less power hungry now, a telco CO that has decommed a 5ESS will find itself with a ton of empty thermal and power budget.
when I say serious telco stuff is a lot less power hungry, it's by huge margins. randomly chosen example of radio transport equipment. For instance back in the day a powerful, very expensive point to point microwave radio system might be a full 42U rack, 800W in load, with waveguide going out to antennas on a roof. It would carry one, two or three DS3 equivalent of capacity (45 Mbps each).
now, that same telco might have a radio on its CO roof in the same microwave bands that is 1.3 Gbps FDD capacity, pure ethernet with a SFP+ fiber interface built into it, and the whole radio is a 40W electrical load. The radio is mounted directly on the antenna with some UV/IR resistant weatherproof 16 gauge DC power cable running down into the CO and plugged into a fuse panel.
Their tech may be more than adequate today. Bigger businesses may not buy from a small startup company. They expect a lot more. Illumos is a less popular OS. It wouldn't be the first choice for the OS I'd rely on. Who writes the security mitigations for speculative execution bugs? Who patches CVEs in the shipped software which doesn't use Rust?
The answer to "who does X" is Oxide. That's the point. You're not going to Dell who's integrating multiple vendors in the same box in a way that "should" work. You're getting a rack where everything is designed to work together from top to bottom.
The goal is that you can email Oxide and they'll be able fix it regardless of where it is in the stack, even down to the processor ROM.
If you want on prem infra in exactly the shape and form Oxide delivers*
I've read and understood from Joyent and SmartOS that they believe fault tolerant block devices / filesystems is the wrong abstraction, your software should handle losing storage.
We do not put the onus on customers to tolerate data loss. Our storage is redundant and spread through the rack so that if you lose drives or even an entire computer, your data is still safe.
https://oxide.computer/product/storage
And a big enough customer will evaluate Oxide's resources and consider for themselves whether they think Oxide can provide a quick enough turnaround for everything. That's what GP is talking about.
> Who writes the security mitigations for speculative execution bugs? Who patches CVEs in the shipped software which doesn't use Rust?
Oxide.
This is all a pre-canned solution: just use the API like you would an off-prem cloud. Do you worry about AWS patching stuff? And how many people purchasing 'traditional' servers from Dell/HPe/Lenovo worry about patching links like the LOM?
Further, all of Oxide's stuff is on Github, so you're in better shape for old stuff, whereas if the traditional server vendors EO(S)L something firmware-wise you have no recourse.
How much did Shopify buy? Sounds like from what the CEO is saying they bought 1 unit.
>We learned that Oxide has so far shipped “under 20 racks,” which illustrates the selective markets its powerful systems are aimed at.
>B&F understands most of those systems were deployed as single units at customer sites. Therefore, Oxide hopes these and new customers will scale up their operations in response to positive outcomes.
Yikes. If they sold 20 racks in July, how many are they up to now?
We write the security mitigations. We patch the CVEs. Oxide employs many, perhaps most, of the currently active illumos maintainers --- although I don't work on the illumos kernel personally, I talk to those folks every day.
A big part of what we're offering our customers is the promise that there's one vendor who's responsible for everything in the rack. We want to be the responsible party for all the software we ship, whether it's firmware, the host operating system, the hypervisor, and everything else. Arguably, the promise that there's one vendor you can yell at for everything is a more important differentiator for us than any particular technical aspect of our hardware or software.
Because they use such esoteric software that you'll forever be reliant on Oxide.
I'd rather they use more standardized open source software like Linux, Talos, k8s, Ceph, KubeVirt. Instead of rolling it all themselves on an OS that has a very small niche ecosystem.
Oxide is providing an x86 platform to run VMs/containers on. That's a commoditized market.
The value they're offering is that the rack-level consumption and management is improved over the competition, but you should be able to run whatever you want on the actual compute, k8s or whatnot.
This also means you'd not be forever reliant on Oxide.
> > The power shelf distributes DC power up and down the rack via a bus bar. This eliminates the 70 total AC power supplies found in an equivalent legacy server rack within 32 servers, two top-of-rack switches, and one out-of-band switch, each with two AC power supplies
This creates a single point of failure, trading robustness for efficiency. There's nothing wrong with that, but software/ops might have to accommodate by making the opposite tradeoff. In general, the cost savings advertised by cloud infrastructure should be more holistic.
>This creates a single point of failure, trading robustness for efficiency. There's nothing wrong with that, but software/ops might have to accommodate by making the opposite tradeoff.
I'll happily take a single high qualify power supply (which may have internal redundancy FWIW) over 70 much more cheaply made power supplies that stress other parts of my datacenter via sheer inefficiency, and also costs more in aggregate. Nobody drives down the highway with 10 spare tires for their SUV.
A DC busbar can propagate a short circuit across the rack, and DC circuit protection is harder than AC. So of course each server now needs its own current limiter, or a cheap fuse.
But I’m not debating the merits of this engineering tradeoff - which seems fine, and pretty widely adopted - just its advertisement. The healthcare industry understands the importance of assessing clinical endpoints (like mortality) rather than surrogate measures (like lab results). Whenever we replace “legacy” with “cloud”, it’d be nice to estimate the change in TCO.
Let's say your high quality supply's yearly failure rate is 100 times less than the cheap ones
The probability of at least a single failure is 1-(1-r)^70.
This is quite high even w/out considering the higher quality of the one supply.
The probability of all 70 going down is
r^70 which is absurdly low.
Let's say r = 0.05 or one failed supply every 20 in a year.
1-(1-r)^70 = 97%
r^70 < 1E-91
The high quality supply has r = 0.0005, in between no failure and all failing. If you code can handle node failure, very many, cheaper supplies appears to be more robust.
Yeah but the failure rate of an analog piece of copper is pretty low, it'll keep being copper unless you do stupid things. You'll have multiple power supplies provide power on the same piece of copper
The big piece of copper is fed by redundant rectifiers. Each power shelf has six independent rectifiers which are 5+1 redundant if the rack is fully loaded with compute sleds, or 3+3 redundant if the rack is half-populated. Customers who want more redundancy can also have a second power shelf with six more rectifiers.
The bus bar itself is an SPoF, but it's also just dumb copper. That doesn't mean that nothing can go wrong, but it's pretty far into the tail of the failure distribution.
The power shelf that keeps the busbar fed will have multiple rectifiers, often with at least N+1 redundancy so that you can have a rectifier fail and swap it without the rack itself failing. Similar things apply to the battery shelves.
It's also plausible to have multiple power supplies feeding the same bus bar in parallel (if they're designed to support this) e.g. one at each end of a row.
This is how our rack works (Oxide employee). In each power shelf, there are 6 power supplies and only 5 need to be functional to run at full load. If you want even more redundancy, you can use both power shelves with independent power feeds to each so even if you lose a feed, the rack still has 5+1 redundant power supplies.
This isn't even remotely close. Unless all 32 servers have redundant AC power feeds present, you've traded one single point of failure for another single point of failure.
In the event that all 32 servers had redundant AC power feeds, you could just install a pair of redundant DC power feeds.
It's highly dependent on the individual server model and quite often how you spec it too. Most 1U Dell machines I worked with in the past only had a single slot for a PSU, whereas the beefier 2U (and above) machines generally came with 2 PSUs.
Rack servers have two PSUs because enterprise buyers are gullible and will buy anything. Generally what happens in case of a single PSU failure is the other PSU also fails or it asserts PROCHOT which means instead of a clean hard down server you have a slow server derping along at 400MHz which is worse in every possible way.
The whole thing with eliminating 70 discrete 1U server size AC-to-DC power supplies is nothing new. It's the same general concept as the power distribution unit in the center of an open compute platform rack design from 10+ years ago.
Everyone who's doing serious datacenter stuff at scale knows that one of the absolute least efficient, labor intensive and cabling intensive/annoying ways of powering stuff is to have something like a 42U cabinet with 36 servers in it, each of them with dual power supplies, with power leads going to a pair of 208V 30A vertical PDUs in the rear of the cabinet. It gets ugly fast in terms of efficiency.
The single point of failure isn't really a problem as long as the software is architected to be tolerant of the disappearance of an entire node (mapping to a single motherboard that is a single or dual cpu socket config with a ton of DDR4 on it).
They do have a good point here. If you do the total power budget on a typical 1U (discrete chassis, not blade) server which is packed full of a wall of 40mm fans pushing air, the highest speed screaming 40mm 12VDC fans can be 20W electrical load each. It's easy to "spend" at least 120W at maximum heat from the CPUs, in a dual socket system, just on the fans to pull air from the front/cold side of the server through to the rear heat exhaust.
Just going up to 60mm or 80mm standard size DC fans can be a huge efficiency increase in watt-hours spent per cubic meters of air moved per hour.
I am extremely skeptical of the "12x" but using larger fans is more efficient.
from the URL linked:
> Bigger fans = bigger efficiency gains
Oxide server sleds are designed to a custom form factor to accommodate larger fans than legacy servers typically use. These fans can move more air more efficiently, cooling the systems using 12x less energy than legacy servers, which each contain as many as 7 fans, which must work much harder to move air over system components.
FWIW, we had to have the idle speed of our fans lowered because the usual idle of around 5k RPM was WAY too much cooling. We generally run our fans at around 2.5kRPM (barely above idle). This is due to not only the larger fans, but also the fact that we optimized and prioritized as little restriction on airflow as possible. If you’ve taken apart a current gen 1U/2U server and then compare that to how little our airflow is restricted and how little our fans have to work, the 12X reduction becomes a bit clearer.
Not Oxide or Bluesky, but firstly I'd suggest that asking the company about their customers is unlikely to get a response, most companies don't disclose their customers. Secondly, Bluesky have been growing quickly, I can only assume their hardware is too, and that means long lead time products like an Oxide rack aren't going to work, especially when you can have an off the shelf machine from Dell delivered in a few days.
Oxide is very open, we are happy to talk about customers that allow us to talk about them. Some don’t want to, others are very happy to be mentioned, just like any other company.
> we are happy to talk about customers that allow us to talk about them
This is what I meant by "don't disclose", I didn't mean that Oxide was in any way secretive, but that usually this stuff doesn't get agreed, and that it would make more sense to ask the customer rather than the company selling as Oxide won't want to disclose unless there's already an agreement in place (formal or otherwise).
In my head I'm imagining an average landing page. They slap their customers on there like stickers. I doubt bluesky would stay secretive about using oxide if they did
Those customers listed on the front page of companies are there as part of an agreement. Usually something like a discount. Certainly they are not listed without permission. 10x that if it is a case study.
I think they often are listed without permission unfortunately, and often literally based on on the the email addresses of people signing up for a trial. I see my company's logo on the landing page of many products that we don't use or may even have a policy preventing our use of.
What I don't get is why tie to such an ancient platform. AMD Milan is my home lab. The new 9004 Epycs are so much better on power efficiency. I'm sure they've done their market research and the gains must be so significant. We used to have a few petabytes and tens of thousands of cores almost ten years ago and it's crazy how much higher data and compute density you can get with modern 30 TiB disks and Epyc 9654s. 100 such nodes and you have 10k cores and really fast data. I can't see myself running a 7003-series datacenter anymore unless the Oxide gains are that big.
They've built this a while ago. A hardware refresh takes time. The good news is that they may be able to upgrade the existing equipment with newer sleds.
I’m rooting for solutions like this as an alternative to the public cloud.
I do see that an org would rely on one company that theoretically can do a ‘Broadcom VMware’ on them but I don’t get this vibe from 0x1d3 at all.
But they target large orgs, I wish a solution like this would be accessible for smaller companies.
I wish I could throw their stack on my second hand cots hardware, rent a few U’s in two colos for geo redundancy and cry of happiness each month realizing how much money we save on public cloud cost, yet having cloud capabilities/benefits
I'm amazed Apple don't have a rack mount version of their M series chips yet.
Even for their own internal use in their data centers they'd have to save an absolute boat load on power and cooling given their performance per watt compared to legacy stuff.
Oxide is not touching DLC systems in their post even with a 100ft barge pole.
Lenovo's DLC systems use 45 degrees C water to directly cool the power supplies and the servers themselves (water goes through them) for > 97% heat transfer to water. In cooler climates, you can just pump this to your drycoolers, and in winter you can freecool them with just air convection.
Yes, the TDP doesn't go down, but cooling costs and efficiency shots up considerably, reducing POE to 1.03 levels. You can put tremendous amount of compute or GPU power in one rack, and cool them efficiently.
Every chassis handles its own power, but IIRC, all the chassis electricity is DC. and the PSUs are extremely efficient.
I don't think they'd admit much about it even if they had one internally, both because Apple isn't known for their openness about many things, and because they already exited the dedicated server hardware business years ago, so I think they're likely averse to re-entering it without very strong evidence that it would be beneficial for more than a brief period.
In particular, while I'd enjoy such a device, Apple's whole thing is their whole-system integration and charging a premium because of it, and I'm not sure the markets that want to sell people access to Apple CPUs will pay a premium for a 1U over shoving multiple Mac Minis in the same 1U footprint, especially if they've already been doing that for years at this point...
...I might also speculate that if they did this, they'd have a serious problem, because if they're buying exclusive access to all TSMC's newest fab for extended intervals to meet demand on their existing products, they'd have issues finding sources to meet a potentially substantial demand in people wanting their machines for dense compute. (They could always opt to lag the server platforms behind on a previous fab that's not as competed with, of course, but that feels like self-sabotage if they're already competing with people shoving Mac Minis in a rack, and now the Mac Minis get to be a generation ahead, too?)
I will add that consumer macOS is a piss-poor server OS.
At one point, for many years, it would just sometimes fail to `exec()` a process. This would manifest as a random failure on our build farm about once/twice a month. (This would manifest as "/bin/sh: fail to exec binary file" because the error type from the kernel would have the libc fall back to trying to run the binary as a script, as normal for a Unix, but it isn't a script)
This is likely stemming from their exiting the server business years ago, and focusing on consumer appeal more than robustness (see various terrible releases, security- and stability-wise).
(I'll grant that macOS has many features that would make it a great server OS, but it's just not polished enough in that direction)
As I recall, Apple advertised macOS as a Unix without such certification, got sued, and then scrambled to implement the required features to get certification as a result. Here's the story as told by the lead engineer of the project:
This comes up rather often, and on the last significant post about it I saw on HN someone pointed out that the certification is kind of meaningless[1]. macOS poll(2) is not Unix-compliant, hasn't been since forever, yet every new version of macOS gets certified regardless.
That's designed for the broadcast market, where they rack mount everything in the studio environment. It's not really a server, it has no out of band management, redundant power etc.
There are third party rack mounts available for the Mac Mini and Mac Studio also.
Companies buying massive cloud scale server hardware want to be able to choose from a dozen different Taiwanese motherboard manufacturers. Apple is in no way motivated to release or sell the M3/M4 CPUs as a product that major east asia motherboard manufacturers can design their own platform for. Apple is highly invested in tightly integrated ecosystems where everything is soldered down together in one package as a consumer product (take a look at a macbook air or pro motherboard for instance).
Maybe it becomes a big enough profit center to matter. Maybe. At the risk of taking focus away, splitting attention from the mission they're on today: building end user systems.
Maybe they build them for themselves. For what upside? Maybe somewhat better compute efficiency maybe, but I think if you have big workloads the huge massive AMD Turin super-chips are going to be incredibly hard to beat.
It's hard to emphasize just how efficient AMD is, with 192 very high performance cores on a 350-500W chip.
> Maybe they build them for themselves. For what upside?
They do build it for themselves. From their security blog:
"The root of trust for Private Cloud Compute is our compute node: custom-built server hardware that brings the power and security of Apple silicon to the data center, with the same hardware security technologies used in iPhone, including the Secure Enclave and Secure Boot. We paired this hardware with a new operating system: a hardened subset of the foundations of iOS and macOS tailored to support Large Language Model (LLM) inference workloads while presenting an extremely narrow attack surface. This allows us to take advantage of iOS security technologies such as Code Signing and sandboxing."
This is such a narrow narrow tiny corner of computing needs. That has such serious need for ownership, no matter the cost. And has extremely fantastically chill as shit overall computing needs, is un-perfomamce-sensitive as it gets.
I could not be less convinced by this information that this is a useful indicator for the other 99.999999999% of computing needs.
> How can organizations reduce power consumption and corresponding carbon emissions?
Stop running so much useless stuff.
Also maybe ARM over x86_64 and similar power-efficiency-oriented hardware.
Rack-level system design, or at least power & cooling design, is certainly also a reasonable thing to do. But standardization is probably important here, rather than some bespoke solution which only one provider/supplier offers.
> How can organizations keep pace with AI innovation as existing data centers run out of available power?
Current ARM servers actually generally offer "on par" (varies by workload) perf/Watt for generally worse absolute performance (varies by workload) i.e. require more other overhead to achieve the same total perf despite "on par" perf/Watt.
Need either Apple to get into the general market server business or someone to start designing CPUs as well as Apple (based on the comparison between different ARM cores I'm not sure it really matters if they do so using a specific architecture or not).
It's more a case of selection of optimization parameters and corresponding economy. It's not so much that apple towers over others in design (though they are absolutely no slouches and have wins there) but their design team is in position to coordinate with product directly and as such isn't as limited by "but will it sell in high enough numbers for the excel sheet at investor's desk?"
The real show stopper for years is that ARM servers are just not prepared to be a proper platform. uBoot with grudgingly included FDT (after getting kicked out of Linux kernel) does not make a proper platform, and often there's also no BMC, unique approaches to various parts making the server that one annoying weirdo in the data center, etc.
Cloud providers can spend the effort to backfill necessary features with custom parts, but doing so on your own on-prem is hard
Not sure what you mean wrt to Apple's uniqueness. AMD/Mediatek/Intel/Qualcomm/Samsung only make margin on how well they invest on their designs vs their competitors and they'd all love to be outshipping each other and Apple in any market. All, including Apple, also rely on the same manufacturer for their top products and the ones (Intel/Samsung) with alternatives have not been able to use that as an advantage for top performing products. Sure, Apple can work directly with their own product... but at the end of the day the goal and available customer pool to fight over is the same and they still ship fewer units than the others.
I'm not hands-on familiar with other serious ARM server market players but for several years now Ampere ARM server CPUs at least are nothing like you describe. Phoronix says it best in https://www.phoronix.com/review/linux-os-ampereone
> All the Linux distributions I attempted worked out effortlessly on this Supermicro AmpereOne server. Like with Ampere Altra and Ampere eMAG before that, it's a seamless AArch64 Linux experience. Thanks to supporting open standards like UEFI, Arm SBSA/SBBR and ACPI and not having to rely on DeviceTrees or other nuisances, installing an AArch64 Linux distribution on Ampere hardware is as easy as in the x86_64 space.
We don’t currently have GPUs in the product. The closed-ness of the GPU space is a bit of a cultural difference, but we’ll surely have something eventually. As a small company, we have to focus on our strengths, and there’s plenty of folks who don’t need GPUs right now.
For sure. It’s not just GPUs; given that we have one product with three SKUs, there’s a variety of workloads we won’t be appropriate for just yet. Just takes time to diversify the offering.
"If only they used DC from the wall socket, all those H100s would be green" is, not, I think, the hill you want to die on.
But, yeah, my three 18MW/y racks agree that more power efficiency would be nice, it's just that Rewrite It In (Safe) Rust is unlikely to help with that...
It’s significantly more than that, but it’s also true that we include stuff in other languages where appropriate. CockroachDB is in Go, and illumos is in C, as two examples. But almost all new code we write is in Rust. That is the stuff you’re talking about, but also like, our control plane.
Pretty much everything Oxide publishes on github is either in rust or it's an sdk to service in rust. Well and web panel isn'tin rust, so negative points for that, true evangelists would have used WASM.
But Oxide reason to exist is to keep memory of cool racks from Sun running Solaris alive forever.
(And for that matter, Oracle's proprietary Solaris seems better maintained than I ever expected, though in this context I think the open source fork is the relevant thing to look at.)
I really love Oxide to an unhealthy amount (it's become a bit of a meme among my colleagues), but sometimes I do wonder whether they went about their go-to-market the right way. They really tried to do everything at once - custom servers, custom router, custom rack, everything. Their accomplishments are technologically impressive, but, as somebody who is in a position to make purchasing decisions, not economically attractive. They're 3x more expensive than our existing hardware, two generations behind (I'm aware they're on track for a refresh) and don't have any GPUs. E.g. what I would have loved to see is just an after-market BMC/NIC/firmware solution using their stack. Plug it into a cheap Gigabyte system (their BMC is pluggable and NIC is OCP) and just have the control plane manage it as a whole box. I'd have easily paid serveral thousand $ per server just for that. All the rack scale integration, virtualization, migration, network storage, etc stuff is cool, but not everyone needs it. Get your foot in the door at customers, build up some volume for better deals with AMD, and then start building the custom rack stuff ... Of course it's easy to be a critic from the side lines. As I said, I do really love what the Oxide folks are doing, I just really hope it'll become possible for me to buy their gear at some point.
Oxide are doing great work. Hoping they can probe the market a bit more for us out on the sidelines preparing to drop in and compete with some similar tech.
I'm curious what their burn rate is.
Id also wish I could get to play around with a cheaper version of their tech, but they probably havw enough customers that really want a large-scale solution that is completely customizable
> When we started Oxide, the DC bus bar stood as one of the most glaring differences between the rack-scale machines at the hyperscalers and the rack-and-stack servers that the rest of the market was stuck with. That a relatively simple piece of copper was unavailable to commercial buyers
It seems that 0xide was founded in 2019 and Open Compute Project had been specifying dc bus bars for 6 years at that point. People could purchase racks if they wanted, but it seems like, by large, people didn't care enough to go whole hog in on it.
Wonder if the economics have changed or if it's still just neat but won't move the needle.
You simply can't buy OCP hardware is part of the issue, not new anyway. What you're going to find is "OCP Inspired" hardware that has some overlap with the full OCP specification but is almost always meant to run on 240VAC on 19in racks because nobody wants to invest the money in something that can't be bought from CDW.
I remember the one time I had OCP hardware in data center, and how it was essentially rumoured it's better to not ask too much how it got there - not the level of "fell of a truck", but some possibility it was ex-(big tech) equipment acquired through favours, or some really insistent negotiating with Quanta till "to be sold to (big tech)" racks ended up with us
It's normally incredibly difficult for employees to disrupt at massive companies that would be the type which runs a data center. Disruption usually enters the corp in a sales deck, much like the one Oxide would have.
It's stupid, but that's why we all have jobs.
I think engineers should be more forceful to lead their own visions instead being led by accountants and lawyers.
After engineers have the power of implementation and de-implementstion. They need to step into dirty politics and bend other people's views.
It's either theirs or ours. Win-win is a fallacy.
Being able to navigate this is what differentiates a very senior IC (principal, distinguished, etc) and random employees.
Yes. I think as an engineer at this level you need to also have the patience to deal with the bean counters.
But as I’ve grown in my career I’ve actually found that line of thinking refreshing. Can you quantify benefit? If it requires too many assumptions it’s probably not worth it.
But then again there’s always the Vp or the svp who wants to “showcase his towers’ innovative spirit” and then there goes money that could be used for better things. The innovative spirit of the day is random Llm apps.
Let me know how that works out for you!
Things like -48VDC bus bars in the 'telco' world significantly predate the OCP, all the way back to like 1952 in the Bell system.
In general, the telco world concept hasn't changed much. You have AC grid power coming from your local utility into some BIG ASS RECTIFIERS which create -48VDC (and are responsible for charging your BIG ASS BATTERY BANK to float voltage), then various DC fuses/breakers going to distribution of -48VDC bus bars powering the equipment in a CO.
Re: Open Compute, the general concept of what they did was go to a bunch of 1U/2U server power supply manufacturers and get them to make a series of 48VDC-to-12VDC power supplies (which can be 92%+ efficient), and cut out the need for legacy 5VDC feed from power supply into ATX-derived-design x86-64 motherboards.
OCP hardware is only really accessible to hyperscalers. You can't go out and just buy a rack or two, the Taiwanese OEMs don't do direct deals that small. Even if they did, no integration is done for you. You would have to integrate the compute hardware from one company, the network fabric from another company, and then the OS and everything else from yet another. That's a lot of risk, a lot of engineering resources, a lot of procurement overhead, and a lot of different vendors pointing fingers at each other when something doesn't work.
If you're Amazon or Google, you can do this stuff yourself. If you're a normal company, you probably won't have the inhouse expertise.
On the other hand, Oxide sells a turnkey IaaS platform that you can just roll off the pallet, plug in and start using immediately. You only need to pay one company, and you have one company to yell at if something goes wrong.
You can buy a rack of 1-2U machines from Dell, HPE or Cisco with VMware or some other HCI platform, but you don't get that power efficiency or the really nice control plane Oxide have on their platform.
But isn’t it a little surprising (I’m not an expert) that Dell or Supermicro or somefirm like that hadn’t already started offering an approachable access to either OCP gear or a proprietary knockoff of it? Presumably that may still happen if Oxide is seen to have proven the market.
Azure tried this, not with their hyperscaler stuff, but with Azure Operator Nexus.
Basically an "opinionated" combination of Dell, Arista, and Pure storage with a special Azure AKS running on top and a metric ton of management and orchestration smarts. The target customer base was telcos who needed local capabilities in their data centers and who might otherwise have gone to OCP.
As far as I can surmise, it's dead, but not EOLed. Microsoft nuked the operator business unit earlier in the year, and judging by recent job postings from contract shops, AT&T might be the only customer.
Supermicro does sell OCP racks.
https://www.supermicro.com/solutions/Solution-Brief-Supermic...
I recall them offering older versions of the specs but can't easily find a reference, so I might be wrong about how accessible they were.
One is the specs and the other is an actual implementation, what am I missing?
I really wish Oxide had homelab/prosumer grade stuff. I'd be sending them so much money.
I believe the telco’s did dc power for years so I don’t think this anything new. Any old hands out there want to school us on how it was done in the old days?
Every old telco technician had a story about dropping a wrench on a busbar or other bare piece of high powered transmission equipment and having to shut that center down, get out the heavy equipment, and cut it off because the wrench had been welded to the bus bars.
Note that the rack doesn't accept DC input, like lots of (e.g., NEBS certified) telco equipment. There's a bus bar, but it's enclosed within the rack itself. The rack takes single- or three-phase AC inputs to power the rectifiers, which are then attached to the internal bus bar.
big ass rectifiers
big ass solid copper busbars
huge gauge copper cables going around a central office (google "telcoflex IV")
big DC breaker/fuse panels
specialized dc fuse panels for power distribution at the top of racks, using little tiny fuses
100% overhead steel ladder rack type cable trays, since your typical telco CO was never a raised floor type environment (UNLIKE legacy 1960s/1970s mainframe computer rooms), so all the power was kept accessible by a team of people working on stepladders.
The same general thing continues today in serious telco/ISP operations, with tech features to bring it into the modern era. The rectifiers are modular now, and there's also rectiverters. Monitoring is much better. People are moving rapidly away from wet cell 2V lead acid battery banks and AGM sealed lead acid stuff to LiFePo4 battery systems.
DC fuse panels can come with network-based monitoring, ability to turn on/off devices remotely.
equipment is a whole lot less power hungry now, a telco CO that has decommed a 5ESS will find itself with a ton of empty thermal and power budget.
when I say serious telco stuff is a lot less power hungry, it's by huge margins. randomly chosen example of radio transport equipment. For instance back in the day a powerful, very expensive point to point microwave radio system might be a full 42U rack, 800W in load, with waveguide going out to antennas on a roof. It would carry one, two or three DS3 equivalent of capacity (45 Mbps each).
now, that same telco might have a radio on its CO roof in the same microwave bands that is 1.3 Gbps FDD capacity, pure ethernet with a SFP+ fiber interface built into it, and the whole radio is a 40W electrical load. The radio is mounted directly on the antenna with some UV/IR resistant weatherproof 16 gauge DC power cable running down into the CO and plugged into a fuse panel.
See perhaps "Oxide Cloud Computer Tour - Rear":
* https://www.youtube.com/watch?v=lJmw9OICH-4
Their tech may be more than adequate today. Bigger businesses may not buy from a small startup company. They expect a lot more. Illumos is a less popular OS. It wouldn't be the first choice for the OS I'd rely on. Who writes the security mitigations for speculative execution bugs? Who patches CVEs in the shipped software which doesn't use Rust?
The answer to "who does X" is Oxide. That's the point. You're not going to Dell who's integrating multiple vendors in the same box in a way that "should" work. You're getting a rack where everything is designed to work together from top to bottom.
The goal is that you can email Oxide and they'll be able fix it regardless of where it is in the stack, even down to the processor ROM.
This. If you want on prem cloud infra without having to roll it yourself, Oxide is the solution.
(no affiliation, just a fan)
If you want on prem infra in exactly the shape and form Oxide delivers*
I've read and understood from Joyent and SmartOS that they believe fault tolerant block devices / filesystems is the wrong abstraction, your software should handle losing storage.
We do not put the onus on customers to tolerate data loss. Our storage is redundant and spread through the rack so that if you lose drives or even an entire computer, your data is still safe. https://oxide.computer/product/storage
And a big enough customer will evaluate Oxide's resources and consider for themselves whether they think Oxide can provide a quick enough turnaround for everything. That's what GP is talking about.
> Bigger businesses may not buy from a small startup company.
What would you classify Shopify as?
> One existing Oxide user is e-commerce giant Shopify, which indicates the growth potential for the systems available.
* https://blocksandfiles.com/2024/07/04/oxide-ships-first-clou...
Their CEO has tweeted about it:
* https://twitter.com/tobi/status/1793798092212367669
> Who writes the security mitigations for speculative execution bugs? Who patches CVEs in the shipped software which doesn't use Rust?
Oxide.
This is all a pre-canned solution: just use the API like you would an off-prem cloud. Do you worry about AWS patching stuff? And how many people purchasing 'traditional' servers from Dell/HPe/Lenovo worry about patching links like the LOM?
Further, all of Oxide's stuff is on Github, so you're in better shape for old stuff, whereas if the traditional server vendors EO(S)L something firmware-wise you have no recourse.
How much did Shopify buy? Sounds like from what the CEO is saying they bought 1 unit.
>We learned that Oxide has so far shipped “under 20 racks,” which illustrates the selective markets its powerful systems are aimed at.
>B&F understands most of those systems were deployed as single units at customer sites. Therefore, Oxide hopes these and new customers will scale up their operations in response to positive outcomes.
Yikes. If they sold 20 racks in July, how many are they up to now?
Illumos is the OS for the hypervisor and core services, they don't expect their customers to run their code directly on that OS, but inside VMs.
> Bigger businesses may not buy from a small startup company.
Our early customers include government, finance, and places like Shopify.
You’re not wrong that some places may prefer older companies but that doesn’t mean they all do.
Illumos is not really directly relevant to the customer, it’s a non user facing implementation detail.
We provide security updates.
The illumos bare-metal OS is not directly visible to customers.
We write the security mitigations. We patch the CVEs. Oxide employs many, perhaps most, of the currently active illumos maintainers --- although I don't work on the illumos kernel personally, I talk to those folks every day.
A big part of what we're offering our customers is the promise that there's one vendor who's responsible for everything in the rack. We want to be the responsible party for all the software we ship, whether it's firmware, the host operating system, the hypervisor, and everything else. Arguably, the promise that there's one vendor you can yell at for everything is a more important differentiator for us than any particular technical aspect of our hardware or software.
How long before a VPS pops up running Oxide racks? Or, why wouldn't a VPS build on top of Oxide if they offer better efficiency and server management?
Someone could if they wanted to! We’ll see if anyone does.
Because they use such esoteric software that you'll forever be reliant on Oxide.
I'd rather they use more standardized open source software like Linux, Talos, k8s, Ceph, KubeVirt. Instead of rolling it all themselves on an OS that has a very small niche ecosystem.
Oxide is providing an x86 platform to run VMs/containers on. That's a commoditized market.
The value they're offering is that the rack-level consumption and management is improved over the competition, but you should be able to run whatever you want on the actual compute, k8s or whatnot.
This also means you'd not be forever reliant on Oxide.
> > The power shelf distributes DC power up and down the rack via a bus bar. This eliminates the 70 total AC power supplies found in an equivalent legacy server rack within 32 servers, two top-of-rack switches, and one out-of-band switch, each with two AC power supplies
This creates a single point of failure, trading robustness for efficiency. There's nothing wrong with that, but software/ops might have to accommodate by making the opposite tradeoff. In general, the cost savings advertised by cloud infrastructure should be more holistic.
>This creates a single point of failure, trading robustness for efficiency. There's nothing wrong with that, but software/ops might have to accommodate by making the opposite tradeoff.
I'll happily take a single high qualify power supply (which may have internal redundancy FWIW) over 70 much more cheaply made power supplies that stress other parts of my datacenter via sheer inefficiency, and also costs more in aggregate. Nobody drives down the highway with 10 spare tires for their SUV.
A DC busbar can propagate a short circuit across the rack, and DC circuit protection is harder than AC. So of course each server now needs its own current limiter, or a cheap fuse.
But I’m not debating the merits of this engineering tradeoff - which seems fine, and pretty widely adopted - just its advertisement. The healthcare industry understands the importance of assessing clinical endpoints (like mortality) rather than surrogate measures (like lab results). Whenever we replace “legacy” with “cloud”, it’d be nice to estimate the change in TCO.
DC circuit protection is absolutely not harder than AC. DC has the advantage in current flowing in only one direction, not two
Which makes it much harder to break the circuit vs AC
At 48 volts arcing shorts aren't the concern.
No one drives down the highway with one tire either.
Careful, unicyclists are an unforgiving bunch.
Let's say your high quality supply's yearly failure rate is 100 times less than the cheap ones
The probability of at least a single failure is 1-(1-r)^70.
This is quite high even w/out considering the higher quality of the one supply.
The probability of all 70 going down is
r^70 which is absurdly low.
Let's say r = 0.05 or one failed supply every 20 in a year.
1-(1-r)^70 = 97% r^70 < 1E-91
The high quality supply has r = 0.0005, in between no failure and all failing. If you code can handle node failure, very many, cheaper supplies appears to be more robust.
(Assuming uncorrelated events. YMMV)
Yeah but the failure rate of an analog piece of copper is pretty low, it'll keep being copper unless you do stupid things. You'll have multiple power supplies provide power on the same piece of copper
TL/DR, isnt there a single, shared, DC supply that supplies said piece of copper? Presumably connected to mains?
Or are the running on SOFCs?
The big piece of copper is fed by redundant rectifiers. Each power shelf has six independent rectifiers which are 5+1 redundant if the rack is fully loaded with compute sleds, or 3+3 redundant if the rack is half-populated. Customers who want more redundancy can also have a second power shelf with six more rectifiers.
I'm going to assume this is on 3 phase power, but how is the ripple filtered?
Look very carefully at the picture of the rack at https://oxide.computer/ :) there are two power shelves in the middle, not one.
We're absolutely aware of the tradeoffs here and have made quite considered decisions!
The bus bar itself is an SPoF, but it's also just dumb copper. That doesn't mean that nothing can go wrong, but it's pretty far into the tail of the failure distribution.
The power shelf that keeps the busbar fed will have multiple rectifiers, often with at least N+1 redundancy so that you can have a rectifier fail and swap it without the rack itself failing. Similar things apply to the battery shelves.
It's also plausible to have multiple power supplies feeding the same bus bar in parallel (if they're designed to support this) e.g. one at each end of a row.
This is how our rack works (Oxide employee). In each power shelf, there are 6 power supplies and only 5 need to be functional to run at full load. If you want even more redundancy, you can use both power shelves with independent power feeds to each so even if you lose a feed, the rack still has 5+1 redundant power supplies.
This isn't even remotely close. Unless all 32 servers have redundant AC power feeds present, you've traded one single point of failure for another single point of failure.
In the event that all 32 servers had redundant AC power feeds, you could just install a pair of redundant DC power feeds.
>Unless all 32 servers have redundant AC power feeds present, you've traded one single point of failure for another single point of failure.
Is this not standard? I vaguely remember that rack severs typically have two PSUs for this reason.
It's highly dependent on the individual server model and quite often how you spec it too. Most 1U Dell machines I worked with in the past only had a single slot for a PSU, whereas the beefier 2U (and above) machines generally came with 2 PSUs.
But 2 PSUs plugged into the same AC supply still have a single point of failure.
you could have 15 PSUs in a server. It doesn't mean they have redundant power feeds
Rack servers have two PSUs because enterprise buyers are gullible and will buy anything. Generally what happens in case of a single PSU failure is the other PSU also fails or it asserts PROCHOT which means instead of a clean hard down server you have a slow server derping along at 400MHz which is worse in every possible way.
The whole thing with eliminating 70 discrete 1U server size AC-to-DC power supplies is nothing new. It's the same general concept as the power distribution unit in the center of an open compute platform rack design from 10+ years ago.
Everyone who's doing serious datacenter stuff at scale knows that one of the absolute least efficient, labor intensive and cabling intensive/annoying ways of powering stuff is to have something like a 42U cabinet with 36 servers in it, each of them with dual power supplies, with power leads going to a pair of 208V 30A vertical PDUs in the rear of the cabinet. It gets ugly fast in terms of efficiency.
The single point of failure isn't really a problem as long as the software is architected to be tolerant of the disappearance of an entire node (mapping to a single motherboard that is a single or dual cpu socket config with a ton of DDR4 on it).
That’s one reason why 2U4N systems are kinda popular. 1/4 the cabling in legacy infrastructure.
PDUs are also very failure-prone and not worth the trouble.
> This creates a single point of failure,
Who told you there is only one PSU in the power shelf?
They do have a good point here. If you do the total power budget on a typical 1U (discrete chassis, not blade) server which is packed full of a wall of 40mm fans pushing air, the highest speed screaming 40mm 12VDC fans can be 20W electrical load each. It's easy to "spend" at least 120W at maximum heat from the CPUs, in a dual socket system, just on the fans to pull air from the front/cold side of the server through to the rear heat exhaust.
Just going up to 60mm or 80mm standard size DC fans can be a huge efficiency increase in watt-hours spent per cubic meters of air moved per hour.
I am extremely skeptical of the "12x" but using larger fans is more efficient.
from the URL linked:
> Bigger fans = bigger efficiency gains Oxide server sleds are designed to a custom form factor to accommodate larger fans than legacy servers typically use. These fans can move more air more efficiently, cooling the systems using 12x less energy than legacy servers, which each contain as many as 7 fans, which must work much harder to move air over system components.
FWIW, we had to have the idle speed of our fans lowered because the usual idle of around 5k RPM was WAY too much cooling. We generally run our fans at around 2.5kRPM (barely above idle). This is due to not only the larger fans, but also the fact that we optimized and prioritized as little restriction on airflow as possible. If you’ve taken apart a current gen 1U/2U server and then compare that to how little our airflow is restricted and how little our fans have to work, the 12X reduction becomes a bit clearer.
If any Oxide staff are here, I'm just curious, is BlueSky a customer? Seems like it would fit well with their on-prem setup.
Nope, but many of us (Oxide staff) are big fans of what Bluesky is doing!
One of the Bluesky team members posted about their requirements earlier this month, and why Oxide isn't a great fit for them at the moment:
https://bsky.app/profile/jaz.bsky.social/post/3laha2upw3k2z
> Also prices don't make sense for us.
Oof.
Not Oxide or Bluesky, but firstly I'd suggest that asking the company about their customers is unlikely to get a response, most companies don't disclose their customers. Secondly, Bluesky have been growing quickly, I can only assume their hardware is too, and that means long lead time products like an Oxide rack aren't going to work, especially when you can have an off the shelf machine from Dell delivered in a few days.
Oxide is very open, we are happy to talk about customers that allow us to talk about them. Some don’t want to, others are very happy to be mentioned, just like any other company.
> we are happy to talk about customers that allow us to talk about them
This is what I meant by "don't disclose", I didn't mean that Oxide was in any way secretive, but that usually this stuff doesn't get agreed, and that it would make more sense to ask the customer rather than the company selling as Oxide won't want to disclose unless there's already an agreement in place (formal or otherwise).
Gotcha. That totally makes sense, I would t have thought about it that way.
> most companies dont disclose their customers
In my head I'm imagining an average landing page. They slap their customers on there like stickers. I doubt bluesky would stay secretive about using oxide if they did
Those customers listed on the front page of companies are there as part of an agreement. Usually something like a discount. Certainly they are not listed without permission. 10x that if it is a case study.
I think they often are listed without permission unfortunately, and often literally based on on the the email addresses of people signing up for a trial. I see my company's logo on the landing page of many products that we don't use or may even have a policy preventing our use of.
events.bsky appears to be hosted on OVH. Single-product SAAS companies less than a few years old are unlikely to be a major customer cohort for Oxide.
What I don't get is why tie to such an ancient platform. AMD Milan is my home lab. The new 9004 Epycs are so much better on power efficiency. I'm sure they've done their market research and the gains must be so significant. We used to have a few petabytes and tens of thousands of cores almost ten years ago and it's crazy how much higher data and compute density you can get with modern 30 TiB disks and Epyc 9654s. 100 such nodes and you have 10k cores and really fast data. I can't see myself running a 7003-series datacenter anymore unless the Oxide gains are that big.
They've built this a while ago. A hardware refresh takes time. The good news is that they may be able to upgrade the existing equipment with newer sleds.
Yes we're definitely building the next generation of equipment to fit into the existing racks!
I’m rooting for solutions like this as an alternative to the public cloud. I do see that an org would rely on one company that theoretically can do a ‘Broadcom VMware’ on them but I don’t get this vibe from 0x1d3 at all.
But they target large orgs, I wish a solution like this would be accessible for smaller companies.
I wish I could throw their stack on my second hand cots hardware, rent a few U’s in two colos for geo redundancy and cry of happiness each month realizing how much money we save on public cloud cost, yet having cloud capabilities/benefits
I'm amazed Apple don't have a rack mount version of their M series chips yet.
Even for their own internal use in their data centers they'd have to save an absolute boat load on power and cooling given their performance per watt compared to legacy stuff.
Oxide is not touching DLC systems in their post even with a 100ft barge pole.
Lenovo's DLC systems use 45 degrees C water to directly cool the power supplies and the servers themselves (water goes through them) for > 97% heat transfer to water. In cooler climates, you can just pump this to your drycoolers, and in winter you can freecool them with just air convection.
Yes, the TDP doesn't go down, but cooling costs and efficiency shots up considerably, reducing POE to 1.03 levels. You can put tremendous amount of compute or GPU power in one rack, and cool them efficiently.
Every chassis handles its own power, but IIRC, all the chassis electricity is DC. and the PSUs are extremely efficient.
I don't think they'd admit much about it even if they had one internally, both because Apple isn't known for their openness about many things, and because they already exited the dedicated server hardware business years ago, so I think they're likely averse to re-entering it without very strong evidence that it would be beneficial for more than a brief period.
In particular, while I'd enjoy such a device, Apple's whole thing is their whole-system integration and charging a premium because of it, and I'm not sure the markets that want to sell people access to Apple CPUs will pay a premium for a 1U over shoving multiple Mac Minis in the same 1U footprint, especially if they've already been doing that for years at this point...
...I might also speculate that if they did this, they'd have a serious problem, because if they're buying exclusive access to all TSMC's newest fab for extended intervals to meet demand on their existing products, they'd have issues finding sources to meet a potentially substantial demand in people wanting their machines for dense compute. (They could always opt to lag the server platforms behind on a previous fab that's not as competed with, of course, but that feels like self-sabotage if they're already competing with people shoving Mac Minis in a rack, and now the Mac Minis get to be a generation ahead, too?)
I will add that consumer macOS is a piss-poor server OS.
At one point, for many years, it would just sometimes fail to `exec()` a process. This would manifest as a random failure on our build farm about once/twice a month. (This would manifest as "/bin/sh: fail to exec binary file" because the error type from the kernel would have the libc fall back to trying to run the binary as a script, as normal for a Unix, but it isn't a script)
This is likely stemming from their exiting the server business years ago, and focusing on consumer appeal more than robustness (see various terrible releases, security- and stability-wise).
(I'll grant that macOS has many features that would make it a great server OS, but it's just not polished enough in that direction)
> as normal for a Unix
veering offtopic, did you know macOS is a certified Unix?
https://www.opengroup.org/openbrand/register/brand3581.htm
As I recall, Apple advertised macOS as a Unix without such certification, got sued, and then scrambled to implement the required features to get certification as a result. Here's the story as told by the lead engineer of the project:
https://www.quora.com/What-goes-into-making-an-OS-to-be-Unix...
This comes up rather often, and on the last significant post about it I saw on HN someone pointed out that the certification is kind of meaningless[1]. macOS poll(2) is not Unix-compliant, hasn't been since forever, yet every new version of macOS gets certified regardless.
[1]: https://news.ycombinator.com/item?id=41823078
and Windows used to be certified for posix, but none of that matters theses days if it's not bug-compatible with Linux
Did that ever get fixed? That...seems like a pretty critical problem.
Yes, it quietly stopped happening a few years ago, sometime since 2020.
> I will add that consumer macOS is a piss-poor server OS.
Windows is also abysmal but it hasn't stopped people from using it.
But yes, it is too much of a desktop OS.
There is a rack mount version of the Mac Pro you can buy
That's designed for the broadcast market, where they rack mount everything in the studio environment. It's not really a server, it has no out of band management, redundant power etc.
There are third party rack mounts available for the Mac Mini and Mac Studio also.
Rack mount models have LOM over MDM.
Companies buying massive cloud scale server hardware want to be able to choose from a dozen different Taiwanese motherboard manufacturers. Apple is in no way motivated to release or sell the M3/M4 CPUs as a product that major east asia motherboard manufacturers can design their own platform for. Apple is highly invested in tightly integrated ecosystems where everything is soldered down together in one package as a consumer product (take a look at a macbook air or pro motherboard for instance).
For who? How would this help their core mission?
Maybe it becomes a big enough profit center to matter. Maybe. At the risk of taking focus away, splitting attention from the mission they're on today: building end user systems.
Maybe they build them for themselves. For what upside? Maybe somewhat better compute efficiency maybe, but I think if you have big workloads the huge massive AMD Turin super-chips are going to be incredibly hard to beat.
It's hard to emphasize just how efficient AMD is, with 192 very high performance cores on a 350-500W chip.
> Maybe they build them for themselves. For what upside?
They do build it for themselves. From their security blog:
"The root of trust for Private Cloud Compute is our compute node: custom-built server hardware that brings the power and security of Apple silicon to the data center, with the same hardware security technologies used in iPhone, including the Secure Enclave and Secure Boot. We paired this hardware with a new operating system: a hardened subset of the foundations of iOS and macOS tailored to support Large Language Model (LLM) inference workloads while presenting an extremely narrow attack surface. This allows us to take advantage of iOS security technologies such as Code Signing and sandboxing."
<https://security.apple.com/blog/private-cloud-compute/>
This is such a narrow narrow tiny corner of computing needs. That has such serious need for ownership, no matter the cost. And has extremely fantastically chill as shit overall computing needs, is un-perfomamce-sensitive as it gets.
I could not be less convinced by this information that this is a useful indicator for the other 99.999999999% of computing needs.
(some of?) their servers do run apple silicon: https://security.apple.com/blog/private-cloud-compute/
> How can organizations reduce power consumption and corresponding carbon emissions?
Stop running so much useless stuff.
Also maybe ARM over x86_64 and similar power-efficiency-oriented hardware.
Rack-level system design, or at least power & cooling design, is certainly also a reasonable thing to do. But standardization is probably important here, rather than some bespoke solution which only one provider/supplier offers.
> How can organizations keep pace with AI innovation as existing data centers run out of available power?
Waste less energy on LLM chatbots?
Current ARM servers actually generally offer "on par" (varies by workload) perf/Watt for generally worse absolute performance (varies by workload) i.e. require more other overhead to achieve the same total perf despite "on par" perf/Watt.
Need either Apple to get into the general market server business or someone to start designing CPUs as well as Apple (based on the comparison between different ARM cores I'm not sure it really matters if they do so using a specific architecture or not).
It's more a case of selection of optimization parameters and corresponding economy. It's not so much that apple towers over others in design (though they are absolutely no slouches and have wins there) but their design team is in position to coordinate with product directly and as such isn't as limited by "but will it sell in high enough numbers for the excel sheet at investor's desk?"
The real show stopper for years is that ARM servers are just not prepared to be a proper platform. uBoot with grudgingly included FDT (after getting kicked out of Linux kernel) does not make a proper platform, and often there's also no BMC, unique approaches to various parts making the server that one annoying weirdo in the data center, etc.
Cloud providers can spend the effort to backfill necessary features with custom parts, but doing so on your own on-prem is hard
Not sure what you mean wrt to Apple's uniqueness. AMD/Mediatek/Intel/Qualcomm/Samsung only make margin on how well they invest on their designs vs their competitors and they'd all love to be outshipping each other and Apple in any market. All, including Apple, also rely on the same manufacturer for their top products and the ones (Intel/Samsung) with alternatives have not been able to use that as an advantage for top performing products. Sure, Apple can work directly with their own product... but at the end of the day the goal and available customer pool to fight over is the same and they still ship fewer units than the others.
I'm not hands-on familiar with other serious ARM server market players but for several years now Ampere ARM server CPUs at least are nothing like you describe. Phoronix says it best in https://www.phoronix.com/review/linux-os-ampereone
> All the Linux distributions I attempted worked out effortlessly on this Supermicro AmpereOne server. Like with Ampere Altra and Ampere eMAG before that, it's a seamless AArch64 Linux experience. Thanks to supporting open standards like UEFI, Arm SBSA/SBBR and ACPI and not having to rely on DeviceTrees or other nuisances, installing an AArch64 Linux distribution on Ampere hardware is as easy as in the x86_64 space.
Where is the GPU?
We don’t currently have GPUs in the product. The closed-ness of the GPU space is a bit of a cultural difference, but we’ll surely have something eventually. As a small company, we have to focus on our strengths, and there’s plenty of folks who don’t need GPUs right now.
That's fine, just awkward because the GS report shows the TAM or problem depending on your perspective being accelerated computing.
For sure. It’s not just GPUs; given that we have one product with three SKUs, there’s a variety of workloads we won’t be appropriate for just yet. Just takes time to diversify the offering.
maybe the real GPU was the friends we made along the way
"If only they used DC from the wall socket, all those H100s would be green" is, not, I think, the hill you want to die on.
But, yeah, my three 18MW/y racks agree that more power efficiency would be nice, it's just that Rewrite It In (Safe) Rust is unlikely to help with that...
> it's just that Rewrite It In (Safe) Rust is unlikely to help with that...
I didn't see any mention of Rust in the article?
It's pretty much the raison d'être of Oxide. But carry on...
They wrote their own BMC and various other bits and pieces in Rust. That's an extremely tiny part of the whole picture.
It’s significantly more than that, but it’s also true that we include stuff in other languages where appropriate. CockroachDB is in Go, and illumos is in C, as two examples. But almost all new code we write is in Rust. That is the stuff you’re talking about, but also like, our control plane.
Oh and we write a lot of Typescript too.
OSS Rust in Rack trenchcoat.
That's an interesting take. What's your reasoning? Whats your evidence?
Pretty much everything Oxide publishes on github is either in rust or it's an sdk to service in rust. Well and web panel isn'tin rust, so negative points for that, true evangelists would have used WASM.
But Oxide reason to exist is to keep memory of cool racks from Sun running Solaris alive forever.
The raison d'être of Oxide isn't Rust, it's continuing to pretend that the bloated corpse of Solaris still has some signs of life.
https://github.com/illumos/illumos-gate/commits/master/ looks alive to me.
(And for that matter, Oracle's proprietary Solaris seems better maintained than I ever expected, though in this context I think the open source fork is the relevant thing to look at.)