We'll have to wait for first-party benchmarks, but they seem decent so far. A 4060 equivalent $200-$250 isn't bad at all. for I'm curious if we'll get a B750 or B770 and how they'll perform.
At the very least, it's nice to have some decent BUDGET cards now. The ~$200 segment has been totally dead for years. I have a feeling Intel is losing a fair chunk of $ on each card though, just to enter the market.
I quite enjoyed using the CUDA to OneAPI migration tool. Was a bit dodgy at times, but for the most part helped me move a lot of stuff out of the NVIDIA walled garden.
I don't know the numbers but the manufacture of the chip and the cards can't be that expensive...the design was probably much more expensive. Hopefully they are at least breaking even but hopefully making money. Nobody goes into business to lose money. Shareholders would be pissed!
I think a graphics card tailored for 2k gaming is actually great. 2k really is the goldilocks zone between 4k and 1080p graphics before you start creeping into diminishing returns.
For sure its been a sweet spot for a very long time for budget conscious gamers looking for best balance of price and frame rates, but 1440p optimized parts are nothing new. Both NVidia and AMD make parts that target 1440p display users too, and have done for years. Even previous Intel parts you can argue were tailored for 1080p/1440p use, given their comparative performance deficit at 4k etc.
Assuming they retail at prices Intel are suggesting in the press releases, you maybe here save 40-50 bucks over an ~equivalent NVidia 4060.
I would also argue like others here that with tech like frame gen, DLSS etc, even the cheapest discrete NVidia 40xx parts are arguably 1440p optimized now, it doesn't even need to be said in their marketing materials. Im not as familiar with AMD's range right now, but I suspect virtually every discrete graphics card they sell is "2k optmized" by the standard Intel used here, and also doesn't really warrant explicit mention.
I'm baffled that PC gamers have decided that 1440p is the endgame for graphics. When I look at a 27-inch 1440p display, I see pixel edges everywhere. It's right at the edge of losing the visibility of individual pixels, since I can't perceive them at 27-inch 2160p, but not quite there yet for desktop distances.
Time marches on, and I become ever more separated from gaming PC enthusiasts.
Gaming at 2160p is just too expensive still, imo. You gotta pay more for your monitor, GPU and PSU. Then if you want side monitors that match in resolution, you're paying more for those as well.
You say PC gamers at the start of your comment and gaming PC enthusiasts at the end. These groups are not the same and I'd say the latter is largely doing ultrawide, 4k monitor or even 4k TV.
According to steam, 56% are on 1080p, 20% on 1440p and 4% on 2160p.
So gamers as a whole are still settled on 1080p, actually. Not everyone is rich.
The major drawback for PC gaming at 4k that I never see mentioned is how much heat the panels generate. Many of them generate so much heat that rely on active cooling! I bought a pair of high refresh 4k displays and combined with the PC, they raised my room to an uncomfortable temperature. I returned them for other reasons (hard to justify not returning them when I got laid off a week after purchasing them), but I've since made note of the wattage when scouting monitors.
That was earlier this year. I found a new job with a pay raise so it turned out alright. Still miss my old team though.. we've been scattered like straws in the wind.
I'm still using a 50" 1080p (plasma!) television in my living room. It's close to 15 years old now. I've seen newer and bigger TVs many times at my friends house, but it's just not better enough that I can be bothered to upgrade.
Doesn't plasma have deep blacks and color reproduction similar to OLED? They're still very good displays, and being 15 years old means it probably pre-dates the SmartTV era.
I recently recently upgraded my main monitor from 1440p x 144hz to 4K x 144hz (with lots of caveats) and I agree with your assessment. If I had not made significant compromises, it would have cost at least $500 to get a decent monitor, which most people are not willing to spend.
Even with this monitor, I'm barely able to run it with my (expensive, though older) graphics card, and the screen alarmingly flashes whenever I change any settings. It's stable, but this is not a simple plug-and-play configuration (mine requires two DP cables and fiddling with the menu + NVIDIA control panel).
Why do you need two DP cables? Is there not enough bandwidth in a single one? I use a 4k@60 display, which is the maximum my cheap Anker USB-C Hub can manage.
Reddit also seems to have some people who have managed to get 144 with FreeSync, but I've only managed 120.
Funnily enough while I was typing this Netflix caused both my monitors to blackscreen (some sort of NVIDIA reset I think) and then come back. It's not totally stable!
This is likely a cable issue. Certain cable types can't handle 4k. I had to switch from DisplayPort to HDMI with a properly rated cable to get past this in the past.
It works up until too many pixels change, basically.
Had the same issue at 4k 60fps, it mostly worked but the screen flashed black from time to time. I used the thickest cable I had lying around and it has worked fine since.
Interesting. I've been running my 4K monitor at 240Hz with HDR enabled for months and haven't had any issues with Display Stream Compression on my 4080.
Not rich. Well within reach for Americans with expendable income. Mid range 16" macbook pros are in the same price ballpark as 4k gaming rigs. Or put another way costs less than a vacation for two to a popular destination.
I don’t think that’s true anymore. I routinely find 4K/27” monitors for under $100 on Craigslist, and a 3080-equivalent is still good enough to play most games on med-high settings at 4K and ~90Hz, especially if DLSS is available.
Your hypothetical person has a 3080 but needs to crawl craigslist for a sub-100$ monitor? U guess those people exist, but idk why you'd bother with a 3080 to then buy a low refreh rate, high input latency, probably TN, low color accuracy craigslist runoff.
I used to be in the '4k or bust' camp, but then I realized that I needed 1.5x scaling on a 27" display to have my UI at a comfy size. That put me right back at 1440p screen real estate and you had to deal with fractional scaling issues.
Instead, I bought a good 27" 1440p monitor, and you know what? I am not the discerning connoisseur of pixels that I thought I was. Honestly, it's fine.
I will hold out with this setup until I can get a 8k 144hz monitor and a gpu to drive it for a reasonable price. I expect that will take another decade or so.
I find the scaling situation with KDE is better with the Xorg X11 server than it is with Wayland. Things like Zoom will properly be scaled for me with the former.
I have a 4K 43" TV on my desk and it is about perfect for me for desktop use without scaling. For gaming, I tend to turn it down to 1080p because I like frames and don't want to pay up.
At 4K, it's like having 4 21" 1080p monitors. Haven't maximized or minimized a window in years. The sprawl is real.
It's true but I don't run into this issue often since most games and Windows will offer UI/Menu scaling without changing individual windows or the game itself.
I think it's less that gamers have decided it's the "endgame" and more that current gen games at good framerates at 4k require significantly more money than 1440p does, and at least to my eyes just running at native 1440p on a 1440p monitor looks much better than running an internal resolution of 1440p upscaled to 4k, even with DLSS/FSR - so just upgrading piecemeal isn't really a desirable option.
Most people don't have enough disposable income to make spending that extra amount a reasonable tradeoff (and continuing to spend on upgrades to keep up with their monitor on new games).
This is a trade-off with frame rates and rendering quality. When having to choose, most gamers prefer higher frame rate and rendering quality. With 4K, that becomes very expensive, if not impossible. 4K is 2.25 times the pixels of 1440p, which for example means you can get double the frame rate with 1440p using the same processing power and bandwidth.
In other words, the current tech just isn’t quite there yet, or not cheap enough.
Arguably 1440p is the sweet spot for gaming, but I love 4k monitors for the extra text sharpness. Fortunately DLSS and FSR upscaling are pretty good these days. At 4k, quality-mode upscaling gives you a native render resolution about 1440p, with image quality a little better and performance a little worse.
I don’t think it’s seen as the end game, it’s that if you want 120 fps (or 144, 165, or 240) without turning down your graphics settings you’re talking $1000+ GPUs plus a huge case and a couple hundreds watts higher on your power supply.
1440p hits a popular balance where it’s more pixels than 1080p but not so absurdly expensive or power hungry.
Eventually 4K might be reasonably affordable, but we’ll settle at 1440p for a while in the meantime like we did at 1080p (which is still plenty popular too).
That's more of a function of high end Nvidia gaming card prices and power consumption. PC gaming at large isn't about chasing high end graphics anyway, steam deck falls under that umbrella and so does a vast amount of multiplayer gaming that might have other priorities such as affordability or low latency/very high fps.
I find dual 24 inch 1440p q great compromise. Higher pixel density, decent amount of screen real estate, and nice to have an auxiliary monitor when gaming.
I run the second monitor off the IGPU so it doesn't even tax the main GPU.
It's a nice compromise for semi competitive play. On 4k it'd be very expensive and most likely finicky to maintain high FPS.
Tbh now that I think about it I only really need resolution for general usage. For gaming I'm running everything but textures on low with min or max FOV depending on the game so it's not exactly aesthetic anyway. I more so need physical screen size so the heads are physically larger without shoving my face in it and refresh rate.
I don't directly see the pixels per se like on 1080p at 27-inch at desktop distances. But I see harsh edges in corners and text is not flawless like on 2160p.
Like I said, it's on the cusp of invisible pixels.
Gamers often use antialias settings to smooth out harsh edges, whereas an inconsistent frame rate will literally cost you a game victory in many fast-action games. Many esports professionals use low graphics settings for this reason.
I've not tried but I've heard that a butter-smooth 90, 120, or 300 FPS frame rate (that is also synchronized with the display) is really wonderful in many such games, and once you experience that you can't go back. On less powerful systems it then requires making a tradeoff with rendering quality and resolution.
Nvidia markets the 4060 as a 1080p card. It's design makes it worse at 1440p than past X060 cards too. Intel has XeSS to compete with DLSS and are reportedly coming out with their own frame gen competitor. $40-50 is a decent savings in the budget market especially if Intel's claims are to believed and it's actually faster than the 4060.
Actual use is inconsistent. From https://en.wikipedia.org/wiki/2K_resolution: “In consumer products, 2560 × 1440 (1440p) is sometimes referred to as 2K, but it and similar formats are more traditionally categorized as 2.5K resolutions.”
“2K” is used to denote WQHD often enough, whereas 1080p is usually called that, if not “FHD”.
“2K” being used to denote resolutions lower than WQHD is really only a thing for the 2048 cinema resolutions, not for FHD.
“In consumer products, 2560 × 1440 (1440p) is sometimes referred to as 2K,[13] but it and similar formats are more traditionally categorized as 2.5K resolutions.”
It'd be pretty weird if it were called 2k. 1080p is in an absolute sense or as a relative "distance" to the next-lowest thousand closer to 2k pixels of width than 4k is to 4k (both are under, of course, but one's under by 80 pixels, one by 160). It's got a much better claim to the label 2k than 1440p does, and arguably a somewhat better claim to 2k than 4k has to 4k.
[EDIT] I mean, of course, 1080p's also not typically called that, yet another resolution is, but labeling 1440p 2k is especially far off.
You are misunderstanding. 1080p, 1440p, 2160p refer to the number of rows of pixels, and those terms come from broadcast television and computing (the p is progressive, vs i for interlaced). 4k, 2k refer to the number of columns of pixels, and those terms come from cinema and visual effects (and originally means 4096 and 2048 pixels wide). That means 1920×1080 is both 2k and 1080p, 2560×1440 is both 2.5k and 1440p, and 3840×2160 is both 4k and 2160p.
> You are misunderstanding. 1080p, 1440p, 2160p refer to the number of rows of pixels
> (the p is progressive, vs i for interlaced)
> 4k, 2k refer to the number of columns of pixels
> 2560×1440 is both 2.5k and 1440p, and 3840×2160 is both 4k and 2160p.
These parts I did not misunderstand.
> and those terms come from cinema and visual effects (and originally means 4096 and 2048 pixels wide)
OK that part I didn't know, or at least had forgotten—which are effectively the same thing, either way.
> 1920×1080 is both 2k and 1080p
Wikipedia suggests that in this particular case (unlike with 4k) application of "2k" to resolutions other than the original cinema resolution (2048x1080) is unusual; moreover, I was responding to a commenter's usage of "2k" as synonymous with "1440p", which seemed especially odd to me.
I see what you're saying, but I also feel like ALL Nvidia cards are "2K" oriented cards because of DLSS, frame gen, etc. Resolution is less important now in general thanks to their upscaling tech.
12GB max is a non-starter for ML work now. Why not come out with a reasonably priced 24gb card even if it isn't the fastest and target it at the ML dev world? Am I missing something here?
I think a lot of replies to this post are missing that Intel's last graphics card wasn't received well by gamers due to poor drivers. The GT 730 from 2014 has more users than all Arc cards combined according to the latest Steam survey.[0] It's entirely possible that making a 24gb local inference card would do better since they can contribute patches for inference libraries directly like they did for llama.cpp, as opposed to a gaming card where the support surface is much larger. I wish Intel well in any case and hope their drivers (or driver emulators) improve enough to be considered broadly usable.
How big is NVIDIA now? You don't think breaking into that market is a good strategy? And, yes, I understand that this is targeted at gamers and not ML. That was the point of the comment I made. Maybe if they did target ML they would make money and open a path to the massive server market out there.
A video card that beats the 4060 for under $250 is very much going to be a problem for AMD and is going to eat the "low end" market if it is reasonably stable.
I have been trying to hold my slurs in reading this thread.
These ML AI Macbook people are legit insane.
Desktops and gaming is ugly and complex to them (because lego is hard and macbook look nice unga bunga), yet it is a mass market Intel wants to move in on.
People here complain because Intel is not making a cheap GPU to "make AI" on when that's a market of maybe 1000 people.
This Intel card is perfect for an esports gaming machine running CS2, Valorant, Rocket Leauge and casual or older games like The Sims, GoG games etc. Market of 1 million + right there, CS2 alone is 1mil people playing everyday. Not people grinding leetcode on their macs. Every real developer has a desktop, epyc cpu, giga ram and a nice GPU for downtime and run a real OS like Linux or even Windows (yes majority of devs run Windows)
Most devs use windows (https://www.statista.com/statistics/869211/worldwide-softwar...). Reddit llocallama alone has 250k users. Clearly the market is bigger than 1000 people.
Why are gamers and Linux people always so aggressive diminutive of other people’s interests?
I agree ML is about to hit (or has likely already hit) some serious constraints compared to breathless predictions of two years ago. I don't think there's anything equivalent to the AI winter on the horizon, though—LLMs even operated by people who have no clue how the underlying mechanism functions are still far more empowered than anything like the primitives of the 80s enabled.
I think there'll be a "financial" winter - or another way a bubble burst - the investment right now is simply unsustainable, how are these products going to be monetized?
Nvidia had a revenue of $27billion in 2023 - that's about $160 per person per year [0] for every working age person in the USA. And it's predicted to more than double in 2024. If you reduce that to office workers (you know, the people who might actually get some benefit, as no AI is going to milk a cow or serve you starbucks) that's more like $1450/year. Or again more than double that for 2024.
How much value add is the current set of AI products going to give us? It's still mostly promise too.
Sure, like most bubbles there'll probably still be some winners, but there's no way the current market as a whole is sustainable.
The only way the "maximal AI" dream income is actually going to happen is if they functionally replace a significant proportion of the working population completely. And that probably would have large enough impacts to society that things like "Dollars In A Bank" or similar may not be so important.
While I'd agree monetisation seems to be a challenge in the long term (analogy: spreadsheets are used everywhere, but are so easy to make they're not themselves a revenue stream, only as part of a bigger package)…
> Nvidia had a revenue of $27billion in 2023 - that's about $160 per person per year [0] for every working age person in the USA
As a non-American, I'd like to point out we also earn money.
> as no AI is going to milk a cow or serve you starbucks
Some of the malls around here have food courts where robots bring out the meals. I assume they're no more sophisticated than robot vacuum cleaners, but they get the job done.
Transformer models seem to be generally pretty good at high-level robot control, though IIRC a different architecture is needed down at the level of actuators and stepper motors.
Sure, robotics help many jobs, and some level of the current deep learning boom seems to have crossover in improving that - but how many of them are running LLMs that affect Nvidia's bottom line right now? There's some interesting research in that area, but it's certainly not the primary driving force. And then is the control system the limiting factor for many systems - it's probably relatively easy to get a machine today that makes a Starbucks coffee "as good as" a decently trained human. But the market doesn't seem to want that.
And I know restricting it to the US is a simplification, but so is restricting it to Nvidia, it's just to give a ballpark back-of-the-envelope "does this even make sense?" level calculation. And that's what I'm failing to see.
Machines that will make espresso, automatically, that I personally like better than what Starbucks serves are widely available. No AI needed, and they aren't even "robotic". They can use ordinary coffee beans, and you can get them for home use or for commercial use. You can also go to a mall and get a robot to make you coffee.
Nonetheless, Starbucks does not use these machines, and I don't see any reason that AI, on its current trajectory, will change that calculation any time soon.
It's pretty often discussed, it's just hard to put everything into a single comment (or thread).
I mean, Yudkowsky has basically spent the last decade screaming into the void about how AI will with high probability literally kill everyone, and even people like me who think that danger is much less likely still look at the industrial revolution and how slow we were to react to the harms of climate change and think "speed-running another one of these may be unwise, we should probably be careful".
Yeah, but automated milking robots like that have been in the market for more than a decade now IIRC?
Seems like a lot of CV solutions have seen fairly steady but small incremental advances over the past 10-15 years, quite unrelated to the current AI hype.
I use one in the kitchen, because it's easier than the ads and prose on most recipe websites, and it can adapt to whatever ingredients I actually have rather than being a fixed list.
Used them in the garden while weeding and in the garden store while planning what to plant, in both cases to identify plants by image and tell me about them — though I'd say image capable AI are no longer mere "large language models".
Used ChatGPT while shopping to help me locate products in the store I was in, when I couldn't find just by wandering the isles, by uploading a photo of the aisle I happened to be in at the point I gave up.
> even operated by people who have no clue how the underlying mechanism functions are still far more empowered than anything like the primitives of the 80s enabled.
I'm still not convinced about that. All the """studies""" show 30-60% boost in productivity but clearly this doesn't translate to anything meaningful in real life because no industry laid off 30-60% of their workforce and no industry progressed anywhere close to 30% since chat gpt was released.
It's been released a whole 24 months ago, remember the talks about freeing us from work and curing cancer... Even investments funds which are the biggest suckers for anything profitable are more and more doubtful
Yeah… I want to think of it like mining, where you’ve found an ore vein. You have to switch from prospecting to mining. There’s a lot of work to be done by integrating our LLMs and other tools with other systems, and I think the cost/benefit of making models bigger, Bigger, BIGGER is reaching a plateau.
Haven't people been saying that for the last decade? I mean, eventually they will be right, maybe "about" means next year, or maybe a decade later? They just have to stop making huge improvements for a few years and the investment will dry up.
I really wasn't interested in computer hardware anymore (they are fast enough!) until I discovered the world of running LLMs and other AI locally. Now I actually care about computer hardware again. It is weird, I wouldn't have even opened this HN thread a year ago.
Curious as I’m of the same mind - what’s your local AI setup? I’m looking to implement a local system that would ideally accommodate voice chat. I know the answer depends on my use case - mostly searching and analysis of personal documents - but would love to hear how you’ve implemented.
If you are just starting up, you can try out 'open-webui' as inspiration.
After that you can just use llama.cpp to build out your own things.
Hardware side, I just have a beefy server that acts as a router (mellanox card to provider fiber optic and local fiber network), firewall, wifi access point, zigbee coordinator, host to various services, camera video feed ingestion and processing, and so on...
Control and freedom. You can use unharmonious models and hacks to existing models, also latency, you can actually use AI for a lot more applications when it is running locally.
The survivors of the AI winter are not the dinosaurs but the small mammals that can profit by dramatically reducing the cost of AI inference in a minimum Capex environment.
Launching a new SKU for $500-1000 with 48gb of RAM seems like a profitable idea. The GPU isn't top-of-the-line, but the RAM would be unmatched for running a lot of models locally.
It's not technically possible to just slap on more RAM. GDDR6 is point-to-point with option for clamshell, and the largest chips in mass production are 16Gbit/32 bit. So, for a 192bit card, the best you can get is 192/32×16Gbit×2 = 24GB.
To have more memory, you have to design a new die with a wider interface. The design+test+masks on leading edge silicon is tens of millions of NRE, and has to be paid well over a year before product launch. No-one is going to do that for a low-priced product with an unknown market.
The savior of home inference is probably going to be AMD's Strix Halo. It's a laptop APU built to be a fairly low end gaming chip, but it has a 256-bit LPDDR5X interface. There are larger LPDDR5X packages available (thanks to the smartphone market), and Strix Halo should be eventually available with 128GB of unified ram, performance probably somewhere around a 4060.
You can’t just throw in more RAM without having the rest of the GPU architected for it. So there’s an R&D cost involved for such a design, and there may even be trade-offs on performance for the mass-market lower-tier models. I’m doubtful that the LLM enthusiast/tinkerer market is large enough for that to be obviously profitable.
That would depend on how they designed the memory controllers. GDDR6 only supporting 1-2gb modules at present (I believe GDDR6W supports 4gb modules). If they were using 12 1gb modules, then increasing to 24gb shouldn't be a very large change.
Honestly, Apple seems to be on the right track here. DDR5 is slower than GDDR6, but you can scale the amount of RAM far higher simply by swapping out the density.
give me 48gb with reasonable power consumption so I can dev locally and I will buy it in a heartbeat. Anyone that is fine-tuning would want a setup like that to test things before pushing to real GPUs. And in reality if you can fine-tune on a card like that in two days instead of a few hours it would totally be worth it.
The bigger point here is to ask why they aren't designing that in from the start. Same with AMD. RAM has been stalled and is critical. Start focusing on allowing a lot more of it, even at the cost of performance, and you have a real product. I have a 12GB 3060 as my dev box and the big limiter for it is RAM, not cuda cores. If it had 48GB but the same number of cores then I would be very happy with it, especially if it was power efficient.
Because designing a low end GPU with a very wide memory interface isn't useful for gaming, and that is where the vast majority of non-datacenter discrete GPU sales are right now.
There has been no hint or evidence (beyond hope) Intel will add a 900 class this generation.
B770 was rumoured to match the 16 GB of the A770 (and to be the top end offering for Battlemage) but it is said to not have even been taped out yet with rumour it may end up having been cancelled completely.
I.e. don't hold your breath for anything consumer from Intel this generation better for AI than tha A770 you could have bought 2 years ago. Even if something slightly better is coming at all there is no hint it will be soon.
NVIDIA and AMD don't even make new GPU silicon this low-end. Their smallest current-gen GPUs all debuted at higher price points, though the Radeon RX 7600 is now available at the same price that the B580 is launching at.
I was watching a few of the preliminary commentaries on the Battlemage cards, e.g. Linus Tech Tips and so on, and they said the same thing "focus on the low/mid-end". My conclusion is that what they are talking about is "low end, brand new, current gen. graphics card". It's a very special type of low end.
The Intel cards are getting more interesting for me as I'm questioning my continued use of macOS. Intels focus on Linux support makes their options really interesting, though I don't see a need for something as powerful as these new cards.
Can you even do ML work with a GPU not compatible with CUDA? (genuine question)
A quick search showed me the equivalence to CUDA in the Intel world is oneAPI, but in practice, are the major Python libraries used for ML compatible with oneAPI? (Was also gonna ask if oneAPI can run inside Docker but apparently it does [1])
I still don't understand why graphics cards haven't evolved to include sodimm slots so that the vram can be upgraded by the end user. At this point memory requirements vary so much from gamer to scientist so it would make more sense to offer compute packages with user-supplied memory.
tl;dr GPU's need to transition from being add-in cards to being a sibling motherboard. A sisterboard? Not a daughter board.
One of the reasons GPUs can have multiples of CPU bandwidth is they avoid the difficulties of pluggable dimms - direct soldered can have much higher frequencies at lower power.
It's one of the reasons why ARM Macbooks get great performance/watt, memory being even "closer" than mainboard soldered RAM so getting more of those benefits, though naturally less flexibility.
This makes sense. Do we need to begin exploring different socket configurations? IE, using something that more closely resembles a CPU socket versus a traditional RAM slot.
Even DDR5 has this problem. Go look at what soldered DDR5 can do frequency wise compared to DIMMs. It's one of the problems the new CAMM form factor aims to help solve, making it tractable to push the memory frequency up beyond what DIMMs can get yout currently.
I have always wondered: would it be possible to put memory on the back side of the motherboard to get I closer to the CPU? And if it is, would it solve anything else than ram clearance for CPU coolers?
I put an a360 Card into an old machine I turned into a plex server. It turned it into a transcoding powerhouse. I can do multiple indepdent streams now without it skipping a beat. Price-performance ratio was off the chart
My 7950X3Ds GPU does 4k HDR (33Mb/s) to 1080p at 40fps (proxmox, jellyfin). If these GPUs would support SR-IOV I would grab one for transcoding and GPU accelerated remote desktop.
Untouched video (star wars 8) 4k HDR (60Mb/s) to 1080p at 28fps
All first gen arc gpus share the same video encoder/decoder, including the sub-$100 A310, that can handle four (I haven't tested more than two) simultaneous 4k HDR -> 1080p AV1 transcodes at high bitrate with tone mapping while using 12-15W of power.
Any idea how that compares to Apple Silicon for that job? I bought the $599 MacBook Air with M1 as my plex server for this reason. Transcodes 4k HEVC and doesn’t even need a fan. Sips watts.
Amazing. It is the first time I have plugged any gpu into my linux box and have it just work. I am never going back to anything else. My main computer uses an a750, and my jellyfin server uses an a310.
No issues with linux. The server did not like the a310, but that is because it is an old dell t430 and it is unsupported hardware. The only thing I had to do was to tweak the fan curve so that it stopped going full tilt.
A not inconsequential possibility is that both the iGPU and dGPU are sharing the transcoding workload, rather than the dGPU replacing the iGPU. It's a fairly forgotten feature of Intel Arc, but I don't blame anyone because the help articles are dusty to say the least.
Well informed gamers know Intel's discrete GPU is hanging by a thread, so they're not hoping on that bandwagon.
Too small for ML.
The only people really happy seem to be the ones buying it for transcoding and I can't imagine there is a huge market of people going "I need to go buy a card for AV1 encoding".
Intel has earned a lot of credit in the Linux space.
Nvidia is trash tier in terms of support and only recently making serious steps to actually support the platform.
AMD went all in nearly a decade ago and it's working pretty well for them. They are mostly caught up to being Intel grade support in the kernel.
Meanwhile, Intel has been doing this since I was in college. I was running the i915 driver in Ubuntu 20 years ago. Sure their chips are super low power stuff, but what you can do with them and the level of software support you get is unmatched. Years before these other vendors were taking the platform seriously Intel was supporting and funding Mesa development.
The AMD driver has been great on my Framework 13, but the 6.10 series was completely busted. 6.11 worked fine. I can't remember a series where any of my Intel laptops didn't work for that long.
This is repeated often, but I have had very good support from Nvidia on Linux over the years. AMD on the other hand gives lousy support. File a bug report about a problem and expect to be ignored, especially if it has anything to do with emulation. Intel’s Linux support on the other hand has been very good for me too.
If it works well on Linux there's a market for that. AMD are hinting that they will be focusing on iGPUs going forward (all power to them, their iGPUs are unmatched and NVIDIA is dominating dGPU). Intel might be the savior we need. Well, Intel and possibly NVK.
Had this been available a few weeks ago I would have gone through the pain of early adoption. Sadly it wasn't just an upgrade build for me, so I didn't have the luxury of waiting.
>Well informed gamers know Intel's discrete GPU is hanging by a thread, so they're not hoping on that bandwagon.
If Intel's stats are anything to go by, League runs way better than it did on the last generation and it's the only game that has had issues on the last-gen that's still left running on DX9, CS:GO was another notable one but CS2 has launched since and the game has moved to DX12/VK. This was, literally, the biggest issue they had - drivers were also wonky but they seem to have ironed that out as well.
The game devs are going to spend all their time & effort targetting amd/nvidia. Custom code paths etc.
It's not a one size fits all world. OpenCL etc abstraction are good at covering up differences, but not that good. So if you're the player with <10% market share you're going to have an uphill battle to just be on par.
Intel's customers are 3rd party Cpu assemblers like Dell & HP. Many corporate bulk buyers only care if 1-2 of the apps they use are supported. The lack of wider support isn't a concern.
something like 60% of the GPUs by volume on Steam are on the mid-to-low end. like a non-trivial number still running 750 Ti's. there are a lot of gamers, and most aren't well-off tech bros.
there is a niche for this, though it remains to be seen if it'll be profitable enough for a large org like Intel
If you go on the intel arc subreddit people are hyped about intel GPUs. Not sure what the price is but the previous gen was cheap and the extra competition is welcomed
In particular, intel just needs to support vfio and it’ll be huge for homelabs.
Would it though? How many people are running inference at home? Outside of enthusiasts I don't know anyone. Even companies don't self-host models and prefer to use APIs. Not that I wouldn't like a consumer GPU with tons of VRAM, but I think that the market for it is quite small for companies to invest building it. If you bother to look at Steam's hardware stats you'll notice that only a small percentage is using high-end cards.
This is the weird part, I saw the same comments in other threads. People keep saying how everyone yearns for local LLMs… but other than hardcore enthusiasts it just sounds like a bad investment? Like it’s a smaller market than gaming GPUs. And by the time anyone runs them locally, you’ll have bigger/better models and GPUs coming out, so you won’t even be able to make use of them. Maybe the whole “indoctrinate users to be a part of Intel ecosystem, so when they go work for big companies they would vouch for it” would have merit… if others weren’t innovating and making their products better (like NVIDIA).
This makes sense in some ways technologically, but just having a "centralized compute box" seems like a lot more complexity than many/most would want in their homes.
I mean, everything could have been already working that way for a lot of years right? One big shared compute box in your house and everything else is a dumb screen? But few people roll that way, even nerds, so I don't see that becoming a thing for offloaded AI compute.
I also think that the future of consumer AI is going to be models trained/refined on your own data and habits, not just a box in your basement running stock ollama models. So I have some latency/bandwidth/storage/privacy questions when it comes to wirelessly and transparently offloading it to a magic AI box that sits next to my wireless router or w/e, versus running those same tasks on-device. To say nothing of consumer appetite for AI stuff that only works (or only works best) when you're on your home network.
Intel sold their GPUs at negative margin which is part of why the stock fell off a cliff. If they could double the vram they could raise the price into the green even selling thousands, likely closer to 100k, would be far better than what they're doing now. The problem is Intel is run by incompetent people who guard their market segments as tribal fiefs instead of solving for the customer.
that's a dumb management "cart before the horse" problem. I understand a few bugs in the driver but they really should have gotten the driver working decently well before production. Would have even given them more time tweaking the GPU. This is exactly why Intel is failing and will continue to fail with that type of management
Intel management is just brain dead. They could have sold the cards for mining when there was a massive GPU shortage and called it the developer edition but no. It's hard to develop a driver for games when you have no silicon.
I think you're massively underestimating the development cost, of the number of people who would actually purchase a higher vram card at a higher price.
You'd need hundreds of thousands of units to really make much of a difference.
Well, IIUC it's a bit more "having more than 12GB of RAM and raising the price will let it run bigger LLMs on consumer hardware and that'll drive premium-ness / market share / revenue, without subsidizing the price"
I don't know where this idea is coming from, although it's all over these threads.
For context, I write a local LLM inference engine and have 0 idea why this would shift anyone's purchase intent. The models big enough to need more than 12GB VRAM are also slow enough on consumer GPUs that they'd be absurd to run. Like less than 2 tkns/s. And I have 64 GB of M2 Max VRAM and a 24 GB 3090ti.
Enthusiast/Prosumer/etc. market is generally still usually highly in most markers even if the revenue is limited. e.g. if hobbyists/students/developers start using Intel GPUs in a few years the enterprise market might become much less averse to buying Intel's datacenter chips.
Would it though? How many people are running inference at home?
I don't know how to quantify it, but it certainly seems like a lot of people are buying consumer nVidia GPUs for compute and the relatively paltry amounts of RAM on those cards seems to be the number one complaint.
So I would say that Intel's potential market is "everybody who is currently buying nVidia GPUs for compute."
nVidia's stingy consumer RAM choices also seem to be a fairly transparent ploy to create a protective moat around their insanely-high-profit-margin datacenter GPUs. So that just seems like kind of an obvious thing for Intel or AMD to consider tackling.
(Although, it has to be said, a lot of commenters have pointed out that it's not as easy as just slapping more RAM chips onto the GPU boards; you need wider data busses as well etc.)
It's a chicken and egg scenario. The main problem with running inference at home is the lack of hardware. If the hardware was there more people would do it. And it's not a problem if "enthusiasts" are the only ones using it because that's to be expected at this stage of the tech cycle. If the market is small just charge more, the enthusiasts will pay it. Once more enthusiasts are running inference at home, then the late adopters will eventually come along.
100% - this could be Intel's ticket to capture the hearts of developers and then everything else that flows downstream. They have nothing to lose here -- just do it Intel!
You can get that on mac mini and it will probably cost you less than equivalent PC setup. Should also perform better than low end Intel GPU and be better supported. Will use less power as well.
My 7800x says not really. Compared to my 3070 it feels so incredibly slow that gets in the way of productivity.
Specifically, waiting ~2 seconds vs ~20 for a code snippet is much more detrimental to my productivity than the time difference would suggest. In ~2 seconds I don't get distracted, in ~20 seconds my mind starts wandering and then I have to spend time refocusing.
Make a GPU that is 50% slower than a 2 generations older mid-range GPU (in tokens/s) but on bigger models and I would gladly shell out 1000+$.
So much so that I am considering getting a 5090 if nVdia actually fixes the connector mess they made with 4090s or even a used v100.
Maybe that's not too bad for someone who wants to use pre-existing models. Their AI Playground examples require at minimum an Intel Core Ultra H CPU, which is quite low-powered compared to even these dedicated GPUs: https://github.com/intel/AI-Playground
I don't know a single person in real life that has any desire to run local LLMs. Even amongst my colleagues and tech friends, not very many use LLMs period. It's still very niche outside AI enthusiasts. GPT is better than anything I can run locally anyway. It's not as popular as you think it is.
I run a 12GB model on my 3060 and use it to help answer healthcare questions. I'm currently doing a medical residency. (No I don't use it to diagnose). It helps comply with any HIPAA style regulations. I sometimes use it to fix up my emails. Not sure why people are longing for a 128GB card, just download a quantized model and run with LM Studio (https://lmstudio.ai/). At least two of my colleagues are using ChatGPT on a regular basis. LLMs are being used in the ER department. LLMs and speech models are being used in psychiatry visits.
This is what I'm fiddling with. My 2080Ti is not quite enough to make it viable. I find the small models fail too often, so need larger Whisper and LLM models.
Like the 4060 Ti would have been a nice fit if it hadn't been for the narrow memory bus, which makes it slower than my 2080 Ti for LLM inference.
A more expensive card has the downside of not being cheap enough to justify idling in my server, and my gaming card is at times busy gaming.
absolutely wrong -- if you're not clever enough to think of any other reason to run an LLM locally then don't condemn the rest of the world to "well they're just using it for porno!"
>Battlemage is still treated to fully open-source graphics driver support on Linux.
I am hoping these are open in such a manner that they can be used in OpenBSD. Right now I avoid all hardware with a Nvidia GPU. That makes for somewhat slim pickings.
If the firmware is acceptable to the OpenBSD folks, then I will happly use these.
For me, the most important feature is Linux support. Even if I'm not a gamer, I might want to use the GPU for compute and buggy proprietary drivers are much more than just an inconvenience.
Sure, but open drivers have been AMDs selling point for a decade, and even nVidia is finally showing signs of opening up. So it's bit dubious if these new Intels really can compete on this front, at least for very long.
I welcome a new competitor. Sucks to really only have one valid option on Linux atm. My 6600 is a little long in the tooth. I only have it becuase it is dead silent and runs a 5K display without issue - but I would definitely like to upgrade it for something that can hold its own with ML.
> Sucks to really only have one valid option on Linux atm.
I don't think that's a super fair shake? Intel iGPUs have been around for a while if you had a laptop chip or iGPU-enabled desktop chip. They've supported Linux just fine for ages, and will fill any non-3D application you might have.
And Nvidia chips are quite good on Linux nowadays - Wayland has been very usable since the 535-series drivers and nearly flawless since 550. You're right to be apprehensive about proprietary GPU hardware but I think there are plenty of options on the table right now.
The iGPU in my 13900K cannot run my 5K display with decent performance (in a desktop environment). I chalked it up to hardware issues but it could be drivers. I am on Debian Linux.
Those numbers are identical to the A770, and don't match the numbers from the preview[0], so I think that's a copy paste error.
If we use the numbers from the preview:
| |Arc A770|Arc B580|RTX 4060|
|--------|--------|--------|--------|
|Process |N6 |N5 |N5 |
|Die Size|406mm^2 |272mm^2 |159mm^2 |
|Trans. |21.7B |19.6B |18.9B |
|Mem Bus |256 bit |192 bit |128 bit |
|TDP |225W |190W |115W |
|~Perf |90% |110% |100% |
In terms of performance per die area it's a big improvement over A770 but still far behind Nvidia. It's interesting that the transistor density is so much lower than the 4060 despite having the same (or at least similar) process node. Speculating about why that may be:
- Nvidia has better layout.
- Intel is using higher performance (larger) transistor libraries or layout in order to hit the higher boost frequencies (2800 vs 2460).
- Wider bus interface takes up more space.
- The B580 may have 1 render slice and 64-bits of memory bus disabled, and they're not including those transistors in the count, but they still take up area.
Those numbers suggest that they have caught Nvidia in performance per transistor. As for the die area being larger, I suspect that the larger memory bus might be partly responsible. The transistors used for IO stopped shrinking on new nodes a while ago, so they use plenty of die area.
> Both the Arc B580 and B570 are based on the "BMG-G21" a new monolithic silicon built on the TSMC 5 nm EUV process node. The silicon has a die-area of 272 mm², and a transistor count of 19.6 billion
I wonder what happened to my brain writing the above. There is a "He" in there that makes no sense. Flawless support for "different" screens? I of course mean "many screens".
I'm really curious to see if these still rely heavily on resizable BAR. Putting these in old computers in linux without reBAR support makes the driver crash with literally any load rendering the cards completely unusable.
It's a real shame, the single slot a380 is a great performance for price light gaming and general use card for small machines.
What is the newest platform that lacks resizable BAR? It was standardized in 2006. Is 4060-level graphics performance useful in whatever old computer has that problem?
The newest platform is probably POWER10. ReBar is not supported on any POWER platform, most likely including the upcoming POWER11.
Also, I don't think you'll find many mainboards from 2006 supporting it. It may have been standardized in 2006, but a quick online search leads me to think that even on x86 mainboards it didn't become commonly available until at least 2020.
If the firmware is open source, perhaps it could be retrofitted. For amd64 motherboards that do not have native UEFI support, retrofits are possible through this:
Congrats on a pretty niche reply. I wonder if literally anyone has tried to put an ARC dGPU in a POWER system. Maybe someone from Libre-SOC will chime in.
Sandy Bridge (2009) is still a very usable CPU with a modern GPU. In theory Sandy Bridge supported resizable BAR but in practice they didn't. I think the problem was BIOS's.
On paper any PCIe 2.0 motherboard can receive a BIOS update adding ReBAR support with 2.1, but reality is that you pretty much have to get a PCIe 3.0 motherboard to have any chance of having it or modding it in yourself.
Another issue is that not every GPU actually supports ReBAR, I'm reasonably certain the Nvidia drivers turn it off for some titles, and pretty much the only vendor that reliably wants ReBAR on at all times is Intel Arc.
I also personally wouldn't say that Sandy Bridge is very usable with a modern GPU without also specifying what kind of CPU or GPU. Or context in how it's being used.
My old Ice Lake CPU was very much a bottleneck in lots of games in 2018 when I finally replaced it. It was a noticeable improvement across the board making the jump to a Zen+ CPU at the time, even with the same GPU.
Oh wow. That's older than I thought. This is definitely less of an issue than folks make out of it.
I cling onto my old hardware to limit ewaste where I can. I still gave up on my old sandybridge machine once it hit about a decade old. Not only would the CPU have trouble keeping up, its mostly only PCIe 2.0. A few had 3.0. You wouldn't get the full potential even out of the cheapest one of these intel cards. If you are putting a GPU in a system like that I can't imagine even buying new. Just get something used off ebay.
There were a lot of generations after Sandy Bridge which didn't have it; Sandy Bridge was just one generation that didn't really support it on the consumer side.
Consumer boards and CPUs didn't really support it well until after 2018. I upgraded away from a Zen+ system because it didn't support it.
While "standardized" many implementations were so buggy to be unusable - we needed >4gb pcie mappings for a development board, and finding motherboards that actually worked was a PITA well into 2012 (when I left that project).
ReBAR was standardized in 2006 but consumer motherboards didn't start shipping with an option to enable it until much later, and didn't start turning it on by default until a few years ago.
I wanted to have alternative choices than Nvidia for high power GPUs. Then the more I thought about it, the more it made sense to rent cloud services for AI/ML workloads and lesser powered ones for gaming. The only use cases I could come up with for wanting high-end cards are 4k gaming (a luxury I can't justify for infrequent use) or for PC VR which may still be valid if/when a decent OLED (or mini-OLED) headset is available--the Sony PSVR2 with PC adapter is pretty close. The Bigscreen Beyond is also a milestone/benchmark.
Don't rent a GPU for gaming, unless you're doing something like a full-on game streaming service. +10ms isn't much for some games, but would be noticeable on plenty.
IMO you want those frames getting rendered as close to the monitor as possible, and you'd probably have a better time with lower fidelity graphics rendered locally. You'd also get to keep gaming during a network outage.
I don't even think network latency is the real problem, it's all the buffering needed to encode a game's output to a video stream and keep it v-synced with a network-attached display.
I've tried game streaming under the best possible conditions (<1ms network latency) and it still feels a little off. Especially shooters and 2D platformers.
Yeah - there's no way to play something like Overwatch/Fornite on a streaming service and have a good time. The only things that seems to be ok is turned based or platformers.
I haven't decided/pulled-the-trigger but the Intel ARC series are giving the AMD parts a good run for the money.
The only concern is how well the new Intel drivers work (full support for DX12) with older titles which are continuously being improved (for DX11, 10, and some for 9 others via emulation).
There's likely some deep discounting of Intel cards because of how bad the drivers were at launch and the prices may not stay so low once things are working much better.
Intel isn't going anywhere for at least a couple of hardware genrations. Buying a GPU is also not "investing" in anything. In 2 years' time you can replace it whith whatever is best value for money at that time.
> There's no way you're going to maintain and develop the intel linux driver as a solo dev.
I agree entirely.
My point was that even if Intel disappeared tomorrow, there's a good chance that Linux developer community would take over maintenance of those drivers.
In contrast to, e.g., 10-years-ago nvidia, where IIUC it was very difficult for outsiders to obtain the documentation needed to write proper drivers for their GPUs.
I can't speak from experience with their GPUs on Linux, but I know on Windows most of their problems stem from supporting pre-DX12 Direct3D titles. Nvidia and AMD have spent many years polishing up their Direct3D support and putting in driver-side hacks that paper over badly programmed Direct3D games.
These are obviously Windows-specific issues that don't come up at all in Linux, where all that Direct3D headache is taken care of by DXVK. Amusingly a big part of Intel's efforts to improve D3D performance on Windows has been to use DXVK for many titles.
Anyone using Intel graphics cards? Aside from specs drivers and support can make or break the value prop of a gfx card. Would be curious what actually using is these is like.
I use an A770 LE for PC gaming. Windows drivers have improved substantially in the last two years. There's a driver update every month or so, although the Intel Arc control GUI hasn't improved in a while. Popular newer titles have generally run well; I've played some Metaphor, Final Fantasy 16, Elden Ring, Spider-Man Remastered, Horizon Zero Dawn, Overwatch, Jedi Survivor, Forza Horizon 4, Monster Hunter Sunbreak, etc. without major issues. Older games sometimes struggle; a 6 year old Need for Speed doesn't display terrain, some 10+ year old indie games crash. Usually fixed by dropping dxvk.dll in the game directory. This fix cannot be used with older Windows Store games. One problematic newer title was Starfield, which at launch had massive frame pacing and hard crashing issues exclusive to Intel Arc.
I've had a small sound latency issue forever; most visible with YouTube videos, the first half-second of every video is silent.
I picked this card up for about $120 less than the GTX 4060. Wasn't a terrible decision.
Appreciate the detail! I've always found the xx60 class of Nvidia cards to be the entry point for me when I'm putting together a gaming rig so might investigate Intel next time I'm putting a build together.
I'm not a gamer and there is not enough memory in this thing for me to care to use it for AI applications so that leaves just one thing I care about: hardware accelerated video encoding and decoding. Let's see some performance metrics both in speed and visual quality
From what I have gathered, the alchemist av1 is about the same or sliiiightly worse than current nvenc. My a750 does about 1400fps for dvd encoding on the quality preset. I havent had the opportunity to try 1080p or 4k though.
Bit disappointed there's no 16gig(or more) version. But absolutely thrilled the rumours of Intel discrete graphics' demise were wildly exaggerated(looking at you, Moore's Law is Dead...).
Very happy with my A770. Godsend for people like me who want plenty VRAM to play with neural nets, but don't have the money for workstation GPUs or massively overpriced Nvidia flagships. Works painlessly with linux, gaming performance is fine, price was the first time I haven't felt fleeced buying a GPU in many years. Not having CUDA does lead to some friction, but I think nVidia's CUDA moat is a temporary situation.
Prolly sit this one out unless they release another SKU with 16G or more ram. But if Intel survives long enough to release Celestial, I'll happily buy one.
> upscaling is absolutely vital for a reasonable experience on some games
This strikes me as a bit of a sad state of affairs. We've moved beyond a Parkinson's law of computational resources –usage by games expands to fill the available resources– to resource usage expanding to fill the available resources on the highest end machines unavailable for less than a few thousand dollars... and then using that to train a model to simulate by upscaling higher quality or performance on lower end machines.
A counterargument would be that this makes high-end experiences available to more people, and while in the individual case, I don't buy that that's where the incentives it creates are driving the entire industry.
To put a finer point on it: at what percentage of budget is too much money being spent on producing assets?
Isn't it insane to think that rendering triangles for the visuals in games has gotten so demanding that we need an artificially intelligent system embedded in our graphics cards to paint pixels that look like high definition geometry?
What a time to be alive. Our most advanced technology is used to cheat on homework and play video games.
> Isn't it insane to think that rendering triangles for the visuals in games has gotten so demanding that we need an artificially intelligent system embedded in our graphics cards to paint pixels that look like high definition geometry?
That's not _quite_ how temporal upscaling work in practice. It's more of a blend between existing pixels, not generating entire pixels from scratch.
The technique has existed since before ML upscalers became common. It's just turned out that ML is really good at determining how much to blend by each frame, compared to hand written and tweaked per-game heuristics.
---
For some history, DLSS 1 _did_ try and generate pixels entirely from scratch each frame. Needless to say, the quality was crap, and that was after a very expensive and time consuming process to train the model for each individual game (and forget about using it as you develop the game; imagine having to retrain the AI model as you implement the graphics).
DLSS 2 moved to having the model predict blend weights fed into an existing TAAU pipeline, which is much more generalizable and has way better quality.
It is. And it strikes me as evidence we've lost the plot and a measure has ceased to be a good measure upon being a target.
It used to be that more computational power was desirable because it would allow for developers to more fully realize creative visions that weren't previously possible.
Now, it seems that the goal is simply visual fidelity and asset complexity... and the rest of the experience is not only secondary, but compromised in pursuit of the former.
Thinking back on recent games that felt like something new and painstakingly crafted... they're almost all 2D (or look like it), lean on excellent art/music (and even haptics!) direction, have a well-crafted core gameplay loop or set of systems, and have relatively low actual system requirements (which in turn means they are exceptionally smooth without any AI tricks).
Off the top of my head few years: Hades, Balatro, Animal Well, Cruelty Squad[0], Spelunky, Pizza Tower, Papers Please, etc. Most of these could just as easily have been made a decade ago.
That's not to say we haven't had many games that are gorgeous and fun. But while the latter is necessary and sufficient, the former is neither.
It's just icing: it doesn't matter if the cake tastes like crap.
[0] a mission statement if there ever was one for how much fun something can be while not just being ugly but being actively antagonistic to the senses and any notion of good taste.
You have a long time to go, unless you want to play the latest and greatest AAA games, but that will not be your cards fault, it's the game studios not optimizing their game.
I’ll pick up a B580 to see how it works with Jellyfin transcoding, OBS streaming using AV1, and, with some luck, Davinci Resolve. Maybe a little Blender?
Other exciting tests will include things like fan control, since that’s still an issue with Arc GPUs.
Looking forward to it! Would appreciate if you post a link here when you do, I'm excited to see how B series compares to A series for media encoding/decoding (assuming Battlemage is the same across all gpus the same way Alchemist was).
Too late, and it has a bad rep. This effort from Intel to sell discrete GPUs is just inertia from old aspirations, won't really help noticeably to save it, as there is not much money in it. Most probably the whole Intel ARC effort will be mothballed, and probably many more will.
I think it's the right call since there isn't much competition in GPU industry anyway. Sure, Intel is far behind. But they need to start somewhere in order to break ground.
Strictly speaking strategically, my intuition is that they will learn from this, course correct and then would start making progress.
The idea of another competitive GPU manufacturer is nice. But it is hard to bring into existence. Intel is not in a position to invest lots of money and sustained effort into products for which the market is captured and controlled by a much bigger and more competent company on top of its game. Not even AMD can get more market share, and they are much more competent in the GPU technology. Unless NVIDIA and AMD make serious mistakes, Intel GPUs will remain a 3rd rate product.
> "They need to start somewhere in order to break ground"
Intel has big problems and it's not clear they should occupy themselves with this. They should stabilize, and the most plausible way to do that is to cut the weak parts, and get back to what they were good at - performant secure x86_64 CPUs, maybe some new innovative CPUs with low consumption, maybe memory/solid state drives.
That's a very low margin and cyclical market since memory/SSDs are basically commodities. I don't think Intel would have any chance surviving in such a market they just have way to much bloat/R&D spending. Which is not a bad thing as long as you can produce better than products than the competition.
No reviews and when you click on the reseller links in the press announcement they're still selling A750s with no B-Series in sight. Strong paper launch.
These are pretty interesting, but I'm curious about the side-by-side screenshot with the slider: why does ray tracing need to be enabled to see the yellow stoplight? That seems like a weird oversight.
Drivers were very rough at launch. Some games didn't run at all, some basic functionality and configuration either crashes or failed to work, some things ran very poorly, etc. However, it was essentially all ironed out over many months of work.
They likely won't need to do the same discovery and fixing for B-series as they've already dealt with it.
Yes, I understand that. I'm saying it doesn't read as easily IMO as (modern) NVIDIA/AMD model numbers. Most numbers I deal with are base-10, not base-36.
On other hand considering Geforce is 3rd loop of base 10 maybe it is not so bad...
Radeon is on other hand a pure absolute mess... Going back same 20 years.
I like Intel's aggressive pricing against entry/mid level GPUs, which hopefully puts downward pressure on all GPUs. Overall, their biggest concern is software support. We've had reports of certain DX11/12 games failing to run properly on Proton, and the actual performance of the A series varied greatly between games even on Windows. I suspect we'll see the same issues when the B580 gets proper third party benchmarking.
Their dedication to Linux Support, combined with their good pricing makes this a potential buy for me in future versions. To be frank, I won't be replacing my 7900 XTX with this. Intel needs to provide more raw power in their cards and third parties need to improve their software support before this captures my business.
How does this connect to Gelsinger's retirement, announced yesterday? The comments on that news were all doom and gloom, so I had expected more negative news today. Not a product launch. But I'm just some guy on HN, what do I know?
Based on scaling by XMX/engine clock napkin math, the B580 should have 230 FP16 TFLOPS and 456 GB/s MBW theoretical. At similar efficiency to LNL Xe2, that should be about pp512 ~4700 t/s and tg128 ~77 t/s for a 7B class model. This would be about 75% of a 3090 for pp and 50% for tg (and of course, 50% of memory). For $250, that's not too bad.
I do want to note a couple things from my poking around. The IPEX-LLM [1] was very responsive, and was able to address an issue I had w/ llama.cpp within days. They are doing weekly update releases, so that's great. The IPEX stands for Intel Extension for PyTorch [2] and it is a mostly drop-in for PyTorch: "Intel® Extension for PyTorch* extends PyTorch* with up-to-date features optimizations for an extra performance boost on Intel hardware. Optimizations take advantage of Intel® Advanced Vector Extensions 512 (Intel® AVX-512) Vector Neural Network Instructions (VNNI) and Intel® Advanced Matrix Extensions (Intel® AMX) on Intel CPUs as well as Intel Xe Matrix Extensions (XMX) AI engines on Intel discrete GPUs. Moreover, Intel® Extension for PyTorch* provides easy GPU acceleration for Intel discrete GPUs through the PyTorch* xpu device."
All of this depends on Intel oneAPI Base Kit [3] which has easy Linux (and presumably Windows) support. I am normally an AUR guy on my Arch Linux workstation, but those are basically broken and I had much more success installing oneAPI Base Kit (w/o issues) directly in Arch Linux. Sadly, this is also where there are issues some of the code is either dependent on older versions of oneAPI Base Kit that are no longer available (vLLM requires oneAPI Base Toolkit 2024.1 - this is not available for download from the Intel site anymore) or in dependency hell (GPU whisper simply will not work, ipex-llm[xpu] has internal conflicts from the get go), so it's not all sunshine. On average, ROCm w/ RNDA3 is much more mature (while not always the fastest, most basic things do just work now).
Well hopefully they'll release a A770 replacement as well. I guess they rushed these cards to get ahead of Nvidia and AMD who should release their next gen early next year.
From the gaming side of things, I'm disappointed that Intel and AMD are focusing on the midrange market going forwards. I'm on Linux with a 6900XT and wasn't going to upgrade until there's a compatible option with acceptable raytracing performance (and when HDR is finally sorted out). The 4090 and other high tier cards are absurdly expensive, would be good to have competition in that segment.
I wanted Intel to do well so I purchased an ARC card. The problem is not the hardware. For some games, it worked fine, but in others, it kept crashing left and right. After updates to drivers, crashing was reduced, but it still happened. Driver software is not easy to develop thoroughly. Even AMD had problems when compared to Nvidia when AMD really started to enter the GPU game after buying ATI. AMD has long since solved their driver woes, but years after ARC's launch, Intel still has not.
It's also a hardware problem. For example, Alchemist's EUs being SIMD8 but games requiring SIMD16, so it needs to be dispatched to two EUs in a lockstep, or the lack of support for Execute Indirect instruction commonly used in UE5 games, which is currently emulated in software, makes game compatibility a very hit-or-miss.
Battlemage is supposed to fix all these architectural issues. EU in Xe2 is now SIMD16 (which is why the number of EUs per Xe2 core is halved from that of Xe1), and they've added all the previously software-emulated instructions, including Execute Indirect, so in theory Battlemage should be in a much better position in game compatibility side of things.
On Linux side of things, lacking sparse residency support in i915 also contributes to game compatibility[1] (though this is now available under Mesa 24). This is something the new xe driver is supposed to fix, but it's still a long way to go until it's actually usable.
Intel can't compete head to head with Nvidia on performance.
But surely it's easy enough to compete on video ram - why not load their GPUs to the max with video ram?
And also video encoder cores - Intel has a great video encoder core and these vary little across high end to low end GPUs - so they could make it a standout feature to have, for example, 8 video encoder cores instead of 2.
It's no wonder Nvidia is the king because AMD and Intel just don't seem willing to fight.
Twitch did in the past (mostly), but it is trying pivot at the moment: Take tight control over the encoding settings on the client side and just pass the already encoded stream through the CDN. https://help.twitch.tv/s/article/multiple-encodes
Also having different encoding settings for different purposes is desired (e.g. high quality local recording for an edit later while live streaming to different services at the same time [Twitch, Youtube, ...]).
That said, I'm not aware of Intel limiting the number of encoding streams, so I don't know where the number 2 originates.
GP is asking what measure of FPS. The most likely value when unspecified is usually "mean FPS" but, being a marketing graph, it doesn't explicitly say.
As long as all bars show the same, is there much difference between “mean”, “minimum” as long as the settings are sensible? Keep in mind this is one benchmark (two, really) and you won’t precisely recreate the testing conditions, and it doesn’t matter because your use case is not running the tests.
Why don't they just release a basic GPU with 128GB RAM and eat NVidia's local generative AI lunch? The networking effect of all devs porting their LLMs etc. to that card would instantly put them as a major CUDA threat. But beancounters running the company would never get such an idea...
Disclosure: HPC admin who works with NIVIDA cards here.
Because, no. It's not as simple as that.
NVIDIA has a complete ecosystem now. They have cards. They have cards of cards (platforms), which they produce, validate and sell. They have NVLink crossbars and switches which connects these cards on their card of cards with very high speeds and low latency.
For inter-server communication they have libraries which coordinate cards, workloads and computations.
They bought Mellanox, but that can be used by anyone, so there's no lock-in for now.
As a tangent, NVIDIA has a whole set of standards for pumping tremendous amount of data in and out of these mesh of cards. Let it be GPU-Direct storage or specialized daemons which handle data transfers on and off cards.
If you think that you can connect n cards on PCIe bus and just send workloads to them and solve problems magically, you'll hurt yourself a lot, both performance and psychology wise.
You have to build a stack which can perform these things with maximum possible performance to be able to compute with NVIDIA. It's not just emulating CUDA, now. Esp., on the high end of the AI spectrum (GenAI, MultiCard, MultiSystem, etc.).
For other lower end, multi-tenant scenarios, they have card virtualization, MIG, etc. for card sharing. You have to complete on that, too, for cloud and smaller applications.
I have been hacking on local llama 3 inference software (for the CPU, but I have been thinking about how I would port it to a GPU) and would like to do a rebuttal:
Inference workloads are easy to parallelize to N cards with minimal connectivity between them. The Nvlink crossbars and switches just are not needed.
In particular, inference can be divided into two distinct phases, which are input processing (prompt processing) and output generation (token generation). They are remarkably different in their requirements. Input processing is compute bound via GEMM operations while output generation is memory bandwidth bound via GEMV operations. Technically, you can do the input processing via GEMV too by processing 1 token at a time, but that is slow, so you do not want to do that. Anyway, these phases can be further subdivided into the model’s layers. You can have 1 GPU per layer with the logits passing from GPU to GPU in a pipeline. The GPUs just need the layer’s weights and the key-value cache for all of the tokens in that layer in memory to be able to work effectively. For llama 3.1 405B, there are 126 layers, so that is up to 126 GPUs.
That is of course slightly slower than if you just had 1 GPU with an incredible amount of VRAM, but you can always have more than one query in flight to get better than 1 GPU’s worth of performance from this pipeline approach. There are other ways of doing parallelization too, such as having output processing use GEMM to do multiple queries in parallel. This would be what others call batching, although I am only interested in doing 1 query at a time right now, so I have not touched it.
In essence, you can connect n cards on PCIe and have them solve inferencing problems magically, with the right software. Training is a different matter and I cannot comment on it as I have not studied it yet.
I presume the counterargument is that inference hosting is commoditized (sort of like how stateless CPU-based containerized workload hosts are commoditized); there’s no margin in that business, because it is parallelizable, and arbitrarily schedulable, and able to be spread across heterogenous hardware pretty easily (just route individual requests to sub-cluster A or B), preventing any kind of lock-in and thus any kind of rent-extraction by the vendor.
Which therefore means that cards that can only do inference, are fungible. You don’t want to spend CapEx on getting into a new LOB just to sell something fungible.
All the gigantic GPU clusters that you can sell a million at a time to a bigcorp under a high-margin service contract, meanwhile, are training clusters. Nvidia’s market cap right now is fundamentally built on the model-training “space race” going on between the world’s ~15 big AI companies. That’s the non-fungible market.
For Intel to see any benefit (in stock price terms) from an ML-accelerator-card LOB, it’d have to be a card that competes in that space. And that’s a much taller order.
Coincidentally, it has 128GB of RAM. However, it is not a GPU, is designed to do training too and uses expensive HBM.
Modern GPUs can do more than inference/training and the original poster asked about a GPU with 128GB of RAM, not a card that can only do inferencing as you described. Interestingly, Qualcomm made its own card targeted at only inferencing with 128GB of RAM without using HBM:
They do not sell it through PC parts channels so I do not know the price, but it is exactly what you described and it has been built. Presumably, a GPU with the same memory configuration would be of interest to the original poster.
Facebook did a technical paper where they described their training cluster and the sheer amount of complexity is staggering. That said, the original poster was interested in inferencing, not training.
Training is close to traditional HPC in many ways. Inference is far simpler since it's a simple forward-going pipeline of a relatively small working set.
what kind of bandwidth/latency between GPUs would one need in that setup to not be bottlenecking? What you're describing sounds quite forgiving. Is it forgiving enough that we could potentially connect those GPUs over a LAN, or even a remote decentralized cloud of host computers?
From my understanding that's certainly possible to do without the latency hurting much with large batching between inference layers
* the model dimension
* how many bits per variable are used by your quantization
* how many tokens are being processed per step (input processing can do all N input tokens simultaneously and output processing can only do 1 at a time when doing a single query)
* how many times you split the model layers across multiple GPUs
The model dimensions are:
* 4096 for llama 3/3.1 8B.
* 8192 for llama 3/3.1 70B.
* 16384 for llama 3.1 405B.
The model layers are:
* 32 for llama 3/3.1 8B
* 80 for llama 3/3.1 70B
* 126 for llama 3.1 405B
The amount of data that needs to be transferred for each split is surprisingly small. Each time you move the calculation of a subsequent layer to a different GPU, you need to transfer an array that is of size model_dimension * num_tokens * bits_per_variable. Then this reduces to a classic network transfer time problem, where you consider both time until the first byte arrives and the transfer time until the last byte arrives. Reality will likely be longer than that idealized scenario, especially since you need to send a signal saying to begin computing.
Input processing can tackle so many tokens simultaneously that it probably is not worth thinking too much about this penalty there. Output processing is where the penalty is more significant, since you will incur these costs for every token. Let’s say we are doing fp16 or bf16 on llama 3 8B. Then we need to transfer 8KB every time we move the calculation for another layer to another GPU. If you use RDMA and do this over 10GbE, the transfer time would be 6.4 microseconds. If we assume the time to first byte and time to do a signal to begin processing is 3.6 microseconds combined (chosen to round things up), then we get a penalty of 10 microseconds per split, per token. If you are doing 60 tokens per second and split things across 4 GPUs over the network, you have a penalty of 30 microseconds per token. It will run about 0.003% slower and you are not going to notice this at all. Assuming 10GbE with RDMA is somewhat idealized, although I needed to pick something to give some numbers.
In any case, the equation for figuring what factor slower it would be is 1 / (1 + time to do transfers and trigger processing per each token in seconds). That would mean under a less ideal situation where the penalty is 5 milliseconds per token, the calculation will be ~0.99502487562 times what it would have been had it been done in a hypothetical single GPU that has all of the VRAM needed, but otherwise the same specifications. This penalty is also not very noticeable.
In conclusion, you are right. I actually recall seeing a random YouTube video talking about a software project that does clustered interferencing, so people are already doing this. Unfortunately, I do not remember the project name, channel name or video name.
On servers you're right: but for local LLM inference, I think you're wrong. For local LLMs most people are bottlenecked by not having enough VRAM: pretty much no one is running a 70b model on Nvidia GPUs locally, just due to the expense. You don't need maximum performance: you need it to run at all, which most people can't do for the good models — at least, not without heavy quantization that pretty badly lobotomizes them.
Apple is the king right now of local LLM inference, just because of their unified memory architecture meaning that people can get large amounts of "VRAM" (since all RAM is VRAM). They're not as fast as Nvidia — not even close to an H100, for example. But they don't need to be. No consumer can afford an H100, but they can afford a Mac.
I think he's mostly referring to inference and not training, which I entirely agree with - a 4x version of this card for workstations would do really well - even some basic interconnect between the cards a la nvlink would really drive this home.
The training can come after, with some inference and runtime optimizations on the software stack.
Most of the above infra is predicated on limiting RAM so that you need so much communication between cards. Bump the RAM up and you could do single card inference and all those connections become overhead that could have gone to more ram. For training there is an argument still, but even there the more RAM you have the less all that connectivity gains you. RAM has been used to sell cards and servers for a long time now, it is time to open the floodgates.
Correct for inference - the main use of the interconnect is RDMA requests between GPUs to fit models that wouldn't otherwise fit.
Not really correct for training - training has a lot of all-to-all problems, so hierarchical reduction is useful but doesn't really solve the incast problem - Nvlink _bandwidth_ is less of an issue than perhaps the SHARP functions in the NVLink switch ASICs.
You use at least half of this stack for desktop setups. You need copying daemons, the ecosystem support (docker-nvidia, etc.), some of the libraries, etc. even when you're on a single system.
If you're doing inference on a server; MIG comes into play. If you're doing inference on a larger cloud, GPU-direct storage comes into play.
It's possible you're underestimating the open source community.
If there's a competing platform that hobbyists can tinker with, the ecosystem can improve quite rapidly, especially when the competing platform is completely closed and hobbyists basically are locked out and have no alternative.
> It's possible you're underestimating the open source community.
On the contrary. You really don't know how I love and prefer open source and love a more leveling playing field.
> If there's a competing platform that hobbyists can tinker with...
AMD's cards are better from hardware and software architecture standpoint, but the performance is not there yet. Plus, ROCm libraries are not that mature, but they're getting there. Developing high performance, high quality code is deceivingly expensive, because it's very heavy in theory, and you fly very close to the metal. I did that in my Ph.D., so I know what it entails. So it requires more than a couple (hundred) hobbyists to pull off (see the development of Eigen linear algebra library, or any high end math library).
Some big guns are pouring money into AMD to implement good ROCm libraries, and it started paying off (Debian has a ton of ROCm packages now, too). However, you need to be able to pull it off in the datacenter to be able to pull it off on the desktop.
AMD also needs to be able to enable ROCm on desktop properly, so people can start hacking it at home.
> especially when the competing platform is completely closed...
NVIDIA gives a lot of support to universities, researchers and institutions who play with their cards. Big cards may not be free, but know-how, support and first steps are always within reach. Plus, their researchers dogfood their own cards, and write papers with them.
So, as long as papers got published, researchers do their research, and something got invented, many people don't care about how open source the ecosystem is. This upsets me a ton, but when closed source AI companies and researchers who forget to add crucial details to their papers so what they did can't be reproduced don't care about open source, because they think like NVIDIA. "My research, my secrets, my fame, my money".
It's not about sharing. It's about winning, and it's ugly in some aspects.
That said, for hobbyist inference on large pretrained models, I think there is an interesting set of possibilities here: maybe a number of operations aren't optimized, and it takes 10x as long to load the model into memory... but all that might not matter if AMD were to be the first to market for 128GB+ VRAM cards that are the only things that can run next-generation open-weight models in a desktop environment, particularly those generating video and images. The hobbyists don't need to optimize all the linear algebra operations that researchers need to be able to experiment with when training; they just need to implement the ones used by the open-weight models.
But of course this is all just wishful thinking, because as others have pointed out, any developments in this direction would require a level of foresight that AMD simply hasn't historically shown.
IDK, I found a post that's 2 years old that has links to doing llama and SD on an Arc [0] (although might be linux only), I feel like a cheap huge ram card would create a 'critical mass' as far as being able to start optimizing, and then from a longer term Intel could promise and deliver on 'scale up' improvements.
It would be a huge shift for them. To go from preferring some (sometimes not quite reached) metric, to, perhaps rightly play the 'reformed underdog'. Commoditize Big-Memory ML Capable GPUs, even if they aren't quite as competitive as the top players at first.
Will the other players respond? Yes. But ruin their margin. I know that sounds cutthroat[1] but hey I'm trying to hypothetically sell this to whomever is taking the reigns after Pat G.
> NVIDIA gives a lot of support to universities, researchers and institutions who play with their cards. Big cards may not be free, but know-how, support and first steps are always within reach. Plus, their researchers dogfood their own cards, and write papers with them.
Ideally they need to do that too. Ideally they have some 'high powered' prototypes (e.x. lets say they decide a 2-gpu per card design with an interlink is feasible for some reason) to share as well. This may not be be entirely ethical[1] in this example of how a corp could play it out, again it's a thought experiment since intel has NOT announced or hinted at a larger memory card anyway.
> AMD also needs to be able to enable ROCm on desktop properly, so people can start hacking it at home
AMD's driver story has always been a hot mess, My desktop won't behave with both my onboard video and 4060 enabled, every AMD card I've had winds up with some weird firmware quirk one way or another... I guess I'm saying their general level of driver quality doesn't lend to hope they'll fix dev tools that soon...
ROCm doesn't really matter when the hardware is almost the same as Nvidia cards. AMD is not selling "cheaper" card with a lot of RAM, what the original poster was asking. (and a reason why people who like to tinker with large model are using Macs).
You're writing as if AMD cares about open source. If they would only actually open source their driver the community would have made their cards better than nvidia ones long ago.
I'm one of those academics. You've got it all wrong. So many people care about open source. So many people carefully release their code and make everything reproducible.
We desperately just want AMD to open up. They just refuse. There's nothing secret going on and there's no conspiracy. There's just a company that for some inexplicable reason doesn't want to make boatloads of money for free.
AMD is the worst possible situation. They're hostile to us and they refuse to invest to make their stuff work.
> If they would only actually open source their driver the community would have made their cards better than nvidia ones long ago.
Software wise, maybe. But you can't change AMD's hardware with a magic wand, and that's where a lot of CUDA's optimizations come from. AMD's GPU architecture is optimized for raster compute, and it's been that way for decades.
I can assure you that AMD does not have a magic button to press that would make their systems competitive for AI. If that was possible it would have been done years ago, with or without their consent. The problem is deeper and extends to design decisions and disagreement over the complexity of GPU designs. If you compare AMD's cards to Nvidia on "fair ground" (eg. no CUDA, only OpenCL) the GPGPU performance still leans in Nvidia's favor.
That would require competently produced documentation. Intel can't do that for any of their side projects because their MBAs don't get a bonus if the tech writers are treated as a valuable asset.
No. I've been reading up. I'm planning to run Flux 12b on my AMD 5700G with 64GB RAM. CPU will take 5-10minutes per image which will be fine for me tinkering while writing code. Maybe I'll be able to get the GPU going on it too.
Point of the OP is this is entirely possible with even an iGPU if only we have the RAM. nVidia should be irrelevant for local inference.
Copying daemons (gdrcopy) is about pumping data in and out of a single card. docker-nvidia and rest of the stack is enablement for using cards.
GPU-Direct is about pumping data from storage devices to cards, esp. from high speed storage systems across networks.
MIG actually shares a single card to multiple instances, so many processes or VMs can use a single card for smaller tasks.
Nothing I have written in my previous comment is related to inter-card, inter-server communication, but all are related to disk-GPU, CPU-GPU or RAM-CPU communication.
Edit: I mean, it's not OK to talk about downvoting, and downvote as you like but, I install and enable these cards for researchers. I know what I'm installing and what it does. C'mon now. :D
i've run inference on Intel Arc and it works just fine so i am not sure what you're talking about. I certainly didn't need docker! I've never tried to do anything on AMD yet.
I had the 16GB arc, and it was able to run inference at the speed i expected, but twice as many per batch as my 8GB card, which i think is about what you'd expect.
once the model is on the card, there's no "disk" anymore, so having more vram to load the model and the tokenizer and whatever else on means there's no disk, and realistically when i am running loads on my 24GB 3090 the CPU is maybe 4% over idle usage. My bottleneck, as it stands, to running large models is vram, not anything else.
If i needed to train (from scratch or whatever) i'd just rent time somewhere, even with a 128GB card locally, because obviously more tensors is better.
and you're getting downvoted because there's literally lm studio and llama.cpp and sd-webui that run just fine for inference on our non-dc, non-nvlink, 1/15th the cost GPUs.
As a preface, precompute_input_logits() is really just a generalized version of the forward() function that can operate on multiple input tokens at a time to do faster input processing, although it can be used in place of the forward() function for output generation just by passing only a single token at a time.
Also, my apologies for the code being a bit messy. matrix_multiply() and batched_matrix_multiply() are wrappers for GEMM, which I ended up having to use directly anyway when I needed to do strided access. Then matmul() is a wrapper for GEMV, which is really just a special case of GEMM. This is a work in progress personal R&D project that is based on prior work others did (as it spared me from having to do the legwork to implement the less interesting parts of inferencing), so it is not meant to be pretty.
Anyway, my purpose in providing that link is to show what is needed to do inferencing (on llama 3). You have a bunch of matrix weights, plus a lookup table for vectors that represent tokens, in memory. Then your operations are:
* memcpy()
* memset()
* GEMM (GEMV is a special case of GEMM)
* sinf()
* cosf()
* expf()
* sqrtf()
* rmsnorm (see the C function for the definition)
* softmax (see the C function for the definition)
* Addition, subtraction, multiplication and division.
I specify rmsnorm and softmax for completeness, but they can be implemented in terms of the other operations.
If you can do those, you can do inferencing. You don’t really need very specialized things. Over 95% of time will be spent in GEMM too.
My next steps likely will be to figure out how to implement fast GEMM kernels on my CPU. While my own SGEMV code outperforms the Intel MKL SGEMV code on my CPU (Ryzen 7 5800X where 1 core can use all memory bandwidth), my initial attempts at implementing SGEMM have not fared quite so well, but I will likely figure it out eventually. After I do, I can try adapting this to FP16 and then memory usage will finally be low enough that I can port it to a GPU with 24GB of VRAM. That would enable me to do what I say is possible rather than just saying it as I do here.
By the way, the llama.cpp project has already figured all of this out and has things running on both GPUs and CPUs using just about every major quantization. I am rolling my own to teach myself how things work. By happy coincidence, I am somehow outperforming llama.cpp in prompt processing on my CPU but sadly, the secrets of how I am doing it are in Intel’s proprietary cblas_sgemm_batch() function. However, since I know it is possible for the hardware to perform like that, I can keep trying ideas for my own implementation until I get something that performs at the same level or better.
I am favoriting this comment for reference later when I start poking around in the base level stuff. I find it pretty funny how simple this stuff can get. Have you messed with ternary computing inference yet? I imagine that shrinks the list even further - or at least reduces the compute requirements in favor of brute force addition. https://arxiv.org/html/2410.00907
No. I still have a long list of things to try doing with llama 3 (and possibly later llama 3.1) on more normal formats like fp32 and in the future, fp16. When I get things running on a GPU, I plan to try using bf16 and maybe fp8 if I get hardware that supports it. Low bit quantizations hurt model quality, so I am not very interested in them. Maybe that will change if good quality models trained to use them become available.
My plan is to make more attempts at rolling my own. Reverse engineering things is not something that I do, since that would prevent me from publishing the result as OSS.
Rather than tackling the entire market at once, they could start with one section and build from there. NVIDIA didn't get to where it was in a year, it took many strategic acquisitions. (All the networking and other HPC-specialized stuff I was buying a decade ago has seemingly been bought by NVIDIA).
Start by being a "second vendor" for huge customers of NVIDIA that want to foster competition, as well as a few others willing to take risks, and build from there.
Intel has already bought and killed everything they need to compete here. They seem incapable of sticking to any market that isn’t x86. Likely because when they were making those acquisitions they were drunk on margin and didn’t want to focus anywhere else.
You are of course right WRT datacenter use of GPUs. The OP spoke about local generation though. It is of course a smaller market, but a market nevertheless, and the more amateurs and students are using your product, the more of them would consider applying it in more professional settings.
Sun (Sparc) and HP (PA-RISC) used to own most of the server market in 1990, but lost most of it to x86 by 2000. Few people had a Sun box with Solaris, but tons of people had access to a PC with Linux, which was inferior in many ways, but well-known and much less locked-up.
But all the tricks of developing a solid software stack to support the HW are already out no? The basic principles are there, to my understanding the main challenge is not doing this development in tandem with the HW people and the requirments to support older legacy device which makes it harder for example for Amd to compete. The only challenges,which intel are prepped to face is logistics and fabs.
On a separate note, project like JAX are aiming to circumvent that abstraction layer cuda adds to nvidia, so having decent hardware competition is definitely an option. Just some time ago, vllm fully supported amd gpus! We need more competition.
I agree, just for my PC, something that'd enable small devs to create interesting foundation model apps that'd deploy to users using this local AI cards to run these new Apps.
There might be an hen-egg problem if the apps end up requiring a 128Gb AI accelerator card. You only get the card if there are apps to run, and you only develop the apps if the cards are wide-spread. With so much RAM, the cards will not be let's-through-them-into-a-cheap-build cheap.
I think there have to be a couple of killer apps that run "OK" with CPU or GPU, but would run tremendously better with such a card.
I have a question for you, since I’m somewhat entering the HPC world. In the EU the EuroHPC-JU is building what they call AI factories, afaict these are just batch processing (Slurm I think) clusters with GPUs in the nodes. So I wonder where you’d place those cards of cards. Are you saying there is another, perhaps better ways to use massive amounts of these cards? Or is that still in the “super powerful workstation” domain? Thanx in advance.
View it as Raspberry Pi for AI workloads. Initial stage is for enthusiasts that would develop the infra, figure out what is possible and spread the word. Then the next phase will be SME industry adoption, making it commercially interesting, while bypassing Nvidia completely. At some point it would live its own life and big players jump in. Classical disrupt strategy via low cost unique offerings.
Pretty sure they are talking about inference in the post you are responding to. Training the model obviously needs far more compute, but running them locally is what most devs are interested in.
When this walled garden is the only way to use GPUs with high efficiency and everybody is using this stack, and NVIDIA controlling the supply of these "platform boards" to OEMs, they don't make money, but they literally print it.
However, AMD is coming for them because a couple of high profile supercomputer centers (LUMI, Livermore, etc.) are using Instinct cards and pouring money to AMD to improve their cards and stack.
I have not used their (Instinct) cards, yet, but their Linux driver architecture is way better than NVIDIA.
HPC is used in research, which is often not expected to make money. The hope is that the research will result in something that makes money. One example would be drug discovery. Another would be weather prediction, which is not so much a way to make money, but to minimize losses.
It's like the most profitable set of products in tech. You have companies like Meta, MSFT, Amazon, Google etc spending $5B every few years buying this hardware.
So what you're saying is Intel, or any other would-be NVIDIA competitor, needs to put out fast interconnects, not just compute cards. This is true.
I'm not sure your argument stands when it comes to OP's idea of a single card with 128GB VRAM. This would be enough to run ~180B models with reasonable quantization --we're not near maxing out the capability of 180B yet (see the latest 32B models performing near public SOTA).
This indeed would push rapid and wide adoption and be quite disruptive. But sure, it wouldn't instantly enable competitive training of 405B models.
Just how "basic" do you think a GPU can be while having the capability to interface with that much DRAM? Getting there with GDDR6 would require a really wide memory bus even if you could get it to operate with multiple ranks. Getting to 128GB with LPDDR5x would be possible with the 256-bit bus width they used on the top parts of the last generation, but would result in having half the bandwidth of an already mediocre card. "Just add more RAM" doesn't work the way you wish it could.
M3/M4 Max MacBooks with 128GB RAM are already way better than an A6000 for very large local LLMs. So even if the GPU is as slow as the one in M3/M4 Max (<3070), and using some basic RAM like LPDDR5x it would still be way faster than anything from NVidia.
The M4 Max needs an enormous 512bit memory bus to extract enough bandwidth out of those LPDDR5x chips, while the GPUs that Intel just launched are 192/160bit and even flagships rarely exceed 384bit. They can't just slap more memory on the board, they would need to dedicate significantly more silicon area to memory IO and drive up the cost of the part, assuming their architecture would even scale that wide without hitting weird bottlenecks.
The memory controller would be bigger, and the cost would be higher, but not radically higher. It would be an attractive product for local inference even at triple the current price and the development expense would be 100% justified if it helped Intel get any kind of foothold in the ML market.
Why not? It doesn't have to be balanced. RAM is cheap. You would get an affordable card that can hold a large model and still do inference e.g. 4x faster than a CPU. The 128GB card doesn't have to do inference on a 128GB model as fast as a 16GB card does on a 16GB model, it can be slower than that and still faster than any cost-competitive alternative at that size.
The extra RAM also lets you do things like load a sparse mixture of experts model entirely into the GPU, which will perform well even on lower end GPUs with less bandwidth because you don't have to stream the whole model for each token, but you do need enough RAM for the whole model because you don't know ahead of time which parts you'll need.
To get 128GB of RAM on a GPU you'd need at least a 1024 bit bus. GDDR6x is 16Gbit 32 pins, so you'd need 64 GDDR6x chips, which good luck even trying to fit that around the GPU die since traces need to be the same length, and you want to keep them as short as possible. There's also a good chance you can't run a clamshell setup so you'd have to double the bus width to 2048 because 32 GDDR6x chips would kick off way too much heat to be cooled on the back of a GPU. Such a ridiculous setup would obviously be extremely expensive and would use way too much power.
A more sensible alternative would be going with HBM, except good luck getting any capacity for that since it's all being used for the extremely high margin data center GPUs. HBM is also extremely expensive both in terms of the cost of buying the chips and due to it's advanced packaging requirements.
You do not need a 1024-bit bus to put 128GB of some DDR variant on a GPU. You could do a 512-bit bus with dual rank memory. The 3090 had a 384-bit bus with dual rank memory and going to 512-bit from that is not much of a leap.
This assumes you use 32Gbit chips, which will likely be available in the near future. Interestingly, the GDDR7 specification allows for 64Gbit chips:
> the GDDR7 standard officially adds support for 64Gbit DRAM devices, twice the 32Gbit max capacity of GDDR6/GDDR6X
Yeah, the idea that you're limited by bus width is kind of silly. If you're using ordinary DDR5 then consider that desktops can handle 192GB of memory with a 128-bit memory bus, implying that you get 576GB with a 384-bit bus and 768GB at 512-bit. That's before you even consider using registered memory, which is "more expensive" but not that much more expensive.
And if you want to have some real fun, cause "registered GDDR" to be a thing.
> They can't just slap more memory on the board, they would need to dedicate significantly more silicon area to memory IO and drive up the cost of the part,
In the pedantic sense of just literally slapping more on existing boards? No, they might have one empty spot for an extra BGA VRAM chip, but not enough for the gain's we're talking about. But this is absolutely possible, trivially so for someone like Intel/AMD/NVidia, that has full control over the architectural and design process. Is it a switch they flip at the factory 3 days before shipping? No, obviously not. But if they intended this to be the case ~2 years ago when this was just a product on the drawing board? Absolutely. There is 0 technical/hardware/manufacturing reason they couldn't do this. And considering the "entry level" competitor product is the M4 Max which starts at at least $3,000 (for a 128GB equipped one), the margin on pricing more than exists to cover a few hundred extra in ram and extra overhead in higher-layer more populated PCB's.
The real impediment is what you landed on at the end there combined with the greater ecosystem not having support for it. Intel could drop a card that is, by all rights, far better performing hardware than a competing Nvidia GPU, but Nvidia's dominance in API's, CUDA, Networking, Fabric-switches (NVLink, mellanox, bluefield), etc etc for that past 10+ years and all of the skilled labor that is familiar with it would largely render a 128GB Arc GPU a dud on delivery, even if it was priced as a steal. Same thing happened with the Radeon VII. Killer compute card that no one used because while the card itself was phenomenal, the rest of the ecosystem just wasn't there.
Now, if intel committed to that card, and poured their considerable resources into that ecosystem, and continued to iterate on that card/family, then now we're talking, but yeah, you can't just 10X VRAM on a card that's currently a non-player in the GPGPU market and expect anyone in the industry to really give a damn. Raise an eyebrow or make a note to check back in a year? Sure. But raise the issue to get a greenlight on the corpo credit line? Fat chance.
> The M4 Max needs an enormous 512bit memory bus to extract enough bandwidth out of those LPDDR5x chips
Does M4 Max have 64-byte cache lines?
If they can fetch or flush an entire cache line in a single memory-bus transaction, I wonder if that opens up any additional hardware / performance optimizations.
Because Apple isn't playing the same game as everyone else. They have the money and clout to buy out TSMCs bleeding-edge processes and leave everyone else with the scraps, and their silicon is only sold in machines with extremely fat margins that can easily absorb the BOM cost of making huge chips on the most expensive processes money can buy.
Bleeding edge processes is what Intel specializes in. Unlike Apple, they don’t need TSMC. This should have been a huge advantage for Intel. Maybe that’s why Gelsinger got the boot.
> Bleeding edge processes is what Intel specializes in. Unlike Apple, they don’t need TSMC.
Intel literally outsourced their Arrow Lake manufacturing to TSMC because they couldn't fabricate the parts themselves - their 20A (2nm) process node never reached a production-ready state, and was eventually cancelled about a month ago.
Intel is maybe a year or two behind TSMC right now. They might or might not catch up since it is a moving target, but I dont think there is anything TSMC is doing today that Intel wont be doing in the near future.
These days, Intel merely specializes in bleeding processes. They spent far too many years believing the unrealistic promises from their fab division, and in the past few years they've been suffering the consequences as the problems are too big to be covered up by the cost savings of vertical integration.
Intel's foundry side has been floundering so hard that they've resorted to using TSMC themselves in an attempt to keep up with AMD. Their recently launched CPUs are a mix of Intel-made and TSMC-made chiplets, but the latter accounts for most of the die area.
I'm not certain this is quite as damning as it sounds. My understanding is that the foundry business was intentionally walled off from the product business, and that the latter wasn't going to be treated as a privileged customer.
no, in fact, it sounds even more damning because client side was able to pick whatever was best on the market, and it wasn't intel. Client side could to learn and customize their designs to use another company's processes (this is an extremely hard thing to do by the way) faster than intel foundry could even get their pants up in the morning.
Intel foundry screwed up so badly that Nokia's server division was almost shut down because of Intel Foundry's failure. (imagine being so bad at your job, that your clients go out of business) If Intel client side chose to use Foundry, there just wouldn't be any chips to sell.
Transistor IO logic scaling died a while ago, which is what prompted AMD to go with a chiplet architecture. Being on a more advanced process does not make implementing an 512-bit memory bus any easier for Apple. If anything, it makes it more expensive for Apple than it would be for Intel.
Everyone else wants configurable RAM that scales both down (to 16GB) and up (to 2TB), to cover smaller laptops and bigger servers.
GPUs with soldered on RAM has 500GB/sec bandwidths, far in excess of Apples chips. So the 8GB or 16GB offered by NVidia or AMD is just far superior at vid o game graphics (where textures are the priority)
> GPUs with soldered on RAM has 500GB/sec bandwidths, far in excess of Apples chips.
Apple is doing 800GB/sec on the M2 Ultra and should reach about 1TB/sec with the M4 Ultra, but that's still lagging behind GPUs. The 4090 was already at the 1TB/sec mark two years ago, the 5090 is supposedly aiming for 1.5TB/sec, and the H200 is doing 5TB/sec.
HBM is kind of not fair lol. But 4096-line bus is gonna have more bandwidth than any competitor.
It's pretty expensive though.
The 500GB/sec number is for a more ordinary GPU like the B580 Battlemage in the $250ish price range. Obviously the $2000ish 4090 will be better, but I don't expect the typical consumer to be using those.
But an on-package memory bus has some of the advantages of HBM, just to a lesser extent, so it's arguably comparable as an "intermediate stage" between RAM chips and HBM. Distances are shorter (so voltage drop and capacitance are lower, so can be driven at lower power), routing is more complex but can be worked around by more layers, which increases cost but on a significantly smaller area than required for dimms, and the dimms connections themselves can hurt performance (reflection from poor contacts, optional termination makes things more complex, and the expectations of mix-and-match for dimm vendors and products likely reduce fine tuning possibilities).
There's pretty much a direct opposite scaling between flexibility and performance - dimms > soldered ram > on-package ram > die-interconnects.
The question is why Intel GPUs, which already have soldered memory, aren't sold with more of it. The market here isn't something that can beat enterprise GPUs at training, it's something that can beat desktop CPUs at inference with enough VRAM to fit large models at an affordable price.
If their memory IO supports multiple ranks like the RTX 3090 (it used dual rank) did, they could do a new PCB layout and then add more memory chips to it. No additional silicon area would be necessary.
It doesn't matter if the "cost is driven up". Nvidia has proven that we're all lil pay pigs for them. 5090 will be 3000$ for 32gb of VRAM. Screenshot this now, it will age well.
You are absolutely correct, and even my non-prophetic ass echoed exactly the first sentence of the top comment in this HN thread ("Why don't they just release a basic GPU with 128GB RAM and eat NVidia's local generative AI lunch?").
Yes, yes, it's not trivial to have a GPU with 128gb of memory with cache tags and so on, but is that really in the same universe of complexity of taking on Nvidia and their CUDA / AI moat any other way? Did Intel ever give the impression they don't know how to design a cache? There really has to be a GOOD reason for this, otherwise everyone involved with this launch is just plain stupid or getting paid off to not pursue this.
Saying all this with infinite love and 100% commercial support of OpenCL since version 1.0, a great enjoyer of A770 with 16GB of memory, I live to laugh in the face of people who claimed for over 10 years that OpenCL is deprecated on MacOS (which I cannot stand and will never use, yet the hardware it runs on...) and still routinely crushes powerful desktop GPUs, in reality and practice today.
Both Intel and AMD produce server chips with 12 channel memory these days (that's 12x64bit for 768bit) which combined with DDR5 can push effective socket bandwidth beyond 800GB/s, which is well into the area occupied by single GPUs these days.
You can even find some attractive deals on motherboard/ram/cpu bundles built around grey market engineering sample CPUs on aliexpress with good reports about usability under Linux.
Building a whole new system like this is not exactly as simple as just plugging a GPU into an existing system, but you also benefit from upgradeability of the memory, and not having to use anything like CUDA. llamafile, as an example, really benefits from AVX-512 available in recent CPUs. LLMs are memory bandwidth bound, so it doesn't take many CPU cores to keep the memory bus full.
Another benefit is that you can get a large amount of usable high bandwidth memory with a relatively low total system power usage. Some of AMD's parts with 12 channel memory can fit in a 200W system power budget. Less than a single high end GPU.
My desktop machine has had 128gb since 2018, but for the AI workloads currently commanding almost infinite market value, it really needs the 1TB/s bandwidth and teraflops that only a bona fide GPU can provide. An early AMD GPU with these characteristics is the Radeon VII with 16gb HBM, which I bought for 500 eur back in 2019 (!!!).
I'm a rendering guy, not an AI guy, so I really just want the teraflops, but all GPU users urgently need a 3rd market player.
That 128gb is hanging off a dual channel memory bus with only 128 total bits of bandwidth. Which is why you need the GPU. The Epyc and Xeon CPUs I'm discussing have 6x the memory bandwidth, and will trade blows with that GPU.
At a mere 20x the cost or something, to say nothing about the motherboard etc :( 500 eur for 16GB of 1TB/s with tons of fp32 (and even fp64! The main reason I bought it) back in 2019 is no joke.
Believe me, as a lifelong hobbyist-HPC kind of person, I am absolutely dying for such a HBM/fp64 deal again.
Isn't 2666 MHz ECC RAM obscenely slow? 32 cores without the fast AVX-512 of Zen5 isn't what anyone is looking for in terms of floating point throughput (ask me about electricity prices in Germany), and for that money I'd rather just take a 4090 with 24GB memory and do my own software fixed point or floating point (which is exactly what I do personally and professionally).
This is exactly what I meant about Intel's recent launch. Imagine if they went full ALU-heavy on latest TSMC process and packaged 128GB with it, for like, 2-3k Eur. Nvidia would be whipping their lawyers to try to do something about that, not just their engineers.
My experience is that input processing (prompt processing) is compute bottlenecked in GEMM. AVX-512 would help there, although my CPU’s Zen 3 cores do not support it and the memory bandwidth does not matter very much. For output generation (token generation), memory bandwidth is a bottleneck and AVX-512 would not help at all.
12 channel DDR5 is actually 12x32-bit. JEDEC in its wisdom decided to split the 64-bit channels of earlier versions of DDR into 2x 32-bit channels per DIMM. Reaching 768-bit memory buses with DDR5 requires 24 channels.
Whenever I see DDR5 memory channels discussed, I am never sure if the speaker is accounting for the 2x 32-bit channels per DIMM or not.
The question is whether there's enough overall demand for a GPU architecture with 4x the VRAM of a 5090 but only about 1/3rd of the bandwidth. At that point it would only really be good for AI inferencing, so why not make specialized inferencing silicon instead?
Intel and Qualcomm are doing this, although Intel uses HBM and their hardware is designed to do both inference and training while Qualcomm uses more conventional memory and their hardware is only designed do inference:
They did not put it into the PC parts supply chain for reasons known only to them. That said, it would be awesome if Intel made high memory variants of their Arc graphics cards for sale through the PC parts supply chains.
That would basically mean Intel doubling the size of their current GPU die, with a different memory PHY. They're clearly not ready to make that an affordable card. Maybe when they get around to making a chiplet-based GPU.
Are you suggesting that Intel 'just' release a GPU at the same price point as an M4 Max SOC? And that there would be a large market for it if they did so? Seems like an extremely niche product that would be demanding to manufacture. The M4 Max makes sense because it's a complete system they can sell to Apple's price-insensitive audience, Intel doesn't have a captive market like that to sell bespoke LLM accelerator cards to yet.
If this hypothetical 128GB LLM accelerator was also a capable GPU that would be more interesting but Intel hasn't proven an ability to execute on that level yet.
Nothing in my comment says about pricing it at the M4 Max level. Apple charges as much because they can (typing this on an $8000 M3 Max). 128GB LPDDR5 is dirt cheap these days just Apple adds its premium because they like to. Nothing prevents Intel from releasing a basic GPU with that much RAM for under $1k.
You're asking for a GPU die at least as large as NVIDIA's TU102 that was $1k in 2018 when paired with only 11GB of RAM (because $1k couldn't get you a fully-enabled die to use 12GB of RAM). I think you're off by at least a factor of two in your cost estimates.
Though Intel should also identify say the top-100 finetuners and just send it to them for free, on the down low. That would create some market pressure.
Intel has Xeon Phi which was a spin-off of their first attempt at GPU so they have a lot of tech in place they can reuse already. They don't need to go with GDDRx/HBMx designs that require large dies.
I don't want to further this discussions but may be you dont realise some of the people who replied to you either design hardware for a living or has been in the hardware industry for longer than 20 years.
It would be interesting if those saying that a regular GPU with 128GB of VRAM cannot be made would explain how Qualcomm was able to make this card. It is not a big stretch to imagine a GPU with the same memory configuration. Note that Qualcomm did not use HBM for this.
For some reason Apple did it with M3/M4 Max likely by folks that are also on HN. The question is how many of the years spent designing HW were spent also by educating oneselves on the latest best ways to do it.
Even LPDDR requires a large die. It only takes things out of the realm of technologically impossible to merely economically impractical. A 512-bit bus is still very inconveniently large for a single die.
Thank You Wtallis. Somewhere along the line, this basic "knowledge" of hardware is completely lost. I dont expect this to be explained in any comment section on old Anandtech. It seems hardware enthusiast has mostly disappeared, I guess that is also why Anandtech closed. We now live in a world where most site are just BS rumours.
It is possible to have multiple memory ranks to reduce the bus width requirements for a given amount of memory. Nvidia has demonstrated that this is doable with GDDR6X on the RTX 3090. The RTX 3090 has a 384-bit bus with 24 memory ICs, despite only needing 12 to reach 384-bit. That means it has every two chips sharing one 32-bit interface, which is a dual rank configuration. If you look at the history of computer memory, you can find many examples of multi-rank configurations. I also recall LR-DIMMs as being another way of achieving this.
Achieving 128GB VRAM with a 256-bit bus (which seems like a reasonable bus width) would mean some multiple of 8 chips. If Micron, Samsung or SK Hynix made 128Gb GDDR7 chips, then 8 would suffice. The best right now seems 24Gb, although 32Gb seems likely to follow (and it would likely come sooner if a large customer such as Intel asked for it), so they would just need to have 32 chips in a quad rank configuration to achieve 128GB.
This assumes that there is no limit in the GDDR7 specification that prevents quad rank configurations. If there is and it still supports dual rank like GDDR6X did, then a 512-bit bus could be done. It would likely be extremely pricy and require a new chip tape out that has much more IO logic transistors to handle the additional bus width (and IO logic transistor scaling is dead, so the die area would be huge), but it is hypothetically possible. Given how much people are willing to pay for more VRAM, it could make business sense to do.
Even if there is no limit in the GDDR7 specification that prevents quadrank, their memory IO logic would need to support it and if it does not, they would need to redesign that and do a new chip tape out in addition to a new board design. This would also be very expensive, although not as expensive as going to a 512-bit memory interface.
In summary, adding more memory would cost more to do and it would not improve competitiveness in the target market for these cards, which I imagine is the main reason that they do not do it.
By the way, the reason that Nvidia implemented support for 2 chips per channel is because they wanted to be able to reach 48GB VRAM on the workstation variant of the 3090 that is known as the RTX A6000 (non-Ada). I do not know why they used 24x 8Gb chips rather than 12x 16Gb on the 3090, although if I had to guess, it had something to do with rank interleaving.
Having four chips per channel is exactly why this is implausible. DDR5 can barely operate with four ranks per channel, at severely reduced speeds. Pulling that off with GDDR6 or GDDR7 is not something we can presume to be possible without specific evidence. The highest-density configurations possible for LPDDR5x are dual-rank and byte mode (one chip per 8 bits of the memory bus, so two chips ganged together to populate a 16-bit channel) — and that still operates at less than half the speed of GDDR6.
I've not seen any proposals for buffering LPDDR or GDDR, so an analog to LRDIMMs is not a readily-available technology.
GDDR is the memory technology that operates at the edge of what's possible for per-pin bandwidth. Loading that memory bus down with many ranks is not something we can expect to be achievable by just putting down more pads on the PCB.
> DDR5 can barely operate with four ranks per channel, at severely reduced speeds.
That is objectively false. See, for instance, V-color’s threadripper RAM[0]. If 96GB quad-rank modules @ 6000Mhz in octo-channel counts as “barely operating” maybe we have different definitions of operation requirements.
As a side note, their quad-channel 3-rank RAM [1] hits 8000MHz, out of the box. Admittedly only 24GB modules, but still.
In that case, we need a 512-bit memory bus to do this using the 32Gbit GDDR7 chips that should be on the market in the near future. This would be very expensive, but it should be possible, or do you see a reason why that cannot be done either?
That said, I am not an electrical engineer (although I work alongside one and have had a minor role in picking low end components for custom PCBs), I think if Intel were to make a GPU with 128GB VRAM using GDDR7 in the next year or two, the engineer who does the trace routing to make it possible should make a go fund me page for people to send beer money.
I think the goalposts may have shifted a bit, from why hasn't Intel made such a card to why is Intel not (publicly) working on such a card to be released in a year or two.
In terms of what would have been feasible for Intel to bring to market in 2024, the cheapest option for 128GB capacity would probably have been ~8.5Gb/s LPDDR5x on a 256-bit bus, but to at least match the bandwidth of the chip they just launched, it would have made more sense to use a 512-bit bus and bump the die size back up to ~half the reticle limit like their previous generation die with a 256-bit bus. So they would have had a quite slow but high-capacity GPU with a manufacturing cost equal to at least an RTX 4080, before adding in the cost of all that DRAM. And if they had started working on that chip as soon as LLaMA went public, they might have been able to deliver it by now.
It's no surprise at all that such a risky niche product did not emerge from a division of Intel that is lucky to not have been liquidated yet.
In hindsight, I misread you as saying that 128GB of RAM on a “basic GPU” is not technically feasible. My reply was to say it is feasible.
Intel is rumored to have a B770 GPU in development, but it was running late and then was delayed to next year since it had yet to tape out, so they are launching their B580 and B570 graphics cards, which had been ready to go for a while, now. That is why the bus size appears to have dropped across generations. Presumably, if they made a 512-bit bus version, it would be a 9 series card. They certainly left room for it in their lineup, but as far as leaks have been concerned, there is not much hope for one. I do not expect them to use anything other than GDDR7 on their battlemage cards.
As for a high memory ARC card, I am of the opinion that such a product would sell well among the local llama community. There might even be more sales of a high memory ARC card for inference than of the regular ARC cards for gaming given that their discrete graphics sales peaked at 250,000 in Q1 2023 before collapsing, which can be confirmed using the data here:
The market for high memory GPUs is surely bigger than that. That said, Intel is likely pricing their ARC GPUs at a loss after R&D costs are considered. This is likely intended to help them break into a new market, although it has not been going well for them so far. I would guess that they are at least a generation away from profitability.
Intel intends for its Gaudi 3 accelerators to be used for this rather than the ARC line. Those coincidentally have 128GB of RAM, but they use HBM rather than a DDR variant. Qualcomm on the other hand made its own accelerator with 128GB of LPDDR4x RAM:
If my math is right, Qualcomm went with a 1024-bit memory bus and some incorrect rounding (rounding 137.5 to 138 before multiplying by 4) to reach their stated bandwidth figure. Qualcomm is not selling it through the PC parts supply chain, so I have no clue how much it costs, but I assume that it is expensive. I assume that they used LPDDR4x to be able to build a product since they were too late in securing HBM supply and even if they did, they would not be able to scale production to meet demand growth since Nvidia is buying all of the HBM that it can.
GPU inference is always a balancing act, trying to avoid bottlenecks on memory bandwidth (loading data from the GPU's global memory/VRAM to the much smaller internal shared memory, where it can be used for calculations) and compute (once the values are loaded).
Splitting the model up between several GPUs would add a third much worse bottleneck – memory bandwidth between the GPUs. No matter how well you connect them, it'll be slower than transfer within a single GPU.
Still, the fact that you can fit an 8× larger GPU might be worth it to you. It's a trade-off that's almost universally made while training LLMs (sometimes even with the model split down both its width and length), but is much less attractive for inference.
At least for LLMs and transformers this isn't relevant. Having 8x the chips and 8x the memory bandwidth is always better. Interchip communication for matrix multiplication against a constant left matrix with a tiny right matrix isn't bandwidth bound, only latency bound.
Last I've heard, the architecture makes that difficult. But my information may be outdated, and even if it isn't, I'm not a hardware designer and may have just misunderstood the limits I hear others discuss.
K80 used to be two glued K40 but their interconnect was barely faster than PCIe so it didn't have much benefit as one had to move stuff between two internal GPUs anyway.
Inference workloads likely won’t care very much. For llama 3.1 405B with bf16 when you split the workload across GPUs by layer, you need to do a 32KB memory copy before the next GPU can begin processing. That can be done incredibly quickly over PCI-E.
How do architectural bottlenecks due to modified Von Neumann architectures' debuggable instruction pipelines limit computational performance when scaling to larger amounts of off-chip RAM?
Tomasulo's algorithm also centralizes on a common data bus (the CPU-RAM data bus) which is a bottleneck that must scale with the amount of RAM.
Can in-RAM computation solve for error correction without redundant computation and consensus algorithms?
> The term "von Neumann architecture" has evolved to refer to any stored-program computer in which an instruction fetch and a data operation cannot occur at the same time (since they share a common bus). This is referred to as the von Neumann bottleneck, which often limits the performance of the corresponding system. [4]
> The von Neumann architecture is simpler than the Harvard architecture (which has one dedicated set of address and data buses for reading and writing to memory and another set of address and data buses to fetch instructions).
For whatever reason Hynix hasn't turned their PIM into a usable product. LPDDR based PIM is insanely effective for inference. I can't stress this enough. An NPU+LPDDR6 PIM would kill GPUs for inference.
> The simplest method therefore would be to use TOPS/W for digital approaches in future, but to use TOPS-B/W for analogue in-memory computing approaches!
> [ { Frequency, NNs employed, Precision, Sparsity and Pruning, Process node, Memory and Power Consumption, utilization} for more representative variants of TOPS/W metric ]
GDDR isnt like the ram that connects to cpu, it's much more difficult and expensive to add more. You can get up to 48GB with some expensive stacked gddr, but if you wanted to add more stacks you'd need to solve some serious signal timing related headaches that most users wouldn't benefit from.
I think the high memory local inference stuff is going to come from "AI enabled" cpus that share the memory in your computer. Apple is doing this now, but cheaper options are on the way. As a shape its just suboptimal for graphics, so it doesn't make sense for any of the gpu vendors to do it.
As someone else said - I don't think you have to have GDDR, surely there are other options. Apple does a great job of it on their APUs with up to 192GB, even an old AMD Threadripper chip can do quite well with its DDR4/5 performance
For ai inference you definitely have other options, but for low end graphics? the lpddr that apple (and nvidia in grace) use would be super expensive to get a comparable bandwidth (think $3+/gb and to get 500GB/sec you need at least 128GB).
And that 500GB/sec is pretty low for a gpu, its like a 4070 but the memory alone would add $500+ to the cost of the inputs, not even counting the advanced packaging (getting those bandwidths out of lpddr needs organic substrate).
It's not that you can't, just when you start doing this it stops being like a graphics card and becomes like a cpu.
They can use LPDDR5x, it would still massively accelerate inference of large local LLMs that need more than 48GB RAM. Any tensor swapping between CPU RAM and GPU RAM kills the performance.
I think we don't really disagree, I just think that this shape isn't really a gpu its just a cpu because it isn't very good for graphics at that point.
That's why I said "basic GPU". It doesn't have to be too fast but it should still be way faster than a regular CPU. Intel already has Xeon Phi so a lot of things were developed already (like memory controller, heavy parallel dies etc.)
I wonder at that point you'd just be better served by CPU with 4 channels of RAM. If my math is right 4 channels of DDR5-8000 would get you 256GB/s. Not as much bandwidth as a typical discrete GPU, but it would be trivial to get many hundreds of GB of RAM and would be expandable.
Unfortunately I don't think either Intel or AMD makes a CPU that supports quad channel RAM at a decent price.
4 channels of DDR5-8000 would give you 128GB/sec. DDR5 has 2x32-bit channels per DIMM rather than 1x64-bit channel like DDR4 did. You would need 8 channels of DDR5-8000 to reach 256GB/sec.
I think all the people saying "just use a CPU" massively underestimate the speed difference between current CPUs and current GPUs. There's like four orders of magnitude. It's not even in the same zip code. Say you have a 64-core CPU at 2Ghz with 512-bit 1-cycle FP16 instructions. That gives you 32 ops per cycle, 2048 across the entire package, so 4TFlops.
My 7900 XTX does 120TFlops.
To match that, you would need to scale that CPU up to either 2048 cores, 2KB per register (still one-cycle!) or 64Ghz.
I guess if you had 1024-bit registers and 8Ghz, you could get away with only 240 cores. Good luck thermal dissipating that btw. To reverse an opinion I'm seeing in this thread, at that point your CPU starts looking more like a GPU by necessity.
Usually, you can do 2 AVX-512 operations per cycle and using FMADD (fused multiply-add) instructions, you can do two floating point operations for the price of one. That would be 128 operations per cycle per core. The result would be 16TFlops on a 2GHz 64 core CPU, not 4 TFlops. This would give a 1 order of magnitude difference, rather than 4 orders of magnitude.
For inference, prompt processing is compute intensive, while token generation is memory bandwidth bound. The differences in memory bandwidth between CPUs and GPUs tend to be more profound than the difference in compute.
That's fair. On the other hand, there's like exactly one CPU with FP16 AVX512 anyways, and 64core aren't exactly commonplace either. And even with all those advantages, using a datacenter CPU, you're still a factor of 10 off from a GPU that isn't even consumer top-end. With a normal processor, say 16 cores, 16 float ops, even with fused ops and dispatching two ops per cycle you're still only at 2T and ~50x. In consumer spaces, I'm more optimistic about dedicated coprocessors. Maybe even iTPU?
This is only relevant for the flash attention part of the transformer, but a NPU is an equally suitable replacement for a GPU for flash attention.
Once you have offloaded flash attention, you're back to GEMV having a memory bottleneck. GEMV does a single multiplication and addition per parameter. You can add as many EXAFLOPs as you want, it won't get faster than your memory.
I guess it's hard to know how well this would compete with integrated gpus, especially at a reasonable pricepoint. If you wanted to spend $4000+ on it, it could be very competitive and might look something like nvidias grace-hopper superchip, but if you want the product to be under $1k I think it might be better just to buy separate cards for your graphics and ai stuff.
It is not stacked. It is multirank. Stacking means putting multiple layers on the same chip. They are already doing it for HBM. They will likely do it for other forms of DRAM in the future. Samsung reportedly will begin doing it in the 2030s:
I am not sure why they can already do stacking for HBM, but not GDDR and DDR. My guess is that it is cost related. I have heard that HBM reportedly costs 3 times more than DDR. Whatever they are doing to stack it now that is likely much more expensive than their planned 3D fabrication node.
HBM3E memory is at least 3x the price of DDR5 (it requires 3x the wafer as DDR5) and capacity is sold out for all of 2025 already... that's the price and production bottleneck.
High speed, low latency server grade DDR5 is around $800-$1600 for 128GB. Triple that for $2400 - $4800 just for the memory. Still need the GPUs/APUs, card, VRMs, etc.
Even the nVidia H100 with "only" 94GB starts at $30k...
Nvidia's $30,000 is a 90% margin product at scale. They could charge 1/3 that and still be very profitable. There has rarely been such a profitable large corporation in terms of the combo of profit & margin.
Their last quarter was $35b in sales and $26b in gross profit ($21.8b op income; 62% op income margin vs sales).
Visa is notorious for their extreme margin (66% op income margin vs sales) due to being basically a brand + transaction network. So the fact that a hardware manufacturer is hitting those levels is truly remarkable.
It's very clear that either AMD or Intel could accept far lower margins to go after them. And indeed that's exactly what will be required for any serious attempt to cut into their monopoly position.
Visa doesn't actually make a ton of money off each transaction, if you divide out their revenue against their payment volume (napkin math)...
They processed $12T in payments last year (almost a billion payments per day), with a net revenue of $32B. That's a gross transaction margin of 0.26% and their GAAP net income was half that, about 0.14%. [1]
They're just a transaction network, unlike say Amex which is both an issuer and a network. Being just the network is more operationally efficient.
That’s a weird way to account for their business size. There isn’t a significant marginal cost per transaction. They didn’t sell $12T in products. They facilitated that much in payments. Their profits are fantastic.
If you have no clue how profit margins are calculated then you're better off staying quiet.
It's quite simple. Divide revenue minus costs by revenue. Transaction volume isn't revenue. Visa only gets the transaction fee.
Even if I give you the benefit of the doubt and do a proper interpretation of the number you've arrived at, its meaning is quite different and quite off topic from this discussion. What you have calculated is the total share of costs that Visa represents in that 12 trillion dollar part of the economy. It is like saying Visa's share of GDP is 0.1%.
> And indeed that's exactly what will be required for any serious attempt to cut into their monopoly position.
You misunderstand why and how Nvidia is a monopoly. Many companies make GPUs, and all those GPUs can be used for computation if you develop compute shaders for them. This part is not the problem, you can already go buy cheaper hardware that outperforms Nvidia if price is your only concern.
Software is the issue. That's it - it's CUDA and nothing else. You cannot assail Nvidia's position, and moreover their hardware's value, without a really solid reason for datacenters to own them. Datacenters do not want to own GPUs because once the AI bubble pops they'll be bagholders for Intel and AMD's depreciated software. Nvidia hardware can at least crypto mine, or be leased out to industrial customers that have their own remote CUDA applications. The demand for generic GPU compute is basically nonexistent, the reason this market exists at all is because CUDA exists, and you cannot turn over Nvidia's foothold without accepting that fact.
The only way the entire industry can fuck over Nvidia is if they choose to invest in a complete CUDA replacement like OpenCL. That is the only way that Nvidia's value can be actually deposed without any path of recourse for their business, and it will never happen because every single one of Nvidia's competitors hate each other's guts and would rather watch each other die in gladiatorial combat than help each other fight the monster. And Jensen Huang probably revels in it, CUDA is a hedged bet against the industry ever working together for common good.
I feel people are exaggerating the impossibility of replacing CUDA. Adopting CUDA is convenient right now because yes it is difficult to replace it. Barrier to entry for orgs that can do that is very high. But it has been done. Google has the TPU for example.
The TPU is not a GPU nor is it commercially available. It is a chip optimized around a limited featureset with a limited software layer on top of it. It's an impressive demonstration on Google's behalf to be sure, but it's also not a shot across the bow at Nvidia's business. Nvidia has the TSMC relations, a refined and complex streaming multiprocessor architecture and actual software support their customers can go use today. TPUs haven't quite taken over like people anticipated anyways.
I don't personally think CUDA is impossible to replace - but I do think that everyone capable of replacing CUDA has been ignoring it recently. Nvidia's role as the GPGPU compute people is secure for the foreseeable future. Apple wants to design simpler GPUs, AMD wants to design cheaper GPUs, and Intel wants to pretend like they can compete with AMD. Every stakeholder with the capacity to turn this ship around is pretending like Nvidia doesn't exist and whistling until they go away.
I don’t disagree with what you are saying but I want to point out that the fact that the TPU is not a GPU is not really relevant. In the end what matters most is whether or not it can accelerate PyTorch.
They're not exaggerating it. The more things change, the more they stay the same. Nvidia and AMD had the exact same relationship 15 years ago that they do today. The AMD crowd clutching about their better efficiencies, and the Nvidia crowd having grossly superior drivers/firmware/hardware, including unique PhysX stuff that STILL has not been matched since 2012 (remember Planetside 2 or Broderlands 2 physics? Pepperidge Farm Remembers...)
So many billions of dollars and no one is even 1% close to displacing CUDA in any meaningful way. ZULDA is dead. ROCM is a meme, Scale is a meme. Either you use CUDA or you don't do meaningful AI work.
CUDA is not the issue. AMD have already reimplemented like 80% of it, and honestly that part of it mostly works fine. Pytorch supports it, (almost) all the big frameworks support it, if you're not doing really arcane things it just works. It's the drivers! They took like two years after the release of their flagship card to stop randomly crashing. Everything geohot has ever said about AMD drivers is 100% true. They just cannot stop shooting themselves in the foot.
Geohot (temporarily) giving up. https://github.com/ROCm/ROCm/issues/2198#issuecomment-157438... Sadly most of the real spicy Twitter messages are gone since he deleted all his content, but there was a really fun one where he went off on a beautifully cryptic commit message in the driver. He also begged AMD to opensource the firmware so he could debug it. Sadly, AMD promised to do it and then nothing happened, as is typical for AMD promises. That's why tinygrad nowadays is aiming to just bypass the driver and firmware entirely.
> The only way the entire industry can fuck over Nvidia is if they choose to invest in a complete CUDA replacement like OpenCL. That is the only way that Nvidia's value can be actually deposed without any path of recourse for their business, and it will never happen because every single one of Nvidia's competitors hate each other's guts and would rather watch each other die
Intel seems to have thrown their weight behind SYCL, which is an open standard intended to compete with CUDA. Its not clear there has been much interest from other hardware vendors though.
I do not misunderstand why Nvidia has a monopoly. You jumped drastically beyond anything I was discussing and incorrectly assumed ignorance on my part. I never said why I thought they had one. I never brought up matters of performance or software or moats at all. I matter of fact stated they had a monopoly, you assumed the rest.
It's impossible to assail their monopoly without utilizing far lower prices, coming up under their extreme margin products. It's how it is almost always done competitively in tech (see: ARM, or Office (dramatically undercut Lotus with a cheaper inferior product), or Linux, or Huawei, or Chromebooks, or Internet Explorer, or just about anything).
Note: I never said lower prices is all you'd need. Who would think that? The implication is that I'm ignorant of the entire history of tech, it's a poor approach to discussion with another person on HN frankly.
Nvidia's monopoly is pretty much detached from price at this point. That's the entire reason why they can charge insane margins - nobody cares! There is not a single business squaring Nvidia up with serious intent to take down CUDA. It's been this way for nearly two decades at this point, with not a single spark of hope to show for it.
In the case of ARM, Office, Linux, Huawei, and ChromeOS, these were all actual alternatives to the incumbent tools people were familiar with. You can directly compare Office and Lotus because they are fundamentally similar products - ARM had a real chance against x86 because wasn't a complex ISA to unseat. Nvidia is not analogous to these businesses because they occupy a league of their own as the provider of CUDA. It's not exaggeration to say that they have completely seceded from the market of GPUs and can sustain themselves on demand from crypto miners and AI pundits alone.
AMD, Intel and even Apple have bigger things to worry about than hitting an arbitrary price point, if they want Nvidia in their crosshairs. All of them have already solved the "sell consumer tech at attractive prices" problem but not the "make it complex, standardize it and scale it up" problem.
It is cheaper to pay Nvidia than it is to roll your own solution and no one else is competitive. That is the reason Nvidia can charge so much per card.
Thank you for laying it out. It's so silly to see people in the comments act like Intel or Nvidia can't EASILY add more VRAM to their cards. Every single argument against it is all hogwash.
Meta comment: "why don't they just" phrase usually indicates significant ignorance about a subject, it's better to learn a little bit before dispensing criticism about beancounters or whatnot.
In this case, the die I/O limits precludes more than a reasonable number of DDR channels.
Because you can't stack that much ram on a GPU without sufficient channels to do so. You could probably do 64GB on GDDR6 but you can't do 128GB on GDDR6 without more memory channels. 2GB per chip per channel is the current limit for GDDR6 this is why HBM was invented.
It is why you can only see GPUs with 24GB of memory at the moment.
HBM2 can handle 64GB ( 4 x 8GB Stack ) ( Total capacity 128GB )
HBM3 can handle 192GB ( 4 x 24GB Stack ) ( Total capacity 384GB )
Look at the RTX 3090 and RTX A6000 (the non-Ada one). They both have 24 memory chips with a 384-bit memory bus, but one has 24GB of VRAM and the other has 48GB of VRAM. They both have two chips per channel. This breaks the 24GB VRAM limit that you claim to exist.
After carefully reviewing all of the other comments explaining the many technical and organizational reasons why they should not “just do that,” I have come to the conclusion that it was a big missed opportunity by intel.
This GPU has a 192-bit memory bus. At 32-bit GDDR bus width (well w/ 6, 2x16b data channels per chip), that means you have 6 channels. With regular GDDR6, the largest produced size is 16Gb (2GB), so 12GB is what you get. You could double that up w/ a beefed up PCB if the memory controller supports it to get up to 24GB (in the way workstation cards like W7900 and A6000).
Beyond that, you'd have to move to GDDR7 (which has 24Gb/3GB chips incoming) or to HBM stacks, but at that point you're well beyond a "basic GPU". I think the only way you could get to 128GB would be either using regular (LP)DDR or HBM.
Note, Apple M chips have weak GPUs with decent MBW and large memory capacities (up to 192GB @ 800GB/s for an M2 Ultra, launched mid 2023) and have not been a major CUDA threat so I don't think your hypothesis actually stands up.
48 GB is at the tail end of what's reasonale for normal GPUs. The IO requires a lot of die space. And intel's architecture is not very space efficient right now compared to nvidia's
And even if you spend a lot of die space on memory controllers, you can only fit so many GDDR chips around the GPU core while maintaining signal integrity. HBM sidesteps that issue but it's still too expensive for anything but the highest end accelerators, and the ordinary LPDDR that Apple uses is lacking in bandwidth compared to GDDR, so they have to compensate with ginormous amounts of IO silicon. The M4 Ultra is expected to have similar bandwidth to a 4090 but the former will need a 1024bit bus to get there while the latter is only 384bit.
Going off of how the 4090 and 7900 xtx is arranged I think you could maybe fit on or two chips more around the die over their 12, but that's still a far cry from 128. That would probably just need a shared bus like normal DDR as you're not fitting that much with 16 gbit density
Look at the 3090, which uses 24 chips (12 on one side and 12 on another). Pushing it to 32 is doable. 32 is all you need to reach 128GB VRAM with the 32Gbit GDDR7 chips that should be on the market in the near future.
Where would you route the connection to the additional 4 groups of chips around the die? The PCIe connection needs to be there too, and they also may not like power delivery going through them
What if we did what others suggested was the practical limit - 48GB. Then just put 2-3 cards in the system and maybe had a little bridge over a separate bus for them to communicate?
I believe that would need some software work from Intel where they're lacking a bit now with their delayed start. Not sure how the frameworks themselves split up the inference work to avoid crossing GPUs as the bandwidth is horrible there.
If we're being reasonable and say that you're not using a modern HEDT CPU that costs a couple thousand, the best a consumer botherboard can get right now would be 2x 8x PCIe gen 5 at 32GB/s and one chipset x8 PCIe gen 4 at 16GB/s. I'm not sure if a motherboard like that actually exists but Intel's chipset should allow it; AMD only does x4 to chipset so the third slot is limited by that
Totally agree. Someone needs to exploit the lack of available gpu memory in graphics cards for model runners. Even training tensors tends to run against memory issues with the current cards.
armchair hardware enthusiast opinion: because silicon in the high-end is expensive, and not just a matter of slapping things together.
besides the clear limitation of the memory technology they are using compared to the nvidia's enterprise solution, for such large GPU chips that could really make use of such memory, they need to make binning possible by selling cut-down versions of them as well.
nvidia can pull this off because they can sell lower-end chips at the same time. intel is barely making a dent in sales, and making a high-end chip will only be very risky, at the cost of potentially benefiting a niche crowd.
> put them as a major CUDA threat.
that is a software/ecosystem problem, which hardware alone cannot solve. for all the devs that use Macs, even in AI it is only about inference at the moment. nobody is coming at CUDA for training for the near future. amd tried and failed plenty already.
Isn't nobody actually making anything close to a profit in the AI space? When your entire business model is propped up by VC money making your bill of materials that you use to make negative profit cheaper is probably not very high on your list of priorities.
I think a better idea would be an NPU with slower memory, or tie it to the system DDR. I don't think consumer inference (possibly even training) applications would need the memory bandwidth offered by GDDR/HBM. Inference on my 7950x is already stupid fast (all things considered).
The deeper problem is that the market for this is probably incredibly niche.
The size of the local inference market is too small. Maybe a couple of thousand LLM enthusiasts? It's not enough to make a profit or even breakeven on the development costs for the hardware.
For now. This might very well change once the general public realizes they can be movie directors (or generative world gamers) just by downloading some model and plugging in an eGPU. The potential inference market is huge
The reason is AMD and Nvidia don't is that they don't want to cannibalize their high end AI market. Intel doesn't have a high end AI market to protect.
NVidia and AMD make $$$ on datacenter GPUs so it makes sense they don't want to discount their own high-end. Intel has nothing there so they can happily go for commodization of AI hardware like what Meta did when releasing LLaMA to the wild.
Isn’t that setting just a historical thing annd ann integrated GPU is able to access any system memory that is mapped by the IOMMU? I assume this is how it works for people using the NVIDIA Jetson AGX Orin 64GB Developer Kit to do inference. I do not know why it would be different for AMD APUs.
I remember somebody complaining about it on reddit, unable to overcome some BIOS limitation on an AMD G processor. Even on M3 Max one had to issue a special command to enable GPU to access more memory.
AMD would be selling it at a loss. Given that HBM costs 3x the price of desktop DRAM and a 192GB kit costs $600 at Newegg, the memory alone would cost 90% of the price. The GPU die, PCB, power circuitry, etc likely costs more than $200 to make.
This does not consider that the board of directors would crucify Lisa Su if she authorized the use of HBM on a consumer product while it is supply constrained and there is enterprise demand for products using it. AMD can only get a limited amount of it and what they do get is not enough for enterprise demand where AMD has extremely healthy margins.
Even if they by some miracle turned a profit on a $2000 consumer card with 192GB HBM, every sale would have a massive opportunity cost and effectively would be a loss in the eyes of the board of directors.
Meanwhile, Nvidia would be unaffected because AMD could not produce very many of these.
NVidia would be dramatically affected, just not overnight.
If Intel or AMD sold a niche product with 48GB RAM even at a loss, but hit high-end consumer pricing, there would be a flood of people doing various AI work to buy it. The end result would be that parts of NVidia's moat would start draining rather quickly, and AMD / Intel would be in a stronger position for AI products.
I use NVidia because when I bought AMD during the GPU shortage, ROCm simply didn't work for AI. This was a few years back, but I was burned badly enough that I'm unlikely to risk AMD again for a long, long time. Unused code sits broken, and no ecosystem gets built up. A few years later, things are gradually improving for AMD for the kinds of things I wanted to do years ago, but all my code is already built around NVidia, and all my computers have NVidia cards. It's a project with users, and all those users are buying NVidia as well (even if just for surface dependencies, like dev-ops scripts which install CUDA). That, times thousands of projects, is part of NVidia's moat.
If I could build a cheap system with around 200GB, that would be incentive for me to move the relatively surface dependencies to work on a different platform. I can buy a motherboard with four PCI slots, and plug in four 48GB cards to get there. I'd build things around Intel or AMD instead.
The alternative is NVidia would start shipping competitive cards. If they did that, their high-end profit margins would dissolve.
The breakpoints for inference functionality are really at around 16GB, 48GB, and 200GB, for various historical reasons.
> If I could build a cheap system with around 200GB,
Even if AMD dropped the price to $2000, you could not be able to build a system with one of these cards. You cannot buy these cards at their current 5 digit pricing. The idea that you could buy it if they dropped the price to $2000 is a fantasy, since others would purchase the supply long before you have a chance to purchase one, just like they do now.
AMD is already selling out at the current 5 digit pricing and Nvidia is not affected, since Nvidia is selling millions of cards per year and still cannot meet demand while AMD is selling around 100,000. AMD dropping the price to $2000 would not harm Nvidia in the slightest. It would harm AMD by turning a significant money maker into a loss leader. It would also likely result in Lisa Su being fired.
By the way, the CUDA moat is overrated since people already implement support for alternatives. llama.cpp for example supports at least 3. PyTorch supports alternatives too. None of this harms Nvidia unless Nvidia stops innovating and that is unlikely to happen. A price drop to $2000 would not change this.
Let’s say for the sake of argument that you could build such a card and sell it for less than $5k. Why would you do it? You know there’s huge demand in the tens of billions per quarter for high end cards. Why undercut so heavily that market? To overthrow NVidia? So you’ll end up with a profit margin way low and then your shareholders will eat you alive.
I think plenty of enthusiastic open source devs would jump at it and fix their software if the software was reasonably open. The same effect as what happened when Meta released LLaMA.
AMD GPUs aren't very attractive to ML folks because they don't outshine Nvidia in any single aspect. Blasting lots of RAM onto a GPU would make it attractive immediately with lots of attention from devs occupied with more interesting things.
does the 7900xt outperform the 3090ti? if so, there's already a market because those are the same price. I don't mean in theory are there any workloads that the 7900xt can do better? Even if they're practically equal performance you get a warranty and support with your new 7900xt.
Problem with MI300x is the price. Problem with 7900XTX is that it's at best as good as Nvidia with the same RAM for a similar price. If 7900XTX had e.g. 64GB of RAM, was 2x slower than 4080, and kept its price, it would sell like crazy.
I have a 7900 XTX. Honestly I regret it. It took two years for the driver to stop randomly crashing with very pedestrian ROCm loads. And there's no future in AMD support now they're getting out of the high-end dual-use GPU game anyways. I should have gone with NVidia.
Honestly, I don't think we would need "this type of RAM." The confused part of this discussion is the belief that we need obscene bandwidth.
If I need 300GB/s memory bandwidth for my workload, that can be accomplished with:
* One RAM chip with 300GB/s
* Two RAM chips with 150GB/s each
* Four RAM chips with 75GB/s each
Etc.
Stepping up from 16GB to 196GB, the bandwidth requirements for each chip go down 10-fold, and you can use much cheaper RAM as a result. And all the signalling requirements relax too.
Much of this discussion presumes a 200GB card would individually need the same capacity to each RAM chip as a 12GB card. This is just false. An A770 or 4060-grade card couldn't keep up with that much data. And if I'm using a small model, I can get the same bandwidth by properly distributing it among RAM chips (which most hardware does automatically).
An A770 or 4060-grade card, with the same total memory capacity as we have today, but 200GB RAM, would allow us to run high-quality LLMs locally or do high-resolution renders. That wouldn't have the same performance as a $200k card, obviously, but for many inferences uses, that's just not very important.
If I were buying for my own uses, I'd want 12x 32GB PC3200 DIMMs for a total of 384GB RAM at $600 for the RAM (say $2k total), with an individual throughput of 25GB/sec and a total throughput of 300GB/sec. I'd be okay with 4060-grade performance. My own uses are a bit niche, and I think for most other people's uses, something with a little more throughput and a little less capacity (48-196GB) might make more sense. But you definitely don't need the same throughput as existing GPU RAM.
Pretty obvious stuff, right? I mean, you don't even need HBM for that, you just need a TON of memory channels. Sure, that kind of setup would only be efficient for highly coalesced reads/writes, but that's what you need these days for inference - highly coalesced reads and writes. You could even get by with 64GB of DDR5. DDR5-4800 (rather modest) is 38.4GB/s per channel. To get 1TB/s you'd only need 26 channels. With the more expensive DDR5-6400 you'd only need 20. That doesn't at all sound insurmountable for a company of Intel's caliber. Heck, break up the dies (and the channels) across several chiplets even, if the interconnect is decent it'll still run really well.
I do think one challenge is, AFAIK with most GDDR5/6 there's a density issue that requires either larger memory bus paths or other additional complexity to support large sizes.
That said, the lack of even a 16GB variant is sus.
I'll take some copium in that maybe they're trying to solve the 'size' issue somehow and are just making sure whatever system they use isn't gonna be an i820 MTH debacle before they pull the trigger on announcing it.
Does CUDA even matter than much for LLMs? Especially inference? I don't think software would be the limiting factor for this hypothetical GPU. Afterall it would be competing with Apple's M chips not with the 4090 or Nvidia's enterprise GPUs.
It's the only thing that matters. Folks act like AMD support is there because suddenly you can run the most basic LLM workload. Try doing anything actually interesting (i.e, try running anything cool in the mechanistic interoperability or representation/attention engineer world) with AMD and suddenly everything broken, nothing works, and you have to spend millions worth of AI engineer developer time to try to salvage a working solution.
This is the most script kiddy comment I've seen in a while.
llama.cpp is just inference, not training, and the CUDA backend is still the fastest one by far. No one is even close to matching CUDA on either training or inference. The closest is AMD with ROCm, but there's likely a decade of work to be done to be competitive.
The funny thing about Cerebras is that it doesn't scale well at all for inference and if you talk to them in person, they are currently making all their money on training workloads.
Inference is still a lot faster on CUDA than on CPU. It's fine if you run it at home or on your laptop for privacy, but if you're serving those models at any scale, you're going to be using GPUs with CUDA.
Inference is also a much smaller market right now, but will likely be overtaken later as we have more people using the models than competing to train the best one.
Inference on very large LLMs where model + backprop exceed 48GB is already way faster on a 128GB MacBook than on NVidia unless you have one of those monstrous Hx00s with lots of RAM which most devs don't.
No one is running LLMs on consumer NVidia GPUs or apple MacBooks.
A dev, if they want to run local models, probably run something which just fits on a proper GPU. For everything else, everyone uses an API key from whatever because its fundamentaly faster.
IF a affordable intel GPU would be relevant faster for inferencing, is not clear at all.
A 4090 is at least double the speed of Apples GPU.
4090 is 5x faster than M3 Max 128GB according to my tests but it can't even inference LLaMA-30B. The moment you hit that memory limit the inference is suddenly 30x slower than M3 Max. So a basic GPU with 128GB RAM would trash 4090 on those larger LLMs.
Quantized 30B models should run in 24GB VRAM. A quick search found people doing that with good speed: [1]
I have a 4090, PCIe 3x16, DDR4 RAM.
oobabooga/text-generation-webui
using exllama
I can load 30B 4bit GPTQ models and use full 2048 context
I get 30-40 tokens/s
Quantized sure but there is some loss of variability of the output one can notice quickly with 30B models. If you want to use the fp16 version you are out of luck.
I ran some variation of llama.cpp that could handle large models by running portion of them on GPU and if too large, the rest on CPU and those were the results. Maybe I can dig it from some computer at home but it was almost like a year ago when I got M3 Max with 128GB RAM.
My comment was about Intel having a starter project, getting enthusiastic response from devs, network effects and iterate from there. They need a way to threaten Nvidia and just focusing on what they can't do won't bring them there. There is one route where they can disturb Nvidia's high end over time and that's a cheap basic GPU with lots of RAM. Like Ryzen 1st gen whose single core performance was two generations behind Intel trashed Intel by providing 2x as many cores for cheap.
That's a question M3 Max with its internal GPU already answered. It's not like I didn't do any HPC or CUDA work in the past to be completely clueless about how GPUs work though I haven't created those libraries myself.
You're not wrong, but technically llama.cpp does have training (both raw model and fine tuning). And it's been around for a long time. Back around the ggml->gguf switch I used llama.cpp to train a tiny 0.9B llama 1 through the early fast parts of the loss reduction on 3GB of IRC logs with 64 tokens of context over about a month. It eventually produced some gpt2-like IRC lines within it's very short context.
Would anyone choose llama.cpp's training tools to do serious work? No. Do they exist and work, yes.
Yep. Any large GenAI image model (beyond SD 1.5) is hideously slow on Mac's irrespective of how much RAM you cram in - whereas I can spit out a 1024x1024 image from Flux.1 Dev model in ~15 seconds on a RTX 4090.
4080 won't do video due to low RAM. The GPU doesn't have to be as fast there, it can be 5x slower which is still way faster than a CPU. And Intel can iterate from there.
my hunch is the path forward for intel on both the CPU and the GPU end is to release a series of consumer chipsets with a large number of PCIE 5.0 lanes, and keep iterating this. This would cannibalize some of the datacenter server side revenue, both that's a reboot... get the hackers raving about intel value for the money instead of EPYC. Or do a skunkworks ARM64 M1 like processor; there's a market for this as a datacenter part...
We'll have to wait for first-party benchmarks, but they seem decent so far. A 4060 equivalent $200-$250 isn't bad at all. for I'm curious if we'll get a B750 or B770 and how they'll perform.
At the very least, it's nice to have some decent BUDGET cards now. The ~$200 segment has been totally dead for years. I have a feeling Intel is losing a fair chunk of $ on each card though, just to enter the market.
I'd love to see their GPGPU software support under Linux.
The keywords you're looking for are Intel basekit, oneapi, and ipex.
https://christianjmills.com/posts/intel-pytorch-extension-tu...
https://chsasank.com/intel-arc-gpu-driver-oneapi-installatio...
Lots of interesting things packaged into OneAPI for developers to check out: https://www.intel.com/content/www/us/en/developer/tools/onea...
I quite enjoyed using the CUDA to OneAPI migration tool. Was a bit dodgy at times, but for the most part helped me move a lot of stuff out of the NVIDIA walled garden.
I don't know the numbers but the manufacture of the chip and the cards can't be that expensive...the design was probably much more expensive. Hopefully they are at least breaking even but hopefully making money. Nobody goes into business to lose money. Shareholders would be pissed!
I think a graphics card tailored for 2k gaming is actually great. 2k really is the goldilocks zone between 4k and 1080p graphics before you start creeping into diminishing returns.
For sure its been a sweet spot for a very long time for budget conscious gamers looking for best balance of price and frame rates, but 1440p optimized parts are nothing new. Both NVidia and AMD make parts that target 1440p display users too, and have done for years. Even previous Intel parts you can argue were tailored for 1080p/1440p use, given their comparative performance deficit at 4k etc.
Assuming they retail at prices Intel are suggesting in the press releases, you maybe here save 40-50 bucks over an ~equivalent NVidia 4060.
I would also argue like others here that with tech like frame gen, DLSS etc, even the cheapest discrete NVidia 40xx parts are arguably 1440p optimized now, it doesn't even need to be said in their marketing materials. Im not as familiar with AMD's range right now, but I suspect virtually every discrete graphics card they sell is "2k optmized" by the standard Intel used here, and also doesn't really warrant explicit mention.
I'm baffled that PC gamers have decided that 1440p is the endgame for graphics. When I look at a 27-inch 1440p display, I see pixel edges everywhere. It's right at the edge of losing the visibility of individual pixels, since I can't perceive them at 27-inch 2160p, but not quite there yet for desktop distances.
Time marches on, and I become ever more separated from gaming PC enthusiasts.
Gaming at 2160p is just too expensive still, imo. You gotta pay more for your monitor, GPU and PSU. Then if you want side monitors that match in resolution, you're paying more for those as well.
You say PC gamers at the start of your comment and gaming PC enthusiasts at the end. These groups are not the same and I'd say the latter is largely doing ultrawide, 4k monitor or even 4k TV.
According to steam, 56% are on 1080p, 20% on 1440p and 4% on 2160p.
So gamers as a whole are still settled on 1080p, actually. Not everyone is rich.
The major drawback for PC gaming at 4k that I never see mentioned is how much heat the panels generate. Many of them generate so much heat that rely on active cooling! I bought a pair of high refresh 4k displays and combined with the PC, they raised my room to an uncomfortable temperature. I returned them for other reasons (hard to justify not returning them when I got laid off a week after purchasing them), but I've since made note of the wattage when scouting monitors.
>hard to justify not returning them when I got laid off a week after purchasing them
Ouch, had something similar happen to me before when I bought a VR headset and had to return it. Wishing you the best on your job search!
That was earlier this year. I found a new job with a pay raise so it turned out alright. Still miss my old team though.. we've been scattered like straws in the wind.
I'm still using a 50" 1080p (plasma!) television in my living room. It's close to 15 years old now. I've seen newer and bigger TVs many times at my friends house, but it's just not better enough that I can be bothered to upgrade.
Doesn't plasma have deep blacks and color reproduction similar to OLED? They're still very good displays, and being 15 years old means it probably pre-dates the SmartTV era.
Yes exactly, the color is really good and it's just a stupid monitor that has an old basic Chromecast and a Nintendo Switch plugged in.
It does have HDMI-CEC, so I haven't even used the remote control in several years.
Classic Plasma TVs are no joke. I’ve got a 720p Plasma TV that still gets the job done.
If you’re ok with the resolution, then the only downside is significant power consumption and lack of HDR support.
I recently recently upgraded my main monitor from 1440p x 144hz to 4K x 144hz (with lots of caveats) and I agree with your assessment. If I had not made significant compromises, it would have cost at least $500 to get a decent monitor, which most people are not willing to spend.
Even with this monitor, I'm barely able to run it with my (expensive, though older) graphics card, and the screen alarmingly flashes whenever I change any settings. It's stable, but this is not a simple plug-and-play configuration (mine requires two DP cables and fiddling with the menu + NVIDIA control panel).
Why do you need two DP cables? Is there not enough bandwidth in a single one? I use a 4k@60 display, which is the maximum my cheap Anker USB-C Hub can manage.
I'm not sure, but there's an in-depth exploration of the monitor here: https://tftcentral.co.uk/reviews/acer_nitro_xv273k.htm
Reddit also seems to have some people who have managed to get 144 with FreeSync, but I've only managed 120.
Funnily enough while I was typing this Netflix caused both my monitors to blackscreen (some sort of NVIDIA reset I think) and then come back. It's not totally stable!
This is likely a cable issue. Certain cable types can't handle 4k. I had to switch from DisplayPort to HDMI with a properly rated cable to get past this in the past.
It works up until too many pixels change, basically.
Had the same issue at 4k 60fps, it mostly worked but the screen flashed black from time to time. I used the thickest cable I had lying around and it has worked fine since.
If you're on an Nvidia 4000 series DisplayPort is limited to 1.4, ~26 Gigabit/s.
https://linustechtips.com/topic/729232-guide-to-display-cabl... is a calculator for bandwidth 4K@144 HDR is ~40 Gigabit/s. You can do better with compression, but I find Nvidia cards have an issue with compression enabled.
Interesting. I've been running my 4K monitor at 240Hz with HDR enabled for months and haven't had any issues with Display Stream Compression on my 4080.
Not rich. Well within reach for Americans with expendable income. Mid range 16" macbook pros are in the same price ballpark as 4k gaming rigs. Or put another way costs less than a vacation for two to a popular destination.
I don’t think that’s true anymore. I routinely find 4K/27” monitors for under $100 on Craigslist, and a 3080-equivalent is still good enough to play most games on med-high settings at 4K and ~90Hz, especially if DLSS is available.
Your hypothetical person has a 3080 but needs to crawl craigslist for a sub-100$ monitor? U guess those people exist, but idk why you'd bother with a 3080 to then buy a low refreh rate, high input latency, probably TN, low color accuracy craigslist runoff.
Could just get a 3060 and a nice 1440p monitor.
3080-equivalent performance can be had for a few hundred bucks these days, no?
> You say PC gamers at the start of your comment and gaming PC enthusiasts at the end. These groups are not the same
Prove to me those aren't synonyms.
Prove to me they are.
I used to be in the '4k or bust' camp, but then I realized that I needed 1.5x scaling on a 27" display to have my UI at a comfy size. That put me right back at 1440p screen real estate and you had to deal with fractional scaling issues.
Instead, I bought a good 27" 1440p monitor, and you know what? I am not the discerning connoisseur of pixels that I thought I was. Honestly, it's fine.
I will hold out with this setup until I can get a 8k 144hz monitor and a gpu to drive it for a reasonable price. I expect that will take another decade or so.
> That put me right back at 1440p screen real estate and you had to deal with fractional scaling issues.
Get a 5K 27".
No fractional scaling, same real estate, much better picture.
MacOS and Windows have very good fractional scaling now. Linux... still no, but it's better than the X days.
I find the scaling situation with KDE is better with the Xorg X11 server than it is with Wayland. Things like Zoom will properly be scaled for me with the former.
Are you using fractional scaling, or 2x?
2x
I have a 4K 43" TV on my desk and it is about perfect for me for desktop use without scaling. For gaming, I tend to turn it down to 1080p because I like frames and don't want to pay up.
At 4K, it's like having 4 21" 1080p monitors. Haven't maximized or minimized a window in years. The sprawl is real.
It's true but I don't run into this issue often since most games and Windows will offer UI/Menu scaling without changing individual windows or the game itself.
I think it's less that gamers have decided it's the "endgame" and more that current gen games at good framerates at 4k require significantly more money than 1440p does, and at least to my eyes just running at native 1440p on a 1440p monitor looks much better than running an internal resolution of 1440p upscaled to 4k, even with DLSS/FSR - so just upgrading piecemeal isn't really a desirable option.
Most people don't have enough disposable income to make spending that extra amount a reasonable tradeoff (and continuing to spend on upgrades to keep up with their monitor on new games).
This is a trade-off with frame rates and rendering quality. When having to choose, most gamers prefer higher frame rate and rendering quality. With 4K, that becomes very expensive, if not impossible. 4K is 2.25 times the pixels of 1440p, which for example means you can get double the frame rate with 1440p using the same processing power and bandwidth.
In other words, the current tech just isn’t quite there yet, or not cheap enough.
Arguably 1440p is the sweet spot for gaming, but I love 4k monitors for the extra text sharpness. Fortunately DLSS and FSR upscaling are pretty good these days. At 4k, quality-mode upscaling gives you a native render resolution about 1440p, with image quality a little better and performance a little worse.
It’s a great way to have my cake and eat it too.
I don’t think it’s seen as the end game, it’s that if you want 120 fps (or 144, 165, or 240) without turning down your graphics settings you’re talking $1000+ GPUs plus a huge case and a couple hundreds watts higher on your power supply.
1440p hits a popular balance where it’s more pixels than 1080p but not so absurdly expensive or power hungry.
Eventually 4K might be reasonably affordable, but we’ll settle at 1440p for a while in the meantime like we did at 1080p (which is still plenty popular too).
You incorrectly presume that all gamers care about things such as "pixel edges". I think 1080p is fine. Game mechanics always trump fidelity.
That's more of a function of high end Nvidia gaming card prices and power consumption. PC gaming at large isn't about chasing high end graphics anyway, steam deck falls under that umbrella and so does a vast amount of multiplayer gaming that might have other priorities such as affordability or low latency/very high fps.
How far away are you sitting from that monitor? Mine is about 3-4 feet away from my face and I have not had the same experience.
I don't see anybody thinking 1440p is "endgame," as opposed to a pretty nice compromise at the moment.
I find dual 24 inch 1440p q great compromise. Higher pixel density, decent amount of screen real estate, and nice to have an auxiliary monitor when gaming.
I run the second monitor off the IGPU so it doesn't even tax the main GPU.
It's a nice compromise for semi competitive play. On 4k it'd be very expensive and most likely finicky to maintain high FPS.
Tbh now that I think about it I only really need resolution for general usage. For gaming I'm running everything but textures on low with min or max FOV depending on the game so it's not exactly aesthetic anyway. I more so need physical screen size so the heads are physically larger without shoving my face in it and refresh rate.
4k gaming (2160p) is like watching 8k video on your TV.
It's doable, the tech is there. But the cost is WAY too high compared to what you get from it in the end.
if you can see the pixels on a 27 inch 1440p display, you're just sitting too close to the screen lol
I don't directly see the pixels per se like on 1080p at 27-inch at desktop distances. But I see harsh edges in corners and text is not flawless like on 2160p.
Like I said, it's on the cusp of invisible pixels.
Gamers often use antialias settings to smooth out harsh edges, whereas an inconsistent frame rate will literally cost you a game victory in many fast-action games. Many esports professionals use low graphics settings for this reason.
I've not tried but I've heard that a butter-smooth 90, 120, or 300 FPS frame rate (that is also synchronized with the display) is really wonderful in many such games, and once you experience that you can't go back. On less powerful systems it then requires making a tradeoff with rendering quality and resolution.
Nvidia markets the 4060 as a 1080p card. It's design makes it worse at 1440p than past X060 cards too. Intel has XeSS to compete with DLSS and are reportedly coming out with their own frame gen competitor. $40-50 is a decent savings in the budget market especially if Intel's claims are to believed and it's actually faster than the 4060.
2k usually refers to 1080p no? The k is the approximate horizontal resolution, so 1920x1080 is definitely 2k enough.
Actual use is inconsistent. From https://en.wikipedia.org/wiki/2K_resolution: “In consumer products, 2560 × 1440 (1440p) is sometimes referred to as 2K, but it and similar formats are more traditionally categorized as 2.5K resolutions.”
“2K” is used to denote WQHD often enough, whereas 1080p is usually called that, if not “FHD”.
“2K” being used to denote resolutions lower than WQHD is really only a thing for the 2048 cinema resolutions, not for FHD.
TIL
2k Usually refers to 2560x1440.
1920x1080 is 1080p.
It doesn't make a whole lot of sense, but that's how it is.
https://en.wikipedia.org/wiki/2K_resolution
That's amusing because I think almost everyone I know confuses it with 1440p. I've never heard of 2k being used for 1080p before.
“In consumer products, 2560 × 1440 (1440p) is sometimes referred to as 2K,[13] but it and similar formats are more traditionally categorized as 2.5K resolutions.”
1440p is colloquially referred to as 2.5K, not 2K.
It'd be pretty weird if it were called 2k. 1080p is in an absolute sense or as a relative "distance" to the next-lowest thousand closer to 2k pixels of width than 4k is to 4k (both are under, of course, but one's under by 80 pixels, one by 160). It's got a much better claim to the label 2k than 1440p does, and arguably a somewhat better claim to 2k than 4k has to 4k.
[EDIT] I mean, of course, 1080p's also not typically called that, yet another resolution is, but labeling 1440p 2k is especially far off.
You are misunderstanding. 1080p, 1440p, 2160p refer to the number of rows of pixels, and those terms come from broadcast television and computing (the p is progressive, vs i for interlaced). 4k, 2k refer to the number of columns of pixels, and those terms come from cinema and visual effects (and originally means 4096 and 2048 pixels wide). That means 1920×1080 is both 2k and 1080p, 2560×1440 is both 2.5k and 1440p, and 3840×2160 is both 4k and 2160p.
> You are misunderstanding. 1080p, 1440p, 2160p refer to the number of rows of pixels
> (the p is progressive, vs i for interlaced)
> 4k, 2k refer to the number of columns of pixels
> 2560×1440 is both 2.5k and 1440p, and 3840×2160 is both 4k and 2160p.
These parts I did not misunderstand.
> and those terms come from cinema and visual effects (and originally means 4096 and 2048 pixels wide)
OK that part I didn't know, or at least had forgotten—which are effectively the same thing, either way.
> 1920×1080 is both 2k and 1080p
Wikipedia suggests that in this particular case (unlike with 4k) application of "2k" to resolutions other than the original cinema resolution (2048x1080) is unusual; moreover, I was responding to a commenter's usage of "2k" as synonymous with "1440p", which seemed especially odd to me.
I have never seen 2.5k used in the wild (gamer forums etc) so it can't be that colloquial.
I've never seen "2.5K" used colloquially and I see "2K" used everywhere all the time.
Can it compete with the massive used GPU market though? Why buy a new Intel card when I can get a used Nvidia card that I know will work well?
warranty. Plus, do you want a 2nd hand GPU that was used for cryptomining 24/7?
>Plus, do you want a 2nd hand GPU that was used for cryptomining 24/7?
...which will be most likely in better condition than anything that was used for, let's say, gaming...
Throw a new cooler on it and then yeah sure.
To some, buying used never crosses their mind.
Please say 1440p and not 2k. Ignoring arguments about what 2k should mean, there’s enough use either way that it’s confusing.
I see what you're saying, but I also feel like ALL Nvidia cards are "2K" oriented cards because of DLSS, frame gen, etc. Resolution is less important now in general thanks to their upscaling tech.
12GB max is a non-starter for ML work now. Why not come out with a reasonably priced 24gb card even if it isn't the fastest and target it at the ML dev world? Am I missing something here?
I think a lot of replies to this post are missing that Intel's last graphics card wasn't received well by gamers due to poor drivers. The GT 730 from 2014 has more users than all Arc cards combined according to the latest Steam survey.[0] It's entirely possible that making a 24gb local inference card would do better since they can contribute patches for inference libraries directly like they did for llama.cpp, as opposed to a gaming card where the support surface is much larger. I wish Intel well in any case and hope their drivers (or driver emulators) improve enough to be considered broadly usable.
[0] https://store.steampowered.com/hwsurvey/videocard/ - 0.19% share
> Am I missing something here?
Video games
It's insane how out of touch people can be here, lol
How big is NVIDIA now? You don't think breaking into that market is a good strategy? And, yes, I understand that this is targeted at gamers and not ML. That was the point of the comment I made. Maybe if they did target ML they would make money and open a path to the massive server market out there.
A video card that beats the 4060 for under $250 is very much going to be a problem for AMD and is going to eat the "low end" market if it is reasonably stable.
I have been trying to hold my slurs in reading this thread.
These ML AI Macbook people are legit insane.
Desktops and gaming is ugly and complex to them (because lego is hard and macbook look nice unga bunga), yet it is a mass market Intel wants to move in on.
People here complain because Intel is not making a cheap GPU to "make AI" on when that's a market of maybe 1000 people.
This Intel card is perfect for an esports gaming machine running CS2, Valorant, Rocket Leauge and casual or older games like The Sims, GoG games etc. Market of 1 million + right there, CS2 alone is 1mil people playing everyday. Not people grinding leetcode on their macs. Every real developer has a desktop, epyc cpu, giga ram and a nice GPU for downtime and run a real OS like Linux or even Windows (yes majority of devs run Windows)
Most devs use windows (https://www.statista.com/statistics/869211/worldwide-softwar...). Reddit llocallama alone has 250k users. Clearly the market is bigger than 1000 people. Why are gamers and Linux people always so aggressive diminutive of other people’s interests?
>Why are gamers and Linux people always so aggressive diminutive of other people’s interests?
Both groups have a high autism %
We love to be "technically correct" and we often are. So we get frustrated when people claim things that are wrong.
Intel GPUs don't sell well to gamers. They've been on the market for years now.
>market of maybe 1000 people
The market of people interested in local ai inference is in the millions. If it's cheap enough the data center market is at least 10 million.
> Intel GPUs don't sell well to gamers. They've been on the market for years now.
Intel has only had discrete GPUs on the market for 2 years. I guess that is a plural number of years, but only barely.
Yes, Intel cards have sucked. But they are trying again!
Yes, everyone knows the best way of making a product not suck is to give up and never try again.
/s
ML is about hit another winter. Maybe intel is ahead of industry.
Or we can keep asking high computers questions about programming.
> ML is about hit another winter.
I agree ML is about to hit (or has likely already hit) some serious constraints compared to breathless predictions of two years ago. I don't think there's anything equivalent to the AI winter on the horizon, though—LLMs even operated by people who have no clue how the underlying mechanism functions are still far more empowered than anything like the primitives of the 80s enabled.
I think there'll be a "financial" winter - or another way a bubble burst - the investment right now is simply unsustainable, how are these products going to be monetized?
Nvidia had a revenue of $27billion in 2023 - that's about $160 per person per year [0] for every working age person in the USA. And it's predicted to more than double in 2024. If you reduce that to office workers (you know, the people who might actually get some benefit, as no AI is going to milk a cow or serve you starbucks) that's more like $1450/year. Or again more than double that for 2024.
How much value add is the current set of AI products going to give us? It's still mostly promise too.
Sure, like most bubbles there'll probably still be some winners, but there's no way the current market as a whole is sustainable.
The only way the "maximal AI" dream income is actually going to happen is if they functionally replace a significant proportion of the working population completely. And that probably would have large enough impacts to society that things like "Dollars In A Bank" or similar may not be so important.
[0] Using the stat of "169.8 million people worked at some point in 2022" https://www.bls.gov/news.release/pdf/work.pdf
[1] 18.5 million office workers according to https://www.bls.gov/news.release/ocwage.nr0.htm
While I'd agree monetisation seems to be a challenge in the long term (analogy: spreadsheets are used everywhere, but are so easy to make they're not themselves a revenue stream, only as part of a bigger package)…
> Nvidia had a revenue of $27billion in 2023 - that's about $160 per person per year [0] for every working age person in the USA
As a non-American, I'd like to point out we also earn money.
> as no AI is going to milk a cow or serve you starbucks
Cows have been getting the robots for a while now, here's a recent article: https://modernfarmer.com/2023/05/for-years-farmers-milked-co...
Robots serve coffee as well as the office parts of the coffee business: https://www.techopedia.com/ai-coffee-makers-robot-baristas-a...
Some of the malls around here have food courts where robots bring out the meals. I assume they're no more sophisticated than robot vacuum cleaners, but they get the job done.
Transformer models seem to be generally pretty good at high-level robot control, though IIRC a different architecture is needed down at the level of actuators and stepper motors.
Sure, robotics help many jobs, and some level of the current deep learning boom seems to have crossover in improving that - but how many of them are running LLMs that affect Nvidia's bottom line right now? There's some interesting research in that area, but it's certainly not the primary driving force. And then is the control system the limiting factor for many systems - it's probably relatively easy to get a machine today that makes a Starbucks coffee "as good as" a decently trained human. But the market doesn't seem to want that.
And I know restricting it to the US is a simplification, but so is restricting it to Nvidia, it's just to give a ballpark back-of-the-envelope "does this even make sense?" level calculation. And that's what I'm failing to see.
Machines that will make espresso, automatically, that I personally like better than what Starbucks serves are widely available. No AI needed, and they aren't even "robotic". They can use ordinary coffee beans, and you can get them for home use or for commercial use. You can also go to a mall and get a robot to make you coffee.
Nonetheless, Starbucks does not use these machines, and I don't see any reason that AI, on its current trajectory, will change that calculation any time soon.
I love how the fact that we might not want AI/robots everywhere in our lives isn't even discussed.
They could serve us a plate of shit and we'd debate if pepper or salt is better to complement it
It's pretty often discussed, it's just hard to put everything into a single comment (or thread).
I mean, Yudkowsky has basically spent the last decade screaming into the void about how AI will with high probability literally kill everyone, and even people like me who think that danger is much less likely still look at the industrial revolution and how slow we were to react to the harms of climate change and think "speed-running another one of these may be unwise, we should probably be careful".
Well, "AI" is milking cows. Not LLM's though. Our milking robot uses image recognition to find the cow's teats to put the milking cup on.
Yeah, but automated milking robots like that have been in the market for more than a decade now IIRC?
Seems like a lot of CV solutions have seen fairly steady but small incremental advances over the past 10-15 years, quite unrelated to the current AI hype.
Improving capabilities of AI isn't at odds with expecting an "AI Winter" - just the current drive is more hype than sustainable, provable progress.
We've been through multiple AI Winters, as a new technique is developed, it does increase the capabilities. Just not as much as the hype suggested.
To say there won't be a bust implies this boom will last forever, into whatever singularity that implies.
I think the more accurate denominator would be the world population. People are seeing benefits to LLMs even outside of the office.
How do LLMs make money though?
> I think the more accurate denominator would be the world population. People are seeing benefits to LLMs even outside of the office.
For example ?
(besides deep fakes)
I use one in the kitchen, because it's easier than the ads and prose on most recipe websites, and it can adapt to whatever ingredients I actually have rather than being a fixed list.
Used them in the garden while weeding and in the garden store while planning what to plant, in both cases to identify plants by image and tell me about them — though I'd say image capable AI are no longer mere "large language models".
Used ChatGPT while shopping to help me locate products in the store I was in, when I couldn't find just by wandering the isles, by uploading a photo of the aisle I happened to be in at the point I gave up.
> even operated by people who have no clue how the underlying mechanism functions are still far more empowered than anything like the primitives of the 80s enabled.
I'm still not convinced about that. All the """studies""" show 30-60% boost in productivity but clearly this doesn't translate to anything meaningful in real life because no industry laid off 30-60% of their workforce and no industry progressed anywhere close to 30% since chat gpt was released.
It's been released a whole 24 months ago, remember the talks about freeing us from work and curing cancer... Even investments funds which are the biggest suckers for anything profitable are more and more doubtful
> All the """studies""" show 30-60% boost in productivity
The services that provide serious productivity boosts aren't being heavily used or marketed yet. They:
1. Attempt to do narrow tasks with high proficiency. 2. Replace specific job titles. 3. Are high-value enough to be slower to engage in layoffs.
What I'm suggesting is not a fungible solution, but it is one that will be highly profitable and productive.
What we had in the 80s was barely able to perform spell-check, free downloadable LLMs today are mind-blowing even in comparison to GPT-2.
I think the only good thing that came out of the 80s was the 90s. I’d leave that decade alone so we can forget about it.
Yeah… I want to think of it like mining, where you’ve found an ore vein. You have to switch from prospecting to mining. There’s a lot of work to be done by integrating our LLMs and other tools with other systems, and I think the cost/benefit of making models bigger, Bigger, BIGGER is reaching a plateau.
Haven't people been saying that for the last decade? I mean, eventually they will be right, maybe "about" means next year, or maybe a decade later? They just have to stop making huge improvements for a few years and the investment will dry up.
I really wasn't interested in computer hardware anymore (they are fast enough!) until I discovered the world of running LLMs and other AI locally. Now I actually care about computer hardware again. It is weird, I wouldn't have even opened this HN thread a year ago.
What makes local AI interesting to you vs larger remote models like ChatGPT and Claude?
Not OP but for me a big thing is privacy, I can feed it personal documents and expect those to not leak.
It has zero cost, hardware is already there. I'm not captive to some remote company.
I can fiddle and integrate with other home sensors / automation as I want.
Curious as I’m of the same mind - what’s your local AI setup? I’m looking to implement a local system that would ideally accommodate voice chat. I know the answer depends on my use case - mostly searching and analysis of personal documents - but would love to hear how you’ve implemented.
If you are just starting up, you can try out 'open-webui' as inspiration. After that you can just use llama.cpp to build out your own things.
Hardware side, I just have a beefy server that acts as a router (mellanox card to provider fiber optic and local fiber network), firewall, wifi access point, zigbee coordinator, host to various services, camera video feed ingestion and processing, and so on...
define brefy server
llama.ccp and time seems to be the general answer to this question.
Control and freedom. You can use unharmonious models and hacks to existing models, also latency, you can actually use AI for a lot more applications when it is running locally.
Lack of ideological capture of the public models.
The survivors of the AI winter are not the dinosaurs but the small mammals that can profit by dramatically reducing the cost of AI inference in a minimum Capex environment.
Selling cheap products that are worse than the competition is a valid strategy during downturns as businesses look to cut costs
The ML dev world isn’t a consumer mass market like PC gaming is.
Launching a new SKU for $500-1000 with 48gb of RAM seems like a profitable idea. The GPU isn't top-of-the-line, but the RAM would be unmatched for running a lot of models locally.
It's not technically possible to just slap on more RAM. GDDR6 is point-to-point with option for clamshell, and the largest chips in mass production are 16Gbit/32 bit. So, for a 192bit card, the best you can get is 192/32×16Gbit×2 = 24GB.
To have more memory, you have to design a new die with a wider interface. The design+test+masks on leading edge silicon is tens of millions of NRE, and has to be paid well over a year before product launch. No-one is going to do that for a low-priced product with an unknown market.
The savior of home inference is probably going to be AMD's Strix Halo. It's a laptop APU built to be a fairly low end gaming chip, but it has a 256-bit LPDDR5X interface. There are larger LPDDR5X packages available (thanks to the smartphone market), and Strix Halo should be eventually available with 128GB of unified ram, performance probably somewhere around a 4060.
Interesting, will have to try it out when it is released.
You can’t just throw in more RAM without having the rest of the GPU architected for it. So there’s an R&D cost involved for such a design, and there may even be trade-offs on performance for the mass-market lower-tier models. I’m doubtful that the LLM enthusiast/tinkerer market is large enough for that to be obviously profitable.
That would depend on how they designed the memory controllers. GDDR6 only supporting 1-2gb modules at present (I believe GDDR6W supports 4gb modules). If they were using 12 1gb modules, then increasing to 24gb shouldn't be a very large change.
Honestly, Apple seems to be on the right track here. DDR5 is slower than GDDR6, but you can scale the amount of RAM far higher simply by swapping out the density.
It's a 192 bit interface, so 6 16gbit chips.
Of course you can just add more RAM. Double the capacity of every chip and you get twice the RAM without ever asking an engineer.
People did it with the RTX3070. https://www.tomshardware.com/news/3070-16gb-mod
Can you find me a 32Gbit GDDR6 chip?
give me 48gb with reasonable power consumption so I can dev locally and I will buy it in a heartbeat. Anyone that is fine-tuning would want a setup like that to test things before pushing to real GPUs. And in reality if you can fine-tune on a card like that in two days instead of a few hours it would totally be worth it.
> give me 48gb with reasonable power consumption so I can dev locally and I will buy it in a heartbeat
https://a.co/d/1LMNatf
I would love too, but you can't just add the chips, you need the the bus too.
The bigger point here is to ask why they aren't designing that in from the start. Same with AMD. RAM has been stalled and is critical. Start focusing on allowing a lot more of it, even at the cost of performance, and you have a real product. I have a 12GB 3060 as my dev box and the big limiter for it is RAM, not cuda cores. If it had 48GB but the same number of cores then I would be very happy with it, especially if it was power efficient.
Because designing a low end GPU with a very wide memory interface isn't useful for gaming, and that is where the vast majority of non-datacenter discrete GPU sales are right now.
These are $200 low end cards, the B5X0 cards. Presumably they have B7X0 and perhaps even B9X0 cards in the pipeline as well.
There has been no hint or evidence (beyond hope) Intel will add a 900 class this generation.
B770 was rumoured to match the 16 GB of the A770 (and to be the top end offering for Battlemage) but it is said to not have even been taped out yet with rumour it may end up having been cancelled completely.
I.e. don't hold your breath for anything consumer from Intel this generation better for AI than tha A770 you could have bought 2 years ago. Even if something slightly better is coming at all there is no hint it will be soon.
> These are $200 low end cards
Hm, i wouldn't consider 200$ low end.
There isn't a cheaper card that's worth buying over using the iGPU you already have so yeah, that's the low end.
NVIDIA and AMD don't even make new GPU silicon this low-end. Their smallest current-gen GPUs all debuted at higher price points, though the Radeon RX 7600 is now available at the same price that the B580 is launching at.
I was watching a few of the preliminary commentaries on the Battlemage cards, e.g. Linus Tech Tips and so on, and they said the same thing "focus on the low/mid-end". My conclusion is that what they are talking about is "low end, brand new, current gen. graphics card". It's a very special type of low end.
The Intel cards are getting more interesting for me as I'm questioning my continued use of macOS. Intels focus on Linux support makes their options really interesting, though I don't see a need for something as powerful as these new cards.
It is when the high end is 1000$ or more.
> 12GB max is a non-starter for ML work now.
Can you even do ML work with a GPU not compatible with CUDA? (genuine question)
A quick search showed me the equivalence to CUDA in the Intel world is oneAPI, but in practice, are the major Python libraries used for ML compatible with oneAPI? (Was also gonna ask if oneAPI can run inside Docker but apparently it does [1])
[1] https://hub.docker.com/r/intel/oneapi
There is ROCm and Vulkan compute.
Vulkan is especially appealing because you don't need any special GPGPU drivers and it runs on any card which supports Vulkan.
https://github.com/intel/intel-extension-for-pytorch
these are the entry level cards, i imagine the coming higher end variants will have the option of much more ram.
I was wondering the same thing. Seems crazy to keep pumping out 12gb cards in 2025.
I still don't understand why graphics cards haven't evolved to include sodimm slots so that the vram can be upgraded by the end user. At this point memory requirements vary so much from gamer to scientist so it would make more sense to offer compute packages with user-supplied memory.
tl;dr GPU's need to transition from being add-in cards to being a sibling motherboard. A sisterboard? Not a daughter board.
One of the reasons GPUs can have multiples of CPU bandwidth is they avoid the difficulties of pluggable dimms - direct soldered can have much higher frequencies at lower power.
It's one of the reasons why ARM Macbooks get great performance/watt, memory being even "closer" than mainboard soldered RAM so getting more of those benefits, though naturally less flexibility.
This makes sense. Do we need to begin exploring different socket configurations? IE, using something that more closely resembles a CPU socket versus a traditional RAM slot.
Even DDR5 has this problem. Go look at what soldered DDR5 can do frequency wise compared to DIMMs. It's one of the problems the new CAMM form factor aims to help solve, making it tractable to push the memory frequency up beyond what DIMMs can get yout currently.
I have always wondered: would it be possible to put memory on the back side of the motherboard to get I closer to the CPU? And if it is, would it solve anything else than ram clearance for CPU coolers?
GDDR does not exist in sodimm form factor.
Intel and AMD internal GPUs can use normal computer RAM. But they are slower for that reason and many others.
Who cares?
Yes exactly!!
This is not an ML card... this is a gaming card... Why are you people like this?
> Am I missing something here?
This is a graphics card.
Sir, this is a Wendy's
I put an a360 Card into an old machine I turned into a plex server. It turned it into a transcoding powerhouse. I can do multiple indepdent streams now without it skipping a beat. Price-performance ratio was off the chart
Intel has been a beast at transcoding for years, it’s a relatively niche application though.
My 7950X3Ds GPU does 4k HDR (33Mb/s) to 1080p at 40fps (proxmox, jellyfin). If these GPUs would support SR-IOV I would grab one for transcoding and GPU accelerated remote desktop.
Untouched video (star wars 8) 4k HDR (60Mb/s) to 1080p at 28fps
All first gen arc gpus share the same video encoder/decoder, including the sub-$100 A310, that can handle four (I haven't tested more than two) simultaneous 4k HDR -> 1080p AV1 transcodes at high bitrate with tone mapping while using 12-15W of power.
No SR-IOV.
Any idea how that compares to Apple Silicon for that job? I bought the $599 MacBook Air with M1 as my plex server for this reason. Transcodes 4k HEVC and doesn’t even need a fan. Sips watts.
Apple Silicon still don't support AV1 encoding but it is good enough for simple Jellyfin server i'm using one myself
All Intel arc even the $99 A310 has HW accel h265 and AV1 encoding.
Apple's hardware encode/decode for AV1 is quite literally, shit.
Good to know. I'm still waiting for UNRaid 7.0 for proper Arc support to pull the trigger on one.
How's the Linux compatibility? I was tempted to do the same for my CentOS Stream Plex box.
Amazing. It is the first time I have plugged any gpu into my linux box and have it just work. I am never going back to anything else. My main computer uses an a750, and my jellyfin server uses an a310.
No issues with linux. The server did not like the a310, but that is because it is an old dell t430 and it is unsupported hardware. The only thing I had to do was to tweak the fan curve so that it stopped going full tilt.
Interesting application. Was this a machine lacking an iGPU, or does the Intel GPU-on-a-stick have more quicksync power than the iGPU?
A not inconsequential possibility is that both the iGPU and dGPU are sharing the transcoding workload, rather than the dGPU replacing the iGPU. It's a fairly forgotten feature of Intel Arc, but I don't blame anyone because the help articles are dusty to say the least.
Who is the target audience for this?
Well informed gamers know Intel's discrete GPU is hanging by a thread, so they're not hoping on that bandwagon.
Too small for ML.
The only people really happy seem to be the ones buying it for transcoding and I can't imagine there is a huge market of people going "I need to go buy a card for AV1 encoding".
Intel has earned a lot of credit in the Linux space.
Nvidia is trash tier in terms of support and only recently making serious steps to actually support the platform.
AMD went all in nearly a decade ago and it's working pretty well for them. They are mostly caught up to being Intel grade support in the kernel.
Meanwhile, Intel has been doing this since I was in college. I was running the i915 driver in Ubuntu 20 years ago. Sure their chips are super low power stuff, but what you can do with them and the level of software support you get is unmatched. Years before these other vendors were taking the platform seriously Intel was supporting and funding Mesa development.
The AMD driver has been great on my Framework 13, but the 6.10 series was completely busted. 6.11 worked fine. I can't remember a series where any of my Intel laptops didn't work for that long.
This is repeated often, but I have had very good support from Nvidia on Linux over the years. AMD on the other hand gives lousy support. File a bug report about a problem and expect to be ignored, especially if it has anything to do with emulation. Intel’s Linux support on the other hand has been very good for me too.
If it works well on Linux there's a market for that. AMD are hinting that they will be focusing on iGPUs going forward (all power to them, their iGPUs are unmatched and NVIDIA is dominating dGPU). Intel might be the savior we need. Well, Intel and possibly NVK.
Had this been available a few weeks ago I would have gone through the pain of early adoption. Sadly it wasn't just an upgrade build for me, so I didn't have the luxury of waiting.
AMD has some great iGPUs but it seems like they're still planning to compete in the dGPU space just not at the high end of the market.
> Too small for ML.
What do you mean by this - I assume you mean too small for SoTA LLMs? There are many ML applications where 12GB is more than enough.
Even w.r.t. LLMs, not everyone requires the latest & biggest LLM models. Some "small", distilled and/or quantized LLMs are perfectly usable with <24GB
If you're aiming for usable, then sure that works. The gains in model ability from doubling size is quite noticable at that scale though.
Still...tangibly cheaper than even a 2nd hand 3090 so there is perhaps a market for it
>Well informed gamers know Intel's discrete GPU is hanging by a thread, so they're not hoping on that bandwagon.
If Intel's stats are anything to go by, League runs way better than it did on the last generation and it's the only game that has had issues on the last-gen that's still left running on DX9, CS:GO was another notable one but CS2 has launched since and the game has moved to DX12/VK. This was, literally, the biggest issue they had - drivers were also wonky but they seem to have ironed that out as well.
I'm using an Intel card right now. With Wayland. It just works.
Ubuntu 24.04 couldn't even boot to a tty with the Nvidia Quadro thing that came with this major-brand PC workstation, still under warranty.
> Intel's discrete GPU is hanging by a thread, so they're not hoping on that bandwagon
Why would that matter? You buy one GPU, in a few years you buy another GPU. It's not a life decision.
>Why would that matter?
The game devs are going to spend all their time & effort targetting amd/nvidia. Custom code paths etc.
It's not a one size fits all world. OpenCL etc abstraction are good at covering up differences, but not that good. So if you're the player with <10% market share you're going to have an uphill battle to just be on par.
> The game devs are going to spend all their time & effort targetting amd/nvidia. Custom code paths etc.
From my experience they target NVidia and Consoles. AMD might get a look at the code just before release if they notice any big problems.
I'd be surprised if many Gamedevs even pick up the phone for an Intel GPU developer.
Consoles use AMD GPUs for years now.
Cheap gaming rigs.
They do well compared to AMD/Nvidia at that price point.
Is it a market worth chasing at all?
Doubt.
It's for the low end gaming market which Nvidia and AMD have been neglecting for years.
all-in-1 machines.
Intel's customers are 3rd party Cpu assemblers like Dell & HP. Many corporate bulk buyers only care if 1-2 of the apps they use are supported. The lack of wider support isn't a concern.
something like 60% of the GPUs by volume on Steam are on the mid-to-low end. like a non-trivial number still running 750 Ti's. there are a lot of gamers, and most aren't well-off tech bros.
there is a niche for this, though it remains to be seen if it'll be profitable enough for a large org like Intel
If you go on the intel arc subreddit people are hyped about intel GPUs. Not sure what the price is but the previous gen was cheap and the extra competition is welcomed
In particular, intel just needs to support vfio and it’ll be huge for homelabs.
It's cheap, plenty of market when the others have forgotten the segment.
12GB memory
-.-
I feel like _anyone_ who can pump out GPU's with 24GB+ of memory that are capable to use for py-stuff would benefit greatly.
Even if it's not as performant as the NVIDIA options - just to be able to get the models to run, at whatever speed.
They would fly off the shelves.
Would it though? How many people are running inference at home? Outside of enthusiasts I don't know anyone. Even companies don't self-host models and prefer to use APIs. Not that I wouldn't like a consumer GPU with tons of VRAM, but I think that the market for it is quite small for companies to invest building it. If you bother to look at Steam's hardware stats you'll notice that only a small percentage is using high-end cards.
This is the weird part, I saw the same comments in other threads. People keep saying how everyone yearns for local LLMs… but other than hardcore enthusiasts it just sounds like a bad investment? Like it’s a smaller market than gaming GPUs. And by the time anyone runs them locally, you’ll have bigger/better models and GPUs coming out, so you won’t even be able to make use of them. Maybe the whole “indoctrinate users to be a part of Intel ecosystem, so when they go work for big companies they would vouch for it” would have merit… if others weren’t innovating and making their products better (like NVIDIA).
The future of local LLMs is not people running it on their PCs.
It's going to be a HomePod/AppleTV/Echo/Google Home -style box you set up in a corner and forget about it.
Then your devices in the ecosystem can offload some LLM tasks to that local system for inference, without having to do everything on-device.
This makes sense in some ways technologically, but just having a "centralized compute box" seems like a lot more complexity than many/most would want in their homes.
I mean, everything could have been already working that way for a lot of years right? One big shared compute box in your house and everything else is a dumb screen? But few people roll that way, even nerds, so I don't see that becoming a thing for offloaded AI compute.
I also think that the future of consumer AI is going to be models trained/refined on your own data and habits, not just a box in your basement running stock ollama models. So I have some latency/bandwidth/storage/privacy questions when it comes to wirelessly and transparently offloading it to a magic AI box that sits next to my wireless router or w/e, versus running those same tasks on-device. To say nothing of consumer appetite for AI stuff that only works (or only works best) when you're on your home network.
Intel sold their GPUs at negative margin which is part of why the stock fell off a cliff. If they could double the vram they could raise the price into the green even selling thousands, likely closer to 100k, would be far better than what they're doing now. The problem is Intel is run by incompetent people who guard their market segments as tribal fiefs instead of solving for the customer.
> which is part of why the stock fell off a cliff
Was it? Their GPUs sales were insignificantly low so I doubt that had a huge effect on their net income.
They spent billions at TSMC making Alchemist dies that sat in a warehouse for a year or two as they tried to fix the drivers.
that's a dumb management "cart before the horse" problem. I understand a few bugs in the driver but they really should have gotten the driver working decently well before production. Would have even given them more time tweaking the GPU. This is exactly why Intel is failing and will continue to fail with that type of management
Intel management is just brain dead. They could have sold the cards for mining when there was a massive GPU shortage and called it the developer edition but no. It's hard to develop a driver for games when you have no silicon.
By subsidizing it more they'll lose less money?
Increasing VRAM would differentiate intel GPUs and allow driving higher ASPs, into the green.
I think you're massively underestimating the development cost, of the number of people who would actually purchase a higher vram card at a higher price.
You'd need hundreds of thousands of units to really make much of a difference.
Well, IIUC it's a bit more "having more than 12GB of RAM and raising the price will let it run bigger LLMs on consumer hardware and that'll drive premium-ness / market share / revenue, without subsidizing the price"
I don't know where this idea is coming from, although it's all over these threads.
For context, I write a local LLM inference engine and have 0 idea why this would shift anyone's purchase intent. The models big enough to need more than 12GB VRAM are also slow enough on consumer GPUs that they'd be absurd to run. Like less than 2 tkns/s. And I have 64 GB of M2 Max VRAM and a 24 GB 3090ti.
Enthusiast/Prosumer/etc. market is generally still usually highly in most markers even if the revenue is limited. e.g. if hobbyists/students/developers start using Intel GPUs in a few years the enterprise market might become much less averse to buying Intel's datacenter chips.
So I would say that Intel's potential market is "everybody who is currently buying nVidia GPUs for compute."
nVidia's stingy consumer RAM choices also seem to be a fairly transparent ploy to create a protective moat around their insanely-high-profit-margin datacenter GPUs. So that just seems like kind of an obvious thing for Intel or AMD to consider tackling.
(Although, it has to be said, a lot of commenters have pointed out that it's not as easy as just slapping more RAM chips onto the GPU boards; you need wider data busses as well etc.)
It's a chicken and egg scenario. The main problem with running inference at home is the lack of hardware. If the hardware was there more people would do it. And it's not a problem if "enthusiasts" are the only ones using it because that's to be expected at this stage of the tech cycle. If the market is small just charge more, the enthusiasts will pay it. Once more enthusiasts are running inference at home, then the late adopters will eventually come along.
Mac minis are great for this. They're cheap-ish and they can run quite large models at a decent speed if you run it with an MLX backend.
mini _Pro_ are great for this, ones with large RAM upgrades.
If you get the base 16GB mini, it will have more or less the same VRAM but way worse performance than an Arc.
If you already have a PC, it makes sense to go for the cheapest 12GB card instead of a base mac mini.
100% - this could be Intel's ticket to capture the hearts of developers and then everything else that flows downstream. They have nothing to lose here -- just do it Intel!
They could lose a lot of money?
They already do... google $INTC, stare in disbelief in the right side "Financials".
At some point they should make a stand, that's the whole meta-topic of this thread.
Sorry, you're right, they could lose a lot more money.
You can get that on mac mini and it will probably cost you less than equivalent PC setup. Should also perform better than low end Intel GPU and be better supported. Will use less power as well.
You can just use a CPU in that case, no? You can run most ML inference on vectorized operations on modern CPUs at a fraction of the price.
My 7800x says not really. Compared to my 3070 it feels so incredibly slow that gets in the way of productivity.
Specifically, waiting ~2 seconds vs ~20 for a code snippet is much more detrimental to my productivity than the time difference would suggest. In ~2 seconds I don't get distracted, in ~20 seconds my mind starts wandering and then I have to spend time refocusing.
Make a GPU that is 50% slower than a 2 generations older mid-range GPU (in tokens/s) but on bigger models and I would gladly shell out 1000+$.
So much so that I am considering getting a 5090 if nVdia actually fixes the connector mess they made with 4090s or even a used v100.
I'm running codeseeker 13B model on my macbook with no perf issues and I get a response within a few seconds.
Running a specialist model makes more sense on small devices.
I don't understand, make it slower so it's faster?
My 2080Ti at half speed would still beat the crap out of my 5900X CPU for inference, as long as the model fits in VRAM.
I think that's what GP was alluding to.
Maybe that's not too bad for someone who wants to use pre-existing models. Their AI Playground examples require at minimum an Intel Core Ultra H CPU, which is quite low-powered compared to even these dedicated GPUs: https://github.com/intel/AI-Playground
I don't know a single person in real life that has any desire to run local LLMs. Even amongst my colleagues and tech friends, not very many use LLMs period. It's still very niche outside AI enthusiasts. GPT is better than anything I can run locally anyway. It's not as popular as you think it is.
I run a 12GB model on my 3060 and use it to help answer healthcare questions. I'm currently doing a medical residency. (No I don't use it to diagnose). It helps comply with any HIPAA style regulations. I sometimes use it to fix up my emails. Not sure why people are longing for a 128GB card, just download a quantized model and run with LM Studio (https://lmstudio.ai/). At least two of my colleagues are using ChatGPT on a regular basis. LLMs are being used in the ER department. LLMs and speech models are being used in psychiatry visits.
A 128GB card could run llama 3.1B 70B with FP8. It is a huge increase in quality over what the current 24GB cards can do.
The only consumer demand for local AI models is for generating pornography
How about running your intelligent home with a voice assistant on your own computer? In privacy-oriented countries (Germany) that would be massive.
This is what I'm fiddling with. My 2080Ti is not quite enough to make it viable. I find the small models fail too often, so need larger Whisper and LLM models.
Like the 4060 Ti would have been a nice fit if it hadn't been for the narrow memory bus, which makes it slower than my 2080 Ti for LLM inference.
A more expensive card has the downside of not being cheap enough to justify idling in my server, and my gaming card is at times busy gaming.
absolutely wrong -- if you're not clever enough to think of any other reason to run an LLM locally then don't condemn the rest of the world to "well they're just using it for porno!"
so you're saying that a huge market?!
I want local copilot. I would pay for this.
>Battlemage is still treated to fully open-source graphics driver support on Linux.
I am hoping these are open in such a manner that they can be used in OpenBSD. Right now I avoid all hardware with a Nvidia GPU. That makes for somewhat slim pickings.
If the firmware is acceptable to the OpenBSD folks, then I will happly use these.
They are promising good Linux support, which kind of implies, at least, that everything but opaque blobs are open.
For me, the most important feature is Linux support. Even if I'm not a gamer, I might want to use the GPU for compute and buggy proprietary drivers are much more than just an inconvenience.
Sure, but open drivers have been AMDs selling point for a decade, and even nVidia is finally showing signs of opening up. So it's bit dubious if these new Intels really can compete on this front, at least for very long.
I welcome a new competitor. Sucks to really only have one valid option on Linux atm. My 6600 is a little long in the tooth. I only have it becuase it is dead silent and runs a 5K display without issue - but I would definitely like to upgrade it for something that can hold its own with ML.
> Sucks to really only have one valid option on Linux atm.
I don't think that's a super fair shake? Intel iGPUs have been around for a while if you had a laptop chip or iGPU-enabled desktop chip. They've supported Linux just fine for ages, and will fill any non-3D application you might have.
And Nvidia chips are quite good on Linux nowadays - Wayland has been very usable since the 535-series drivers and nearly flawless since 550. You're right to be apprehensive about proprietary GPU hardware but I think there are plenty of options on the table right now.
The iGPU in my 13900K cannot run my 5K display with decent performance (in a desktop environment). I chalked it up to hardware issues but it could be drivers. I am on Debian Linux.
I wonder how many transistors it has and what the chip size it is.
For power, it's 190W compared to 4060's 115 W.
EDIT: from [1]: B580 has 21.7 billion transistors at 406 mm² die area, compared to 4060's 18.9 billion and 146 mm². That's a big die.
[1] https://www.techpowerup.com/gpu-specs/arc-b580.c4244
Those numbers are identical to the A770, and don't match the numbers from the preview[0], so I think that's a copy paste error.
If we use the numbers from the preview:
In terms of performance per die area it's a big improvement over A770 but still far behind Nvidia. It's interesting that the transistor density is so much lower than the 4060 despite having the same (or at least similar) process node. Speculating about why that may be:- Nvidia has better layout. - Intel is using higher performance (larger) transistor libraries or layout in order to hit the higher boost frequencies (2800 vs 2460). - Wider bus interface takes up more space. - The B580 may have 1 render slice and 64-bits of memory bus disabled, and they're not including those transistors in the count, but they still take up area.
[0]: https://www.techpowerup.com/review/intel-arc-b580-battlemage...
Those numbers suggest that they have caught Nvidia in performance per transistor. As for the die area being larger, I suspect that the larger memory bus might be partly responsible. The transistors used for IO stopped shrinking on new nodes a while ago, so they use plenty of die area.
> Both the Arc B580 and B570 are based on the "BMG-G21" a new monolithic silicon built on the TSMC 5 nm EUV process node. The silicon has a die-area of 272 mm², and a transistor count of 19.6 billion
https://www.techpowerup.com/review/intel-arc-b580-battlemage...
These numbers seem bit more believable
Official page: https://www.intel.com/content/www/us/en/products/docs/discre...
I love my a750. Works fantastic out of the box in Linux. He encoding and decoding for every format I use. Flawless support for different screens.
I haven't regretted the purchase at all.
I wonder what happened to my brain writing the above. There is a "He" in there that makes no sense. Flawless support for "different" screens? I of course mean "many screens".
I'm really curious to see if these still rely heavily on resizable BAR. Putting these in old computers in linux without reBAR support makes the driver crash with literally any load rendering the cards completely unusable.
It's a real shame, the single slot a380 is a great performance for price light gaming and general use card for small machines.
Yes, it even says so as a requirement on the box.
What I'm also interested in is whether they'll require a PCIE 4.0 motherboard or if a 3.0 x16 slot is fine.
Also whether their software properly supports VR now...
Yes, the slide deck mentions reBAR is still required.
What is the newest platform that lacks resizable BAR? It was standardized in 2006. Is 4060-level graphics performance useful in whatever old computer has that problem?
The newest platform is probably POWER10. ReBar is not supported on any POWER platform, most likely including the upcoming POWER11.
Also, I don't think you'll find many mainboards from 2006 supporting it. It may have been standardized in 2006, but a quick online search leads me to think that even on x86 mainboards it didn't become commonly available until at least 2020.
If the firmware is open source, perhaps it could be retrofitted. For amd64 motherboards that do not have native UEFI support, retrofits are possible through this:
https://github.com/xCuri0/ReBarUEFI
Congrats on a pretty niche reply. I wonder if literally anyone has tried to put an ARC dGPU in a POWER system. Maybe someone from Libre-SOC will chime in.
do you have a reference for power rebar support? just curious, I couldn't find anything with a quick look
Oh no... my POWER gaming rig... no..
Ryzen 2000 series processors don't support AMD's "Smart Access Memory" which is pretty much resizable BAR. That's 2018.
Coffee Lake also didn't really support ReBAR either, also 2018.
Sandy Bridge (2009) is still a very usable CPU with a modern GPU. In theory Sandy Bridge supported resizable BAR but in practice they didn't. I think the problem was BIOS's.
ReBARUEFI was enough to get it working on an ASUS P8P67-something something when I tried it in January-ish.
[1] https://github.com/xCuri0/ReBarUEFI
On paper any PCIe 2.0 motherboard can receive a BIOS update adding ReBAR support with 2.1, but reality is that you pretty much have to get a PCIe 3.0 motherboard to have any chance of having it or modding it in yourself.
Another issue is that not every GPU actually supports ReBAR, I'm reasonably certain the Nvidia drivers turn it off for some titles, and pretty much the only vendor that reliably wants ReBAR on at all times is Intel Arc.
I also personally wouldn't say that Sandy Bridge is very usable with a modern GPU without also specifying what kind of CPU or GPU. Or context in how it's being used.
My old Ice Lake CPU was very much a bottleneck in lots of games in 2018 when I finally replaced it. It was a noticeable improvement across the board making the jump to a Zen+ CPU at the time, even with the same GPU.
Oh wow. That's older than I thought. This is definitely less of an issue than folks make out of it.
I cling onto my old hardware to limit ewaste where I can. I still gave up on my old sandybridge machine once it hit about a decade old. Not only would the CPU have trouble keeping up, its mostly only PCIe 2.0. A few had 3.0. You wouldn't get the full potential even out of the cheapest one of these intel cards. If you are putting a GPU in a system like that I can't imagine even buying new. Just get something used off ebay.
There were a lot of generations after Sandy Bridge which didn't have it; Sandy Bridge was just one generation that didn't really support it on the consumer side.
Consumer boards and CPUs didn't really support it well until after 2018. I upgraded away from a Zen+ system because it didn't support it.
While "standardized" many implementations were so buggy to be unusable - we needed >4gb pcie mappings for a development board, and finding motherboards that actually worked was a PITA well into 2012 (when I left that project).
ReBAR was standardized in 2006 but consumer motherboards didn't start shipping with an option to enable it until much later, and didn't start turning it on by default until a few years ago.
I wanted to have alternative choices than Nvidia for high power GPUs. Then the more I thought about it, the more it made sense to rent cloud services for AI/ML workloads and lesser powered ones for gaming. The only use cases I could come up with for wanting high-end cards are 4k gaming (a luxury I can't justify for infrequent use) or for PC VR which may still be valid if/when a decent OLED (or mini-OLED) headset is available--the Sony PSVR2 with PC adapter is pretty close. The Bigscreen Beyond is also a milestone/benchmark.
Don't rent a GPU for gaming, unless you're doing something like a full-on game streaming service. +10ms isn't much for some games, but would be noticeable on plenty.
IMO you want those frames getting rendered as close to the monitor as possible, and you'd probably have a better time with lower fidelity graphics rendered locally. You'd also get to keep gaming during a network outage.
Absolutely. By "and lesser powered ones for gaming" I meant purchase.
I don't even think network latency is the real problem, it's all the buffering needed to encode a game's output to a video stream and keep it v-synced with a network-attached display.
I've tried game streaming under the best possible conditions (<1ms network latency) and it still feels a little off. Especially shooters and 2D platformers.
Yeah - there's no way to play something like Overwatch/Fornite on a streaming service and have a good time. The only things that seems to be ok is turned based or platformers.
Which video card are you using for PSVR?
I haven't decided/pulled-the-trigger but the Intel ARC series are giving the AMD parts a good run for the money.
The only concern is how well the new Intel drivers work (full support for DX12) with older titles which are continuously being improved (for DX11, 10, and some for 9 others via emulation).
There's likely some deep discounting of Intel cards because of how bad the drivers were at launch and the prices may not stay so low once things are working much better.
Given Intel's recent troubles, I'm trying to decide how risky it is to invest in their platform. Especially discrete GPUs for Linux gaming
Fortunately, having their Linux drivers be (mostly?) open source makes a purchase seem less risky.
Intel isn't going anywhere for at least a couple of hardware genrations. Buying a GPU is also not "investing" in anything. In 2 years' time you can replace it whith whatever is best value for money at that time.
> Buying a GPU is also not "investing" in anything.
It is in the (minor) sense that I'd rely on Intel for warranty support, driver updates (if closed source), and firmware fixes.
But I agree with your main point that the worst-case downside isn't that big of a deal.
There's no way you're going to maintain and develop the intel linux driver as a solo dev.
> There's no way you're going to maintain and develop the intel linux driver as a solo dev.
I agree entirely.
My point was that even if Intel disappeared tomorrow, there's a good chance that Linux developer community would take over maintenance of those drivers.
In contrast to, e.g., 10-years-ago nvidia, where IIUC it was very difficult for outsiders to obtain the documentation needed to write proper drivers for their GPUs.
I can't speak from experience with their GPUs on Linux, but I know on Windows most of their problems stem from supporting pre-DX12 Direct3D titles. Nvidia and AMD have spent many years polishing up their Direct3D support and putting in driver-side hacks that paper over badly programmed Direct3D games.
These are obviously Windows-specific issues that don't come up at all in Linux, where all that Direct3D headache is taken care of by DXVK. Amusingly a big part of Intel's efforts to improve D3D performance on Windows has been to use DXVK for many titles.
SR-IOV is supported on their iGPUs and outside of it exclusive to their enterprise offering. Give it to me on desktop and I'll buy.
Intel is allergic to competition.
There is no competition. People get old workstation/server cards
Intel has a huge problem with business units not competing with eachother.
Do they still lock out ECC support on Core processors? I'm still running an ancient i3 for ECC support in my home server.
ECC is supported on a random subset of processors (one of each performance tier) but the motherboard still costs 500$.
ECC requires workstation/server motherboards
"Arc B"?
Presumably graphics cards optimised for hairdressers and telephone sanitisers?
It took me sooooo long to recognize this reference.
Anyone using Intel graphics cards? Aside from specs drivers and support can make or break the value prop of a gfx card. Would be curious what actually using is these is like.
I use an A770 LE for PC gaming. Windows drivers have improved substantially in the last two years. There's a driver update every month or so, although the Intel Arc control GUI hasn't improved in a while. Popular newer titles have generally run well; I've played some Metaphor, Final Fantasy 16, Elden Ring, Spider-Man Remastered, Horizon Zero Dawn, Overwatch, Jedi Survivor, Forza Horizon 4, Monster Hunter Sunbreak, etc. without major issues. Older games sometimes struggle; a 6 year old Need for Speed doesn't display terrain, some 10+ year old indie games crash. Usually fixed by dropping dxvk.dll in the game directory. This fix cannot be used with older Windows Store games. One problematic newer title was Starfield, which at launch had massive frame pacing and hard crashing issues exclusive to Intel Arc.
I've had a small sound latency issue forever; most visible with YouTube videos, the first half-second of every video is silent.
I picked this card up for about $120 less than the GTX 4060. Wasn't a terrible decision.
Appreciate the detail! I've always found the xx60 class of Nvidia cards to be the entry point for me when I'm putting together a gaming rig so might investigate Intel next time I'm putting a build together.
I put an Arc card in my daughter's machine last month. Seems to work fine.
What OS?
Windows 11
What about sharing GPU across multiple VMs? Isn't Nvidia walled this feature behind unreasonably high price features?
I'm not a gamer and there is not enough memory in this thing for me to care to use it for AI applications so that leaves just one thing I care about: hardware accelerated video encoding and decoding. Let's see some performance metrics both in speed and visual quality
From what I have gathered, the alchemist av1 is about the same or sliiiightly worse than current nvenc. My a750 does about 1400fps for dvd encoding on the quality preset. I havent had the opportunity to try 1080p or 4k though.
Bit disappointed there's no 16gig(or more) version. But absolutely thrilled the rumours of Intel discrete graphics' demise were wildly exaggerated(looking at you, Moore's Law is Dead...).
Very happy with my A770. Godsend for people like me who want plenty VRAM to play with neural nets, but don't have the money for workstation GPUs or massively overpriced Nvidia flagships. Works painlessly with linux, gaming performance is fine, price was the first time I haven't felt fleeced buying a GPU in many years. Not having CUDA does lead to some friction, but I think nVidia's CUDA moat is a temporary situation.
Prolly sit this one out unless they release another SKU with 16G or more ram. But if Intel survives long enough to release Celestial, I'll happily buy one.
have you tested llama.cpp with this card on Linux? when i tested about a year ago, it was a nightmare.
A few months ago, yeah. Had to set an environment variable(added to the ollama systemd unit file), but otherwise it worked just fine.
Seems to feature ray tracing (kind of obvious), but also upscaling.
My experience on WH40K DT has taught me that upscaling is absolutely vital for a reasonable experience on some games.
> upscaling is absolutely vital for a reasonable experience on some games
This strikes me as a bit of a sad state of affairs. We've moved beyond a Parkinson's law of computational resources –usage by games expands to fill the available resources– to resource usage expanding to fill the available resources on the highest end machines unavailable for less than a few thousand dollars... and then using that to train a model to simulate by upscaling higher quality or performance on lower end machines.
A counterargument would be that this makes high-end experiences available to more people, and while in the individual case, I don't buy that that's where the incentives it creates are driving the entire industry.
To put a finer point on it: at what percentage of budget is too much money being spent on producing assets?
Isn't it insane to think that rendering triangles for the visuals in games has gotten so demanding that we need an artificially intelligent system embedded in our graphics cards to paint pixels that look like high definition geometry?
What a time to be alive. Our most advanced technology is used to cheat on homework and play video games.
> Isn't it insane to think that rendering triangles for the visuals in games has gotten so demanding that we need an artificially intelligent system embedded in our graphics cards to paint pixels that look like high definition geometry?
That's not _quite_ how temporal upscaling work in practice. It's more of a blend between existing pixels, not generating entire pixels from scratch.
The technique has existed since before ML upscalers became common. It's just turned out that ML is really good at determining how much to blend by each frame, compared to hand written and tweaked per-game heuristics.
---
For some history, DLSS 1 _did_ try and generate pixels entirely from scratch each frame. Needless to say, the quality was crap, and that was after a very expensive and time consuming process to train the model for each individual game (and forget about using it as you develop the game; imagine having to retrain the AI model as you implement the graphics).
DLSS 2 moved to having the model predict blend weights fed into an existing TAAU pipeline, which is much more generalizable and has way better quality.
It is. And it strikes me as evidence we've lost the plot and a measure has ceased to be a good measure upon being a target.
It used to be that more computational power was desirable because it would allow for developers to more fully realize creative visions that weren't previously possible.
Now, it seems that the goal is simply visual fidelity and asset complexity... and the rest of the experience is not only secondary, but compromised in pursuit of the former.
Thinking back on recent games that felt like something new and painstakingly crafted... they're almost all 2D (or look like it), lean on excellent art/music (and even haptics!) direction, have a well-crafted core gameplay loop or set of systems, and have relatively low actual system requirements (which in turn means they are exceptionally smooth without any AI tricks).
Off the top of my head few years: Hades, Balatro, Animal Well, Cruelty Squad[0], Spelunky, Pizza Tower, Papers Please, etc. Most of these could just as easily have been made a decade ago.
That's not to say we haven't had many games that are gorgeous and fun. But while the latter is necessary and sufficient, the former is neither.
It's just icing: it doesn't matter if the cake tastes like crap.
[0] a mission statement if there ever was one for how much fun something can be while not just being ugly but being actively antagonistic to the senses and any notion of good taste.
Probably would jump to Intel once my 3060 gets too old
You have a long time to go, unless you want to play the latest and greatest AAA games, but that will not be your cards fault, it's the game studios not optimizing their game.
I’ll pick up a B580 to see how it works with Jellyfin transcoding, OBS streaming using AV1, and, with some luck, Davinci Resolve. Maybe a little Blender?
Other exciting tests will include things like fan control, since that’s still an issue with Arc GPUs.
Should make for a fun blog post.
Looking forward to it! Would appreciate if you post a link here when you do, I'm excited to see how B series compares to A series for media encoding/decoding (assuming Battlemage is the same across all gpus the same way Alchemist was).
I actually really like the Arc 770.
However, this is going to go on clearance within 6 months. Good for consumers, bad for Intel.
Also keep in mind for any ML task Nvidia has the best ecosystem around. AMD and Intel are both like 5 years behind to be charitable...
Too late, and it has a bad rep. This effort from Intel to sell discrete GPUs is just inertia from old aspirations, won't really help noticeably to save it, as there is not much money in it. Most probably the whole Intel ARC effort will be mothballed, and probably many more will.
What's the alternative?
I think it's the right call since there isn't much competition in GPU industry anyway. Sure, Intel is far behind. But they need to start somewhere in order to break ground.
Strictly speaking strategically, my intuition is that they will learn from this, course correct and then would start making progress.
The idea of another competitive GPU manufacturer is nice. But it is hard to bring into existence. Intel is not in a position to invest lots of money and sustained effort into products for which the market is captured and controlled by a much bigger and more competent company on top of its game. Not even AMD can get more market share, and they are much more competent in the GPU technology. Unless NVIDIA and AMD make serious mistakes, Intel GPUs will remain a 3rd rate product.
> "They need to start somewhere in order to break ground"
Intel has big problems and it's not clear they should occupy themselves with this. They should stabilize, and the most plausible way to do that is to cut the weak parts, and get back to what they were good at - performant secure x86_64 CPUs, maybe some new innovative CPUs with low consumption, maybe memory/solid state drives.
> maybe memory/solid state drives
That's a very low margin and cyclical market since memory/SSDs are basically commodities. I don't think Intel would have any chance surviving in such a market they just have way to much bloat/R&D spending. Which is not a bad thing as long as you can produce better than products than the competition.
No reviews and when you click on the reseller links in the press announcement they're still selling A750s with no B-Series in sight. Strong paper launch.
The fine article states reviews are still embargoed, and sales start next week.
The mods have thankfully changed this to a Phoronix article instead of the Intel page and the title has been reworked to not include 'launch'.
"old aspirations"
"there is not much money in it"?
WTF?
These are pretty interesting, but I'm curious about the side-by-side screenshot with the slider: why does ray tracing need to be enabled to see the yellow stoplight? That seems like a weird oversight.
It's possible that the capture wasn't taken at the exact same frame, or that the state of the light isn't deterministic in the benchmark.
They say the best predictor for the future is the past.
How was driver support for their A-series?
Drivers were very rough at launch. Some games didn't run at all, some basic functionality and configuration either crashes or failed to work, some things ran very poorly, etc. However, it was essentially all ironed out over many months of work.
They likely won't need to do the same discovery and fixing for B-series as they've already dealt with it.
> Intel with their Windows benchmarks are promoting the Arc B580 as being 24% faster than the Intel Arc A750
Not a huge fan of the numbering system they've used. B > A doesn't parse as easily as 5xxx > 4xxx to me.
They're going in alphabetical order: A - Alchemist B - Battlemage C - Celestial (Future gen) D - Druid (Future gen)
Hey we complained about all the numbers in their product names. Getting names from the D&D PHB is… actually very cool, no complaints.
Yes, I understand that. I'm saying it doesn't read as easily IMO as (modern) NVIDIA/AMD model numbers. Most numbers I deal with are base-10, not base-36.
The naming scheme they are using is easier to parse for me so all in the eye of the beholder.
On other hand considering Geforce is 3rd loop of base 10 maybe it is not so bad... Radeon is on other hand a pure absolute mess... Going back same 20 years.
I kinda like the idea of Intel.
You aren’t using excel or sheets I see?
what's the current status of using cuda on non-gpu chips?
IIRC that was one of the original goals of geohot's tinybox project, though I'm not sure exactly where that evolved
I like Intel's aggressive pricing against entry/mid level GPUs, which hopefully puts downward pressure on all GPUs. Overall, their biggest concern is software support. We've had reports of certain DX11/12 games failing to run properly on Proton, and the actual performance of the A series varied greatly between games even on Windows. I suspect we'll see the same issues when the B580 gets proper third party benchmarking.
Their dedication to Linux Support, combined with their good pricing makes this a potential buy for me in future versions. To be frank, I won't be replacing my 7900 XTX with this. Intel needs to provide more raw power in their cards and third parties need to improve their software support before this captures my business.
I'm considering getting one to replace my 8 year old NVIDIA card but why are there 2 SKUs almost identical in price?
Binning.
https://en.wikipedia.org/wiki/Product_binning#Core_unlocking
None of the store links work. Weird. Is this not supposed to be a public page yet?
Must be an announcement rather than a launch I guess?
Why, though? Intel's strategy seems puzzling, to say the least.
Hard to get subsidies if you’re not releasing new lines of products.
How does this connect to Gelsinger's retirement, announced yesterday? The comments on that news were all doom and gloom, so I had expected more negative news today. Not a product launch. But I'm just some guy on HN, what do I know?
I don't see any connection. This is a very minor product for Intel.
the new intel battlemage cards look sweet. if they can extend displays on linux, then i'll definitely be buying one
Recently I did some testing of the IPEX-LLM llama.cpp backend on LNL's Xe2: https://www.reddit.com/r/LocalLLaMA/comments/1gheslj/testing...
Based on scaling by XMX/engine clock napkin math, the B580 should have 230 FP16 TFLOPS and 456 GB/s MBW theoretical. At similar efficiency to LNL Xe2, that should be about pp512 ~4700 t/s and tg128 ~77 t/s for a 7B class model. This would be about 75% of a 3090 for pp and 50% for tg (and of course, 50% of memory). For $250, that's not too bad.
I do want to note a couple things from my poking around. The IPEX-LLM [1] was very responsive, and was able to address an issue I had w/ llama.cpp within days. They are doing weekly update releases, so that's great. The IPEX stands for Intel Extension for PyTorch [2] and it is a mostly drop-in for PyTorch: "Intel® Extension for PyTorch* extends PyTorch* with up-to-date features optimizations for an extra performance boost on Intel hardware. Optimizations take advantage of Intel® Advanced Vector Extensions 512 (Intel® AVX-512) Vector Neural Network Instructions (VNNI) and Intel® Advanced Matrix Extensions (Intel® AMX) on Intel CPUs as well as Intel Xe Matrix Extensions (XMX) AI engines on Intel discrete GPUs. Moreover, Intel® Extension for PyTorch* provides easy GPU acceleration for Intel discrete GPUs through the PyTorch* xpu device."
All of this depends on Intel oneAPI Base Kit [3] which has easy Linux (and presumably Windows) support. I am normally an AUR guy on my Arch Linux workstation, but those are basically broken and I had much more success installing oneAPI Base Kit (w/o issues) directly in Arch Linux. Sadly, this is also where there are issues some of the code is either dependent on older versions of oneAPI Base Kit that are no longer available (vLLM requires oneAPI Base Toolkit 2024.1 - this is not available for download from the Intel site anymore) or in dependency hell (GPU whisper simply will not work, ipex-llm[xpu] has internal conflicts from the get go), so it's not all sunshine. On average, ROCm w/ RNDA3 is much more mature (while not always the fastest, most basic things do just work now).
[1] https://github.com/intel-analytics/ipex-llm
[2] https://github.com/intel/intel-extension-for-pytorch
[3] https://www.intel.com/content/www/us/en/developer/tools/onea...
12GB of vRAM? What a wasted opportunity.
I wonder if their partners can release 24GB cards.
Well hopefully they'll release a A770 replacement as well. I guess they rushed these cards to get ahead of Nvidia and AMD who should release their next gen early next year.
For lowest end GPU? (and 2k gaming?) It is plenty even for most 4k games.
Gaming sure, but not for GPU compute
You most likely would buy 700x series for compute
From the gaming side of things, I'm disappointed that Intel and AMD are focusing on the midrange market going forwards. I'm on Linux with a 6900XT and wasn't going to upgrade until there's a compatible option with acceptable raytracing performance (and when HDR is finally sorted out). The 4090 and other high tier cards are absurdly expensive, would be good to have competition in that segment.
But can it AI?
I wanted Intel to do well so I purchased an ARC card. The problem is not the hardware. For some games, it worked fine, but in others, it kept crashing left and right. After updates to drivers, crashing was reduced, but it still happened. Driver software is not easy to develop thoroughly. Even AMD had problems when compared to Nvidia when AMD really started to enter the GPU game after buying ATI. AMD has long since solved their driver woes, but years after ARC's launch, Intel still has not.
It's also a hardware problem. For example, Alchemist's EUs being SIMD8 but games requiring SIMD16, so it needs to be dispatched to two EUs in a lockstep, or the lack of support for Execute Indirect instruction commonly used in UE5 games, which is currently emulated in software, makes game compatibility a very hit-or-miss.
Battlemage is supposed to fix all these architectural issues. EU in Xe2 is now SIMD16 (which is why the number of EUs per Xe2 core is halved from that of Xe1), and they've added all the previously software-emulated instructions, including Execute Indirect, so in theory Battlemage should be in a much better position in game compatibility side of things.
On Linux side of things, lacking sparse residency support in i915 also contributes to game compatibility[1] (though this is now available under Mesa 24). This is something the new xe driver is supposed to fix, but it's still a long way to go until it's actually usable.
[1]: https://www.phoronix.com/news/Intel-Vukan-Sparse-TR-TT
I haven't experienced many crashing issues on Windows 11. What games are you seeing this in?
Do you mean on Linux and are those problems with anv? Radv seems to be developed faster these days with anv being slightly more behind.
Intel over there with two spears in the knees looking puzzled and in pain.
tell it to my intc stock price
Typical, they release this on the day I pick up a brand new A770 16gb to toy with LLM stuff.
ah well. pretty sure it'll do for my needs.
"B-series", huh?
I'm guessing their marketing department isn't known as the "A-team".
Intel can't compete head to head with Nvidia on performance.
But surely it's easy enough to compete on video ram - why not load their GPUs to the max with video ram?
And also video encoder cores - Intel has a great video encoder core and these vary little across high end to low end GPUs - so they could make it a standout feature to have, for example, 8 video encoder cores instead of 2.
It's no wonder Nvidia is the king because AMD and Intel just don't seem willing to fight.
Which market segment wants to encode 8 streams at once for cheap, and how big is it?
Streaming and streamers want to do multi resolution streaming.
Aaah. I assumed that for that kind of thing Twitch did the transcoding.
Twitch did in the past (mostly), but it is trying pivot at the moment: Take tight control over the encoding settings on the client side and just pass the already encoded stream through the CDN. https://help.twitch.tv/s/article/multiple-encodes
Also having different encoding settings for different purposes is desired (e.g. high quality local recording for an edit later while live streaming to different services at the same time [Twitch, Youtube, ...]).
That said, I'm not aware of Intel limiting the number of encoding streams, so I don't know where the number 2 originates.
Unlabelled graphs are infuriating. Are the charts average framerate? Mean framerate? Maximum framerate?
The two graphs on the page show FPS.
GP is asking what measure of FPS. The most likely value when unspecified is usually "mean FPS" but, being a marketing graph, it doesn't explicitly say.
As long as all bars show the same, is there much difference between “mean”, “minimum” as long as the settings are sensible? Keep in mind this is one benchmark (two, really) and you won’t precisely recreate the testing conditions, and it doesn’t matter because your use case is not running the tests.
Why don't they just release a basic GPU with 128GB RAM and eat NVidia's local generative AI lunch? The networking effect of all devs porting their LLMs etc. to that card would instantly put them as a major CUDA threat. But beancounters running the company would never get such an idea...
Disclosure: HPC admin who works with NIVIDA cards here.
Because, no. It's not as simple as that.
NVIDIA has a complete ecosystem now. They have cards. They have cards of cards (platforms), which they produce, validate and sell. They have NVLink crossbars and switches which connects these cards on their card of cards with very high speeds and low latency.
For inter-server communication they have libraries which coordinate cards, workloads and computations.
They bought Mellanox, but that can be used by anyone, so there's no lock-in for now.
As a tangent, NVIDIA has a whole set of standards for pumping tremendous amount of data in and out of these mesh of cards. Let it be GPU-Direct storage or specialized daemons which handle data transfers on and off cards.
If you think that you can connect n cards on PCIe bus and just send workloads to them and solve problems magically, you'll hurt yourself a lot, both performance and psychology wise.
You have to build a stack which can perform these things with maximum possible performance to be able to compute with NVIDIA. It's not just emulating CUDA, now. Esp., on the high end of the AI spectrum (GenAI, MultiCard, MultiSystem, etc.).
For other lower end, multi-tenant scenarios, they have card virtualization, MIG, etc. for card sharing. You have to complete on that, too, for cloud and smaller applications.
I have been hacking on local llama 3 inference software (for the CPU, but I have been thinking about how I would port it to a GPU) and would like to do a rebuttal:
https://github.com/ryao/llama3.c
Inference workloads are easy to parallelize to N cards with minimal connectivity between them. The Nvlink crossbars and switches just are not needed.
In particular, inference can be divided into two distinct phases, which are input processing (prompt processing) and output generation (token generation). They are remarkably different in their requirements. Input processing is compute bound via GEMM operations while output generation is memory bandwidth bound via GEMV operations. Technically, you can do the input processing via GEMV too by processing 1 token at a time, but that is slow, so you do not want to do that. Anyway, these phases can be further subdivided into the model’s layers. You can have 1 GPU per layer with the logits passing from GPU to GPU in a pipeline. The GPUs just need the layer’s weights and the key-value cache for all of the tokens in that layer in memory to be able to work effectively. For llama 3.1 405B, there are 126 layers, so that is up to 126 GPUs.
That is of course slightly slower than if you just had 1 GPU with an incredible amount of VRAM, but you can always have more than one query in flight to get better than 1 GPU’s worth of performance from this pipeline approach. There are other ways of doing parallelization too, such as having output processing use GEMM to do multiple queries in parallel. This would be what others call batching, although I am only interested in doing 1 query at a time right now, so I have not touched it.
In essence, you can connect n cards on PCIe and have them solve inferencing problems magically, with the right software. Training is a different matter and I cannot comment on it as I have not studied it yet.
I presume the counterargument is that inference hosting is commoditized (sort of like how stateless CPU-based containerized workload hosts are commoditized); there’s no margin in that business, because it is parallelizable, and arbitrarily schedulable, and able to be spread across heterogenous hardware pretty easily (just route individual requests to sub-cluster A or B), preventing any kind of lock-in and thus any kind of rent-extraction by the vendor.
Which therefore means that cards that can only do inference, are fungible. You don’t want to spend CapEx on getting into a new LOB just to sell something fungible.
All the gigantic GPU clusters that you can sell a million at a time to a bigcorp under a high-margin service contract, meanwhile, are training clusters. Nvidia’s market cap right now is fundamentally built on the model-training “space race” going on between the world’s ~15 big AI companies. That’s the non-fungible market.
For Intel to see any benefit (in stock price terms) from an ML-accelerator-card LOB, it’d have to be a card that competes in that space. And that’s a much taller order.
Intel does make cards aimed at this space too:
https://www.intel.com/content/www/us/en/products/details/pro...
Coincidentally, it has 128GB of RAM. However, it is not a GPU, is designed to do training too and uses expensive HBM.
Modern GPUs can do more than inference/training and the original poster asked about a GPU with 128GB of RAM, not a card that can only do inferencing as you described. Interestingly, Qualcomm made its own card targeted at only inferencing with 128GB of RAM without using HBM:
https://www.qualcomm.com/news/onq/2023/11/introducing-qualco...
They do not sell it through PC parts channels so I do not know the price, but it is exactly what you described and it has been built. Presumably, a GPU with the same memory configuration would be of interest to the original poster.
Back in January, someone on Reddit claimed the list price was $16k.
It's competing against Nvidia H100s, which cost $25k. It's cheap, at least by the norms of the space.
You are both correct. AI inference is a comparatively easy problem from the perspective of parallelization when compared to most HPC problems.
Facebook did a technical paper where they described their training cluster and the sheer amount of complexity is staggering. That said, the original poster was interested in inferencing, not training.
https://arxiv.org/pdf/2407.21783
Training is close to traditional HPC in many ways. Inference is far simpler since it's a simple forward-going pipeline of a relatively small working set.
what kind of bandwidth/latency between GPUs would one need in that setup to not be bottlenecking? What you're describing sounds quite forgiving. Is it forgiving enough that we could potentially connect those GPUs over a LAN, or even a remote decentralized cloud of host computers?
From my understanding that's certainly possible to do without the latency hurting much with large batching between inference layers
This depends on:
The model dimensions are: The model layers are: The amount of data that needs to be transferred for each split is surprisingly small. Each time you move the calculation of a subsequent layer to a different GPU, you need to transfer an array that is of size model_dimension * num_tokens * bits_per_variable. Then this reduces to a classic network transfer time problem, where you consider both time until the first byte arrives and the transfer time until the last byte arrives. Reality will likely be longer than that idealized scenario, especially since you need to send a signal saying to begin computing.Input processing can tackle so many tokens simultaneously that it probably is not worth thinking too much about this penalty there. Output processing is where the penalty is more significant, since you will incur these costs for every token. Let’s say we are doing fp16 or bf16 on llama 3 8B. Then we need to transfer 8KB every time we move the calculation for another layer to another GPU. If you use RDMA and do this over 10GbE, the transfer time would be 6.4 microseconds. If we assume the time to first byte and time to do a signal to begin processing is 3.6 microseconds combined (chosen to round things up), then we get a penalty of 10 microseconds per split, per token. If you are doing 60 tokens per second and split things across 4 GPUs over the network, you have a penalty of 30 microseconds per token. It will run about 0.003% slower and you are not going to notice this at all. Assuming 10GbE with RDMA is somewhat idealized, although I needed to pick something to give some numbers.
In any case, the equation for figuring what factor slower it would be is 1 / (1 + time to do transfers and trigger processing per each token in seconds). That would mean under a less ideal situation where the penalty is 5 milliseconds per token, the calculation will be ~0.99502487562 times what it would have been had it been done in a hypothetical single GPU that has all of the VRAM needed, but otherwise the same specifications. This penalty is also not very noticeable.
In conclusion, you are right. I actually recall seeing a random YouTube video talking about a software project that does clustered interferencing, so people are already doing this. Unfortunately, I do not remember the project name, channel name or video name.
On servers you're right: but for local LLM inference, I think you're wrong. For local LLMs most people are bottlenecked by not having enough VRAM: pretty much no one is running a 70b model on Nvidia GPUs locally, just due to the expense. You don't need maximum performance: you need it to run at all, which most people can't do for the good models — at least, not without heavy quantization that pretty badly lobotomizes them.
Apple is the king right now of local LLM inference, just because of their unified memory architecture meaning that people can get large amounts of "VRAM" (since all RAM is VRAM). They're not as fast as Nvidia — not even close to an H100, for example. But they don't need to be. No consumer can afford an H100, but they can afford a Mac.
I think he's mostly referring to inference and not training, which I entirely agree with - a 4x version of this card for workstations would do really well - even some basic interconnect between the cards a la nvlink would really drive this home.
The training can come after, with some inference and runtime optimizations on the software stack.
Most of the above infra is predicated on limiting RAM so that you need so much communication between cards. Bump the RAM up and you could do single card inference and all those connections become overhead that could have gone to more ram. For training there is an argument still, but even there the more RAM you have the less all that connectivity gains you. RAM has been used to sell cards and servers for a long time now, it is time to open the floodgates.
Correct for inference - the main use of the interconnect is RDMA requests between GPUs to fit models that wouldn't otherwise fit.
Not really correct for training - training has a lot of all-to-all problems, so hierarchical reduction is useful but doesn't really solve the incast problem - Nvlink _bandwidth_ is less of an issue than perhaps the SHARP functions in the NVLink switch ASICs.
All of that is highly relevant for training but what the poster was asking for is a desktop inference card.
You use at least half of this stack for desktop setups. You need copying daemons, the ecosystem support (docker-nvidia, etc.), some of the libraries, etc. even when you're on a single system.
If you're doing inference on a server; MIG comes into play. If you're doing inference on a larger cloud, GPU-direct storage comes into play.
It's all modular.
It's possible you're underestimating the open source community.
If there's a competing platform that hobbyists can tinker with, the ecosystem can improve quite rapidly, especially when the competing platform is completely closed and hobbyists basically are locked out and have no alternative.
> It's possible you're underestimating the open source community.
On the contrary. You really don't know how I love and prefer open source and love a more leveling playing field.
> If there's a competing platform that hobbyists can tinker with...
AMD's cards are better from hardware and software architecture standpoint, but the performance is not there yet. Plus, ROCm libraries are not that mature, but they're getting there. Developing high performance, high quality code is deceivingly expensive, because it's very heavy in theory, and you fly very close to the metal. I did that in my Ph.D., so I know what it entails. So it requires more than a couple (hundred) hobbyists to pull off (see the development of Eigen linear algebra library, or any high end math library).
Some big guns are pouring money into AMD to implement good ROCm libraries, and it started paying off (Debian has a ton of ROCm packages now, too). However, you need to be able to pull it off in the datacenter to be able to pull it off on the desktop.
AMD also needs to be able to enable ROCm on desktop properly, so people can start hacking it at home.
> especially when the competing platform is completely closed...
NVIDIA gives a lot of support to universities, researchers and institutions who play with their cards. Big cards may not be free, but know-how, support and first steps are always within reach. Plus, their researchers dogfood their own cards, and write papers with them.
So, as long as papers got published, researchers do their research, and something got invented, many people don't care about how open source the ecosystem is. This upsets me a ton, but when closed source AI companies and researchers who forget to add crucial details to their papers so what they did can't be reproduced don't care about open source, because they think like NVIDIA. "My research, my secrets, my fame, my money".
It's not about sharing. It's about winning, and it's ugly in some aspects.
Yep, this thread has a good compilation of ROCm woes: https://news.ycombinator.com/item?id=34832660
That said, for hobbyist inference on large pretrained models, I think there is an interesting set of possibilities here: maybe a number of operations aren't optimized, and it takes 10x as long to load the model into memory... but all that might not matter if AMD were to be the first to market for 128GB+ VRAM cards that are the only things that can run next-generation open-weight models in a desktop environment, particularly those generating video and images. The hobbyists don't need to optimize all the linear algebra operations that researchers need to be able to experiment with when training; they just need to implement the ones used by the open-weight models.
But of course this is all just wishful thinking, because as others have pointed out, any developments in this direction would require a level of foresight that AMD simply hasn't historically shown.
IDK, I found a post that's 2 years old that has links to doing llama and SD on an Arc [0] (although might be linux only), I feel like a cheap huge ram card would create a 'critical mass' as far as being able to start optimizing, and then from a longer term Intel could promise and deliver on 'scale up' improvements.
It would be a huge shift for them. To go from preferring some (sometimes not quite reached) metric, to, perhaps rightly play the 'reformed underdog'. Commoditize Big-Memory ML Capable GPUs, even if they aren't quite as competitive as the top players at first.
Will the other players respond? Yes. But ruin their margin. I know that sounds cutthroat[1] but hey I'm trying to hypothetically sell this to whomever is taking the reigns after Pat G.
> NVIDIA gives a lot of support to universities, researchers and institutions who play with their cards. Big cards may not be free, but know-how, support and first steps are always within reach. Plus, their researchers dogfood their own cards, and write papers with them.
Ideally they need to do that too. Ideally they have some 'high powered' prototypes (e.x. lets say they decide a 2-gpu per card design with an interlink is feasible for some reason) to share as well. This may not be be entirely ethical[1] in this example of how a corp could play it out, again it's a thought experiment since intel has NOT announced or hinted at a larger memory card anyway.
> AMD also needs to be able to enable ROCm on desktop properly, so people can start hacking it at home
AMD's driver story has always been a hot mess, My desktop won't behave with both my onboard video and 4060 enabled, every AMD card I've had winds up with some weird firmware quirk one way or another... I guess I'm saying their general level of driver quality doesn't lend to hope they'll fix dev tools that soon...
[0] - https://old.reddit.com/r/LocalLLaMA/comments/12khkka/running...
[1] - As you said, it's about winning and it can get ugly.
ROCm doesn't really matter when the hardware is almost the same as Nvidia cards. AMD is not selling "cheaper" card with a lot of RAM, what the original poster was asking. (and a reason why people who like to tinker with large model are using Macs).
You're writing as if AMD cares about open source. If they would only actually open source their driver the community would have made their cards better than nvidia ones long ago.
I'm one of those academics. You've got it all wrong. So many people care about open source. So many people carefully release their code and make everything reproducible.
We desperately just want AMD to open up. They just refuse. There's nothing secret going on and there's no conspiracy. There's just a company that for some inexplicable reason doesn't want to make boatloads of money for free.
AMD is the worst possible situation. They're hostile to us and they refuse to invest to make their stuff work.
> If they would only actually open source their driver the community would have made their cards better than nvidia ones long ago.
Software wise, maybe. But you can't change AMD's hardware with a magic wand, and that's where a lot of CUDA's optimizations come from. AMD's GPU architecture is optimized for raster compute, and it's been that way for decades.
I can assure you that AMD does not have a magic button to press that would make their systems competitive for AI. If that was possible it would have been done years ago, with or without their consent. The problem is deeper and extends to design decisions and disagreement over the complexity of GPU designs. If you compare AMD's cards to Nvidia on "fair ground" (eg. no CUDA, only OpenCL) the GPGPU performance still leans in Nvidia's favor.
That would require competently produced documentation. Intel can't do that for any of their side projects because their MBAs don't get a bonus if the tech writers are treated as a valuable asset.
Innovation is a bottom up process. If they sell the hardware the community will spring up to take advantage.
No. I've been reading up. I'm planning to run Flux 12b on my AMD 5700G with 64GB RAM. CPU will take 5-10minutes per image which will be fine for me tinkering while writing code. Maybe I'll be able to get the GPU going on it too.
Point of the OP is this is entirely possible with even an iGPU if only we have the RAM. nVidia should be irrelevant for local inference.
The Ryzen 5700G is one of the APUs tested on the Debian ROCm CI [1]. It works quite well with the Debian / Ubuntu system packages.
[1]: http://ci.rocm.debian.net/
No you don‘t need much bandwidth between cards for inference
Copying daemons (gdrcopy) is about pumping data in and out of a single card. docker-nvidia and rest of the stack is enablement for using cards.
GPU-Direct is about pumping data from storage devices to cards, esp. from high speed storage systems across networks.
MIG actually shares a single card to multiple instances, so many processes or VMs can use a single card for smaller tasks.
Nothing I have written in my previous comment is related to inter-card, inter-server communication, but all are related to disk-GPU, CPU-GPU or RAM-CPU communication.
Edit: I mean, it's not OK to talk about downvoting, and downvote as you like but, I install and enable these cards for researchers. I know what I'm installing and what it does. C'mon now. :D
Mostly, I think, we don’t really understand your argument that Intel couldn’t easily replicate the parts needed only for inference.
Yeah, for example llama.cpp runs on Intel GPUs via Vulkan or SYCL. The latter is actively being maintained by Intel developers.
Obviously that is only one piece of software, but its a certainly a useful one if you are using one of the many LLMs it supports.
i've run inference on Intel Arc and it works just fine so i am not sure what you're talking about. I certainly didn't need docker! I've never tried to do anything on AMD yet.
I had the 16GB arc, and it was able to run inference at the speed i expected, but twice as many per batch as my 8GB card, which i think is about what you'd expect.
once the model is on the card, there's no "disk" anymore, so having more vram to load the model and the tokenizer and whatever else on means there's no disk, and realistically when i am running loads on my 24GB 3090 the CPU is maybe 4% over idle usage. My bottleneck, as it stands, to running large models is vram, not anything else.
If i needed to train (from scratch or whatever) i'd just rent time somewhere, even with a 128GB card locally, because obviously more tensors is better.
and you're getting downvoted because there's literally lm studio and llama.cpp and sd-webui that run just fine for inference on our non-dc, non-nvlink, 1/15th the cost GPUs.
Inferencing is much more simple than you think:
See the precompute_input_logits() and forward() functions here:
https://github.com/ryao/llama3.c/blob/master/run.c#L520
As a preface, precompute_input_logits() is really just a generalized version of the forward() function that can operate on multiple input tokens at a time to do faster input processing, although it can be used in place of the forward() function for output generation just by passing only a single token at a time.
Also, my apologies for the code being a bit messy. matrix_multiply() and batched_matrix_multiply() are wrappers for GEMM, which I ended up having to use directly anyway when I needed to do strided access. Then matmul() is a wrapper for GEMV, which is really just a special case of GEMM. This is a work in progress personal R&D project that is based on prior work others did (as it spared me from having to do the legwork to implement the less interesting parts of inferencing), so it is not meant to be pretty.
Anyway, my purpose in providing that link is to show what is needed to do inferencing (on llama 3). You have a bunch of matrix weights, plus a lookup table for vectors that represent tokens, in memory. Then your operations are:
I specify rmsnorm and softmax for completeness, but they can be implemented in terms of the other operations.If you can do those, you can do inferencing. You don’t really need very specialized things. Over 95% of time will be spent in GEMM too.
My next steps likely will be to figure out how to implement fast GEMM kernels on my CPU. While my own SGEMV code outperforms the Intel MKL SGEMV code on my CPU (Ryzen 7 5800X where 1 core can use all memory bandwidth), my initial attempts at implementing SGEMM have not fared quite so well, but I will likely figure it out eventually. After I do, I can try adapting this to FP16 and then memory usage will finally be low enough that I can port it to a GPU with 24GB of VRAM. That would enable me to do what I say is possible rather than just saying it as I do here.
By the way, the llama.cpp project has already figured all of this out and has things running on both GPUs and CPUs using just about every major quantization. I am rolling my own to teach myself how things work. By happy coincidence, I am somehow outperforming llama.cpp in prompt processing on my CPU but sadly, the secrets of how I am doing it are in Intel’s proprietary cblas_sgemm_batch() function. However, since I know it is possible for the hardware to perform like that, I can keep trying ideas for my own implementation until I get something that performs at the same level or better.
I am favoriting this comment for reference later when I start poking around in the base level stuff. I find it pretty funny how simple this stuff can get. Have you messed with ternary computing inference yet? I imagine that shrinks the list even further - or at least reduces the compute requirements in favor of brute force addition. https://arxiv.org/html/2410.00907
No. I still have a long list of things to try doing with llama 3 (and possibly later llama 3.1) on more normal formats like fp32 and in the future, fp16. When I get things running on a GPU, I plan to try using bf16 and maybe fp8 if I get hardware that supports it. Low bit quantizations hurt model quality, so I am not very interested in them. Maybe that will change if good quality models trained to use them become available.
> but sadly, the secrets of how I am doing it are in Intel’s proprietary cblas_sgemm_batch() function.
Perhaps you can reverse engineer it?
My plan is to make more attempts at rolling my own. Reverse engineering things is not something that I do, since that would prevent me from publishing the result as OSS.
Rather than tackling the entire market at once, they could start with one section and build from there. NVIDIA didn't get to where it was in a year, it took many strategic acquisitions. (All the networking and other HPC-specialized stuff I was buying a decade ago has seemingly been bought by NVIDIA).
Start by being a "second vendor" for huge customers of NVIDIA that want to foster competition, as well as a few others willing to take risks, and build from there.
Intel has already bought and killed everything they need to compete here. They seem incapable of sticking to any market that isn’t x86. Likely because when they were making those acquisitions they were drunk on margin and didn’t want to focus anywhere else.
> They seem incapable of sticking to any market that isn’t x86
Even within that they seem to have difficulty expanding to new areas, remember edison?
Killing QLogic’s infiniband business after buying it was a major loss for the industry.
> NVIDIA has a complete ecosystem now. They have cards. They have cards of cards
Nvidia is starting to sound like a house of cards to me.
You are of course right WRT datacenter use of GPUs. The OP spoke about local generation though. It is of course a smaller market, but a market nevertheless, and the more amateurs and students are using your product, the more of them would consider applying it in more professional settings.
Sun (Sparc) and HP (PA-RISC) used to own most of the server market in 1990, but lost most of it to x86 by 2000. Few people had a Sun box with Solaris, but tons of people had access to a PC with Linux, which was inferior in many ways, but well-known and much less locked-up.
But all the tricks of developing a solid software stack to support the HW are already out no? The basic principles are there, to my understanding the main challenge is not doing this development in tandem with the HW people and the requirments to support older legacy device which makes it harder for example for Amd to compete. The only challenges,which intel are prepped to face is logistics and fabs. On a separate note, project like JAX are aiming to circumvent that abstraction layer cuda adds to nvidia, so having decent hardware competition is definitely an option. Just some time ago, vllm fully supported amd gpus! We need more competition.
Lets see how quickly that changes if intel releases cards with massive amounts of ram for a fraction of the cost.
yes but if Intel provided the base, people will flock to it build software for it. Intel doesn't even need be involved but they should.
> ... local generative AI lunch
I agree, just for my PC, something that'd enable small devs to create interesting foundation model apps that'd deploy to users using this local AI cards to run these new Apps.
There might be an hen-egg problem if the apps end up requiring a 128Gb AI accelerator card. You only get the card if there are apps to run, and you only develop the apps if the cards are wide-spread. With so much RAM, the cards will not be let's-through-them-into-a-cheap-build cheap.
I think there have to be a couple of killer apps that run "OK" with CPU or GPU, but would run tremendously better with such a card.
I have a question for you, since I’m somewhat entering the HPC world. In the EU the EuroHPC-JU is building what they call AI factories, afaict these are just batch processing (Slurm I think) clusters with GPUs in the nodes. So I wonder where you’d place those cards of cards. Are you saying there is another, perhaps better ways to use massive amounts of these cards? Or is that still in the “super powerful workstation” domain? Thanx in advance.
View it as Raspberry Pi for AI workloads. Initial stage is for enthusiasts that would develop the infra, figure out what is possible and spread the word. Then the next phase will be SME industry adoption, making it commercially interesting, while bypassing Nvidia completely. At some point it would live its own life and big players jump in. Classical disrupt strategy via low cost unique offerings.
Pretty sure they are talking about inference in the post you are responding to. Training the model obviously needs far more compute, but running them locally is what most devs are interested in.
Off the topoc. I think, in the long-term , inference should be done along with some kind of training.
How does any of this make money?
When this walled garden is the only way to use GPUs with high efficiency and everybody is using this stack, and NVIDIA controlling the supply of these "platform boards" to OEMs, they don't make money, but they literally print it.
However, AMD is coming for them because a couple of high profile supercomputer centers (LUMI, Livermore, etc.) are using Instinct cards and pouring money to AMD to improve their cards and stack.
I have not used their (Instinct) cards, yet, but their Linux driver architecture is way better than NVIDIA.
I think he was asking how HPC makes money. The answer as far as I know is that it often does not for anyone other than the vendors.
I wonder where you got your information on AMD’s “Linux driver architecture”. It is reportedly a mess:
https://news.ycombinator.com/item?id=34832660
So far, I have been very happy with Nvidia’s Linux drivers.
HPC is used in research, which is often not expected to make money. The hope is that the research will result in something that makes money. One example would be drug discovery. Another would be weather prediction, which is not so much a way to make money, but to minimize losses.
Having the complete ecosystem affords them significant margins.
Against what?
As of today they have SaaS company margins as a hardware company which is practically unheard of.
I mean, do they?
What?
It's like the most profitable set of products in tech. You have companies like Meta, MSFT, Amazon, Google etc spending $5B every few years buying this hardware.
Stale money is moving around. Nothing changed .
What is stale money?
Hmm. There is a lot of money that exists, doing nothing. I consider that stale money.
Edit: I can’t sort this out. Where did all the money go?
Have you looked under your sofa?
Sadly it’s just stale goldfish and magnetites.
So what you're saying is Intel, or any other would-be NVIDIA competitor, needs to put out fast interconnects, not just compute cards. This is true.
I'm not sure your argument stands when it comes to OP's idea of a single card with 128GB VRAM. This would be enough to run ~180B models with reasonable quantization --we're not near maxing out the capability of 180B yet (see the latest 32B models performing near public SOTA).
This indeed would push rapid and wide adoption and be quite disruptive. But sure, it wouldn't instantly enable competitive training of 405B models.
> Disclosure: HPC admin who works with NIVIDA cards here. > Because, no. It's not as simple as that.
Wow what he said is way above your head! Please reread what he wrote.
Just how "basic" do you think a GPU can be while having the capability to interface with that much DRAM? Getting there with GDDR6 would require a really wide memory bus even if you could get it to operate with multiple ranks. Getting to 128GB with LPDDR5x would be possible with the 256-bit bus width they used on the top parts of the last generation, but would result in having half the bandwidth of an already mediocre card. "Just add more RAM" doesn't work the way you wish it could.
M3/M4 Max MacBooks with 128GB RAM are already way better than an A6000 for very large local LLMs. So even if the GPU is as slow as the one in M3/M4 Max (<3070), and using some basic RAM like LPDDR5x it would still be way faster than anything from NVidia.
The M4 Max needs an enormous 512bit memory bus to extract enough bandwidth out of those LPDDR5x chips, while the GPUs that Intel just launched are 192/160bit and even flagships rarely exceed 384bit. They can't just slap more memory on the board, they would need to dedicate significantly more silicon area to memory IO and drive up the cost of the part, assuming their architecture would even scale that wide without hitting weird bottlenecks.
The memory controller would be bigger, and the cost would be higher, but not radically higher. It would be an attractive product for local inference even at triple the current price and the development expense would be 100% justified if it helped Intel get any kind of foothold in the ML market.
> They can't just slap more memory on the board
Why not? It doesn't have to be balanced. RAM is cheap. You would get an affordable card that can hold a large model and still do inference e.g. 4x faster than a CPU. The 128GB card doesn't have to do inference on a 128GB model as fast as a 16GB card does on a 16GB model, it can be slower than that and still faster than any cost-competitive alternative at that size.
The extra RAM also lets you do things like load a sparse mixture of experts model entirely into the GPU, which will perform well even on lower end GPUs with less bandwidth because you don't have to stream the whole model for each token, but you do need enough RAM for the whole model because you don't know ahead of time which parts you'll need.
To get 128GB of RAM on a GPU you'd need at least a 1024 bit bus. GDDR6x is 16Gbit 32 pins, so you'd need 64 GDDR6x chips, which good luck even trying to fit that around the GPU die since traces need to be the same length, and you want to keep them as short as possible. There's also a good chance you can't run a clamshell setup so you'd have to double the bus width to 2048 because 32 GDDR6x chips would kick off way too much heat to be cooled on the back of a GPU. Such a ridiculous setup would obviously be extremely expensive and would use way too much power.
A more sensible alternative would be going with HBM, except good luck getting any capacity for that since it's all being used for the extremely high margin data center GPUs. HBM is also extremely expensive both in terms of the cost of buying the chips and due to it's advanced packaging requirements.
You do not need a 1024-bit bus to put 128GB of some DDR variant on a GPU. You could do a 512-bit bus with dual rank memory. The 3090 had a 384-bit bus with dual rank memory and going to 512-bit from that is not much of a leap.
This assumes you use 32Gbit chips, which will likely be available in the near future. Interestingly, the GDDR7 specification allows for 64Gbit chips:
> the GDDR7 standard officially adds support for 64Gbit DRAM devices, twice the 32Gbit max capacity of GDDR6/GDDR6X
https://www.anandtech.com/show/21287/jedec-publishes-gddr7-s...
Yeah, the idea that you're limited by bus width is kind of silly. If you're using ordinary DDR5 then consider that desktops can handle 192GB of memory with a 128-bit memory bus, implying that you get 576GB with a 384-bit bus and 768GB at 512-bit. That's before you even consider using registered memory, which is "more expensive" but not that much more expensive.
And if you want to have some real fun, cause "registered GDDR" to be a thing.
Man, I'm old enough to remember when 512 was a thing for consumer cards back when we had 4-8gb memory
Sure that was only gddr5 and not gddr6 or lpddr5, but I would have bet we'd be up to 512bit again 10 years down the line..
(I mean supposedly hbm3 has done 1024-2048bit busses but that seems more research or super high end cards, not consumer)
Rumor is the 5090 will be bringing back the 512bit bus, for a whopping 1.5TB/sec bandwidth.
> They can't just slap more memory on the board, they would need to dedicate significantly more silicon area to memory IO and drive up the cost of the part,
In the pedantic sense of just literally slapping more on existing boards? No, they might have one empty spot for an extra BGA VRAM chip, but not enough for the gain's we're talking about. But this is absolutely possible, trivially so for someone like Intel/AMD/NVidia, that has full control over the architectural and design process. Is it a switch they flip at the factory 3 days before shipping? No, obviously not. But if they intended this to be the case ~2 years ago when this was just a product on the drawing board? Absolutely. There is 0 technical/hardware/manufacturing reason they couldn't do this. And considering the "entry level" competitor product is the M4 Max which starts at at least $3,000 (for a 128GB equipped one), the margin on pricing more than exists to cover a few hundred extra in ram and extra overhead in higher-layer more populated PCB's.
The real impediment is what you landed on at the end there combined with the greater ecosystem not having support for it. Intel could drop a card that is, by all rights, far better performing hardware than a competing Nvidia GPU, but Nvidia's dominance in API's, CUDA, Networking, Fabric-switches (NVLink, mellanox, bluefield), etc etc for that past 10+ years and all of the skilled labor that is familiar with it would largely render a 128GB Arc GPU a dud on delivery, even if it was priced as a steal. Same thing happened with the Radeon VII. Killer compute card that no one used because while the card itself was phenomenal, the rest of the ecosystem just wasn't there.
Now, if intel committed to that card, and poured their considerable resources into that ecosystem, and continued to iterate on that card/family, then now we're talking, but yeah, you can't just 10X VRAM on a card that's currently a non-player in the GPGPU market and expect anyone in the industry to really give a damn. Raise an eyebrow or make a note to check back in a year? Sure. But raise the issue to get a greenlight on the corpo credit line? Fat chance.
A 128GB VRAM Intel Arc card at a low price would be an OSS developer’s dream come true. It would be the WRT54G of inference.
> The M4 Max needs an enormous 512bit memory bus to extract enough bandwidth out of those LPDDR5x chips
Does M4 Max have 64-byte cache lines?
If they can fetch or flush an entire cache line in a single memory-bus transaction, I wonder if that opens up any additional hardware / performance optimizations.
> Does M4 Max have 64-byte cache lines?
on the CPU side: 64 bytes at L1, 128 byte cachelines at L2
A single memory transaction is almost always a 16n burst for LPDDR5x.
Apple could do it. Why can’t Intel?
Because Apple isn't playing the same game as everyone else. They have the money and clout to buy out TSMCs bleeding-edge processes and leave everyone else with the scraps, and their silicon is only sold in machines with extremely fat margins that can easily absorb the BOM cost of making huge chips on the most expensive processes money can buy.
Bleeding edge processes is what Intel specializes in. Unlike Apple, they don’t need TSMC. This should have been a huge advantage for Intel. Maybe that’s why Gelsinger got the boot.
> Bleeding edge processes is what Intel specializes in. Unlike Apple, they don’t need TSMC.
Intel literally outsourced their Arrow Lake manufacturing to TSMC because they couldn't fabricate the parts themselves - their 20A (2nm) process node never reached a production-ready state, and was eventually cancelled about a month ago.
OK, so the question becomes: TSMC could do it. Why can’t Intel?
Intel is maybe a year or two behind TSMC right now. They might or might not catch up since it is a moving target, but I dont think there is anything TSMC is doing today that Intel wont be doing in the near future.
They are trying … for like 10 years
Wasn't it cancelled in favor of 18A?
It was, but that only puts them further away from shipping product.
These days, Intel merely specializes in bleeding processes. They spent far too many years believing the unrealistic promises from their fab division, and in the past few years they've been suffering the consequences as the problems are too big to be covered up by the cost savings of vertical integration.
Intel's foundry side has been floundering so hard that they've resorted to using TSMC themselves in an attempt to keep up with AMD. Their recently launched CPUs are a mix of Intel-made and TSMC-made chiplets, but the latter accounts for most of the die area.
I'm not certain this is quite as damning as it sounds. My understanding is that the foundry business was intentionally walled off from the product business, and that the latter wasn't going to be treated as a privileged customer.
no, in fact, it sounds even more damning because client side was able to pick whatever was best on the market, and it wasn't intel. Client side could to learn and customize their designs to use another company's processes (this is an extremely hard thing to do by the way) faster than intel foundry could even get their pants up in the morning.
Intel foundry screwed up so badly that Nokia's server division was almost shut down because of Intel Foundry's failure. (imagine being so bad at your job, that your clients go out of business) If Intel client side chose to use Foundry, there just wouldn't be any chips to sell.
Intel Arc hardware is manufactured by TSMC, specifically on N6 and N5 for this latest announcement.
Intel doesn't currently have nodes competitive with TSMC or excess capacity in their better processes.
Serious question, why don't they have excess capacity? They aren't producing many CPUs...that people want.
Hard to have excess capacity when your latest fabs aren't able to produce anything reliably.
They don't even have competitive capacity for all their CPU needs. They have negative spare capacity overall.
Because their production capacity is much smaller than TSMC, it's declined over the past few years, and their newer nodes are having yield issues.
if you think intel has bleeding edge processes, that hasn't been the case for over 6 years ...
> and their silicon is only sold in machines with extremely fat margins
Like the brand new Mini that cost 600 USD and went to 500 during Black week.
The 600 mini is on the m4 with the anemic 10 core GPU. This is for grandama or your 10 year old.
The good one which is still slower than m4 max is 2200.
If you want the max you need at least a macbook pro starting at 3200 and if you want the better one with 128G RAM it starts at about 5k
No matter how few GPU cores the M4 has, it is still an extremely potent product on the whole.
Better than most of the pc's out there.
Do you mean subjectively because you find Mac pleasant to use? The $600 mini certainly isn't more performant than 85% of desktops.
> The $600 mini certainly isn't more performant than 85% of desktops.
The average desktop isn't exactly a gaming machine. But a corporate box or a low end home desktop with a Core i5 and using the iGPU.
Transistor IO logic scaling died a while ago, which is what prompted AMD to go with a chiplet architecture. Being on a more advanced process does not make implementing an 512-bit memory bus any easier for Apple. If anything, it makes it more expensive for Apple than it would be for Intel.
Because LPDDR5x is soldered on RAM.
Everyone else wants configurable RAM that scales both down (to 16GB) and up (to 2TB), to cover smaller laptops and bigger servers.
GPUs with soldered on RAM has 500GB/sec bandwidths, far in excess of Apples chips. So the 8GB or 16GB offered by NVidia or AMD is just far superior at vid o game graphics (where textures are the priority)
> GPUs with soldered on RAM has 500GB/sec bandwidths, far in excess of Apples chips.
Apple is doing 800GB/sec on the M2 Ultra and should reach about 1TB/sec with the M4 Ultra, but that's still lagging behind GPUs. The 4090 was already at the 1TB/sec mark two years ago, the 5090 is supposedly aiming for 1.5TB/sec, and the H200 is doing 5TB/sec.
HBM is kind of not fair lol. But 4096-line bus is gonna have more bandwidth than any competitor.
It's pretty expensive though.
The 500GB/sec number is for a more ordinary GPU like the B580 Battlemage in the $250ish price range. Obviously the $2000ish 4090 will be better, but I don't expect the typical consumer to be using those.
But an on-package memory bus has some of the advantages of HBM, just to a lesser extent, so it's arguably comparable as an "intermediate stage" between RAM chips and HBM. Distances are shorter (so voltage drop and capacitance are lower, so can be driven at lower power), routing is more complex but can be worked around by more layers, which increases cost but on a significantly smaller area than required for dimms, and the dimms connections themselves can hurt performance (reflection from poor contacts, optional termination makes things more complex, and the expectations of mix-and-match for dimm vendors and products likely reduce fine tuning possibilities).
There's pretty much a direct opposite scaling between flexibility and performance - dimms > soldered ram > on-package ram > die-interconnects.
The question is why Intel GPUs, which already have soldered memory, aren't sold with more of it. The market here isn't something that can beat enterprise GPUs at training, it's something that can beat desktop CPUs at inference with enough VRAM to fit large models at an affordable price.
Intel also has lunar lake CPUs with on package RAM. They could have added more memory channels like Apple did.
If their memory IO supports multiple ranks like the RTX 3090 (it used dual rank) did, they could do a new PCB layout and then add more memory chips to it. No additional silicon area would be necessary.
It doesn't matter if the "cost is driven up". Nvidia has proven that we're all lil pay pigs for them. 5090 will be 3000$ for 32gb of VRAM. Screenshot this now, it will age well.
We'd be happy to pay 5000 for 128gb from Intel.
You are absolutely correct, and even my non-prophetic ass echoed exactly the first sentence of the top comment in this HN thread ("Why don't they just release a basic GPU with 128GB RAM and eat NVidia's local generative AI lunch?").
Yes, yes, it's not trivial to have a GPU with 128gb of memory with cache tags and so on, but is that really in the same universe of complexity of taking on Nvidia and their CUDA / AI moat any other way? Did Intel ever give the impression they don't know how to design a cache? There really has to be a GOOD reason for this, otherwise everyone involved with this launch is just plain stupid or getting paid off to not pursue this.
Saying all this with infinite love and 100% commercial support of OpenCL since version 1.0, a great enjoyer of A770 with 16GB of memory, I live to laugh in the face of people who claimed for over 10 years that OpenCL is deprecated on MacOS (which I cannot stand and will never use, yet the hardware it runs on...) and still routinely crushes powerful desktop GPUs, in reality and practice today.
Both Intel and AMD produce server chips with 12 channel memory these days (that's 12x64bit for 768bit) which combined with DDR5 can push effective socket bandwidth beyond 800GB/s, which is well into the area occupied by single GPUs these days.
You can even find some attractive deals on motherboard/ram/cpu bundles built around grey market engineering sample CPUs on aliexpress with good reports about usability under Linux.
Building a whole new system like this is not exactly as simple as just plugging a GPU into an existing system, but you also benefit from upgradeability of the memory, and not having to use anything like CUDA. llamafile, as an example, really benefits from AVX-512 available in recent CPUs. LLMs are memory bandwidth bound, so it doesn't take many CPU cores to keep the memory bus full.
Another benefit is that you can get a large amount of usable high bandwidth memory with a relatively low total system power usage. Some of AMD's parts with 12 channel memory can fit in a 200W system power budget. Less than a single high end GPU.
My desktop machine has had 128gb since 2018, but for the AI workloads currently commanding almost infinite market value, it really needs the 1TB/s bandwidth and teraflops that only a bona fide GPU can provide. An early AMD GPU with these characteristics is the Radeon VII with 16gb HBM, which I bought for 500 eur back in 2019 (!!!).
I'm a rendering guy, not an AI guy, so I really just want the teraflops, but all GPU users urgently need a 3rd market player.
That 128gb is hanging off a dual channel memory bus with only 128 total bits of bandwidth. Which is why you need the GPU. The Epyc and Xeon CPUs I'm discussing have 6x the memory bandwidth, and will trade blows with that GPU.
At a mere 20x the cost or something, to say nothing about the motherboard etc :( 500 eur for 16GB of 1TB/s with tons of fp32 (and even fp64! The main reason I bought it) back in 2019 is no joke.
Believe me, as a lifelong hobbyist-HPC kind of person, I am absolutely dying for such a HBM/fp64 deal again.
$1,961.19: H13SSL-N Motherboard And EPYC 9334 QS CPU + DDR5 4*128GB 2666MHZ REG ECC RAM Server motherboard kit
https://www.aliexpress.us/item/3256807766813460.html
Doesn't seem like 20x to me. I'm sure spending more than 30 seconds searching could find even better deals.
Isn't 2666 MHz ECC RAM obscenely slow? 32 cores without the fast AVX-512 of Zen5 isn't what anyone is looking for in terms of floating point throughput (ask me about electricity prices in Germany), and for that money I'd rather just take a 4090 with 24GB memory and do my own software fixed point or floating point (which is exactly what I do personally and professionally).
This is exactly what I meant about Intel's recent launch. Imagine if they went full ALU-heavy on latest TSMC process and packaged 128GB with it, for like, 2-3k Eur. Nvidia would be whipping their lawyers to try to do something about that, not just their engineers.
Yes and no. I have been developing some local llama 3 inference software on a machine with 3200MT/s ECC RAM and a Ryzen 7 5800X:
https://github.com/ryao/llama3.c
My experience is that input processing (prompt processing) is compute bottlenecked in GEMM. AVX-512 would help there, although my CPU’s Zen 3 cores do not support it and the memory bandwidth does not matter very much. For output generation (token generation), memory bandwidth is a bottleneck and AVX-512 would not help at all.
I don't think anyone's stopping you, buddy. Great chat. I hope you have a nice evening.
12 channel DDR5 is actually 12x32-bit. JEDEC in its wisdom decided to split the 64-bit channels of earlier versions of DDR into 2x 32-bit channels per DIMM. Reaching 768-bit memory buses with DDR5 requires 24 channels.
Whenever I see DDR5 memory channels discussed, I am never sure if the speaker is accounting for the 2x 32-bit channels per DIMM or not.
The question is whether there's enough overall demand for a GPU architecture with 4x the VRAM of a 5090 but only about 1/3rd of the bandwidth. At that point it would only really be good for AI inferencing, so why not make specialized inferencing silicon instead?
I genuinely wonder why no one is doing this? Why can't I buy this specialized AI inference silicon with plenty of VRAM?
Intel and Qualcomm are doing this, although Intel uses HBM and their hardware is designed to do both inference and training while Qualcomm uses more conventional memory and their hardware is only designed do inference:
https://www.intel.com/content/www/us/en/products/details/pro...
https://www.qualcomm.com/news/onq/2023/11/introducing-qualco...
They did not put it into the PC parts supply chain for reasons known only to them. That said, it would be awesome if Intel made high memory variants of their Arc graphics cards for sale through the PC parts supply chains.
I guess that would be an NPU combined with LPDDR. Basically any Windows Copilot Plus approved device.
Me too, probably 2x. I’d sell like hot cakes.
That would basically mean Intel doubling the size of their current GPU die, with a different memory PHY. They're clearly not ready to make that an affordable card. Maybe when they get around to making a chiplet-based GPU.
Are you suggesting that Intel 'just' release a GPU at the same price point as an M4 Max SOC? And that there would be a large market for it if they did so? Seems like an extremely niche product that would be demanding to manufacture. The M4 Max makes sense because it's a complete system they can sell to Apple's price-insensitive audience, Intel doesn't have a captive market like that to sell bespoke LLM accelerator cards to yet.
If this hypothetical 128GB LLM accelerator was also a capable GPU that would be more interesting but Intel hasn't proven an ability to execute on that level yet.
Nothing in my comment says about pricing it at the M4 Max level. Apple charges as much because they can (typing this on an $8000 M3 Max). 128GB LPDDR5 is dirt cheap these days just Apple adds its premium because they like to. Nothing prevents Intel from releasing a basic GPU with that much RAM for under $1k.
You're asking for a GPU die at least as large as NVIDIA's TU102 that was $1k in 2018 when paired with only 11GB of RAM (because $1k couldn't get you a fully-enabled die to use 12GB of RAM). I think you're off by at least a factor of two in your cost estimates.
If Intel came out with an ARC GPU with 128GB VRAM at a $2000 price point, I and many others would likely buy it immediately.
Though Intel should also identify say the top-100 finetuners and just send it to them for free, on the down low. That would create some market pressure.
Intel has Xeon Phi which was a spin-off of their first attempt at GPU so they have a lot of tech in place they can reuse already. They don't need to go with GDDRx/HBMx designs that require large dies.
I don't want to further this discussions but may be you dont realise some of the people who replied to you either design hardware for a living or has been in the hardware industry for longer than 20 years.
While it is not a GPU, Qualcomm already made an inferencing card with 128GB RAM:
https://www.qualcomm.com/news/onq/2023/11/introducing-qualco...
It would be interesting if those saying that a regular GPU with 128GB of VRAM cannot be made would explain how Qualcomm was able to make this card. It is not a big stretch to imagine a GPU with the same memory configuration. Note that Qualcomm did not use HBM for this.
For some reason Apple did it with M3/M4 Max likely by folks that are also on HN. The question is how many of the years spent designing HW were spent also by educating oneselves on the latest best ways to do it.
>For some reason.....
They already replied with an answer.
Even LPDDR requires a large die. It only takes things out of the realm of technologically impossible to merely economically impractical. A 512-bit bus is still very inconveniently large for a single die.
> release a GPU at the same price point as an M4 Max SOC
Why would it need to be introduced at Apple's high-margin pricing?
It's also impossible and it would need to be a CPU.
CPUs and GPUs access memory very differently.
Thank You Wtallis. Somewhere along the line, this basic "knowledge" of hardware is completely lost. I dont expect this to be explained in any comment section on old Anandtech. It seems hardware enthusiast has mostly disappeared, I guess that is also why Anandtech closed. We now live in a world where most site are just BS rumours.
Qualcomm made an AI inferencing card with 128GB RAM without using HBM:
https://www.qualcomm.com/news/onq/2023/11/introducing-qualco...
Would someone with “basic ‘knowledge’ of hardware” explain why a GPU cannot be made with the same memory configuration?
That's because Anand Lal Shimpi is a CompE by training.
Not too many hardware enthusiast site editors have that academic background.
And while fervor can sometimes substitute for education... probably not in microprocessor / system design.
The Real World Technologies forum is still an absolute gold mine for hardware discussion.
As for articles, IMO, Chips and Cheese is the closest thing we have to Anandtech or RWT in their peak.
It is possible to have multiple memory ranks to reduce the bus width requirements for a given amount of memory. Nvidia has demonstrated that this is doable with GDDR6X on the RTX 3090. The RTX 3090 has a 384-bit bus with 24 memory ICs, despite only needing 12 to reach 384-bit. That means it has every two chips sharing one 32-bit interface, which is a dual rank configuration. If you look at the history of computer memory, you can find many examples of multi-rank configurations. I also recall LR-DIMMs as being another way of achieving this.
Achieving 128GB VRAM with a 256-bit bus (which seems like a reasonable bus width) would mean some multiple of 8 chips. If Micron, Samsung or SK Hynix made 128Gb GDDR7 chips, then 8 would suffice. The best right now seems 24Gb, although 32Gb seems likely to follow (and it would likely come sooner if a large customer such as Intel asked for it), so they would just need to have 32 chips in a quad rank configuration to achieve 128GB.
This assumes that there is no limit in the GDDR7 specification that prevents quad rank configurations. If there is and it still supports dual rank like GDDR6X did, then a 512-bit bus could be done. It would likely be extremely pricy and require a new chip tape out that has much more IO logic transistors to handle the additional bus width (and IO logic transistor scaling is dead, so the die area would be huge), but it is hypothetically possible. Given how much people are willing to pay for more VRAM, it could make business sense to do.
Even if there is no limit in the GDDR7 specification that prevents quadrank, their memory IO logic would need to support it and if it does not, they would need to redesign that and do a new chip tape out in addition to a new board design. This would also be very expensive, although not as expensive as going to a 512-bit memory interface.
In summary, adding more memory would cost more to do and it would not improve competitiveness in the target market for these cards, which I imagine is the main reason that they do not do it.
By the way, the reason that Nvidia implemented support for 2 chips per channel is because they wanted to be able to reach 48GB VRAM on the workstation variant of the 3090 that is known as the RTX A6000 (non-Ada). I do not know why they used 24x 8Gb chips rather than 12x 16Gb on the 3090, although if I had to guess, it had something to do with rank interleaving.
Having four chips per channel is exactly why this is implausible. DDR5 can barely operate with four ranks per channel, at severely reduced speeds. Pulling that off with GDDR6 or GDDR7 is not something we can presume to be possible without specific evidence. The highest-density configurations possible for LPDDR5x are dual-rank and byte mode (one chip per 8 bits of the memory bus, so two chips ganged together to populate a 16-bit channel) — and that still operates at less than half the speed of GDDR6.
I've not seen any proposals for buffering LPDDR or GDDR, so an analog to LRDIMMs is not a readily-available technology.
GDDR is the memory technology that operates at the edge of what's possible for per-pin bandwidth. Loading that memory bus down with many ranks is not something we can expect to be achievable by just putting down more pads on the PCB.
> DDR5 can barely operate with four ranks per channel, at severely reduced speeds.
That is objectively false. See, for instance, V-color’s threadripper RAM[0]. If 96GB quad-rank modules @ 6000Mhz in octo-channel counts as “barely operating” maybe we have different definitions of operation requirements.
As a side note, their quad-channel 3-rank RAM [1] hits 8000MHz, out of the box. Admittedly only 24GB modules, but still.
[0] https://v-color.net/products/ddr5-ocrdimm-amd-wrx90-workstat... [1] https://v-color.net/products/ddr5-oc-rdimm-amd-trx50-worksta...
You linked to registered/buffered memory modules. I already addressed that case; it doesn't apply to LPDDR or GDDR.
In that case, we need a 512-bit memory bus to do this using the 32Gbit GDDR7 chips that should be on the market in the near future. This would be very expensive, but it should be possible, or do you see a reason why that cannot be done either?
That said, I am not an electrical engineer (although I work alongside one and have had a minor role in picking low end components for custom PCBs), I think if Intel were to make a GPU with 128GB VRAM using GDDR7 in the next year or two, the engineer who does the trace routing to make it possible should make a go fund me page for people to send beer money.
I think the goalposts may have shifted a bit, from why hasn't Intel made such a card to why is Intel not (publicly) working on such a card to be released in a year or two.
In terms of what would have been feasible for Intel to bring to market in 2024, the cheapest option for 128GB capacity would probably have been ~8.5Gb/s LPDDR5x on a 256-bit bus, but to at least match the bandwidth of the chip they just launched, it would have made more sense to use a 512-bit bus and bump the die size back up to ~half the reticle limit like their previous generation die with a 256-bit bus. So they would have had a quite slow but high-capacity GPU with a manufacturing cost equal to at least an RTX 4080, before adding in the cost of all that DRAM. And if they had started working on that chip as soon as LLaMA went public, they might have been able to deliver it by now.
It's no surprise at all that such a risky niche product did not emerge from a division of Intel that is lucky to not have been liquidated yet.
In hindsight, I misread you as saying that 128GB of RAM on a “basic GPU” is not technically feasible. My reply was to say it is feasible.
Intel is rumored to have a B770 GPU in development, but it was running late and then was delayed to next year since it had yet to tape out, so they are launching their B580 and B570 graphics cards, which had been ready to go for a while, now. That is why the bus size appears to have dropped across generations. Presumably, if they made a 512-bit bus version, it would be a 9 series card. They certainly left room for it in their lineup, but as far as leaks have been concerned, there is not much hope for one. I do not expect them to use anything other than GDDR7 on their battlemage cards.
As for a high memory ARC card, I am of the opinion that such a product would sell well among the local llama community. There might even be more sales of a high memory ARC card for inference than of the regular ARC cards for gaming given that their discrete graphics sales peaked at 250,000 in Q1 2023 before collapsing, which can be confirmed using the data here:
https://www.tomshardware.com/pc-components/gpus/discrete-gpu...
The market for high memory GPUs is surely bigger than that. That said, Intel is likely pricing their ARC GPUs at a loss after R&D costs are considered. This is likely intended to help them break into a new market, although it has not been going well for them so far. I would guess that they are at least a generation away from profitability.
Intel intends for its Gaudi 3 accelerators to be used for this rather than the ARC line. Those coincidentally have 128GB of RAM, but they use HBM rather than a DDR variant. Qualcomm on the other hand made its own accelerator with 128GB of LPDDR4x RAM:
https://www.qualcomm.com/news/onq/2023/11/introducing-qualco...
If my math is right, Qualcomm went with a 1024-bit memory bus and some incorrect rounding (rounding 137.5 to 138 before multiplying by 4) to reach their stated bandwidth figure. Qualcomm is not selling it through the PC parts supply chain, so I have no clue how much it costs, but I assume that it is expensive. I assume that they used LPDDR4x to be able to build a product since they were too late in securing HBM supply and even if they did, they would not be able to scale production to meet demand growth since Nvidia is buying all of the HBM that it can.
What if they put 8 identical GPUs in the package, each with 1/8 the memory? Would that be a useful configuration for a modern LLM?
GPU inference is always a balancing act, trying to avoid bottlenecks on memory bandwidth (loading data from the GPU's global memory/VRAM to the much smaller internal shared memory, where it can be used for calculations) and compute (once the values are loaded).
Splitting the model up between several GPUs would add a third much worse bottleneck – memory bandwidth between the GPUs. No matter how well you connect them, it'll be slower than transfer within a single GPU.
Still, the fact that you can fit an 8× larger GPU might be worth it to you. It's a trade-off that's almost universally made while training LLMs (sometimes even with the model split down both its width and length), but is much less attractive for inference.
> Splitting the model up between several GPUs would add a third much worse bottleneck – memory bandwidth between the GPUs.
What if you allowed the system to only have a shared memory between every neighboring pair of GPUs?
Would that make sense for an LLM?
At least for LLMs and transformers this isn't relevant. Having 8x the chips and 8x the memory bandwidth is always better. Interchip communication for matrix multiplication against a constant left matrix with a tiny right matrix isn't bandwidth bound, only latency bound.
Yes:
https://news.ycombinator.com/item?id=42313615
Last I've heard, the architecture makes that difficult. But my information may be outdated, and even if it isn't, I'm not a hardware designer and may have just misunderstood the limits I hear others discuss.
K80 used to be two glued K40 but their interconnect was barely faster than PCIe so it didn't have much benefit as one had to move stuff between two internal GPUs anyway.
Inference workloads likely won’t care very much. For llama 3.1 405B with bf16 when you split the workload across GPUs by layer, you need to do a 32KB memory copy before the next GPU can begin processing. That can be done incredibly quickly over PCI-E.
It could work, but would it be cost-competitive?
Also, cooling.
> "Just add more RAM" doesn't work the way you wish it could.
Re: Tomasulo's algorithm the other day: https://news.ycombinator.com/item?id=42231284
Cerebras WSE-3 has 44 GB of on-chip SRAM per chip and it's faster than HBM. https://news.ycombinator.com/item?id=41702789#41706409
Intel has HBM2e off-chip RAM in Xeon CPU Max series and GPU Max;
What is the difference between DDR, HBM, and Cerebras' 44GB of on-chip SRAM?
How do architectural bottlenecks due to modified Von Neumann architectures' debuggable instruction pipelines limit computational performance when scaling to larger amounts of off-chip RAM?
Tomasulo's algorithm also centralizes on a common data bus (the CPU-RAM data bus) which is a bottleneck that must scale with the amount of RAM.
Can in-RAM computation solve for error correction without redundant computation and consensus algorithms?
Can on-chip SRAM be built at lower cost?
Von Neumann architecture: https://en.wikipedia.org/wiki/Von_Neumann_architecture#Von_N... :
> The term "von Neumann architecture" has evolved to refer to any stored-program computer in which an instruction fetch and a data operation cannot occur at the same time (since they share a common bus). This is referred to as the von Neumann bottleneck, which often limits the performance of the corresponding system. [4]
> The von Neumann architecture is simpler than the Harvard architecture (which has one dedicated set of address and data buses for reading and writing to memory and another set of address and data buses to fetch instructions).
Modified Harvard architecture > Comparisons: https://en.wikipedia.org/wiki/Modified_Harvard_architecture
C-RAM: Computational RAM > DRAM-based PIM Taxonomy, See also: https://en.wikipedia.org/wiki/Computational_RAM
SRAM: Static random-access memory https://en.wikipedia.org/wiki/Static_random-access_memory :
> Typically, SRAM is used for the cache and internal registers of a CPU while DRAM is used for a computer's main memory.
For whatever reason Hynix hasn't turned their PIM into a usable product. LPDDR based PIM is insanely effective for inference. I can't stress this enough. An NPU+LPDDR6 PIM would kill GPUs for inference.
Is this fast enough for DDR or SRAM RAM? "Breakthrough in avalanche-based amorphization reduces data storage energy 1e-9" (2024) https://news.ycombinator.com/item?id=42318944
How many TOPS/W and TFLOPS/W? (T [Float] Operations Per Second per Watt (hour *?))
/? TOPS/W and FLOPS/W: https://www.google.com/search?q=TOPS%2FW+and+FLOPS%2FW :
- "Why TOPS/W is a bad unit to benchmark next-gen AI chips" (2020) https://medium.com/@aron.kirschen/why-tops-w-is-a-bad-unit-t... :
> The simplest method therefore would be to use TOPS/W for digital approaches in future, but to use TOPS-B/W for analogue in-memory computing approaches!
> TOPS-8/W
> [ IEEE should spec this benchmark metric ]
- "A guide to AI TOPS and NPU performance metrics" (2024) https://www.qualcomm.com/news/onq/2024/04/a-guide-to-ai-tops... :
> TOPS = 2 × MAC unit count × Frequency / 1 trillion
- "Looking Beyond TOPS/W: How To Really Compare NPU Performance" (2023) https://semiengineering.com/looking-beyond-tops-w-how-to-rea... :
> TOPS = MACs * Frequency * 2
> [ { Frequency, NNs employed, Precision, Sparsity and Pruning, Process node, Memory and Power Consumption, utilization} for more representative variants of TOPS/W metric ]
GDDR isnt like the ram that connects to cpu, it's much more difficult and expensive to add more. You can get up to 48GB with some expensive stacked gddr, but if you wanted to add more stacks you'd need to solve some serious signal timing related headaches that most users wouldn't benefit from.
I think the high memory local inference stuff is going to come from "AI enabled" cpus that share the memory in your computer. Apple is doing this now, but cheaper options are on the way. As a shape its just suboptimal for graphics, so it doesn't make sense for any of the gpu vendors to do it.
As someone else said - I don't think you have to have GDDR, surely there are other options. Apple does a great job of it on their APUs with up to 192GB, even an old AMD Threadripper chip can do quite well with its DDR4/5 performance
For ai inference you definitely have other options, but for low end graphics? the lpddr that apple (and nvidia in grace) use would be super expensive to get a comparable bandwidth (think $3+/gb and to get 500GB/sec you need at least 128GB).
And that 500GB/sec is pretty low for a gpu, its like a 4070 but the memory alone would add $500+ to the cost of the inputs, not even counting the advanced packaging (getting those bandwidths out of lpddr needs organic substrate).
It's not that you can't, just when you start doing this it stops being like a graphics card and becomes like a cpu.
They can use LPDDR5x, it would still massively accelerate inference of large local LLMs that need more than 48GB RAM. Any tensor swapping between CPU RAM and GPU RAM kills the performance.
I think we don't really disagree, I just think that this shape isn't really a gpu its just a cpu because it isn't very good for graphics at that point.
That's why I said "basic GPU". It doesn't have to be too fast but it should still be way faster than a regular CPU. Intel already has Xeon Phi so a lot of things were developed already (like memory controller, heavy parallel dies etc.)
I wonder at that point you'd just be better served by CPU with 4 channels of RAM. If my math is right 4 channels of DDR5-8000 would get you 256GB/s. Not as much bandwidth as a typical discrete GPU, but it would be trivial to get many hundreds of GB of RAM and would be expandable.
Unfortunately I don't think either Intel or AMD makes a CPU that supports quad channel RAM at a decent price.
4 channels of DDR5-8000 would give you 128GB/sec. DDR5 has 2x32-bit channels per DIMM rather than 1x64-bit channel like DDR4 did. You would need 8 channels of DDR5-8000 to reach 256GB/sec.
I think all the people saying "just use a CPU" massively underestimate the speed difference between current CPUs and current GPUs. There's like four orders of magnitude. It's not even in the same zip code. Say you have a 64-core CPU at 2Ghz with 512-bit 1-cycle FP16 instructions. That gives you 32 ops per cycle, 2048 across the entire package, so 4TFlops.
My 7900 XTX does 120TFlops.
To match that, you would need to scale that CPU up to either 2048 cores, 2KB per register (still one-cycle!) or 64Ghz.
I guess if you had 1024-bit registers and 8Ghz, you could get away with only 240 cores. Good luck thermal dissipating that btw. To reverse an opinion I'm seeing in this thread, at that point your CPU starts looking more like a GPU by necessity.
Usually, you can do 2 AVX-512 operations per cycle and using FMADD (fused multiply-add) instructions, you can do two floating point operations for the price of one. That would be 128 operations per cycle per core. The result would be 16TFlops on a 2GHz 64 core CPU, not 4 TFlops. This would give a 1 order of magnitude difference, rather than 4 orders of magnitude.
For inference, prompt processing is compute intensive, while token generation is memory bandwidth bound. The differences in memory bandwidth between CPUs and GPUs tend to be more profound than the difference in compute.
That's fair. On the other hand, there's like exactly one CPU with FP16 AVX512 anyways, and 64core aren't exactly commonplace either. And even with all those advantages, using a datacenter CPU, you're still a factor of 10 off from a GPU that isn't even consumer top-end. With a normal processor, say 16 cores, 16 float ops, even with fused ops and dispatching two ops per cycle you're still only at 2T and ~50x. In consumer spaces, I'm more optimistic about dedicated coprocessors. Maybe even iTPU?
This is only relevant for the flash attention part of the transformer, but a NPU is an equally suitable replacement for a GPU for flash attention.
Once you have offloaded flash attention, you're back to GEMV having a memory bottleneck. GEMV does a single multiplication and addition per parameter. You can add as many EXAFLOPs as you want, it won't get faster than your memory.
Out of interest, how does that look for diffusion?
I guess it's hard to know how well this would compete with integrated gpus, especially at a reasonable pricepoint. If you wanted to spend $4000+ on it, it could be very competitive and might look something like nvidias grace-hopper superchip, but if you want the product to be under $1k I think it might be better just to buy separate cards for your graphics and ai stuff.
It is not stacked. It is multirank. Stacking means putting multiple layers on the same chip. They are already doing it for HBM. They will likely do it for other forms of DRAM in the future. Samsung reportedly will begin doing it in the 2030s:
https://www.tomshardware.com/pc-components/dram/samsung-outl...
I am not sure why they can already do stacking for HBM, but not GDDR and DDR. My guess is that it is cost related. I have heard that HBM reportedly costs 3 times more than DDR. Whatever they are doing to stack it now that is likely much more expensive than their planned 3D fabrication node.
HBM3E memory is at least 3x the price of DDR5 (it requires 3x the wafer as DDR5) and capacity is sold out for all of 2025 already... that's the price and production bottleneck.
High speed, low latency server grade DDR5 is around $800-$1600 for 128GB. Triple that for $2400 - $4800 just for the memory. Still need the GPUs/APUs, card, VRMs, etc.
Even the nVidia H100 with "only" 94GB starts at $30k...
Nvidia's $30,000 is a 90% margin product at scale. They could charge 1/3 that and still be very profitable. There has rarely been such a profitable large corporation in terms of the combo of profit & margin.
Their last quarter was $35b in sales and $26b in gross profit ($21.8b op income; 62% op income margin vs sales).
Visa is notorious for their extreme margin (66% op income margin vs sales) due to being basically a brand + transaction network. So the fact that a hardware manufacturer is hitting those levels is truly remarkable.
It's very clear that either AMD or Intel could accept far lower margins to go after them. And indeed that's exactly what will be required for any serious attempt to cut into their monopoly position.
Visa doesn't actually make a ton of money off each transaction, if you divide out their revenue against their payment volume (napkin math)...
They processed $12T in payments last year (almost a billion payments per day), with a net revenue of $32B. That's a gross transaction margin of 0.26% and their GAAP net income was half that, about 0.14%. [1]
They're just a transaction network, unlike say Amex which is both an issuer and a network. Being just the network is more operationally efficient.
[1] https://annualreport.visa.com/financials/default.aspx
That’s a weird way to account for their business size. There isn’t a significant marginal cost per transaction. They didn’t sell $12T in products. They facilitated that much in payments. Their profits are fantastic.
If you have no clue how profit margins are calculated then you're better off staying quiet.
It's quite simple. Divide revenue minus costs by revenue. Transaction volume isn't revenue. Visa only gets the transaction fee.
Even if I give you the benefit of the doubt and do a proper interpretation of the number you've arrived at, its meaning is quite different and quite off topic from this discussion. What you have calculated is the total share of costs that Visa represents in that 12 trillion dollar part of the economy. It is like saying Visa's share of GDP is 0.1%.
> And indeed that's exactly what will be required for any serious attempt to cut into their monopoly position.
You misunderstand why and how Nvidia is a monopoly. Many companies make GPUs, and all those GPUs can be used for computation if you develop compute shaders for them. This part is not the problem, you can already go buy cheaper hardware that outperforms Nvidia if price is your only concern.
Software is the issue. That's it - it's CUDA and nothing else. You cannot assail Nvidia's position, and moreover their hardware's value, without a really solid reason for datacenters to own them. Datacenters do not want to own GPUs because once the AI bubble pops they'll be bagholders for Intel and AMD's depreciated software. Nvidia hardware can at least crypto mine, or be leased out to industrial customers that have their own remote CUDA applications. The demand for generic GPU compute is basically nonexistent, the reason this market exists at all is because CUDA exists, and you cannot turn over Nvidia's foothold without accepting that fact.
The only way the entire industry can fuck over Nvidia is if they choose to invest in a complete CUDA replacement like OpenCL. That is the only way that Nvidia's value can be actually deposed without any path of recourse for their business, and it will never happen because every single one of Nvidia's competitors hate each other's guts and would rather watch each other die in gladiatorial combat than help each other fight the monster. And Jensen Huang probably revels in it, CUDA is a hedged bet against the industry ever working together for common good.
I feel people are exaggerating the impossibility of replacing CUDA. Adopting CUDA is convenient right now because yes it is difficult to replace it. Barrier to entry for orgs that can do that is very high. But it has been done. Google has the TPU for example.
The TPU is not a GPU nor is it commercially available. It is a chip optimized around a limited featureset with a limited software layer on top of it. It's an impressive demonstration on Google's behalf to be sure, but it's also not a shot across the bow at Nvidia's business. Nvidia has the TSMC relations, a refined and complex streaming multiprocessor architecture and actual software support their customers can go use today. TPUs haven't quite taken over like people anticipated anyways.
I don't personally think CUDA is impossible to replace - but I do think that everyone capable of replacing CUDA has been ignoring it recently. Nvidia's role as the GPGPU compute people is secure for the foreseeable future. Apple wants to design simpler GPUs, AMD wants to design cheaper GPUs, and Intel wants to pretend like they can compete with AMD. Every stakeholder with the capacity to turn this ship around is pretending like Nvidia doesn't exist and whistling until they go away.
I don’t disagree with what you are saying but I want to point out that the fact that the TPU is not a GPU is not really relevant. In the end what matters most is whether or not it can accelerate PyTorch.
They're not exaggerating it. The more things change, the more they stay the same. Nvidia and AMD had the exact same relationship 15 years ago that they do today. The AMD crowd clutching about their better efficiencies, and the Nvidia crowd having grossly superior drivers/firmware/hardware, including unique PhysX stuff that STILL has not been matched since 2012 (remember Planetside 2 or Broderlands 2 physics? Pepperidge Farm Remembers...)
So many billions of dollars and no one is even 1% close to displacing CUDA in any meaningful way. ZULDA is dead. ROCM is a meme, Scale is a meme. Either you use CUDA or you don't do meaningful AI work.
CUDA is not the issue. AMD have already reimplemented like 80% of it, and honestly that part of it mostly works fine. Pytorch supports it, (almost) all the big frameworks support it, if you're not doing really arcane things it just works. It's the drivers! They took like two years after the release of their flagship card to stop randomly crashing. Everything geohot has ever said about AMD drivers is 100% true. They just cannot stop shooting themselves in the foot.
What did he say?
Geohot (temporarily) giving up. https://github.com/ROCm/ROCm/issues/2198#issuecomment-157438... Sadly most of the real spicy Twitter messages are gone since he deleted all his content, but there was a really fun one where he went off on a beautifully cryptic commit message in the driver. He also begged AMD to opensource the firmware so he could debug it. Sadly, AMD promised to do it and then nothing happened, as is typical for AMD promises. That's why tinygrad nowadays is aiming to just bypass the driver and firmware entirely.
> The only way the entire industry can fuck over Nvidia is if they choose to invest in a complete CUDA replacement like OpenCL. That is the only way that Nvidia's value can be actually deposed without any path of recourse for their business, and it will never happen because every single one of Nvidia's competitors hate each other's guts and would rather watch each other die
Intel seems to have thrown their weight behind SYCL, which is an open standard intended to compete with CUDA. Its not clear there has been much interest from other hardware vendors though.
I do not misunderstand why Nvidia has a monopoly. You jumped drastically beyond anything I was discussing and incorrectly assumed ignorance on my part. I never said why I thought they had one. I never brought up matters of performance or software or moats at all. I matter of fact stated they had a monopoly, you assumed the rest.
It's impossible to assail their monopoly without utilizing far lower prices, coming up under their extreme margin products. It's how it is almost always done competitively in tech (see: ARM, or Office (dramatically undercut Lotus with a cheaper inferior product), or Linux, or Huawei, or Chromebooks, or Internet Explorer, or just about anything).
Note: I never said lower prices is all you'd need. Who would think that? The implication is that I'm ignorant of the entire history of tech, it's a poor approach to discussion with another person on HN frankly.
Nvidia's monopoly is pretty much detached from price at this point. That's the entire reason why they can charge insane margins - nobody cares! There is not a single business squaring Nvidia up with serious intent to take down CUDA. It's been this way for nearly two decades at this point, with not a single spark of hope to show for it.
In the case of ARM, Office, Linux, Huawei, and ChromeOS, these were all actual alternatives to the incumbent tools people were familiar with. You can directly compare Office and Lotus because they are fundamentally similar products - ARM had a real chance against x86 because wasn't a complex ISA to unseat. Nvidia is not analogous to these businesses because they occupy a league of their own as the provider of CUDA. It's not exaggeration to say that they have completely seceded from the market of GPUs and can sustain themselves on demand from crypto miners and AI pundits alone.
AMD, Intel and even Apple have bigger things to worry about than hitting an arbitrary price point, if they want Nvidia in their crosshairs. All of them have already solved the "sell consumer tech at attractive prices" problem but not the "make it complex, standardize it and scale it up" problem.
It is cheaper to pay Nvidia than it is to roll your own solution and no one else is competitive. That is the reason Nvidia can charge so much per card.
Thank you for laying it out. It's so silly to see people in the comments act like Intel or Nvidia can't EASILY add more VRAM to their cards. Every single argument against it is all hogwash.
Meta comment: "why don't they just" phrase usually indicates significant ignorance about a subject, it's better to learn a little bit before dispensing criticism about beancounters or whatnot.
In this case, the die I/O limits precludes more than a reasonable number of DDR channels.
Or literally "just" take the "just" out of the question and 90% of the time the tone becomes inquisitive instead of rhetorical.
Op asked a question and got a bunch of answers "why they couldn't do just that". I think that's a win.
Because you can't stack that much ram on a GPU without sufficient channels to do so. You could probably do 64GB on GDDR6 but you can't do 128GB on GDDR6 without more memory channels. 2GB per chip per channel is the current limit for GDDR6 this is why HBM was invented.
It is why you can only see GPUs with 24GB of memory at the moment.
HBM2 can handle 64GB ( 4 x 8GB Stack ) ( Total capacity 128GB )
HBM3 can handle 192GB ( 4 x 24GB Stack ) ( Total capacity 384GB )
You can not do this with GDDR6.
Look at the RTX 3090 and RTX A6000 (the non-Ada one). They both have 24 memory chips with a 384-bit memory bus, but one has 24GB of VRAM and the other has 48GB of VRAM. They both have two chips per channel. This breaks the 24GB VRAM limit that you claim to exist.
Even 24 or 32 GB for an accessible price would sell out fast. NVIDIA wants $2000 for the 5090 to get 32.
After carefully reviewing all of the other comments explaining the many technical and organizational reasons why they should not “just do that,” I have come to the conclusion that it was a big missed opportunity by intel.
This GPU has a 192-bit memory bus. At 32-bit GDDR bus width (well w/ 6, 2x16b data channels per chip), that means you have 6 channels. With regular GDDR6, the largest produced size is 16Gb (2GB), so 12GB is what you get. You could double that up w/ a beefed up PCB if the memory controller supports it to get up to 24GB (in the way workstation cards like W7900 and A6000).
Beyond that, you'd have to move to GDDR7 (which has 24Gb/3GB chips incoming) or to HBM stacks, but at that point you're well beyond a "basic GPU". I think the only way you could get to 128GB would be either using regular (LP)DDR or HBM.
Note, Apple M chips have weak GPUs with decent MBW and large memory capacities (up to 192GB @ 800GB/s for an M2 Ultra, launched mid 2023) and have not been a major CUDA threat so I don't think your hypothesis actually stands up.
48 GB is at the tail end of what's reasonale for normal GPUs. The IO requires a lot of die space. And intel's architecture is not very space efficient right now compared to nvidia's
> The IO requires a lot of die space.
And even if you spend a lot of die space on memory controllers, you can only fit so many GDDR chips around the GPU core while maintaining signal integrity. HBM sidesteps that issue but it's still too expensive for anything but the highest end accelerators, and the ordinary LPDDR that Apple uses is lacking in bandwidth compared to GDDR, so they have to compensate with ginormous amounts of IO silicon. The M4 Ultra is expected to have similar bandwidth to a 4090 but the former will need a 1024bit bus to get there while the latter is only 384bit.
Going off of how the 4090 and 7900 xtx is arranged I think you could maybe fit on or two chips more around the die over their 12, but that's still a far cry from 128. That would probably just need a shared bus like normal DDR as you're not fitting that much with 16 gbit density
Look at the 3090, which uses 24 chips (12 on one side and 12 on another). Pushing it to 32 is doable. 32 is all you need to reach 128GB VRAM with the 32Gbit GDDR7 chips that should be on the market in the near future.
Where would you route the connection to the additional 4 groups of chips around the die? The PCIe connection needs to be there too, and they also may not like power delivery going through them
What if we did what others suggested was the practical limit - 48GB. Then just put 2-3 cards in the system and maybe had a little bridge over a separate bus for them to communicate?
I believe that would need some software work from Intel where they're lacking a bit now with their delayed start. Not sure how the frameworks themselves split up the inference work to avoid crossing GPUs as the bandwidth is horrible there.
If we're being reasonable and say that you're not using a modern HEDT CPU that costs a couple thousand, the best a consumer botherboard can get right now would be 2x 8x PCIe gen 5 at 32GB/s and one chipset x8 PCIe gen 4 at 16GB/s. I'm not sure if a motherboard like that actually exists but Intel's chipset should allow it; AMD only does x4 to chipset so the third slot is limited by that
Totally agree. Someone needs to exploit the lack of available gpu memory in graphics cards for model runners. Even training tensors tends to run against memory issues with the current cards.
armchair hardware enthusiast opinion: because silicon in the high-end is expensive, and not just a matter of slapping things together.
besides the clear limitation of the memory technology they are using compared to the nvidia's enterprise solution, for such large GPU chips that could really make use of such memory, they need to make binning possible by selling cut-down versions of them as well.
nvidia can pull this off because they can sell lower-end chips at the same time. intel is barely making a dent in sales, and making a high-end chip will only be very risky, at the cost of potentially benefiting a niche crowd.
> put them as a major CUDA threat. that is a software/ecosystem problem, which hardware alone cannot solve. for all the devs that use Macs, even in AI it is only about inference at the moment. nobody is coming at CUDA for training for the near future. amd tried and failed plenty already.
They don't need to do 128gb, 48gb+ would eat their lunch, Intel and AMD are sleeping.
Qualcomm built a card designed to do inferencing with 128GB of RAM:
https://www.qualcomm.com/news/onq/2023/11/introducing-qualco...
I have no idea how much it costs. They do not sell it via PC parts channels.
Isn't nobody actually making anything close to a profit in the AI space? When your entire business model is propped up by VC money making your bill of materials that you use to make negative profit cheaper is probably not very high on your list of priorities.
I think a better idea would be an NPU with slower memory, or tie it to the system DDR. I don't think consumer inference (possibly even training) applications would need the memory bandwidth offered by GDDR/HBM. Inference on my 7950x is already stupid fast (all things considered).
The deeper problem is that the market for this is probably incredibly niche.
Judging by the number of 16 GB laptops I see around, 128 GB of RAM would probably cost a bajillion dollars
One of the great things about having a desktop is being able to get that much for under $300 instead of the price of a second laptop.
Desktop PCs RAM don't have the bandwidth and latency required for running AI.
Not the laptop RAM. It costs pennies, Apple's is just charging $200 for 12GB because they can. It's way too slow though..
And Nvidia doesn't want to cannibalize its high end chips but putting more memory into consumer ones.
The size of the local inference market is too small. Maybe a couple of thousand LLM enthusiasts? It's not enough to make a profit or even breakeven on the development costs for the hardware.
For now. This might very well change once the general public realizes they can be movie directors (or generative world gamers) just by downloading some model and plugging in an eGPU. The potential inference market is huge
The question is is there a real market for this? I do think it could bootstrap from local inference enthousiasts, but it is not clearcut.
Rather than go all in with 128GB, they could test the waters easily with a cheap 32GB offering and take ot from there.
They are probably held back by same reason thats preventing AMD and nVidia from doing it either.
The reason is AMD and Nvidia don't is that they don't want to cannibalize their high end AI market. Intel doesn't have a high end AI market to protect.
There are products like this one: https://www.intel.com/content/www/us/en/products/sku/232592/...
As far as I understand it, it gives you 64 GiB of HBM per socket.
that's a CPU. We are talking about GPUs here for highly parallel matrix multiplication problems. 2 different beasts.
NVidia and AMD make $$$ on datacenter GPUs so it makes sense they don't want to discount their own high-end. Intel has nothing there so they can happily go for commodization of AI hardware like what Meta did when releasing LLaMA to the wild.
Is nVidia or AmD offering 128gb cards in any configuration?
They aren't "cards" but MI300x has 192GB and MI325x has 256GB.
You can run an AMD APU with 128GB of shared RAM.
It's too slow and not very compatible. Most BIOSes also don't allow sharing that much memory with GPU (max like 16GB).
Isn’t that setting just a historical thing annd ann integrated GPU is able to access any system memory that is mapped by the IOMMU? I assume this is how it works for people using the NVIDIA Jetson AGX Orin 64GB Developer Kit to do inference. I do not know why it would be different for AMD APUs.
I remember somebody complaining about it on reddit, unable to overcome some BIOS limitation on an AMD G processor. Even on M3 Max one had to issue a special command to enable GPU to access more memory.
you can do that with nvidia too but it takes you from 6tok/s to 6s/token or worse (not even exaggerating)
AMD has a 192GB GPU. I don’t see them eating NVidia’s lunch with it.
They are charging as much as Nvidia for it. Now imagine they offered such a card for $2k. Would that allow them to eat Nvidia's lunch?
AMD would be selling it at a loss. Given that HBM costs 3x the price of desktop DRAM and a 192GB kit costs $600 at Newegg, the memory alone would cost 90% of the price. The GPU die, PCB, power circuitry, etc likely costs more than $200 to make.
This does not consider that the board of directors would crucify Lisa Su if she authorized the use of HBM on a consumer product while it is supply constrained and there is enterprise demand for products using it. AMD can only get a limited amount of it and what they do get is not enough for enterprise demand where AMD has extremely healthy margins.
Even if they by some miracle turned a profit on a $2000 consumer card with 192GB HBM, every sale would have a massive opportunity cost and effectively would be a loss in the eyes of the board of directors.
Meanwhile, Nvidia would be unaffected because AMD could not produce very many of these.
NVidia would be dramatically affected, just not overnight.
If Intel or AMD sold a niche product with 48GB RAM even at a loss, but hit high-end consumer pricing, there would be a flood of people doing various AI work to buy it. The end result would be that parts of NVidia's moat would start draining rather quickly, and AMD / Intel would be in a stronger position for AI products.
I use NVidia because when I bought AMD during the GPU shortage, ROCm simply didn't work for AI. This was a few years back, but I was burned badly enough that I'm unlikely to risk AMD again for a long, long time. Unused code sits broken, and no ecosystem gets built up. A few years later, things are gradually improving for AMD for the kinds of things I wanted to do years ago, but all my code is already built around NVidia, and all my computers have NVidia cards. It's a project with users, and all those users are buying NVidia as well (even if just for surface dependencies, like dev-ops scripts which install CUDA). That, times thousands of projects, is part of NVidia's moat.
If I could build a cheap system with around 200GB, that would be incentive for me to move the relatively surface dependencies to work on a different platform. I can buy a motherboard with four PCI slots, and plug in four 48GB cards to get there. I'd build things around Intel or AMD instead.
The alternative is NVidia would start shipping competitive cards. If they did that, their high-end profit margins would dissolve.
The breakpoints for inference functionality are really at around 16GB, 48GB, and 200GB, for various historical reasons.
> If I could build a cheap system with around 200GB,
Even if AMD dropped the price to $2000, you could not be able to build a system with one of these cards. You cannot buy these cards at their current 5 digit pricing. The idea that you could buy it if they dropped the price to $2000 is a fantasy, since others would purchase the supply long before you have a chance to purchase one, just like they do now.
AMD is already selling out at the current 5 digit pricing and Nvidia is not affected, since Nvidia is selling millions of cards per year and still cannot meet demand while AMD is selling around 100,000. AMD dropping the price to $2000 would not harm Nvidia in the slightest. It would harm AMD by turning a significant money maker into a loss leader. It would also likely result in Lisa Su being fired.
By the way, the CUDA moat is overrated since people already implement support for alternatives. llama.cpp for example supports at least 3. PyTorch supports alternatives too. None of this harms Nvidia unless Nvidia stops innovating and that is unlikely to happen. A price drop to $2000 would not change this.
Let’s say for the sake of argument that you could build such a card and sell it for less than $5k. Why would you do it? You know there’s huge demand in the tens of billions per quarter for high end cards. Why undercut so heavily that market? To overthrow NVidia? So you’ll end up with a profit margin way low and then your shareholders will eat you alive.
If you want to load up 405B @ FP_16 into a single H100 box, how do you do it? You get two boxes. 2x the price.
Models are getting larger, not smaller. This is why H200 has more memory, but the same exact compute. MI300x vs. MI325x... more memory, same compute.
We would also need to imagine AMD fixing their software.
I think plenty of enthusiastic open source devs would jump at it and fix their software if the software was reasonably open. The same effect as what happened when Meta released LLaMA.
It is open and they regularly merge PRs.
https://github.com/ROCm/ROCm/pulls?q=is%3Apr+is%3Aclosed
AMD GPUs aren't very attractive to ML folks because they don't outshine Nvidia in any single aspect. Blasting lots of RAM onto a GPU would make it attractive immediately with lots of attention from devs occupied with more interesting things.
does the 7900xt outperform the 3090ti? if so, there's already a market because those are the same price. I don't mean in theory are there any workloads that the 7900xt can do better? Even if they're practically equal performance you get a warranty and support with your new 7900xt.
also i didn't know there was a 192GB amd GPU.
MI300X already leads in VRAM as it has 192 GB.
For local inference, 7900 XTX has 24 GB of VRAM for less than $1000.
At what threshold of VRAM would you start being interested in MI?
Problem with MI300x is the price. Problem with 7900XTX is that it's at best as good as Nvidia with the same RAM for a similar price. If 7900XTX had e.g. 64GB of RAM, was 2x slower than 4080, and kept its price, it would sell like crazy.
I have a 7900 XTX. Honestly I regret it. It took two years for the driver to stop randomly crashing with very pedestrian ROCm loads. And there's no future in AMD support now they're getting out of the high-end dual-use GPU game anyways. I should have gone with NVidia.
Who manufactures the type of RAM and can they buy enough capacity? I know nVidia bought up the high bandwidth memory supply for years to come.
Honestly, I don't think we would need "this type of RAM." The confused part of this discussion is the belief that we need obscene bandwidth.
If I need 300GB/s memory bandwidth for my workload, that can be accomplished with:
* One RAM chip with 300GB/s
* Two RAM chips with 150GB/s each
* Four RAM chips with 75GB/s each
Etc.
Stepping up from 16GB to 196GB, the bandwidth requirements for each chip go down 10-fold, and you can use much cheaper RAM as a result. And all the signalling requirements relax too.
Much of this discussion presumes a 200GB card would individually need the same capacity to each RAM chip as a 12GB card. This is just false. An A770 or 4060-grade card couldn't keep up with that much data. And if I'm using a small model, I can get the same bandwidth by properly distributing it among RAM chips (which most hardware does automatically).
An A770 or 4060-grade card, with the same total memory capacity as we have today, but 200GB RAM, would allow us to run high-quality LLMs locally or do high-resolution renders. That wouldn't have the same performance as a $200k card, obviously, but for many inferences uses, that's just not very important.
If I were buying for my own uses, I'd want 12x 32GB PC3200 DIMMs for a total of 384GB RAM at $600 for the RAM (say $2k total), with an individual throughput of 25GB/sec and a total throughput of 300GB/sec. I'd be okay with 4060-grade performance. My own uses are a bit niche, and I think for most other people's uses, something with a little more throughput and a little less capacity (48-196GB) might make more sense. But you definitely don't need the same throughput as existing GPU RAM.
Pretty obvious stuff, right? I mean, you don't even need HBM for that, you just need a TON of memory channels. Sure, that kind of setup would only be efficient for highly coalesced reads/writes, but that's what you need these days for inference - highly coalesced reads and writes. You could even get by with 64GB of DDR5. DDR5-4800 (rather modest) is 38.4GB/s per channel. To get 1TB/s you'd only need 26 channels. With the more expensive DDR5-6400 you'd only need 20. That doesn't at all sound insurmountable for a company of Intel's caliber. Heck, break up the dies (and the channels) across several chiplets even, if the interconnect is decent it'll still run really well.
I've said this for a while...
I do think one challenge is, AFAIK with most GDDR5/6 there's a density issue that requires either larger memory bus paths or other additional complexity to support large sizes.
That said, the lack of even a 16GB variant is sus.
I'll take some copium in that maybe they're trying to solve the 'size' issue somehow and are just making sure whatever system they use isn't gonna be an i820 MTH debacle before they pull the trigger on announcing it.
I would think someday we can use AI to port the software. Maybe even use AI to design the cards.
This is a gaming card. Look at benchmarks.
Because they can’t
Because if they could just do that and it would rival what NVidia has, they would just do it.
But obvoiusly they don't.
And for reasons: NVidia has worked on CUDA for ages, do you believe they just replace this whole thing in no time?
Does CUDA even matter than much for LLMs? Especially inference? I don't think software would be the limiting factor for this hypothetical GPU. Afterall it would be competing with Apple's M chips not with the 4090 or Nvidia's enterprise GPUs.
It's the only thing that matters. Folks act like AMD support is there because suddenly you can run the most basic LLM workload. Try doing anything actually interesting (i.e, try running anything cool in the mechanistic interoperability or representation/attention engineer world) with AMD and suddenly everything broken, nothing works, and you have to spend millions worth of AI engineer developer time to try to salvage a working solution.
Or you can just buy Nvidia.
llama.cpp and its derivatives say yes.
This is the most script kiddy comment I've seen in a while.
llama.cpp is just inference, not training, and the CUDA backend is still the fastest one by far. No one is even close to matching CUDA on either training or inference. The closest is AMD with ROCm, but there's likely a decade of work to be done to be competitive.
Yes, and inference is a huge market in itself and potentially larger than training (gut feeling haven’t run numbers)
Keep NVIDIA for training and Intel/AMD/Cerebras/… for interference.
The funny thing about Cerebras is that it doesn't scale well at all for inference and if you talk to them in person, they are currently making all their money on training workloads.
Inference is still a lot faster on CUDA than on CPU. It's fine if you run it at home or on your laptop for privacy, but if you're serving those models at any scale, you're going to be using GPUs with CUDA.
Inference is also a much smaller market right now, but will likely be overtaken later as we have more people using the models than competing to train the best one.
NVidia Blackwell is not just a GPU. Its a Rack with a interconnect through a custom Nvidia based Network.
And it needs liquid cooling.
You don't just plugin intel cards 'out of the box'.
Inference on very large LLMs where model + backprop exceed 48GB is already way faster on a 128GB MacBook than on NVidia unless you have one of those monstrous Hx00s with lots of RAM which most devs don't.
No one is running LLMs on consumer NVidia GPUs or apple MacBooks.
A dev, if they want to run local models, probably run something which just fits on a proper GPU. For everything else, everyone uses an API key from whatever because its fundamentaly faster.
IF a affordable intel GPU would be relevant faster for inferencing, is not clear at all.
A 4090 is at least double the speed of Apples GPU.
4090 is 5x faster than M3 Max 128GB according to my tests but it can't even inference LLaMA-30B. The moment you hit that memory limit the inference is suddenly 30x slower than M3 Max. So a basic GPU with 128GB RAM would trash 4090 on those larger LLMs.
Quantized 30B models should run in 24GB VRAM. A quick search found people doing that with good speed: [1]
[1] https://old.reddit.com/r/LocalLLaMA/comments/14gdsxe/optimal...Quantized sure but there is some loss of variability of the output one can notice quickly with 30B models. If you want to use the fp16 version you are out of luck.
Do you have the code for that test?
I ran some variation of llama.cpp that could handle large models by running portion of them on GPU and if too large, the rest on CPU and those were the results. Maybe I can dig it from some computer at home but it was almost like a year ago when I got M3 Max with 128GB RAM.
Because the CPU has to load the model in parts for every cycle so you're spending a lot of time on IO and it offsets processing.
You're talking about completely different things here.
It's fine if you're doing a few requests at home, but if you're actually serving AI models, CUDA is the only reasonable choice other than ASICs.
My comment was about Intel having a starter project, getting enthusiastic response from devs, network effects and iterate from there. They need a way to threaten Nvidia and just focusing on what they can't do won't bring them there. There is one route where they can disturb Nvidia's high end over time and that's a cheap basic GPU with lots of RAM. Like Ryzen 1st gen whose single core performance was two generations behind Intel trashed Intel by providing 2x as many cores for cheap.
It would be a good idea to start with some basic understanding of GPU, and realizing why this can't easily be done.
That's a question M3 Max with its internal GPU already answered. It's not like I didn't do any HPC or CUDA work in the past to be completely clueless about how GPUs work though I haven't created those libraries myself.
What have you implemented in CUDA?
You're not wrong, but technically llama.cpp does have training (both raw model and fine tuning). And it's been around for a long time. Back around the ggml->gguf switch I used llama.cpp to train a tiny 0.9B llama 1 through the early fast parts of the loss reduction on 3GB of IRC logs with 64 tokens of context over about a month. It eventually produced some gpt2-like IRC lines within it's very short context.
Would anyone choose llama.cpp's training tools to do serious work? No. Do they exist and work, yes.
A fraction of CUDA capabilities.
Sufficient for LLMs and image/video gen.
A fraction of what a GPU is used for.
FLUX.1 D generation is about a minute at 20 steps on a 4080, but takes 35 minutes on the CPU.
Yep. Any large GenAI image model (beyond SD 1.5) is hideously slow on Mac's irrespective of how much RAM you cram in - whereas I can spit out a 1024x1024 image from Flux.1 Dev model in ~15 seconds on a RTX 4090.
4080 won't do video due to low RAM. The GPU doesn't have to be as fast there, it can be 5x slower which is still way faster than a CPU. And Intel can iterate from there.
It won't be 5x slower, it would be 20-50x slower if you would implement it as you said.
You can't just "add more ram" to GPUs and have them work the same way. Memory access is completely different than on CPUs.
Not even close. Llama.cpp isn't even close to a production ready LLM inference engine, and it runs overwhelmingly faster when using CUDA
If they were serious about AI they would have published TOPS stats at at least float32 and bfloat16.
The lack of quantified stats on the marketing pages tells me Intel is way behind.
my hunch is the path forward for intel on both the CPU and the GPU end is to release a series of consumer chipsets with a large number of PCIE 5.0 lanes, and keep iterating this. This would cannibalize some of the datacenter server side revenue, both that's a reboot... get the hackers raving about intel value for the money instead of EPYC. Or do a skunkworks ARM64 M1 like processor; there's a market for this as a datacenter part...