Excusing everything else that u/bastawhiz said[0]; the obvious fact here is that Claude, OpenAI, Gemini et al. are quite literally burning through 100's of billions of dollars and selling it back to you for pennies on the dollar in the hopes that they get to be the only one left.
If I spend $10 growing Oranges and sell them to you for $1; then of course it's more expensive for you to do the growing.
I feel like I'm taking crazy pills. These models will become more expensive over time, it's functionally impossible for them not to, they just want to capture the market before they have to stop selling at a huge loss.
Well, I'd be surprised if non-R&D inference providers were selling at a loss. There are a plethora to choose from, competition is quite healthy. Will they keep providing cheap tokens while the labs raise their prices? Probably, but then I don't see how they could be raised in the first place. And what timescale are you talking about? A couple of years? It is appropriate to assume inference will become more efficient over time. If you raise your prices, you are going to be out competed before it's profitable (if you assume it is unprofitable) which would be negligent. I don't see how this makes sense.
This isn't a good analysis, and it's because it keeps rounding everything up. He rounds up the cost of electricity by 10%. He has a range of power use, takes the high end (which is 2x the low end) and multiplies it by the inflated electricity cost.
But then they talk about using a newly purchased Mac to do the inference, running at full capacity, 24/7. Why would you do that? Apple silicon is fast but the author points out: you're only getting 10-40 tokens per second. It's not bad, but it's not meant for this!
It's comparing apples to oranges. Yeah, data centers don't pay residential electricity rates. Data centers use chips that are power efficient. Data centers use chips that aren't designed to be a Mac.
Apple silicon works out pretty good if you're not burning tokens 24/7/365 and you're not buying hardware specifically to do it. I use my Mac Studio a few times a week for things that I need it for, but I can run ollama on it over the tailnet "for free". The economics work when I'm not trying to make my Mac Studio behave like a H100 cluster with liquid cooling. Which should come as no surprise to anyone: more tokens per watt on hardware that's multi tenant with cheap electricity will pretty much always win.
It’s more than just data locality. OpenRouter is faster, no? I have an M4 pro, and anything but the smallest dumbest models are unusably slow for interactive use. I personally haven’t yet found a good use case for offline/non-interactive LLM work locally.
The article makes no sense. I can't use OpenRouter as a general purpose computing device. Why are we comparing a whole computer to a single purpose SaaS?
I think it's because there are a lot of people writing articles about the benefits of running local models. I think it's fair to say that there are daily threads on HN singing the praises or local inference. I also see people buying new hardware where the main trigger is ability to run local models.
That's the point: why would you buy a device that's specifically not optimized to be used for 24/7 inference? It's expensive hardware that's not designed to be used in that situation! The power use for inference isn't especially good and you're not getting even a fraction of the benefit from the hardware that you're paying for.
Whether you think building data centers or not is a good idea it's inarguable that the per-token efficiency (power, hardware, etc) is FAR higher in a data center. That's literally what it's designed for.
Unless I'm misunderstanding, this is counting the entire laptop in the cost of generating tokens. The calculation seems to omit that, in addition to receiving LLM output, you have also received a laptop in exchange for your money. If you intend to put this machine in a dark corner and run it solely as a token-munching server, a laptop would be an exceptionally poor choice of technology for this purpose. But if you intend to use the laptop as a laptop, having a laptop is a pretty big benefit over not having a laptop.
You also get the benefit of privacy, freedom from censorship, and control over the model used (i.e. it will not be rugpulled on you in three months after you've built a workflow around a specific model's idiosyncrasies).
For me, the appeal of local compute is first and foremost confidentiality and having the possibility to run my 200K documents through an LLM just to see what happen without having to consider the cost.
A lot of comments here are about the issues with the analysis in OP’s post but much of them are “a distinction without a difference” with respect to the broader conclusion. When we look at purely cost and performance (setting aside privacy) then it’s better for individual devs to pay for hosted then for self hosting. Employers are paying for tokens on the job and most devs are finding the $PREFERRED_PROVIDER’s $20/$100/$200/month subscription sufficient outside of work. Most devs don’t fall in the conditions under which running local models make sense purely on the basis of cost vs performance.
More critically, in practice, setting up local models seems more like a hobby, an educational exercise, or an act of privacy control than it is for cost cutting or productivity.
Mmmm, nope if you do the smart thing. MacBook M5 max 128gb is a premium laptop at 6k, but with it you can do many things and is your good main driver for the day. Then, it can also run DeepSeek V4 flash and perform non trivial tasks locally, without censorship or limitations, even without an internet connection and on very privacy sensitive data. That's a good deal. If you buy 25k for a dual Mac Studio 512gb to abandon OpenAI and company you are going to be disappointed by both performance and cost.
Yea my m4 max with 128gb has ended up making a lot of sense for me. I do video editing, I train ml models, I run large open AI models, I do 3d modeling, rendering and cad work. I never do all of this 100% of the time, I’ll setup a ml training to run over night and check results in the morning, during work I’ll set it up as a server and run local models, on my own time I’ll edit video and work on 3d modeling. It’s an incredibly versatile machine - and all of this is done while keeping your data on your device and giving you full control over your workflows.
The author only compared output token costs -- but for typical agentic workloads, input tokens dominate the costs by a large margin. Running inference locally, input tokens are, to first order, free. (They only generate implicit costs through higher time-to-first-token, higher power use, and lower token output speed).
OP is comparing against Gemma everywhere but concludes paying Anthropic make more sense. Anthropic is $15 per million output token which is 30-35x more expensive even in openrouter .
This is like comparing e-bike at home with e-bike rental and concluding therefore we need to rent Toyota since it can go at similar speeds. Getting tired of bad posts getting much attention .
As stated in the analysis, thousands of dollars. That said, the smart thing to do is target smaller models (few billion parameters) and then use larger models for non-privacy tasks.
Also predictability, resilience, sovereignty. I'm not worried about other people's outages, that unexpected demand will impact me at an inconvenient time, that someone's watering down my model, that my costs will change unpredictably or that some unforseen error will lead to a huge bill.
It's in the same category as rooftop solar for me. It doesn't have to make strict economic sense if you're the particular type of person who gets peace of mind from control of infrastructure / reduced dependency.
I don't hear people debating which is cheaper, local or cloud run models. The conversation, at least what I hear, is a lot of the time users are not utilizing an awful lot of tickets all the time, those providers will be paid if you never use them. If 80% - 90% of the work I and my team are doing with Ai is grunt work, write tests for this, implement a FFT here, write the dB query for X. Nothing exhausting. Those who are using AI for whole cloth "vibe coded" applications and services are definitely better suited to cloud.
If a work laptop can run my local models and get my works needed performance for development, why wouldn't I as a company prefer that?
Add to that the privacy improvements and data protection and potentially further specific inferance if needed it's a no brainer.
Again, Ai is a tool, and the right tool for the job, I would wager with no evidence looked up, is that the majority of Devs would be happy with 10-30 per second locally.
The amount of FUD and notion that hardware depreciates in this manner is widely held. I blame Michael Burry of the Big Short who is perpetuating these lies to the investor community today.
There's a bunch of retro hardware which should make people pause and realize they're stupid to assume hardware slows down on average even 5% 20 years later (it's probably closer to 2% and I'm being generous).
HVAC/power delivery and generation are the major factors, and if you didn't skimp/get defective parts for this and replace failed moving parts (usually fans), your hardware is basically the same 20 years down the line as it was today.
Also using LLMs locally doesn't even induce sustained 100% GPU usage over significant periods of time for most real (agentic coding in OpenCode) use-cases.
So I did the India-specific analysis for a tier-3 city. Here, electricity costs 1/3rd of the US version, and you also get solar subsidy up to a certain amount.
But, if we assume ZERO hardware deprecation (not realistic), then local inference becomes super cheap.. roughly, 90%+ cheaper.
Third case: the break-even happens only if we can get at the very very very least, 8.7 years of useful hardware life. A more realistic number, however, when working 8 hrs/day and not of 24 hrs/day, is around 25 years.
So, for now, local inference is preferable if you deeply care about privacy. From cost perspective, it's still not there.
I'm even surprised people ignorantly talking about advantages of buying very expensive device , run it only sometimes and aiming to beat cloud vendors.
If small model is great it will be hosted with good electricity cost and will be utilized 24/7.
Isn't it 2+2 of economics ?
CPU is a commodity, and we are still buying cpu and ram from vendors for same reason
Put a cost on sending your intellectual property to a saas provider who knows where. Maybe you are building yet another html nobody really cares about, but not everyone does.
Will this cost structure always be this way and are there other benefits to not running your LLM on the cloud?
E.g.
Privacy
Uptime
Future cost structure controls
This is a field that has moved very quickly. And it has moved in a direction to try to trap users into certain habits. But these habits might not best align with what best benefits end users today or some time in the future.
When the AI bubble inevitably pops, the author will find a new way to skew results in favor of cloud LLMs. Like including the price of a desk and a chair in the local token cost.
I've dug into this previously for one simple reason: NVidia segments the market by capping VRAM and Apple silicon uses a shared memory model that could challenge that but it currently doesn't. And I really wonder if Apple realizes the potential of what they have or if they even care.
So, for comparison, a 5090 has 32GB of VRAM and you can get one for ~$3000 maybe. To go beyond that memory with current generation (ie Blackwell) GPUs, you have to go to the RTX 6000 Pro w/ 96GB of VRAM and that's almost $10,000 for the GPU by itself. Beyond that you're in the H100/H200 GPUs and you're talking much bigger money.
Part of the problem here is the author is looking at laptops. That's the only place you'll find the M5 Max currently. The real problem here is that the Mac Studios haven't been updated in almost 2 years. There were configs of those with 256/512GB of RAM but they've been discontinued, possibly because of the RAM shortage and possibly because of they're reaching EOL. Apple hasn't said why. They never do.
Many expect M5 Ultra Mac Studios in Q3 and the M5 Ultra may well have >1TB/s of memory bandwidth (for comparison, the 5090 is 1.8TB/s). Memory bandwidth isn't the only issue. A 5090 will still have more compute power (most likely) but being able to run large models without going to a $10k+ GPU could be huge.
But yes, it's hard to compete with the scales and discounted electricity of a data center. Even H200 compute hours are kinda cheap if you consider the capital cost of what you're using.
I've looked into getting a 128GB M5 Max 16" MBP. That retails for $6k. You might be able to get it for $5400. But I don't think the value proposition is quite there yet. It's close though.
> At ~50-100 watts and $0.18/kWh that's $0.009 or $0.018 per hour. $0.02 per hour. $0.48 cents per day for the electricity to be running inference at 100%.
The true advantage of locally self-hostable, open weight models isn't about monetary cost at all, it's about the CIA triad.
Running locally, you get confidentiality of knowing your tokens are only ever being processed by your own hardware. You get the integrity of knowing your model isn't being secretly or silently quantized differently behind the scenes, or having it's weights updated in ways you don't want. And you get the availability of never having to worry about an API outage, or even an internet outage, for local inference capacity.
And this isn't even starting to address the whole added world of features and tunability you get when you control the inference stack. Sampling parameters, caching mechanisms, etc.
OpenRouter may be cheaper than frontier labs, but you still lose all of these benefits from open weight models the moment you decide to rely on someone else's hardware for your processing.
> OpenRouter has Gemma4 31b at ~38-50 cents per million tokens. This means that on the optimistic side (50 watts, 40 tokens per second, and 10 years) the pro max is as cheap as openrouter. On the pessimistic side (100 watts and 3 years at 10 tokens per second) the pro max is 10x the cost. I think ~3x the cost per million tokens is likely the right number for local inference on the pro max from an accounting perspective.
Apart from that, like detailed in the the article, pricing for local compute also depends on electricity prices.
By the way, I don't want to snark about it, my English is not very good, but it's "per se", not "per say". Just commenting on this petty thing because it seems to be a common misspelling, and it always trips me up a bit. Makes me wonder about another supposed meaning like "from hearsay".
OpenRouter doesn't expose all the LLM sampling parameters/research that llamacpp, vllm, sglang, et al expose (so no high temperature/highly diverse outputs). Also OpenRouter doesn't let you use steering vectors or LoRA or other personalization techniques per-request. Also no true guarantees of ZDR/privacy/data sovereignty.
Oh, and the author didn't mention at all anything related to inference optimization, so no idea if they even know about or enabled things like speculative decoding, optimized attention backends, quantization, etc.
At least AI slop would have hit on far more of the things I listed above. This is worse-than-AI.
Frontier AI companies are selling at a loss.
Excusing everything else that u/bastawhiz said[0]; the obvious fact here is that Claude, OpenAI, Gemini et al. are quite literally burning through 100's of billions of dollars and selling it back to you for pennies on the dollar in the hopes that they get to be the only one left.
If I spend $10 growing Oranges and sell them to you for $1; then of course it's more expensive for you to do the growing.
I feel like I'm taking crazy pills. These models will become more expensive over time, it's functionally impossible for them not to, they just want to capture the market before they have to stop selling at a huge loss.
[0]: https://news.ycombinator.com/item?id=48168433
Well, I'd be surprised if non-R&D inference providers were selling at a loss. There are a plethora to choose from, competition is quite healthy. Will they keep providing cheap tokens while the labs raise their prices? Probably, but then I don't see how they could be raised in the first place. And what timescale are you talking about? A couple of years? It is appropriate to assume inference will become more efficient over time. If you raise your prices, you are going to be out competed before it's profitable (if you assume it is unprofitable) which would be negligent. I don't see how this makes sense.
This isn't a good analysis, and it's because it keeps rounding everything up. He rounds up the cost of electricity by 10%. He has a range of power use, takes the high end (which is 2x the low end) and multiplies it by the inflated electricity cost.
But then they talk about using a newly purchased Mac to do the inference, running at full capacity, 24/7. Why would you do that? Apple silicon is fast but the author points out: you're only getting 10-40 tokens per second. It's not bad, but it's not meant for this!
It's comparing apples to oranges. Yeah, data centers don't pay residential electricity rates. Data centers use chips that are power efficient. Data centers use chips that aren't designed to be a Mac.
Apple silicon works out pretty good if you're not burning tokens 24/7/365 and you're not buying hardware specifically to do it. I use my Mac Studio a few times a week for things that I need it for, but I can run ollama on it over the tailnet "for free". The economics work when I'm not trying to make my Mac Studio behave like a H100 cluster with liquid cooling. Which should come as no surprise to anyone: more tokens per watt on hardware that's multi tenant with cheap electricity will pretty much always win.
Rounding everything down in the most optimistic setting got me to $0.40 per million tokens, and openrouter has the same model at $.38/mtok.
I’ll keep my data local over a $.02/mtok difference.
It’s more than just data locality. OpenRouter is faster, no? I have an M4 pro, and anything but the smallest dumbest models are unusably slow for interactive use. I personally haven’t yet found a good use case for offline/non-interactive LLM work locally.
The article makes no sense. I can't use OpenRouter as a general purpose computing device. Why are we comparing a whole computer to a single purpose SaaS?
I think it's because there are a lot of people writing articles about the benefits of running local models. I think it's fair to say that there are daily threads on HN singing the praises or local inference. I also see people buying new hardware where the main trigger is ability to run local models.
using it 24/7 brings the average cost down, not up.
the less you use local LLM, the less sense it makes since you paid a lot for hardware you don't use
That's the point: why would you buy a device that's specifically not optimized to be used for 24/7 inference? It's expensive hardware that's not designed to be used in that situation! The power use for inference isn't especially good and you're not getting even a fraction of the benefit from the hardware that you're paying for.
The hardware has multiple uses for the same cost. The pay-per-use server does not.
nothing about the current data center craze looks efficient.
Whether you think building data centers or not is a good idea it's inarguable that the per-token efficiency (power, hardware, etc) is FAR higher in a data center. That's literally what it's designed for.
Unless I'm misunderstanding, this is counting the entire laptop in the cost of generating tokens. The calculation seems to omit that, in addition to receiving LLM output, you have also received a laptop in exchange for your money. If you intend to put this machine in a dark corner and run it solely as a token-munching server, a laptop would be an exceptionally poor choice of technology for this purpose. But if you intend to use the laptop as a laptop, having a laptop is a pretty big benefit over not having a laptop.
You also get the benefit of privacy, freedom from censorship, and control over the model used (i.e. it will not be rugpulled on you in three months after you've built a workflow around a specific model's idiosyncrasies).
Yeah, a better metric might be, the difference in cost between the laptop you need to run local models, and the laptop you would have bought anyway.
> control over the model used
but you lose access to the most capable models, you can run only the small ones
For me, the appeal of local compute is first and foremost confidentiality and having the possibility to run my 200K documents through an LLM just to see what happen without having to consider the cost.
A lot of comments here are about the issues with the analysis in OP’s post but much of them are “a distinction without a difference” with respect to the broader conclusion. When we look at purely cost and performance (setting aside privacy) then it’s better for individual devs to pay for hosted then for self hosting. Employers are paying for tokens on the job and most devs are finding the $PREFERRED_PROVIDER’s $20/$100/$200/month subscription sufficient outside of work. Most devs don’t fall in the conditions under which running local models make sense purely on the basis of cost vs performance.
More critically, in practice, setting up local models seems more like a hobby, an educational exercise, or an act of privacy control than it is for cost cutting or productivity.
Mmmm, nope if you do the smart thing. MacBook M5 max 128gb is a premium laptop at 6k, but with it you can do many things and is your good main driver for the day. Then, it can also run DeepSeek V4 flash and perform non trivial tasks locally, without censorship or limitations, even without an internet connection and on very privacy sensitive data. That's a good deal. If you buy 25k for a dual Mac Studio 512gb to abandon OpenAI and company you are going to be disappointed by both performance and cost.
Yea my m4 max with 128gb has ended up making a lot of sense for me. I do video editing, I train ml models, I run large open AI models, I do 3d modeling, rendering and cad work. I never do all of this 100% of the time, I’ll setup a ml training to run over night and check results in the morning, during work I’ll set it up as a server and run local models, on my own time I’ll edit video and work on 3d modeling. It’s an incredibly versatile machine - and all of this is done while keeping your data on your device and giving you full control over your workflows.
The author only compared output token costs -- but for typical agentic workloads, input tokens dominate the costs by a large margin. Running inference locally, input tokens are, to first order, free. (They only generate implicit costs through higher time-to-first-token, higher power use, and lower token output speed).
OP is comparing against Gemma everywhere but concludes paying Anthropic make more sense. Anthropic is $15 per million output token which is 30-35x more expensive even in openrouter .
This is like comparing e-bike at home with e-bike rental and concluding therefore we need to rent Toyota since it can go at similar speeds. Getting tired of bad posts getting much attention .
How much does your data privacy cost?
As stated in the analysis, thousands of dollars. That said, the smart thing to do is target smaller models (few billion parameters) and then use larger models for non-privacy tasks.
I simply can't go back to cloud AI. Privacy and full control are more important to me than speed and SOTA models.
Also predictability, resilience, sovereignty. I'm not worried about other people's outages, that unexpected demand will impact me at an inconvenient time, that someone's watering down my model, that my costs will change unpredictably or that some unforseen error will lead to a huge bill.
It's in the same category as rooftop solar for me. It doesn't have to make strict economic sense if you're the particular type of person who gets peace of mind from control of infrastructure / reduced dependency.
Slightly different slice into this a very similar situation (local vs OpenRouter AI inference).
But in _every_ metric other than privacy it was better to run via OpenRouter than a local model, and not by a small amount.
Direct link to the comparison charts:
https://sendcheckit.com/blog/ai-powered-subject-line-alterna...
Consider deepseek as well. About 50 cents per 1M tokens, for >1T model
I don't hear people debating which is cheaper, local or cloud run models. The conversation, at least what I hear, is a lot of the time users are not utilizing an awful lot of tickets all the time, those providers will be paid if you never use them. If 80% - 90% of the work I and my team are doing with Ai is grunt work, write tests for this, implement a FFT here, write the dB query for X. Nothing exhausting. Those who are using AI for whole cloth "vibe coded" applications and services are definitely better suited to cloud. If a work laptop can run my local models and get my works needed performance for development, why wouldn't I as a company prefer that?
Add to that the privacy improvements and data protection and potentially further specific inferance if needed it's a no brainer.
Again, Ai is a tool, and the right tool for the job, I would wager with no evidence looked up, is that the majority of Devs would be happy with 10-30 per second locally.
"Accelerated depreciation (if any) from shortening the lifespan of the device will be more expensive than the electricity"
Shortening the lifespan?
The amount of FUD and notion that hardware depreciates in this manner is widely held. I blame Michael Burry of the Big Short who is perpetuating these lies to the investor community today.
There's a bunch of retro hardware which should make people pause and realize they're stupid to assume hardware slows down on average even 5% 20 years later (it's probably closer to 2% and I'm being generous).
HVAC/power delivery and generation are the major factors, and if you didn't skimp/get defective parts for this and replace failed moving parts (usually fans), your hardware is basically the same 20 years down the line as it was today.
Also using LLMs locally doesn't even induce sustained 100% GPU usage over significant periods of time for most real (agentic coding in OpenCode) use-cases.
Local LLMs aren’t about cost, but control.
I like that the numbers were crunched, but the answer to these is always a bit of a foregone conclusion.
* Industrial power pricing
* Wholesale hardware pricing
* Utilization density of shared API
means API always wins a cost shootout.
Privacy & tinkering is cool too though
So I did the India-specific analysis for a tier-3 city. Here, electricity costs 1/3rd of the US version, and you also get solar subsidy up to a certain amount.
https://shorturl.at/q6gRE
tldr;
Hardware deprecation costs are the major factor.
But, if we assume ZERO hardware deprecation (not realistic), then local inference becomes super cheap.. roughly, 90%+ cheaper.
Third case: the break-even happens only if we can get at the very very very least, 8.7 years of useful hardware life. A more realistic number, however, when working 8 hrs/day and not of 24 hrs/day, is around 25 years.
So, for now, local inference is preferable if you deeply care about privacy. From cost perspective, it's still not there.
Bizarre running local models have nothing to do with cost. It's about privacy first and foremost
I'm even surprised people ignorantly talking about advantages of buying very expensive device , run it only sometimes and aiming to beat cloud vendors.
If small model is great it will be hosted with good electricity cost and will be utilized 24/7.
Isn't it 2+2 of economics ?
CPU is a commodity, and we are still buying cpu and ram from vendors for same reason
Put a cost on sending your intellectual property to a saas provider who knows where. Maybe you are building yet another html nobody really cares about, but not everyone does.
Will this cost structure always be this way and are there other benefits to not running your LLM on the cloud?
E.g.
Privacy
Uptime
Future cost structure controls
This is a field that has moved very quickly. And it has moved in a direction to try to trap users into certain habits. But these habits might not best align with what best benefits end users today or some time in the future.
OpenRouter and other LLM platforms are being subsidized by VC investment to less than it costs them to run inference, the MacBook Pro is not
When the AI bubble inevitably pops, the author will find a new way to skew results in favor of cloud LLMs. Like including the price of a desk and a chair in the local token cost.
I really wanted the laptop to look better cost-wise, but it doesn't.
I mean if you’re buying it just as an LLM inference server it’s not, but most people already have laptops, in which case it’s practically free
I've dug into this previously for one simple reason: NVidia segments the market by capping VRAM and Apple silicon uses a shared memory model that could challenge that but it currently doesn't. And I really wonder if Apple realizes the potential of what they have or if they even care.
So, for comparison, a 5090 has 32GB of VRAM and you can get one for ~$3000 maybe. To go beyond that memory with current generation (ie Blackwell) GPUs, you have to go to the RTX 6000 Pro w/ 96GB of VRAM and that's almost $10,000 for the GPU by itself. Beyond that you're in the H100/H200 GPUs and you're talking much bigger money.
Part of the problem here is the author is looking at laptops. That's the only place you'll find the M5 Max currently. The real problem here is that the Mac Studios haven't been updated in almost 2 years. There were configs of those with 256/512GB of RAM but they've been discontinued, possibly because of the RAM shortage and possibly because of they're reaching EOL. Apple hasn't said why. They never do.
Many expect M5 Ultra Mac Studios in Q3 and the M5 Ultra may well have >1TB/s of memory bandwidth (for comparison, the 5090 is 1.8TB/s). Memory bandwidth isn't the only issue. A 5090 will still have more compute power (most likely) but being able to run large models without going to a $10k+ GPU could be huge.
But yes, it's hard to compete with the scales and discounted electricity of a data center. Even H200 compute hours are kinda cheap if you consider the capital cost of what you're using.
I've looked into getting a 128GB M5 Max 16" MBP. That retails for $6k. You might be able to get it for $5400. But I don't think the value proposition is quite there yet. It's close though.
I think Apple really do care and know that Moore’s law is likely to position them as major winners in this race in 3-7 years time.
This. The M5’s massive speed up in refill is a good sign.
Apple isn’t expecting wholesale adoption of on-device models this year or next. But all of their design and iteration suggests they see it coming.
> Let's round up to $0.20 per kWh.
Next paragraph
> At ~50-100 watts and $0.18/kWh that's $0.009 or $0.018 per hour. $0.02 per hour. $0.48 cents per day for the electricity to be running inference at 100%.
lol
The true advantage of locally self-hostable, open weight models isn't about monetary cost at all, it's about the CIA triad.
Running locally, you get confidentiality of knowing your tokens are only ever being processed by your own hardware. You get the integrity of knowing your model isn't being secretly or silently quantized differently behind the scenes, or having it's weights updated in ways you don't want. And you get the availability of never having to worry about an API outage, or even an internet outage, for local inference capacity.
And this isn't even starting to address the whole added world of features and tunability you get when you control the inference stack. Sampling parameters, caching mechanisms, etc.
OpenRouter may be cheaper than frontier labs, but you still lose all of these benefits from open weight models the moment you decide to rely on someone else's hardware for your processing.
Wouldn’t a Mac Mini be a better comparison?
Also after a few years you can sell and upgrade.
A 2022 Mac Studio w/ M1 Ultra and 128gb was ~$5200 new and I see them selling for over $4k on eBay.
Can’t sell your used tokens…
Yes, or Mac Studio. Laptops with screens aren't made to run 24/7 heavy workloads.
Your laptop AI costs too much? Speculative investors can help!
Open router doesn't cost money per say, it depends on the providers pricing
> OpenRouter has Gemma4 31b at ~38-50 cents per million tokens. This means that on the optimistic side (50 watts, 40 tokens per second, and 10 years) the pro max is as cheap as openrouter. On the pessimistic side (100 watts and 3 years at 10 tokens per second) the pro max is 10x the cost. I think ~3x the cost per million tokens is likely the right number for local inference on the pro max from an accounting perspective.
Apart from that, like detailed in the the article, pricing for local compute also depends on electricity prices.
By the way, I don't want to snark about it, my English is not very good, but it's "per se", not "per say". Just commenting on this petty thing because it seems to be a common misspelling, and it always trips me up a bit. Makes me wonder about another supposed meaning like "from hearsay".
They do take a cut of 5.5%, (as they should)
Local isn’t (just) about cost, it’s control and trust.
OpenRouter doesn't expose all the LLM sampling parameters/research that llamacpp, vllm, sglang, et al expose (so no high temperature/highly diverse outputs). Also OpenRouter doesn't let you use steering vectors or LoRA or other personalization techniques per-request. Also no true guarantees of ZDR/privacy/data sovereignty.
Oh, and the author didn't mention at all anything related to inference optimization, so no idea if they even know about or enabled things like speculative decoding, optimized attention backends, quantization, etc.
At least AI slop would have hit on far more of the things I listed above. This is worse-than-AI.