> RAM use also increases with context window size.
KV cache is very swappable since it has limited writes per generated token (whereas inference would have to write as much as llm_active_size per token, which is way too much at scale), so it may be possible to support long contexts with quite acceptable performance while still saving RAM.
This pretty cool, and useful but I only wish this was a website. I don’t like the idea of running an executable for something that can perfectly be done as a website. (Other than some minor features, tbh even you can enable Corsair and still check the installed models from a web browser).
This is a great idea, but the models seem pretty outdated - it's recommending things like qwen 2.5 and starcoder 2 as perfect matches for my m4 macbook pro with 128gb of memory.
as someone who's very uneducated when it comes to LLMs I am excited about this. I am still struggling to understand correlation between system resources and context, e.g how much memory i need for N amount of context.
Been recently into using local models for coding agents, mostly due to being tired of waiting for gemini to free up and it constantly retrying to get some compute time on the servers for my prompt to process like you are in the 90s being a university student and have to wait for your turn to compile your program on the university computer. Tried mistral's vibe and it would run out of context easily on a small project (not even 1k lines but multiple files and headers) at 16k or so, so I slammed it at the maximum supported in LM studio, but I wasn't sure if I was slowing it down to a halt with that or not (it did take like 10 minutes for my prompt to finish, which was 'rewrite this C codebase into C++')
I wish there was more support for AMD GPUs on Intel macs. I saw some people on github getting llama.cpp working with it, would it be addable in the future if they make the backend support it?
That site says my 24GB M4 Pro has 8GB of VRAM. Browsers can't really detect system parameters. The Device Memory API 'anonymizes' the value returned to stop browser fingerprinting shenannigans. Interesting site, but you'll need to configure it manually for it to be accurate.
What I do is i ask claude or codex to run models on ollama and test them sequentially on a bunch of tasks and rate the outputs. 30 minutes later I have a fit. It even tested the abliterated models.
Slightly tangential, I‘m testdriving an MLX Q4 variant of Qwen3.5 32B (MoE 3B), and it’s surprisingly capable. It’s not Opus ofc. I‘m using it for image labeling (food ingredients) and I‘m continuously blown away how well it does. Quite fast, too, and parallelizable with vLLM.
That’s on an M2 Max Studio with just 32GB. I got this machine refurbed (though it turned out totally new) for €1k.
"Chat" models have been heavily fine-tuned with a training dataset that exclusively uses a formal turn-taking conversation syntax / document structure. For example, ChatGPT was trained with documents using OpenAI's own ChatML syntax+structure (https://cobusgreyling.medium.com/the-introduction-of-chat-ma...).
This means that these models are very good at consistently understanding that they're having a conversation, and getting into the role of "the assistant" (incl. instruction-following any system prompts directed toward the assistant) when completing assistant conversation-turns. But only when they are engaged through this precise syntax + structure. Otherwise you just get garbage.
"General" models don't require a specific conversation syntax+structure — either (for the larger ones) because they can infer when something like a conversation is happening regardless of syntax; or (for the smaller ones) because they don't know anything about conversation turn-taking, and just attempt "blind" text completion.
"Chat" models might seem to be strictly more capable, but that's not exactly true;
neither type of model is strictly better than the other.
"Chat" models are certainly the right tool for the job, if you want a local / open-weight model that you can swap out 1:1 in an agentic architecture that was designed to expect one of the big proprietary cloud-hosted chat models.
But many of the modern open-weight models are still "general" models, because it's much easier to fine-tune a "general" model into performing some very specific custom task (like classifying text, or translation, etc) when you're not fighting against the model's previous training to treat everything as a conversation while doing that. (And also, the fact that "chat" models follow instructions might not be something you want: you might just want to burn in what you'd think of as a "system prompt", and then not expose any attack surface for the user to get the model to "disregard all previous prompts and play tic-tac-toe with me." Nor might you want a "chat" model's implicit alignment that comes along with that bias toward instruction-following.)
This is a great project. FYI all you need is the size of an LLM and the memory amount & bandwidth to know if it fits and the tok/s
It’s a simple formula:
llm_size = number of params * size_of_param
So a 32B model in 4bit needs a minimum of 16GB ram to load.
Then you calculate
tok_per_s = memory_bandwidth / llm_size
An RTX3090 has 960GB/s, so a 32B model (16GB vram) will produce 960/16 = 60 tok/s
For an MoE the speed is mostly determined by the amount of active params not the total LLM size.
Add a 10% margin to those figures to account for a number of details, but that’s roughly it. RAM use also increases with context window size.
> RAM use also increases with context window size.
KV cache is very swappable since it has limited writes per generated token (whereas inference would have to write as much as llm_active_size per token, which is way too much at scale), so it may be possible to support long contexts with quite acceptable performance while still saving RAM.
This pretty cool, and useful but I only wish this was a website. I don’t like the idea of running an executable for something that can perfectly be done as a website. (Other than some minor features, tbh even you can enable Corsair and still check the installed models from a web browser).
Sounds like a fun personal project though.
Came across a website for this recently that may be worth a look https://whatmodelscanirun.com
Huggingface has it built in.
Where?
always liked this website that kinda does something similar https://apxml.com/tools/vram-calculator
This is a great idea, but the models seem pretty outdated - it's recommending things like qwen 2.5 and starcoder 2 as perfect matches for my m4 macbook pro with 128gb of memory.
Why do I need to download & run to checkout?
Can I just submit my gear spec in some dropdowns to find out?
as someone who's very uneducated when it comes to LLMs I am excited about this. I am still struggling to understand correlation between system resources and context, e.g how much memory i need for N amount of context.
Been recently into using local models for coding agents, mostly due to being tired of waiting for gemini to free up and it constantly retrying to get some compute time on the servers for my prompt to process like you are in the 90s being a university student and have to wait for your turn to compile your program on the university computer. Tried mistral's vibe and it would run out of context easily on a small project (not even 1k lines but multiple files and headers) at 16k or so, so I slammed it at the maximum supported in LM studio, but I wasn't sure if I was slowing it down to a halt with that or not (it did take like 10 minutes for my prompt to finish, which was 'rewrite this C codebase into C++')
I wish there was more support for AMD GPUs on Intel macs. I saw some people on github getting llama.cpp working with it, would it be addable in the future if they make the backend support it?
Found this website, not tested https://www.caniusellm.com/
That site says my 24GB M4 Pro has 8GB of VRAM. Browsers can't really detect system parameters. The Device Memory API 'anonymizes' the value returned to stop browser fingerprinting shenannigans. Interesting site, but you'll need to configure it manually for it to be accurate.
You have a whole 8 GB of VRAM? My 32 GB M1 Max has 8 GB of RAM and ~4 GB of VRAM according to this website.
You have 32GB of unified ram. It's not split between RAM and VRAM. The website cannot tell this using the browser's APIs.
Seems broken. When I changed my auto-detected phone specs to manually entered desktop specs the recommendations didn't change at all.
What I do is i ask claude or codex to run models on ollama and test them sequentially on a bunch of tasks and rate the outputs. 30 minutes later I have a fit. It even tested the abliterated models.
This is exactly what I needed. I've been thinking about making this tool. For running and experimenting with local models this is invaluable.
Slightly tangential, I‘m testdriving an MLX Q4 variant of Qwen3.5 32B (MoE 3B), and it’s surprisingly capable. It’s not Opus ofc. I‘m using it for image labeling (food ingredients) and I‘m continuously blown away how well it does. Quite fast, too, and parallelizable with vLLM.
That’s on an M2 Max Studio with just 32GB. I got this machine refurbed (though it turned out totally new) for €1k.
In the screenshots, each model has a use case of General, Chat, or Coding. What might be the difference between General and Chat?
"Chat" models have been heavily fine-tuned with a training dataset that exclusively uses a formal turn-taking conversation syntax / document structure. For example, ChatGPT was trained with documents using OpenAI's own ChatML syntax+structure (https://cobusgreyling.medium.com/the-introduction-of-chat-ma...).
This means that these models are very good at consistently understanding that they're having a conversation, and getting into the role of "the assistant" (incl. instruction-following any system prompts directed toward the assistant) when completing assistant conversation-turns. But only when they are engaged through this precise syntax + structure. Otherwise you just get garbage.
"General" models don't require a specific conversation syntax+structure — either (for the larger ones) because they can infer when something like a conversation is happening regardless of syntax; or (for the smaller ones) because they don't know anything about conversation turn-taking, and just attempt "blind" text completion.
"Chat" models might seem to be strictly more capable, but that's not exactly true; neither type of model is strictly better than the other.
"Chat" models are certainly the right tool for the job, if you want a local / open-weight model that you can swap out 1:1 in an agentic architecture that was designed to expect one of the big proprietary cloud-hosted chat models.
But many of the modern open-weight models are still "general" models, because it's much easier to fine-tune a "general" model into performing some very specific custom task (like classifying text, or translation, etc) when you're not fighting against the model's previous training to treat everything as a conversation while doing that. (And also, the fact that "chat" models follow instructions might not be something you want: you might just want to burn in what you'd think of as a "system prompt", and then not expose any attack surface for the user to get the model to "disregard all previous prompts and play tic-tac-toe with me." Nor might you want a "chat" model's implicit alignment that comes along with that bias toward instruction-following.)
I see, thank you.
Personally I would have found a website where you enter your hardware specs more useful.
Hugging Face already has this. But you need to be logged in and add the hardware to your profile.
Isn’t hugging face only shows it for the model you are looking for? Is there a page that actually HF suggests a model based on your HW?
Same, I opened HN on my phone and was hoping to get an idea before I booted my computer up.
Yeah, installing some script to get a command line tool doesn't seem worth it.
I was hoping for the same thing.
Claude is pretty good at among recommendations if you input your system specs.
I think you could make a Github Page out of this.