Smaller models might not make the best agentic coding assistants, but I have a 128GB RAM headless machine serving llama.cpp with a number of local models that handles various tasks on a daily basis and works great.
- Qwen3-VL:30b > A file watcher on my NAS sends new images to it, which autocaptions and adds the text descriptions as a hidden EXIF layer into the image along with an entry into a Qdrant vector database for lossy searching and organization.
- Gemma3:27b > Used for personal translation work (mostly English and Chinese). Haven't had a chance to try out the Gemma4 models yet.
- Llama3.1:8b > Performs sentiment analysis on texts / comments / etc.
Comparing to Opus is a little unfair, a comparison against Haiku would be more fair. And for a really fast cloud model, I'd be interested to see how latency stacks up against GPT-5.3 Instant or Gemini 3.1 Flash Lite.
I've been using Qwen-3.6-35B-A3B on a spark for my local coding adventures. It's really quite good and getting faster (running the PR before it gets merged is already a 2x token gen speedup)
I cannot imaging spending the prices for Opus anymore, the little guys are getting good enough and in another year I expect to be even better.
Smaller models might not make the best agentic coding assistants, but I have a 128GB RAM headless machine serving llama.cpp with a number of local models that handles various tasks on a daily basis and works great.
- Qwen3-VL:30b > A file watcher on my NAS sends new images to it, which autocaptions and adds the text descriptions as a hidden EXIF layer into the image along with an entry into a Qdrant vector database for lossy searching and organization.
- Gemma3:27b > Used for personal translation work (mostly English and Chinese). Haven't had a chance to try out the Gemma4 models yet.
- Llama3.1:8b > Performs sentiment analysis on texts / comments / etc.
Look into updating to Gemma4 and Qwen3.6, they are good at agentic things. qwen36moe with unsloth's 8bit quant is my daily driver now.
Have you noticed a gap between 8bit and 4bit quant? I've always ran 4bit quant cause less memory required
In my country we write longer and more detailed texts in primary school. That's blogpostmaxxing.
Comparing to Opus is a little unfair, a comparison against Haiku would be more fair. And for a really fast cloud model, I'd be interested to see how latency stacks up against GPT-5.3 Instant or Gemini 3.1 Flash Lite.
I've been using Qwen-3.6-35B-A3B on a spark for my local coding adventures. It's really quite good and getting faster (running the PR before it gets merged is already a 2x token gen speedup)
I cannot imaging spending the prices for Opus anymore, the little guys are getting good enough and in another year I expect to be even better.
This whatever '-maxxxing' nonsense is really 'cringemaxxing'.
The people using it need to look at the origins of it first before using it everywhere.