Show HN: Open Access Qwen3.6-35B-A3B-UD-Q5_K_M with TurboQuant

3 points | by freakynit 5 hours ago ago

2 comments

Also 3090. using Q4_XL, reduced max context size, with 100k prompt length, I get 2520 tk/s for prompt processing, 68 token/s generation:

  llama-server \
    --model /mnt/ubuntu/models/llama-cpp-qwen/Qwen3.6-35B-A3B-UD-Q4_K_XL.gguf \
    --ctx-size 150000 \
    --n-gpu-layers 99 \
    --cache-type-k q8_0 \
    --cache-type-v q8_0 \
    --parallel 3 \
    --kv-unified \
    --ctx-checkpoints 32 \
    --checkpoint-every-n-tokens 8192 \
    --checkpoint-min-tokens 64 \
    --flash-attn on \
    --batch-size 4096 \
    --ubatch-size 1024 \
    --reasoning on \
    --temp 0.6 \
    --top-p 0.95 \
    --top-k 20

I was wondering if turboquant is worth the effort right now, but I'm not yet seeing it speed wise.

checkpoint-min-tokens is a local patch I have so that small background tasks don't wreck my checkpoint cache.

freakynit 4 hours ago

Update: spot terminated