2 comments

  • EnthrallingEmil 4 hours ago

    Also 3090. using Q4_XL, reduced max context size, with 100k prompt length, I get 2520 tk/s for prompt processing, 68 token/s generation:

      llama-server \
        --model /mnt/ubuntu/models/llama-cpp-qwen/Qwen3.6-35B-A3B-UD-Q4_K_XL.gguf \
        --ctx-size 150000 \
        --n-gpu-layers 99 \
        --cache-type-k q8_0 \
        --cache-type-v q8_0 \
        --parallel 3 \
        --kv-unified \
        --ctx-checkpoints 32 \
        --checkpoint-every-n-tokens 8192 \
        --checkpoint-min-tokens 64 \
        --flash-attn on \
        --batch-size 4096 \
        --ubatch-size 1024 \
        --reasoning on \
        --temp 0.6 \
        --top-p 0.95 \
        --top-k 20
    I was wondering if turboquant is worth the effort right now, but I'm not yet seeing it speed wise.

    checkpoint-min-tokens is a local patch I have so that small background tasks don't wreck my checkpoint cache.

  • freakynit 4 hours ago

    Update: spot terminated