I built a proof-of-concept that streams LLM tokens as Huffman-compressed binary over WebSocket instead of JSON text over SSE.
The Problem: Current LLM APIs (OpenAI, Anthropic, self-hosted) send decoded text wrapped in JSON. For every token, you get something like: `data: {"choices":[{"delta":{"content":"hello"}}]}`. This is verbose, wastes bandwidth, and forces the server to decode tokens to text (CPU cost).
The Solution: Stream raw token IDs as binary. The server sends Huffman-compressed token IDs over WebSocket, and the client decodes them locally using WASM. This offloads token decoding from server to client.
Results from mock benchmarks:
- 30% faster for inline completions (the critical vibecoding use case)
- 25% faster for small completions (100 tokens)
- 12% faster overall average
- ~60% bandwidth savings (3 bytes/token vs 8 bytes/token)
- Client-side decoding means servers can handle more concurrent users
Architecture:
LLM → Token IDs → Huffman encode → WebSocket (binary) → WASM decode → Text
vs.
LLM → Token IDs → Decode to text → JSON → SSE (HTTP) → Parse → Text
Tech Stack:
Rust (WASM for encoder/decoder), TypeScript (test harness), Node.js (mock servers). Includes comprehensive benchmarks comparing both protocols on identical workloads.
Limitations:
- Requires modifying the LLM server to expose token IDs (standard APIs don't do this)
- Tokenizer is baked in at build time (`./build.sh <tokenizer_name>`) - can't switch models dynamically
- Mock server only - no real LLM integration yet
- VS Code extension is non-functional (command registration issues)
- Best for self-hosted deployments where you control the stack
The VS Code extension code is included but doesn't work. Benchmarks and Node.js examples demonstrate the approach.
Why it matters:
- Protocol-level thinking for LLM APIs (not just server scaling)
- Shows binary protocols + client-side decoding beats traditional HTTP/JSON
- Opens discussion about whether LLM APIs should expose token IDs
Built this in ~3K LOC. Fully open source (MIT). Includes comprehensive benchmarks and Node.js examples.
I built a proof-of-concept that streams LLM tokens as Huffman-compressed binary over WebSocket instead of JSON text over SSE.
The Problem: Current LLM APIs (OpenAI, Anthropic, self-hosted) send decoded text wrapped in JSON. For every token, you get something like: `data: {"choices":[{"delta":{"content":"hello"}}]}`. This is verbose, wastes bandwidth, and forces the server to decode tokens to text (CPU cost).
The Solution: Stream raw token IDs as binary. The server sends Huffman-compressed token IDs over WebSocket, and the client decodes them locally using WASM. This offloads token decoding from server to client.
Results from mock benchmarks: - 30% faster for inline completions (the critical vibecoding use case) - 25% faster for small completions (100 tokens) - 12% faster overall average - ~60% bandwidth savings (3 bytes/token vs 8 bytes/token) - Client-side decoding means servers can handle more concurrent users
Architecture:
LLM → Token IDs → Huffman encode → WebSocket (binary) → WASM decode → Text vs. LLM → Token IDs → Decode to text → JSON → SSE (HTTP) → Parse → Text
Tech Stack: Rust (WASM for encoder/decoder), TypeScript (test harness), Node.js (mock servers). Includes comprehensive benchmarks comparing both protocols on identical workloads.
Limitations: - Requires modifying the LLM server to expose token IDs (standard APIs don't do this) - Tokenizer is baked in at build time (`./build.sh <tokenizer_name>`) - can't switch models dynamically - Mock server only - no real LLM integration yet - VS Code extension is non-functional (command registration issues) - Best for self-hosted deployments where you control the stack
The VS Code extension code is included but doesn't work. Benchmarks and Node.js examples demonstrate the approach.
Why it matters: - Protocol-level thinking for LLM APIs (not just server scaling) - Shows binary protocols + client-side decoding beats traditional HTTP/JSON - Opens discussion about whether LLM APIs should expose token IDs
Built this in ~3K LOC. Fully open source (MIT). Includes comprehensive benchmarks and Node.js examples.
Try it: https://github.com/vidur2/token_entropy_encoder
Looking for feedback on the approach, potential issues, and whether this is worth pursuing further!