We need LLM query routing at the OS level like Mobile data.
I know it will sound crazy but hear me out. I think about this AI inference as infrastructure. I do not want to pay for it on every app I use it on. I do not think "I have to pay the mobile data of youtube, and the mobile data of whatsapp etc.". I pay Mobile data infrastructure and let my device route it appropiately. In fact, if we ever go the local llm route, you could have LLM capabilities without having access to the internet (or local LAN), and your OS/computer is the only one capable of doing that routing for you.
It doesn't sound crazy at all, this seems almost obvious. The OS should provide a chat completions server and the user should be able to select the underlying LLM's server. This should be just like selecting a default search engine or browser.
Hopefully the EU forces US tech giants to do this. God knows Apple and Google won't do this on their own. They gotta get that sweet default provider revenue.
I'm not sure I understand what this is trying to solve?
If a prompt I give routes to one model, and then another prompt to another model, how does one tie the context together such that the next model knows what's going on?
Otherwise this would only be useful for one-off prompts as far as I can tell.
And if it did keep a context to be passed around, it would always land hot (not in the cache).
Every turn of a conversation with an LLM is getting the whole conversation. Caching complicates the picture, but not by a huge amount. That's why a short question at the end of a long conversation chews tokens faster than it would in a fresh session.
So, a conversation that's ongoing with one model then switching to another would presumably send the whole conversation and the new question. Which defeats the purpose of splitting traffic...so, you're not wrong to question how this actually improves things for anything other than short sessions, which you could choose your own model for if it's a small problem.
Here's a use case: You want to extend the GPT 5.5 quota in you Codex subscription by routing some % of requests to DeepSeek V4 Pro. A router needs to figure out which requests to route where, for the appropriate difficulty level.
Another use case: You have two models on your local device. One is large and fairly powerful but low, the other is smaller, faster and good at tool calls and chat, but not great for writing and reviewing code. If you route between them per request, you can get a better developer experience with preserved performance.
The linked repo aims to help you achieve these things, as do I with the role-model router and protocol that I linked in another comment.
There are so many proxies like this now but I can tell you from first hand experience this is not going to work. You cannot just route away from a situation at such a high level especially when we are talking about models that are quite different in behaviour, with different context windows and tuned to different tool uses. The harness is doing all kind of funky things to compensate for issues (like tool call truncation) that a proxy that routes dynamically like this will work against the very same strategies that make the harness work.
Interesting concept, work in theory, but I cannot see this being part of larger system.
This is not choosing between different models, really. You should check the (interesting, yet sadly very slop-padded) readme. It’s about trying to make a binary decision: Is this a hard or easy question, and about making that decision extremely fast. They suggest putting another router that chooses the model behind it. I’m not sure how well it would work, but the idea is interesting and different than other routers.
We are developing many applications in my company, some of them safety critical. A natural routing way could happen for certain phases of development, and interfaces via git. One agent works on branch a and is responsible for brainstorm planning specs, and the other is responsible code and tests. The first agent creates tickets for the second one and the second one consumes these. This works with today’s standard harness.
Slight tangent, but “Wayfinder sits behind whatever OpenAI-compatible client you already use” reminds me that descriptions of where proxies sit in the information flow always seem so arbitrary to me:
- “after the client”
- “reverse proxy” (in front of servers)
- “proxy” (in front of client)
I always have to look this up, surely there must be a standardized way to describe this?
Love to see local/cloud routing explicitly supported.
I'm building another router for routing between local and remote models, ShowHN coming up later today. Here's a sneak preview of the github: https://github.com/try-works/role-model
We need LLM query routing at the OS level like Mobile data. I know it will sound crazy but hear me out. I think about this AI inference as infrastructure. I do not want to pay for it on every app I use it on. I do not think "I have to pay the mobile data of youtube, and the mobile data of whatsapp etc.". I pay Mobile data infrastructure and let my device route it appropiately. In fact, if we ever go the local llm route, you could have LLM capabilities without having access to the internet (or local LAN), and your OS/computer is the only one capable of doing that routing for you.
It doesn't sound crazy at all, this seems almost obvious. The OS should provide a chat completions server and the user should be able to select the underlying LLM's server. This should be just like selecting a default search engine or browser.
Hopefully the EU forces US tech giants to do this. God knows Apple and Google won't do this on their own. They gotta get that sweet default provider revenue.
I'm not sure I understand what this is trying to solve?
If a prompt I give routes to one model, and then another prompt to another model, how does one tie the context together such that the next model knows what's going on?
Otherwise this would only be useful for one-off prompts as far as I can tell.
And if it did keep a context to be passed around, it would always land hot (not in the cache).
Every turn of a conversation with an LLM is getting the whole conversation. Caching complicates the picture, but not by a huge amount. That's why a short question at the end of a long conversation chews tokens faster than it would in a fresh session.
So, a conversation that's ongoing with one model then switching to another would presumably send the whole conversation and the new question. Which defeats the purpose of splitting traffic...so, you're not wrong to question how this actually improves things for anything other than short sessions, which you could choose your own model for if it's a small problem.
Here's a use case: You want to extend the GPT 5.5 quota in you Codex subscription by routing some % of requests to DeepSeek V4 Pro. A router needs to figure out which requests to route where, for the appropriate difficulty level.
Another use case: You have two models on your local device. One is large and fairly powerful but low, the other is smaller, faster and good at tool calls and chat, but not great for writing and reviewing code. If you route between them per request, you can get a better developer experience with preserved performance.
The linked repo aims to help you achieve these things, as do I with the role-model router and protocol that I linked in another comment.
I'm not sure if output of easy commands like "summarize this" are added back to the context? I always assumed they are in a separate UI layer?
It's funny how much that first paragraph is Claude's voice. I don't know how it got trained so hard to use, "the shape of" for everything.
Loads of ed sheeran in the training data?
Do you want the honest answer?
There are so many proxies like this now but I can tell you from first hand experience this is not going to work. You cannot just route away from a situation at such a high level especially when we are talking about models that are quite different in behaviour, with different context windows and tuned to different tool uses. The harness is doing all kind of funky things to compensate for issues (like tool call truncation) that a proxy that routes dynamically like this will work against the very same strategies that make the harness work.
Interesting concept, work in theory, but I cannot see this being part of larger system.
This is not choosing between different models, really. You should check the (interesting, yet sadly very slop-padded) readme. It’s about trying to make a binary decision: Is this a hard or easy question, and about making that decision extremely fast. They suggest putting another router that chooses the model behind it. I’m not sure how well it would work, but the idea is interesting and different than other routers.
We are developing many applications in my company, some of them safety critical. A natural routing way could happen for certain phases of development, and interfaces via git. One agent works on branch a and is responsible for brainstorm planning specs, and the other is responsible code and tests. The first agent creates tickets for the second one and the second one consumes these. This works with today’s standard harness.
Slight tangent, but “Wayfinder sits behind whatever OpenAI-compatible client you already use” reminds me that descriptions of where proxies sit in the information flow always seem so arbitrary to me:
I always have to look this up, surely there must be a standardized way to describe this?"after the client" and "in front of client" can mean the same thing depending on your viewpoint.
Exactly, that’s my point
Love to see local/cloud routing explicitly supported.
I'm building another router for routing between local and remote models, ShowHN coming up later today. Here's a sneak preview of the github: https://github.com/try-works/role-model
Posted my ShowHN: https://news.ycombinator.com/item?id=48706181
I do this manually with a desktop app called BoltAI that lets you continue the whole conversation at your LLM of choice.
It'd be nice to just have a command prefix e.g.
/local fix my typo
That’s what I did with Pi, super simple :)
can you send to multiple LLMs to compare responses? From that create a heuristic of which LLM gets what.
This is the way!
I like to think so!