I appreciate Mistral (and others) releasing their weights for free. But given how llama.cpp underpins a lot of the programs which allow users to run open weight models, it is a little frustrating to have companies which brag about releasing models to the community, leave the community to their own devices to slowly try and actually implement their models.
I hear the reason for this is that llama.cpp keeps breaking basic things, so they have become an unreliable partner. Seems this is what Ollama is trying to address by diluting their connections to llama.cpp and directly contacting companies training these models to have simultaneous releases (e.g. GPT-OSS).
There's more to it, though. The inference code you linked to is Python. Unless my software is Python, I have to ship a CPython binary to run the inference code, then wire it up (or port it, if you're feeling spicy).
Ollama brings value by exposing an API (literally over sockets) with many client SDKs. You don't even need the SDKs to use it effectively. If you're writing Node or PHP or Elixir or Clojurescript or whatever else you enjoy, you're probably covered.
It also means that you can swap models trivially, since you're essentially using the same API for each one. You never need to worry about dependency hell or the issues involved in hosting more than one model at a time.
As far as I know, Ollama is really the only solution that does this. Or at the very least, it's the most mature.
I appreciate Mistral (and others) releasing their weights for free. But given how llama.cpp underpins a lot of the programs which allow users to run open weight models, it is a little frustrating to have companies which brag about releasing models to the community, leave the community to their own devices to slowly try and actually implement their models.
I hear the reason for this is that llama.cpp keeps breaking basic things, so they have become an unreliable partner. Seems this is what Ollama is trying to address by diluting their connections to llama.cpp and directly contacting companies training these models to have simultaneous releases (e.g. GPT-OSS).
There are many different inference libraries and it's not clear which ones a small company like mistral should back yet IMO.
They do release high quality inference code, ie https://github.com/mistralai/mistral-inference
There's more to it, though. The inference code you linked to is Python. Unless my software is Python, I have to ship a CPython binary to run the inference code, then wire it up (or port it, if you're feeling spicy).
Ollama brings value by exposing an API (literally over sockets) with many client SDKs. You don't even need the SDKs to use it effectively. If you're writing Node or PHP or Elixir or Clojurescript or whatever else you enjoy, you're probably covered.
It also means that you can swap models trivially, since you're essentially using the same API for each one. You never need to worry about dependency hell or the issues involved in hosting more than one model at a time.
As far as I know, Ollama is really the only solution that does this. Or at the very least, it's the most mature.
Wow I never realized how much mistral was “disconnected” from the ecosystem