Voxtral – Frontier open source speech understanding models

(mistral.ai)

106 points | by meetpateltech a day ago ago

23 comments

homarp 17 hours ago

Running Voxtral-Mini-3B-2507 on GPU requires ~9.5 GB of GPU RAM in bf16 or fp16.

Running Voxtral-Small-24B-2507 on GPU requires ~55 GB of GPU RAM in bf16 or fp16.

Im pretty excited to play around with this. I’ve worked with whisper quite a bit, it’s awesome to have another model in the same class and from Mistral, who tend to be very open. I’m sure unsloth is already working on some GGUF quants - will probably spin it up tomorrow and try it on some audio.

ipsum2 13 hours ago

24B is crazy expensive for speech transcription. Conspicuously no comparison with Parakeet, a 600M param model thats currently dominating leaderboards (but only for English)

[-]

azinman2 7 hours ago

But it also includes world knowledge, can do tool calls, etc. It’s an omnimodel

sheerun 9 hours ago

In demo they mention polish prononcuation is pretty bad, spoken as if second language of english-native speaker. I wonder if it's the same for other languages. On the other hand whispering-english is hillariously good, especially different emotions.

[-]

Raed667 6 hours ago

It is insane how good the "French man speaking English" demo is. It captures a lot of subtleties

danelski 20 hours ago

They claim to undercut competitors of similar quality by half for both models, yet they released both as Apache 2.0 instead of following smaller - open, larger - closed strategy used for their last releases. What's different here?

[-]

halJordan 13 hours ago

They didn't release voxtral large so your question doesn't really make sense

[-]

danelski 6 hours ago

It's about what their top offering is at the moment, not having Large in name. Mistral Medium 3 is notably not Mistral Large 3, but it was released as API-only.

wmf 14 hours ago

They're working on a bunch of features so maybe those will be closed. I guess they're feeling generous on the base model.

Havoc 14 hours ago

Probably not looking to directly compete in transcription space

GaggiX 18 hours ago

There is also a Voxtral Small 24B small model available to be downloaded: https://huggingface.co/mistralai/Voxtral-Small-24B-2507

lostmsu 16 hours ago

Does it support realtime transcription? What is the ~latency?

[-]

rolisz an hour ago

Unlikely. The small model is much larger than whisper (which is already hard to use for realtime)

homarp 19 hours ago

weights:https://huggingface.co/mistralai/Voxtral-Mini-3B-2507 and https://huggingface.co/mistralai/Voxtral-Small-24B-2507

[-]

homarp 19 hours ago

Running Voxtral-Mini-3B-2507 on GPU requires ~9.5 GB of GPU RAM in bf16 or fp16.

Running Voxtral-Small-24B-2507 on GPU requires ~55 GB of GPU RAM in bf16 or fp16.

lostmsu 16 hours ago

My Whisper v3 Large Turbo is $0.001/min, so their price comparison is not exactly perfect.

[-]

ImageXav 16 hours ago

How did you achieve that? I was looking into it and $0.006/min is quoted everywhere.

[-]

lostmsu 16 hours ago

Harvesting idle compute. https://borgcloud.org/speech-to-text

[-]

4b11b4 8 hours ago

This is your service?

[-]

lostmsu 2 hours ago

Yes

BetterWhisper 14 hours ago

Do you support speaker recognition?

[-]

lostmsu 13 hours ago

No. I found models doing that unreliable when there are many speakers.