Pure C, CPU-only inference with Mistral Voxtral Realtime 4B speech to text model

(github.com)

181 points | by Curiositry 11 hours ago ago

14 comments

mythz 2 hours ago

Big fan of Salvatore's voxtral.c and flux2.c projects - hope they continue to get optimized as it'd be great to have lean options without external deps. Unfortunately it's currently too slow for real-world use (AMD 7800X3D/Blas) when adding Voice Input support to llms-py [1].

In the end Omarchy's new support for voxtype.io provided the nicest UX, followed by Whisper.cpp, and despite being slower, OpenAI's Whisper is still a solid local transcription option.

Also very impressed with both the performance and price of Mistral's new Voxtral Transcription API [2] - really fast/instant and really cheap ($0.003/min), IMO best option in CPU/disk-constrained environments.

[1] https://llmspy.org/docs/features/voice-input

[2] https://docs.mistral.ai/models/voxtral-mini-transcribe-26-02

[-]

mijoharas 2 hours ago

One thing I keep looking for is transcribing while I'm talking. I feel like I need that visual feedback. Does voxtype support that?

(I wasn't able to find anything at glance)

Handy claims to have an overlay, but it seems to not work on my system.

[-]

mythz 2 hours ago

Not sure how it works in other OS's but in Omarchy [1] you hold down `Super + Ctrl + X` to start recording and release it to stop, while it's recording you'll see a red voice recording icon in the top bar so it's clear when its recording.

Although as llms-py is a local web App I had to build my own visual indicator [2] which also displays a red microphone next to the prompt when it's recording. It also supports both Tap On/Off and hold down for recording modes. When using voxtype I'm just using the tool for transcription (i.e. not Omarchy OS-wide dictation feature) like:

$ voxtype transcribe /path/to/audio.wav

If you're interested the Python source code to support multiple voice transcription backends is at: [3]

[1] https://learn.omacom.io/2/the-omarchy-manual/107/ai

[2] https://llmspy.org/docs/features/voice-input

[3] https://github.com/ServiceStack/llms/blob/main/llms/extensio...

Curiositry 8 hours ago

This was a breeze to install on Linux. However, I haven't managed to get realtime transcription working yet, ala Whisper.cpp stream or Moonshine.

--from-mic only supports Mac. I'm able to capture audio with ffmpeg, but adapting the ffmpeg example to use mic capture hasn't worked yet:

ffmpeg -f pulse -channels 1 -i 1 -f s16le - 2>/dev/null | ./voxtral -d voxtral-model --stdin

It's possible my system is simply under spec for the default model.

I'd like to be able to use this with the voxtral-q4.gguf quantized model from here: https://huggingface.co/TrevorJS/voxtral-mini-realtime-gguf

[-]

jwrallie 6 hours ago

I am interested in a way to capture audio not only from the mic, but also from one of the monitor ports so you could pipe the audio you are hearing from the web directly for real-time transcription with one of these solutions. Did anyone manage to do that?

I can, for example, capture audio from that with Audacity or OBS Studio and do it later, so it should be possible to do it in real time too assuming my machine can keep up.

[-]

bebna 5 hours ago

Set -i 1 to -i default or to one of your monitors, look them up with pactl list short sources

https://trac.ffmpeg.org/wiki/Capture/PulseAudio

yjftsjthsd-h 7 hours ago

Does it work if you use ffmpeg to feed it audio from a file? I personally would try file->ffmpeg->voxtral then mic->ffmpeg->file, and then try to glue together mic->ffmpeg->voxtral.

(But take with grain of salt; I haven't tried yet)

[-]

Curiositry 5 hours ago

Recording audio with FFMPEG, and transcribing a file that’s piped from FFMPEG both work.

Given that it took 19.64 mins to transcribe the 11 second sample wav, it’s possible I just didn’t wait long enough :)

[-]

yjftsjthsd-h 4 hours ago

Ah. In that case... Yeah. Is it using GPU, and does the whole model fit in your (V)RAM?

[-]

ekianjo 3 hours ago

This is a CPU implementation only.

hrpnk 2 hours ago

There is also a MLX implementation: https://github.com/awni/voxmlx

written-beyond 4 hours ago

Funny, this and the Rust runtime implementation are neck and neck on the frontpage right now.

Cool project!

sgt 4 hours ago

I'm very interested in speech to text - but like tricky dialects and use of various terminologies but I'm still confused as to where to start in the best possible place, in order to train the models with a huge database of voice samples I own.

Any ideas from the HN crowd currently involved in speech 2 text models?

sylware an hour ago

Finally a plain and simple C lib to run LLM opened weights?