Ask HN: Could we skip speech to text using vector databases?

3 points | by andrewoodleyjr 2 years ago ago

2 comments

Imanari 2 years ago

Interesting idea. This made me think of these audio illusions[0] where what you hear depends on what you expect to hear. I wonder if this would present challenges for the proposed approach.

[0] https://www.youtube.com/watch?v=8FXQ38-ZQK0 sorry for the fast-food-tier video, best I could find that was not a short.

minimaxir 2 years ago

It is possible to create audio/speech embeddings using a model like CLAP: https://huggingface.co/laion/larger_clap_music_and_speech

The results aren't good for nearest neighbor vector lookup, however.