3 points | by andrewoodleyjr 2 years ago ago
2 comments
Interesting idea. This made me think of these audio illusions[0] where what you hear depends on what you expect to hear. I wonder if this would present challenges for the proposed approach.
[0] https://www.youtube.com/watch?v=8FXQ38-ZQK0 sorry for the fast-food-tier video, best I could find that was not a short.
It is possible to create audio/speech embeddings using a model like CLAP: https://huggingface.co/laion/larger_clap_music_and_speech
The results aren't good for nearest neighbor vector lookup, however.
Interesting idea. This made me think of these audio illusions[0] where what you hear depends on what you expect to hear. I wonder if this would present challenges for the proposed approach.
[0] https://www.youtube.com/watch?v=8FXQ38-ZQK0 sorry for the fast-food-tier video, best I could find that was not a short.
It is possible to create audio/speech embeddings using a model like CLAP: https://huggingface.co/laion/larger_clap_music_and_speech
The results aren't good for nearest neighbor vector lookup, however.