While it is impressive and I like to follow the advancements in this field, it is incredibly frustrating to listen to. I can't put my finger on why exactly. It's definitely closer to human-sounding, but the uncanny valley is so deep here that I find myself thinking "I just want the point, not the fake personality that is coming with it". I can't make it through a 30s demo.
We're used to hearing some kind of identity behind voices -- we unconsciously sense clusters of vocabulary, intonation patterns, ticks, frequent interruption vs quiet patience, silence tolerance, response patterns to various triggers, etc that communicate a coherent person of some kind.
We may not know that a given speaker is a GenX Methodist from Wisconsin that grew up at skate parks in the suburbs, but we hear clusters of speech behavior that lets our brain go "yeah, I'm used to things fitting together in this way sometimes"
These don't have that.
Instead, they seem to mostly smudge together behaviors that are just generally common in aggregate across the training data. The speakers all voice interrupting acknowledgements eagerly, they all use bright and enunciated podcaster tone, they all draw on similar word choice, etc -- they distinguish gender and each have a stable overall vocal tone, but no identity.
I don't doubt that this'll improve quickly though, by training specific "AI celebrity" voices narrowed to sound more coherent, natural, identifiable, and consistent. (And then, probably, leasing out those voices for $$$.)
As a tech demo for "render some vague sense of life behind this generated dialog" this is pretty good, though.
Agreed. To me it sounds like bad voice-over actors reading from a script. So the natural parts of a conversation where you might say the wrong thing and step back to correct yourself are all gone. Impressive for sure.
It's because it's probably trained with "professional audio", ads, movies, audiobooks, and not "normal people talking". Like the effect when diffusion was mostly trained with stock photos.
Totally agree. Maybe it’s just the clips they chose, but it feels overfit on the weird conversational elements that make it impressive? Like the “oh yeahs” from the other person when someone is speaking. It is cool to see that natural flow in a conversation generated by a model, but there’s waaaay too much of it in these examples to sound natural.
And I say all that completely slackjawed that this is possible.
I'd love to see stats on disfluency rate in conversation, podcasts, and this sample to get an idea of where it lies. It seems like they could have cranked it up, but there's also the chance that it's just the frequency illusion because we were primed to pay attention to it.
I love the technology, but I really don't want AI to sound like this.
Imagine being stuck on a call with this.
> "Hey, so like, is there anything I can help you with today?"
> "Talk to a person."
> "Oh wow, right. (chuckle) You got it. Well, before I connect you, can you maybe tell me a little bit more about what problem you're having? For example, maybe it's something to do with..."
That's how the DJ feature of Spotify talks and it's pretty jarring.
"How's it going. We're gonna start by taking you back to your 2022 favorites, starting with the sweet sounds of XYZ". There's very little you can tweak about it, the suggestions kinda suck, but you're getting a fake friend to introduce them to you. Yay, I guess..
> Example of a multi-speaker dialogue generated by NotebookLM Audio Overview, based on a few potato-related documents.
Listening to this on 1.75x speed is excellent. I think the generated speaking speed is slow for audio quality, bc it'd be much harder to slow-down the generated audio while retaining quality than vice versa.
Yeah... It isn't that it doesn't sound like human speech... it just sounds like how humans speak when they are uncomfortable or reading prepared and they aren't good at it.
That could just be the context though. Listening to a clip that's a demo of what the model can produce is very different to listening to a YouTube video that's using the model to generate speech about something you'd actually want to watch a video of.
Probably because you're expecting it and looking at a demo page. Put these voices behind a real video or advertisement and I would imagine most people wouldn't be able to tell that it's AI generated at all.
It'd be annoying to me whether it was AI or human. The faux-excitement and pseudo-bonhomie is grating. They should focus on how people actually talk, not on copying the vocal intonation of coked-up public radio presenters just back from a positive affirmation seminar.
It's due to the histrionic mental epidemic that we are going through.
A lot of people are just like that IRL.
They cannot just say "the food was fine", it's usually some crap like "What on earth! These are the best cheese sticks I've had IN MY EN TI R E LIFE!".
Is there a free (ad supported?) online tool without login that reads text that you paste into it?
I often would like to listen to a blog post instead of reading it, but haven't found an easy, quick solution yet.
I tried piping text through OpenAI's tts-1-hd, model and it is the first one I ever found that is human like enough for me to like listening to it. So I could write a tool for my own usecase that pipes the text to tts-1-hd and plays the audio.
But maybe there is already something with a public web interface out there?
Both windows and macos (the operating systems) have this built-in under accessibility. It’s worth a try and I use it sometimes when I want to read something while cooking.
There is on iOS. No ads. "Reader" by Eleven Labs. I haven't used it that much but have listened to some white papers and blogs (some of which were like 45 minutes) and it "just worked". Even let's you click text you want to jump to.
And it's Eleven Labs quality- which unless I've fallen behind the times is the highest quality TTS by a margin.
While it is impressive and I like to follow the advancements in this field, it is incredibly frustrating to listen to. I can't put my finger on why exactly. It's definitely closer to human-sounding, but the uncanny valley is so deep here that I find myself thinking "I just want the point, not the fake personality that is coming with it". I can't make it through a 30s demo.
We're used to hearing some kind of identity behind voices -- we unconsciously sense clusters of vocabulary, intonation patterns, ticks, frequent interruption vs quiet patience, silence tolerance, response patterns to various triggers, etc that communicate a coherent person of some kind.
We may not know that a given speaker is a GenX Methodist from Wisconsin that grew up at skate parks in the suburbs, but we hear clusters of speech behavior that lets our brain go "yeah, I'm used to things fitting together in this way sometimes"
These don't have that.
Instead, they seem to mostly smudge together behaviors that are just generally common in aggregate across the training data. The speakers all voice interrupting acknowledgements eagerly, they all use bright and enunciated podcaster tone, they all draw on similar word choice, etc -- they distinguish gender and each have a stable overall vocal tone, but no identity.
I don't doubt that this'll improve quickly though, by training specific "AI celebrity" voices narrowed to sound more coherent, natural, identifiable, and consistent. (And then, probably, leasing out those voices for $$$.)
As a tech demo for "render some vague sense of life behind this generated dialog" this is pretty good, though.
Agreed. To me it sounds like bad voice-over actors reading from a script. So the natural parts of a conversation where you might say the wrong thing and step back to correct yourself are all gone. Impressive for sure.
It's because it's probably trained with "professional audio", ads, movies, audiobooks, and not "normal people talking". Like the effect when diffusion was mostly trained with stock photos.
Totally agree. Maybe it’s just the clips they chose, but it feels overfit on the weird conversational elements that make it impressive? Like the “oh yeahs” from the other person when someone is speaking. It is cool to see that natural flow in a conversation generated by a model, but there’s waaaay too much of it in these examples to sound natural.
And I say all that completely slackjawed that this is possible.
I'd love to see stats on disfluency rate in conversation, podcasts, and this sample to get an idea of where it lies. It seems like they could have cranked it up, but there's also the chance that it's just the frequency illusion because we were primed to pay attention to it.
I love the technology, but I really don't want AI to sound like this.
Imagine being stuck on a call with this.
> "Hey, so like, is there anything I can help you with today?"
> "Talk to a person."
> "Oh wow, right. (chuckle) You got it. Well, before I connect you, can you maybe tell me a little bit more about what problem you're having? For example, maybe it's something to do with..."
That's how the DJ feature of Spotify talks and it's pretty jarring.
"How's it going. We're gonna start by taking you back to your 2022 favorites, starting with the sweet sounds of XYZ". There's very little you can tweak about it, the suggestions kinda suck, but you're getting a fake friend to introduce them to you. Yay, I guess..
> Like the “oh yeahs” from the other person when someone is speaking.
I bet that if you select a British accent you will get fewer of them.
I'm hoping it will be a lot of Ok Guv'ner and right you ares in the style of Dick Van Dyke.
Gor blimey lad, that's the problem now innit???
Right mate
It's like their training set was made up entirely of awkward podcaster banter.
At least 83% Leo Laporte.
Agreed. To be fair, I also get annoyed by fake/exaggerated expression from human podcasters.
> Example of a multi-speaker dialogue generated by NotebookLM Audio Overview, based on a few potato-related documents.
Listening to this on 1.75x speed is excellent. I think the generated speaking speed is slow for audio quality, bc it'd be much harder to slow-down the generated audio while retaining quality than vice versa.
It sounds like every sentence is an ad read.
Yeah... It isn't that it doesn't sound like human speech... it just sounds like how humans speak when they are uncomfortable or reading prepared and they aren't good at it.
That could just be the context though. Listening to a clip that's a demo of what the model can produce is very different to listening to a YouTube video that's using the model to generate speech about something you'd actually want to watch a video of.
I suppose it doesn't matter if it is a human, or a bot delivering the message, if the message is boring
Probably because you're expecting it and looking at a demo page. Put these voices behind a real video or advertisement and I would imagine most people wouldn't be able to tell that it's AI generated at all.
It'd be annoying to me whether it was AI or human. The faux-excitement and pseudo-bonhomie is grating. They should focus on how people actually talk, not on copying the vocal intonation of coked-up public radio presenters just back from a positive affirmation seminar.
they all sound like valley-people, complete with the raspy voice and everything
It's due to the histrionic mental epidemic that we are going through.
A lot of people are just like that IRL.
They cannot just say "the food was fine", it's usually some crap like "What on earth! These are the best cheese sticks I've had IN MY EN TI R E LIFE!".
Try it out in the demo https://cloud.google.com/text-to-speech/?hl=en and in the API https://cloud.google.com/text-to-speech/docs/create-dialogue...
If I change the language in the demo, it removes all my text and replaces it with a template text. That's bad.
It looks like lately a lot of progress have been made in audio generation / audio understanding (everything related to speech, I mean).
Is this related to LLM, or is this a completely different branch of AI, and is it just a coincidence? I am curious.
Is there a free (ad supported?) online tool without login that reads text that you paste into it?
I often would like to listen to a blog post instead of reading it, but haven't found an easy, quick solution yet.
I tried piping text through OpenAI's tts-1-hd, model and it is the first one I ever found that is human like enough for me to like listening to it. So I could write a tool for my own usecase that pipes the text to tts-1-hd and plays the audio. But maybe there is already something with a public web interface out there?
Both windows and macos (the operating systems) have this built-in under accessibility. It’s worth a try and I use it sometimes when I want to read something while cooking.
I use ms edge for this exact use case. Works well enough on any platform
There is on iOS. No ads. "Reader" by Eleven Labs. I haven't used it that much but have listened to some white papers and blogs (some of which were like 45 minutes) and it "just worked". Even let's you click text you want to jump to.
And it's Eleven Labs quality- which unless I've fallen behind the times is the highest quality TTS by a margin.
Reader is on a pretty good path to a monthly subscription model. Great audio quality, large selection of voices, and support for long-form input text.
There's also the built-in "Speak Selection" feature you can enable in the accessibility settings.
The voices are impressive (I can't tell the difference as a non native speaker) but their "personality" sounds extremely annoying lmao
> This means it generates audio over 40-times faster than real time.
Astounding
YouTube videos are already infested with insufferable AI elevator background "music". Even some channels that were previously good are using it.
On the bright side, you can stop watching these channels and have more time for serious things.
> AI elevator background "music".
What are some examples? I haven't encountered this.