The more I listen to NotebookLM “episodes”, the more I am convinced that Google has trained a two-speaker “podcast discussion” model that directly generates the podcast off the back of an existing multimodal backbone. The two speakers interrupt and speak over each other in an uncannily humanlike manner. I wonder whether they basically fine tuned against a huge library of actual podcasts along with the podcast transcripts and perhaps generated synthetic “input material” from the transcripts to feed in as training samples.
In other words, take an episode of The Daily and have one language model write a hypothetical article that would summarize what the podcast was about. And then pass that article into the two—speaker model, transcribe the output, and see how well that transcript aligns with the article fed in as input.
I am sure I’m missing essential details, but the natural sound of these podcasts cannot possibly be coming from a text transcript.
> the more I am convinced that Google has trained a two-speaker “podcast discussion” model that directly generates the podcast off the back of an existing multimodal backbone.
I have good and bad news for you - they did not! We were the first podcast to interview the audio engineer who led the audio model:
TLDR they did confirm that the transcript and the audio are generated separately, but yes the TTS model is trained far beyond anything we have in OSS or commercially available
I feel similarly about NotebookLM, but have noticed one odd thing - occasionally Host A will be speaking, and suddenly Host B will complete their sentence. And usually when this happens, it's in a way that doesn't make sense, because Host A was just explaining something to or answering a question of Host B.
I'm actually not sure what to make of that, but it's interesting to note
It's speaker diarisation, and depending on the quality of the resulting labelling and speaker end marker tokens, what influences the rhythm of a conversation (Or the input data just has many podcast hosts completing each other's..sandwiches?)
Great to see this: Fellow tech-geeks, ignore the NotebookLM thing at your peril.
NotebookLM, far and away, has been the "AI Killer App" for the VAST MAJORITY of bright-but-not-particularly-techy people I know. My 70ish parents and my 8 year old kid are both just blown away by this thing and can't stop playing with it.
Edit: As someone pointed out below, I absolutely mean just the "podcast" thing.
Pretty weird choice of TTS engines. None of them are anywhere near state of the art as far as open TTS system goes. XTTSv2 or the new F5-TTS would have been much better choices.
You can always update the code to use that. Meta releasing stuff on github is not trying to release the "bet" but to give a proof of concept. The licenses of those TTS system matters, it's not enough to be open. If this was a product for their users, they will definitely have better TTS.
The sample output is very poor. Cool demo, but really just emphasizes how much of a hit product the NotebookLM team has managed to come up with, ostensibly with more or less the same foundation models already available.
I'm not so sure this is an open source NotebookLM as it is a few experiments in an iPython notebook. What NotebookLM does at an LLM level is not particularly novel, it's the packaging as a product in a different way than what others are doing that I think is interesting. Also the "podcast" bit is really just an intro/overview of a large corpus, far more useful is being able to discuss that corpus with the bot and get cited references.
What this does however demonstrate is that prototyping with LLMs is very fast. I'd encourage anyone who hasn't had a play around with APIs to give it a go.
If we can have this running locally on mobile phone that would be pretty cool. Imagine receiving a work document (for example, product requirement documents), and then this turning it into a podcast to play for me while I am driving. I think my productivity will be through the roof and I don't need to worry about compliance issues.
The more I listen to NotebookLM “episodes”, the more I am convinced that Google has trained a two-speaker “podcast discussion” model that directly generates the podcast off the back of an existing multimodal backbone. The two speakers interrupt and speak over each other in an uncannily humanlike manner. I wonder whether they basically fine tuned against a huge library of actual podcasts along with the podcast transcripts and perhaps generated synthetic “input material” from the transcripts to feed in as training samples.
In other words, take an episode of The Daily and have one language model write a hypothetical article that would summarize what the podcast was about. And then pass that article into the two—speaker model, transcribe the output, and see how well that transcript aligns with the article fed in as input.
I am sure I’m missing essential details, but the natural sound of these podcasts cannot possibly be coming from a text transcript.
> the more I am convinced that Google has trained a two-speaker “podcast discussion” model that directly generates the podcast off the back of an existing multimodal backbone.
I have good and bad news for you - they did not! We were the first podcast to interview the audio engineer who led the audio model:
https://www.latent.space/p/notebooklm
TLDR they did confirm that the transcript and the audio are generated separately, but yes the TTS model is trained far beyond anything we have in OSS or commercially available
Soundstorm is probably the TTS https://google-research.github.io/seanet/soundstorm/examples...
Thank you swyx. How did I miss this episode?
I feel similarly about NotebookLM, but have noticed one odd thing - occasionally Host A will be speaking, and suddenly Host B will complete their sentence. And usually when this happens, it's in a way that doesn't make sense, because Host A was just explaining something to or answering a question of Host B.
I'm actually not sure what to make of that, but it's interesting to note
It's speaker diarisation, and depending on the quality of the resulting labelling and speaker end marker tokens, what influences the rhythm of a conversation (Or the input data just has many podcast hosts completing each other's..sandwiches?)
That's the annoying part about NLM. It ruins the illusion of having one person explaining it to the other person.
Following up on swyx, the TTS is probably Google finally releasing Soundstorm from the basement.
https://google-research.github.io/seanet/soundstorm/examples...
Great to see this: Fellow tech-geeks, ignore the NotebookLM thing at your peril.
NotebookLM, far and away, has been the "AI Killer App" for the VAST MAJORITY of bright-but-not-particularly-techy people I know. My 70ish parents and my 8 year old kid are both just blown away by this thing and can't stop playing with it.
Edit: As someone pointed out below, I absolutely mean just the "podcast" thing.
As someone who doesn’t listen to podcasts what perils will I suffer from not making podcasts in notebookLM?
Are we talking about NotebookLM generally or specifically the podcast stunt?
Good question: I absolutely mean the podcast stunt.
Idk if I’d call it a killer app.
The podcasts are grating to listen to and usually only contain very surface information I could gain from a paper’s abstract.
It’s a wildly impressive technical achievement though.
Pretty weird choice of TTS engines. None of them are anywhere near state of the art as far as open TTS system goes. XTTSv2 or the new F5-TTS would have been much better choices.
You can always update the code to use that. Meta releasing stuff on github is not trying to release the "bet" but to give a proof of concept. The licenses of those TTS system matters, it's not enough to be open. If this was a product for their users, they will definitely have better TTS.
The sample output is very poor. Cool demo, but really just emphasizes how much of a hit product the NotebookLM team has managed to come up with, ostensibly with more or less the same foundation models already available.
I wonder, how soon they release this in other languages and with different accents epecially Se-Asian accents.
I'm not so sure this is an open source NotebookLM as it is a few experiments in an iPython notebook. What NotebookLM does at an LLM level is not particularly novel, it's the packaging as a product in a different way than what others are doing that I think is interesting. Also the "podcast" bit is really just an intro/overview of a large corpus, far more useful is being able to discuss that corpus with the bot and get cited references.
What this does however demonstrate is that prototyping with LLMs is very fast. I'd encourage anyone who hasn't had a play around with APIs to give it a go.
> What NotebookLM does at an LLM level is not particularly novel, it's the packaging as a product...
Disagreed. NLM is novel in how the two hosts interrupt and overlap each other. No other OSS solution does that, they just take turns talking.
Fair point, although to me the "audio overviews" are a minor feature of the product.
It only creates the podcasts right?
I am more interested in the other features of NotebookLM. The podcasts are fun but gimmicky.
If we can have this running locally on mobile phone that would be pretty cool. Imagine receiving a work document (for example, product requirement documents), and then this turning it into a podcast to play for me while I am driving. I think my productivity will be through the roof and I don't need to worry about compliance issues.
I wish chatgpt or Claude would make an an Android Auto app that I can use while driving.
Man.. the sample is pretty rough
I’d love to hear the output if anyone has used this.
There’s an example output linked on the github page
Page title: NotebookLlama: An Open Source version of NotebookLM
Fixed. Thanks! (Submitted title was "Meta's Open Source NotebookLM")
"Please use the original title, unless it is misleading or linkbait; don't editorialize." - https://news.ycombinator.com/newsguidelines.html