The headline feature isn’t the 25 MB footprint alone. It’s that KittenTTS is Apache-2.0. That combo means you can embed a fully offline voice in Pi Zero-class hardware or even battery-powered toys without worrying about GPUs, cloud calls, or restrictive licenses. In one stroke it turns voice everywhere from a hardware/licensing problem into a packaging problem. Quality tweaks can come later; unlocking that deployment tier is the real game-changer.
A Festival's English model, festvox-kallpc16k, is about 6 MB, and it is a large model; festvox-kallpc8k is about 3.5 MB.
eSpeak NG's data files take about 12 MB (multi-lingual).
I guess this one may generate more natural-sounding speech, but older or lower-end computers were capable of decent speech synthesis previously as well.
I play around with a nvidia jetson orin nano super right now and its actually pretty usuable with gemma3:4b and quite fast - even image processing is done in like 10-20 seconds but this is with GPU support. When something is not working and ollama is not using the GPU this calls take ages because the cpu is just bad.
What about the training data? Is everyone 100% confident that models are not a derived work of the training inputs now, even if they can reproduce input exactly?
yeah, we are super excited to build tiny ai models that are super high quality. local voice interfaces are inevitable and we want to power those in the future. btw, this model is just a preview, and the full release next week will be of much higher quality, along w another ~80M model ;)
There are still people who use machine wide python installs instead of environments? Python dependency hell was already bad years ago, but today it's completely impractical to do it this way. Even on raspberries.
Microsoft's and some of Google's TTS models make the simplest mistakes. For instance, they sometimes read "i.e." as "for example." This is a problem if you have low vision and use TTS for, say, proofreading your emails.
Well, speech synthesizers are pretty much famous for speaking all sorts of things wrong. But what I find very concerning about LLM based TTS is that some of them cant really speak numbers greater then 100. They try, but fail a lot. At least tts-1-hd was pretty much doing this for almost every 3 or 4 digit number. Especially noticeable when it is supposed to read a year number.
Speech speed is always a tunable parameter and not something intrinsic to the model.
The comparison to make is expressiveness and correct intonation for long sentences vs something like espeak. It actually sounds amazing for the size. The closest thing is probably KokoroTTS at 82M params and ~300MB.
I tried it. Not bad for the size (of the model) and speed. Once you install all the massive number of libraries and things needed we are a far cry away from 25MB though. Cool project nonetheless.
Does anybody find it funny that sci-fi movies have to heavily distort "robot voices" to make them sound "convincingly robotic"? A robotic, explicitly non-natural voice would be perfectly acceptable, and even desirable, in many situations. I don't expect a smart toaster to talk like a BBC host; it'd be enough is the speech if easy to recognize.
A robotic, explicitly non-natural voice would be perfectly acceptable, and even desirable, in many situations[...]it'd be enough is the speech if easy to recognize.
We've had formant synths for several decades, and they're perfectly understandable and require a tiny amount of computing power, but people tend not to want to listen to them:
I remember that the novelization of the fifth element describes that the cops are taught to speak as robotic as possible when using speakers for some reason. Always found the idea weird that someone would _want_ that
Depends on the movie. Ash and Bishop in the Alien franchise sound human until there's a dramatic reason to sound more 'robotic'.
I agree with your wider point. I use Google TTS with Moon+Reader all the time (I tried audio books read by real humans but I prefer the consistency of TTS)
Slightly different there because it's important in both cases that Ripley (and we) can't tell they're androids until it's explicitly uncovered. The whole point is that they're not presented as artificial. Same in Blade Runner: "more human than human". You don't have a film without the ambiguity there.
I got an error when I tried the demo with 6 sentences, but it worked great when I reduced the text to 3 sentences. Is the length limit due to the model or just a limitation for the demo?
"This first Book proposes, first in brief, the whole Subject, Mans disobedience, and the loss thereupon of Paradise wherein he was plac't: Then touches the prime cause of his fall, the Serpent, or rather Satan in the Serpent; who revolting from God, and drawing to his side many Legions of Angels, was by the command of God driven out of Heaven with all his Crew into the great Deep."
It takes a while until it starts generating sound on my i7 cores but it kind of works.
This also works:
"blah. bleh. blih. bloh. blyh. bluh."
So I don't think it's a limit on punctuation. Voice quality is quite bad though, not as far from the old school C64 SAM (https://discordier.github.io/sam/) of the eighties as I expected.
I tried to replicate their demo text but it doesn't sound as good for some reason.
If anyone else wants to try:
> Kitten TTS is an open-source series of tiny and expressive text-to-speech models for on-device applications. Our smallest model is less than 25 megabytes.
On PC it's a python dependency hell but someone managed to package it in self contained JS code that works offline once it loaded the model? How is that done?
ONNXRuntime makes it fairly easy, you just need to provide a path to the ONNX file, give it inputs in the correct format, and use the outputs. The ONNXRuntime library handles the rest. You can see this in the main.js file: https://github.com/clowerweb/kitten-tts-web-demo/blob/main/m...
Plus, Python software are dependency hell in general, while webpages have to be self-contained by their nature (thank god we no longer have Silverlight and Java applets...)
Thanks, I was looking for that. While the reddit demo sounds ok, even though on a level we reached a couple of years ago, all TTS samples I tried were barley understandable at all
> Error generating speech: failed to call OrtRun(). ERROR_CODE: 2, ERROR_MESSAGE: Non-zero status code returned while running Expand node. Name:'/bert/Expand' Status Message: invalid expand shape
yeah, this is just a preview model from an early checkpoint. the full model release will be next week which includes a 15M model and an 80M model, both of which will have much higher quality than this preview.
Not open source. "You will need internet connectivity to validate your AccessKey with Picovoice license servers ... If you wish to increase your limits, you can purchase a subscription plan." https://github.com/Picovoice/orca#accesskey
Going online is a dealbreaker but if you really need it you could use ghidra to fix that. I had tried to find a conversion of their model to onnx (making their proprietary pipeline useless) but failed.
Hopefully open source will render them irrelevant in the future.
Does an apk for Android exist for replacing its speech to text engine? I tried sherpa-onnx but it was too slow for real time usage it seemed, and especially so for audiobooks when sped up.
I can't test this out right now, is this just a demo or is it actually an apk for replacing the engine? Because those are two different things, the latter can be used any time you want to read something aloud on the page for example. This is the sherpa-onnx one I'm talking about.
I hope this is the future. Offline, small ML models, running inference on ubiquitous, inexpensive hardware. Models that are easy to integrate into other things, into devices and apps, and even to drive from other models maybe.
Dedicated single-purpose hardware with models would be even less energy-intensive. It's theoretically possible to design chips which run neural networks and alike using just resistors (rather than transistors).
Such hardware is not general-purpose, and upgrading the model would not be possible, but there's plenty of use-cases where this is reasonable.
assuming most answers will be more than a sentence, 2.25 seconds is already long enough if you factor the token generation in between... and imagine with reasoning!... We're not there yet.
Hmm that actually seems extremely slow, Piper can crank out a sentence almost instantly on a Pi 4 which is a like a sloth compared to that Ryzen and the speech quality seems about the same at first glance.
I suppose it would make sense if you want to include it on top of an LLM that's already occupying most of a GPU and this could run in the limited VRAM that's left.
I'm curious why smallish TTS models have metallic voice quality.
The pronunciation sounds about right - i thought it's the hard part. And the model does it well. But voice timbre should be simpler to fix? Like, a simple FIR might improve it?
would love to see how that turns out. the full model release next week will be more expressive and higher quality than this one so we're excited to see you try that out.
Hmm the quality is not so impressive. I'm looking for a really naturally sounding model. Not very happy with piper/kokoro, XTTS was a bit complex to set up.
For STT whisper is really amazing. But I miss a good TTS. And I don't mind throwing GPU power at it. But anyway. this isn't it either, this sounds worse than kokoro.
The best open one I've found so far is Dia - https://github.com/nari-labs/dia - it has some limitations, but i think it's really impressive and I can run it on my laptop.
> Hmm the quality is not so impressive. [...] And I don't mind throwing GPU power at it.
This isn't for you, then. You should evaluate quality here based on the fact you don't need a GPU.
Back in the pre-Tacotron2 days, I was running slim TTS and vocoder models like GlowTTS and MelGAN on Digital Ocean droplets. No GPU to speak of. It cost next to nothing to run.
Since then, the trend has been to scale up. We need more models to scale down.
In the future we'll see small models living on-device. Embedded within toys and tools that don't need or want a network connection. Deployed with Raspberry Pi.
Edge AI will be huge for robotics, toys and consumer products, and gaming (ie. world models).
Thanks! Yeah. It definitely isn’t the absolute best in quality but it trounces the default TTS options on macOS (as third party developers are locked out of the Siri voices). And for less than the size of many modern web pages…
thanks, but keep in mind that this model is just a preview checkpoint that is only 10% trained. the full release next week will be of much higher quality and it will include a 15M model and an 80M model.
Most of these comments were originally posted to a different thread (https://news.ycombinator.com/item?id=44806543). I've moved them hither because on HN we always prefer to give the project creators credit for their work.
(it does however explain how many of these comments are older than the thread they are now children of)
While I think this is indeed impressive and has a specific use case (e.g. in the embedded sector), I'm not totally convinced that the quality is good enough to replace bigger models.
With fish-speech[1] and f5-tts[2] there are at least 2 open source models pushing the quality limits of offline text-to-speech. I tested F5-TTS with an old NVidia 1660 (6GB VRAM) and it worked ok-ish, so running it on a little more modern hardware will not cost you a fortune and produce MUCH higher quality with multi-language and zero-shot support.
For Android there is SherpaTTS[3], which plays pretty well with most TTS Applications.
If you're looking for other languages, Piper has been around in this scene for much longer and they have open-source training code and a lot of models (they're ~60MB instead of 25MB but whatever...) https://huggingface.co/rhasspy/piper-voices/tree/main
Actually I found it irritating that the readme does not mention the language at all. I think it is not good practice to deduce it from the language of the readme itself. I would not like to have German language tts models with only a German readme...
Nothing to do with Apple Intelligence. The speech synthesiser manager (the term manager was used for OS components in Classic Mac OS) has been around since the mid 90s or so. The change you’re hearing is probably a new/modified default voice.
Good TTS feels like it is something that should be natively built into every consumer device. So the user can decide if they want to read or listen to the text at hand.
I'm surprised that phone manufacturers do not include good TTS models in their browser APIs for example. So that websites can build good audio interfaces.
I for one would love to build a text editor that the user can use completely via audio. Text input might already be feasible via the "speak to type" feature, both Android and iOS offer.
But there seems to be no good way to output spoken text without doing round-trips to a server and generate the audio there.
The interface I would like would offer a way to talk to write and then commands like "Ok editor, read the last paragraph" or "Ok editor, delete the last sentence".
It could be cool to do writing this way while walking. Just with a headset connected to a phone that sits in one's pocket.
On Mac OS you can "speak" a text in almost every app, using built in voice (like the Siri voice or some older voices). All offline, and even from the terminal with "say".
Does your Eloquence installation include multiple languages? The one I have is only 1876 KB for US English only. And classic DECtalk is even smaller; I have here a version that's only 638 KB (again, US English only).
The headline feature isn’t the 25 MB footprint alone. It’s that KittenTTS is Apache-2.0. That combo means you can embed a fully offline voice in Pi Zero-class hardware or even battery-powered toys without worrying about GPUs, cloud calls, or restrictive licenses. In one stroke it turns voice everywhere from a hardware/licensing problem into a packaging problem. Quality tweaks can come later; unlocking that deployment tier is the real game-changer.
A Festival's English model, festvox-kallpc16k, is about 6 MB, and it is a large model; festvox-kallpc8k is about 3.5 MB.
eSpeak NG's data files take about 12 MB (multi-lingual).
I guess this one may generate more natural-sounding speech, but older or lower-end computers were capable of decent speech synthesis previously as well.
Custom voices could be added, but the speed was more important to some users.
$ ls -lh /usr/bin/flite
Listed as 27K last I checked.
I recall some Blind users were able to decode Gordon 8-bit dialogue at speeds most people found incomprehensible. =3
> It’s that KittenTTS is Apache-2.0
Have you seen the code[1] in the repo? It uses phonemizer[2] which is GPL-3.0 licensed. In its current state, it's effectively GPL licensed.
[1]: https://github.com/KittenML/KittenTTS/blob/main/kittentts/on...
[2]: https://github.com/bootphon/phonemizer
I play around with a nvidia jetson orin nano super right now and its actually pretty usuable with gemma3:4b and quite fast - even image processing is done in like 10-20 seconds but this is with GPU support. When something is not working and ollama is not using the GPU this calls take ages because the cpu is just bad.
Iam curious how fast this is with CPU only.
> KittenTTS is Apache-2.0
What about the training data? Is everyone 100% confident that models are not a derived work of the training inputs now, even if they can reproduce input exactly?
yeah, we are super excited to build tiny ai models that are super high quality. local voice interfaces are inevitable and we want to power those in the future. btw, this model is just a preview, and the full release next week will be of much higher quality, along w another ~80M model ;)
It depends on espeak-ng which is GPLv3
On another machie the python version is too new, and the package/dependencies don't want to install.
There are still people who use machine wide python installs instead of environments? Python dependency hell was already bad years ago, but today it's completely impractical to do it this way. Even on raspberries.
Debian pretty much "solved" this by making pip refuse to install packages if you are not in an venv.
Python man
We are working to fix that. Thanks
"Fixing python packaging" is somewhat harder than AGI.
Have you considered offering a uvx command to run to get people going quickly?
Though I think you would still need to have the Python build dependencies installed for that to work.
ok
Microsoft's and some of Google's TTS models make the simplest mistakes. For instance, they sometimes read "i.e." as "for example." This is a problem if you have low vision and use TTS for, say, proofreading your emails.
Why does it happen? I'm genuinely curious.
Well, speech synthesizers are pretty much famous for speaking all sorts of things wrong. But what I find very concerning about LLM based TTS is that some of them cant really speak numbers greater then 100. They try, but fail a lot. At least tts-1-hd was pretty much doing this for almost every 3 or 4 digit number. Especially noticeable when it is supposed to read a year number.
Reddit post with generated audio sample: https://www.reddit.com/r/LocalLLaMA/comments/1mhyzp7/kitten_...
was it cross trained on futurama voices?
The reddit video is awesome. I don't understand how people are calling it an OK model. Under 25MB and cpu only for this quality is amazing.
Sounds slow and like something from an anine
Speech speed is always a tunable parameter and not something intrinsic to the model.
The comparison to make is expressiveness and correct intonation for long sentences vs something like espeak. It actually sounds amazing for the size. The closest thing is probably KokoroTTS at 82M params and ~300MB.
I tried it. Not bad for the size (of the model) and speed. Once you install all the massive number of libraries and things needed we are a far cry away from 25MB though. Cool project nonetheless.
It mentions ONNX, so I imagine an ONNX model is or will be available.
ONNX runtime is a single library, with C#'s package being ~115MB compressed.
Not tiny, but usually only a few lines to actually run and only a single dependency.
We will try to get rid of dependencies.
Usually pulling in lots of libraries helps develop/iterate faster. Then can be removed later once the whole thing starts to take shape.
Web version: https://clowerweb.github.io/kitten-tts-web-demo/
It sounds ok, but impressive for the size.
Does anybody find it funny that sci-fi movies have to heavily distort "robot voices" to make them sound "convincingly robotic"? A robotic, explicitly non-natural voice would be perfectly acceptable, and even desirable, in many situations. I don't expect a smart toaster to talk like a BBC host; it'd be enough is the speech if easy to recognize.
A robotic, explicitly non-natural voice would be perfectly acceptable, and even desirable, in many situations[...]it'd be enough is the speech if easy to recognize.
We've had formant synths for several decades, and they're perfectly understandable and require a tiny amount of computing power, but people tend not to want to listen to them:
https://en.wikipedia.org/wiki/Software_Automatic_Mouth
https://simulationcorner.net/index.php?page=sam (try it yourself to hear what it sounds like)
SAM and the way it works is not what people typically associate with the term "formant synthesizer."
DECtalk[1,2] would be a much better example, that's as formant as you get.
[1] https://en.wikipedia.org/wiki/DECtalk [2] https://webspeak.terminal.ink
Well, this one is a bit too jarring to the ears.
But there is no latency, as opposed to KittenTTS, so it certainly has its applications too.
I think it's charming
Try this demo, which has more knobs:
https://discordier.github.io/sam/
Yeah blind people love eloquence
I remember that the novelization of the fifth element describes that the cops are taught to speak as robotic as possible when using speakers for some reason. Always found the idea weird that someone would _want_ that
This one is at least an interesting idea: https://genderlessvoice.com/
Meet Q, a Genderless Voice - https://news.ycombinator.com/item?id=19505835 - March 2019 (235 comments)
The voice sounds great! I find it quite aesthetically pleasing, but it's far from genderless.
Interesting concept, but why is that site filled with Top X blogspam?
It doesn't sound genderless.
Depends on the movie. Ash and Bishop in the Alien franchise sound human until there's a dramatic reason to sound more 'robotic'.
I agree with your wider point. I use Google TTS with Moon+Reader all the time (I tried audio books read by real humans but I prefer the consistency of TTS)
Slightly different there because it's important in both cases that Ripley (and we) can't tell they're androids until it's explicitly uncovered. The whole point is that they're not presented as artificial. Same in Blade Runner: "more human than human". You don't have a film without the ambiguity there.
> I don't expect a smart toaster to talk like a BBC host;
Well sure, the BBC have already established that it's supposed to sound like a brit doing an impersonation of an American: https://www.youtube.com/watch?v=LRq_SAuQDec
I got an error when I tried the demo with 6 sentences, but it worked great when I reduced the text to 3 sentences. Is the length limit due to the model or just a limitation for the demo?
Currently we don't have chunking enabled yet. We will add it soon. That will remove the length limitations.
Perhaps a length limit? I tried this:
"This first Book proposes, first in brief, the whole Subject, Mans disobedience, and the loss thereupon of Paradise wherein he was plac't: Then touches the prime cause of his fall, the Serpent, or rather Satan in the Serpent; who revolting from God, and drawing to his side many Legions of Angels, was by the command of God driven out of Heaven with all his Crew into the great Deep."
It takes a while until it starts generating sound on my i7 cores but it kind of works.
This also works:
"blah. bleh. blih. bloh. blyh. bluh."
So I don't think it's a limit on punctuation. Voice quality is quite bad though, not as far from the old school C64 SAM (https://discordier.github.io/sam/) of the eighties as I expected.
I tried to replicate their demo text but it doesn't sound as good for some reason.
If anyone else wants to try:
> Kitten TTS is an open-source series of tiny and expressive text-to-speech models for on-device applications. Our smallest model is less than 25 megabytes.
Is the demo using the not smallest model?
Doesn't work here. Backend module returns 404 :
https://clowerweb.github.io/node_modules/onnxruntime-web/dis...
Looks like this commit 15 minutes ago broke it https://github.com/clowerweb/kitten-tts-web-demo/commit/6b5c...
(seems reverted now)
On PC it's a python dependency hell but someone managed to package it in self contained JS code that works offline once it loaded the model? How is that done?
ONNXRuntime makes it fairly easy, you just need to provide a path to the ONNX file, give it inputs in the correct format, and use the outputs. The ONNXRuntime library handles the rest. You can see this in the main.js file: https://github.com/clowerweb/kitten-tts-web-demo/blob/main/m...
Plus, Python software are dependency hell in general, while webpages have to be self-contained by their nature (thank god we no longer have Silverlight and Java applets...)
Thanks, I was looking for that. While the reddit demo sounds ok, even though on a level we reached a couple of years ago, all TTS samples I tried were barley understandable at all
This is just an early checkpoint. We hope that the quality will improve in the future.
> Error generating speech: failed to call OrtRun(). ERROR_CODE: 2, ERROR_MESSAGE: Non-zero status code returned while running Expand node. Name:'/bert/Expand' Status Message: invalid expand shape
Doesn't seem to work with thai.
You can also try on https://clowerweb.github.io/node_modules/onnxruntime-web/dis...
yeah, this is just a preview model from an early checkpoint. the full model release will be next week which includes a 15M model and an 80M model, both of which will have much higher quality than this preview.
This doesn’t seem to work on Safari. Works great on Chrome, though
Hmm, we will look into it.
You should post on the NVDA email list. https://nvda.groups.io/g/nvda Or the Screen reader list: https://winaccess.groups.io/g/winaccess FYI blind people do not like any lag when reading that’s is why so many still use eloquence and espeak.
Try https://github.com/Picovoice/orca It's about 7MB all included
Not open source. "You will need internet connectivity to validate your AccessKey with Picovoice license servers ... If you wish to increase your limits, you can purchase a subscription plan." https://github.com/Picovoice/orca#accesskey
The guy is just spamming the project in a lot of comments.
Going online is a dealbreaker but if you really need it you could use ghidra to fix that. I had tried to find a conversion of their model to onnx (making their proprietary pipeline useless) but failed.
Hopefully open source will render them irrelevant in the future.
Does an apk for Android exist for replacing its speech to text engine? I tried sherpa-onnx but it was too slow for real time usage it seemed, and especially so for audiobooks when sped up.
https://github.com/Picovoice/orca/tree/main/demo%2Fandroid
I can't test this out right now, is this just a demo or is it actually an apk for replacing the engine? Because those are two different things, the latter can be used any time you want to read something aloud on the page for example. This is the sherpa-onnx one I'm talking about.
https://k2-fsa.github.io/sherpa/onnx/tts/apk-engine.html
I hope this is the future. Offline, small ML models, running inference on ubiquitous, inexpensive hardware. Models that are easy to integrate into other things, into devices and apps, and even to drive from other models maybe.
Dedicated single-purpose hardware with models would be even less energy-intensive. It's theoretically possible to design chips which run neural networks and alike using just resistors (rather than transistors).
Such hardware is not general-purpose, and upgrading the model would not be possible, but there's plenty of use-cases where this is reasonable.
That is our vision too!
yeah totally. the quality of these tiny models are only going to go up.
I don't mind so much the size in MB, the fact that it's pure CPU and the quality, what I do mind however is the latency. I hope it's fast.
Aside: Are there any models for understanding voice to text, fully offline, without training?
I will be very impressed when we will be able to have a conversation with an AI at a natural rate and not "probe, space, response"
Nvidia's parakeet https://huggingface.co/nvidia/parakeet-tdt-0.6b-v2 appears to be state of the art for english: 10x faster than Whisper.
My mid-range AMD CPU is multiple times faster than realtime with parakeet.
Voice to text fully offline can be done with whisper. A few apps offer it for dictation or transcription.
"The brown fox jumps over the lazy dog.."
Average duration per generation: 1.28 seconds
Characters processed per second: 30.35
--
"Um"
Average duration per generation: 0.22 seconds
Characters processed per second: 9.23
--
"The brown fox jumps over the lazy dog.. The brown fox jumps over the lazy dog.."
Average duration per generation: 2.25 seconds
Characters processed per second: 35.04
--
processor : 0
vendor_id : AuthenticAMD
cpu family : 25
model : 80
model name : AMD Ryzen 7 5800H with Radeon Graphics
stepping : 0
microcode : 0xa50000c
cpu MHz : 1397.397
cache size : 512 KB
assuming most answers will be more than a sentence, 2.25 seconds is already long enough if you factor the token generation in between... and imagine with reasoning!... We're not there yet.
Hmm that actually seems extremely slow, Piper can crank out a sentence almost instantly on a Pi 4 which is a like a sloth compared to that Ryzen and the speech quality seems about the same at first glance.
I suppose it would make sense if you want to include it on top of an LLM that's already occupying most of a GPU and this could run in the limited VRAM that's left.
>Aside: Are there any models for understanding voice to text, fully offline, without training?
OpenAI's whisper is a few years old and pretty solid.
https://github.com/openai/whisper
Any idea what factors play into latency in TTS models?
Mostly model size, and input size. Some models which use attention are O(N^2)
Wow, amazing and good work, I hope to see more amazing models running on CPUs!
thanks, we're going to release many more models in the future, that can run on just CPUs.
I'm curious why smallish TTS models have metallic voice quality.
The pronunciation sounds about right - i thought it's the hard part. And the model does it well. But voice timbre should be simpler to fix? Like, a simple FIR might improve it?
amazing! can't wait to integrate it into https://desktop.with.audio I'm already using KokorosTTS without a GPU. It works fairly well on Apple Silicon.
Foundational tools like this open up the possiblity of one-time payment or even free tools.
would love to see how that turns out. the full model release next week will be more expressive and higher quality than this one so we're excited to see you try that out.
Hmm the quality is not so impressive. I'm looking for a really naturally sounding model. Not very happy with piper/kokoro, XTTS was a bit complex to set up.
For STT whisper is really amazing. But I miss a good TTS. And I don't mind throwing GPU power at it. But anyway. this isn't it either, this sounds worse than kokoro.
The best open one I've found so far is Dia - https://github.com/nari-labs/dia - it has some limitations, but i think it's really impressive and I can run it on my laptop.
Chatterbox is also worth a try.
You should give try to https://pinokio.co/
> Hmm the quality is not so impressive. [...] And I don't mind throwing GPU power at it.
This isn't for you, then. You should evaluate quality here based on the fact you don't need a GPU.
Back in the pre-Tacotron2 days, I was running slim TTS and vocoder models like GlowTTS and MelGAN on Digital Ocean droplets. No GPU to speak of. It cost next to nothing to run.
Since then, the trend has been to scale up. We need more models to scale down.
In the future we'll see small models living on-device. Embedded within toys and tools that don't need or want a network connection. Deployed with Raspberry Pi.
Edge AI will be huge for robotics, toys and consumer products, and gaming (ie. world models).
Try https://github.com/Picovoice/orca
Okay, lots of details information and example code, great. But skimming through I didn’t see any audio samples to judge the quality?
They posted a demo on reddit[0]. It sounds amazing given the tiny size.
[0] https://old.reddit.com/r/LocalLLaMA/comments/1mhyzp7/kitten_...
Thanks! Yeah. It definitely isn’t the absolute best in quality but it trounces the default TTS options on macOS (as third party developers are locked out of the Siri voices). And for less than the size of many modern web pages…
It is not the best TTS but it is freaking amazing it can be done by such a small model and it is good enough for so many use cases.
thanks, but keep in mind that this model is just a preview checkpoint that is only 10% trained. the full release next week will be of much higher quality and it will include a 15M model and an 80M model.
I wonder what would it take to extend it with a custom voice?
Most of these comments were originally posted to a different thread (https://news.ycombinator.com/item?id=44806543). I've moved them hither because on HN we always prefer to give the project creators credit for their work.
(it does however explain how many of these comments are older than the thread they are now children of)
Someone please port this to ONNX so we don't need to do all this ass tooling
Cool.
While I think this is indeed impressive and has a specific use case (e.g. in the embedded sector), I'm not totally convinced that the quality is good enough to replace bigger models.
With fish-speech[1] and f5-tts[2] there are at least 2 open source models pushing the quality limits of offline text-to-speech. I tested F5-TTS with an old NVidia 1660 (6GB VRAM) and it worked ok-ish, so running it on a little more modern hardware will not cost you a fortune and produce MUCH higher quality with multi-language and zero-shot support.
For Android there is SherpaTTS[3], which plays pretty well with most TTS Applications.
1: https://github.com/fishaudio/fish-speech
2: https://github.com/SWivid/F5-TTS
3: https://github.com/woheller69/ttsengine
We have released just a preview of the model. We hope to get the model much better in the future releases.
Hi. Will the training and fine-tuning code also be released?
It would be great if the training data were released too!
What's a good one in reverse; speech to text?
Whisper and the many variants. Here's a good implementation.
https://github.com/ggml-org/whisper.cpp
This one is a whisper-based Python package
https://github.com/primaprashant/hns
Is the name a joke on "If the emperor had a tts device"? It's funny
Chrome does TTS too.
https://codepen.io/logicalmadboy/pen/RwpqMRV
Where does the training data come for the models? Is there an openly available dataset the people use?
Can coqui run in cpu only?
Yes, XTTS2 has been reasonably performant for me and the cloning is acceptable.
Very good model, thanks for the open source
thanks a lot, this model is just a preview checkpoint. the full release next week will be of much higher quality.
The sample rate does more than change the quality.
Is this english only?
If you're looking for other languages, Piper has been around in this scene for much longer and they have open-source training code and a lot of models (they're ~60MB instead of 25MB but whatever...) https://huggingface.co/rhasspy/piper-voices/tree/main
Or use https://github.com/Picovoice/orca which is about 7MB and supports 8 languages
you need api key and internet access to run locally? lol. Classic .NET project.
Actually I found it irritating that the readme does not mention the language at all. I think it is not good practice to deduce it from the language of the readme itself. I would not like to have German language tts models with only a German readme...
I tried on some Japanese for the kicks of it, it reads... "Chinese letter chinese letter japanese letter chinese letter..." :D
But yeah, if it's like any of the others we'll likely see a different "model" per language down the line based on the same techniques
Yes. The FAQ says that multilingual capabilities are in the works.
say is only 193K on MacOS
Usage:That’s not a far comparison. Say just calls the speech synthesis APIs that have been around since at least Mac OS 8.
That being said, the ‘classical’ (pre-AI) speech synthesisers are much smaller than kitten, so you’re not wrong per se, just for the wrong reason.
The linked repository at the top-level here has several gigabytes of dependencies, too.
SAM on Commodore 64 was only 6K:
https://project64.c64.org/Software/SAM10.TXT
Obviously it's not fair to compare these with ML models.
And what dynamic libraries s it linked to? And what other data are they pulling in?
Tried that on 26 beta, and the default voice sounds a lot smoother than it used it.
Running `man say` reveals that "this tool uses the Speech Synthesis manager", so I'm guessing the Apple Intelligence stuff is kicking in.
Nothing to do with Apple Intelligence. The speech synthesiser manager (the term manager was used for OS components in Classic Mac OS) has been around since the mid 90s or so. The change you’re hearing is probably a new/modified default voice.
`say` sounds terrible compared to modern neural network based text to speech engines.
Sounds about the same as Kitten TTS.
To me it sounds worse, especially on the construction of certain more complex sentences or words.
Good TTS feels like it is something that should be natively built into every consumer device. So the user can decide if they want to read or listen to the text at hand.
I'm surprised that phone manufacturers do not include good TTS models in their browser APIs for example. So that websites can build good audio interfaces.
I for one would love to build a text editor that the user can use completely via audio. Text input might already be feasible via the "speak to type" feature, both Android and iOS offer.
But there seems to be no good way to output spoken text without doing round-trips to a server and generate the audio there.
The interface I would like would offer a way to talk to write and then commands like "Ok editor, read the last paragraph" or "Ok editor, delete the last sentence".
It could be cool to do writing this way while walking. Just with a headset connected to a phone that sits in one's pocket.
Can't most people read faster than they can hear? Isn't this why phone menus are so awful?
> But there seems to be no good way to output spoken text without doing round-trips to a server and generate the audio there
As people have been pointing out, we've had mediocre TTS since the 80s. If it was a real benefit people would be using even the inadequate version.
On Mac OS you can "speak" a text in almost every app, using built in voice (like the Siri voice or some older voices). All offline, and even from the terminal with "say".
I am blind and use NVDA with a sinth. How is this news? I don't get it! My sinth is called eloquence and is 4089KB
Does your Eloquence installation include multiple languages? The one I have is only 1876 KB for US English only. And classic DECtalk is even smaller; I have here a version that's only 638 KB (again, US English only).
it would be great if there is typescript support in the future
Yup it runs on the web browser. https://clowerweb.github.io/kitten-tts-web-demo/
https://huggingface.co/KittenML/kitten-tts-nano-0.1
https://github.com/KittenML/KittenTTS
This is the model and Github page, this blog post looks very much AI generated.
Can you run it in reverse for speech recognition?
no, but whisper has a 39M model: https://github.com/openai/whisper
We will release an STT model as well.
Kudos guys!
Thanks
♥
"please join our DISCORD!"...