148 comments

  • MutedEstate45 5 hours ago

    The headline feature isn’t the 25 MB footprint alone. It’s that KittenTTS is Apache-2.0. That combo means you can embed a fully offline voice in Pi Zero-class hardware or even battery-powered toys without worrying about GPUs, cloud calls, or restrictive licenses. In one stroke it turns voice everywhere from a hardware/licensing problem into a packaging problem. Quality tweaks can come later; unlocking that deployment tier is the real game-changer.

    • defanor 3 hours ago

      A Festival's English model, festvox-kallpc16k, is about 6 MB, and it is a large model; festvox-kallpc8k is about 3.5 MB.

      eSpeak NG's data files take about 12 MB (multi-lingual).

      I guess this one may generate more natural-sounding speech, but older or lower-end computers were capable of decent speech synthesis previously as well.

      • Joel_Mckay 2 hours ago

        Custom voices could be added, but the speed was more important to some users.

        $ ls -lh /usr/bin/flite

        Listed as 27K last I checked.

        I recall some Blind users were able to decode Gordon 8-bit dialogue at speeds most people found incomprehensible. =3

    • woadwarrior01 21 minutes ago

      > It’s that KittenTTS is Apache-2.0

      Have you seen the code[1] in the repo? It uses phonemizer[2] which is GPL-3.0 licensed. In its current state, it's effectively GPL licensed.

      [1]: https://github.com/KittenML/KittenTTS/blob/main/kittentts/on...

      [2]: https://github.com/bootphon/phonemizer

    • entropie 35 minutes ago

      I play around with a nvidia jetson orin nano super right now and its actually pretty usuable with gemma3:4b and quite fast - even image processing is done in like 10-20 seconds but this is with GPU support. When something is not working and ollama is not using the GPU this calls take ages because the cpu is just bad.

      Iam curious how fast this is with CPU only.

    • pjc50 42 minutes ago

      > KittenTTS is Apache-2.0

      What about the training data? Is everyone 100% confident that models are not a derived work of the training inputs now, even if they can reproduce input exactly?

    • rohan_joshi 2 hours ago

      yeah, we are super excited to build tiny ai models that are super high quality. local voice interfaces are inevitable and we want to power those in the future. btw, this model is just a preview, and the full release next week will be of much higher quality, along w another ~80M model ;)

    • phh 44 minutes ago

      It depends on espeak-ng which is GPLv3

  • antisol 2 hours ago

      System Requirements
      Works literally everywhere
    
    Haha, on one of my machines my python version is too old, and the package/dependencies don't want to install.

    On another machie the python version is too new, and the package/dependencies don't want to install.

    • sigmoid10 an hour ago

      There are still people who use machine wide python installs instead of environments? Python dependency hell was already bad years ago, but today it's completely impractical to do it this way. Even on raspberries.

      • lynx97 19 minutes ago

        Debian pretty much "solved" this by making pip refuse to install packages if you are not in an venv.

    • hahn-kev an hour ago

      Python man

    • divamgupta 2 hours ago

      We are working to fix that. Thanks

      • pjc50 an hour ago

        "Fixing python packaging" is somewhat harder than AGI.

      • raybb an hour ago

        Have you considered offering a uvx command to run to get people going quickly?

        • zelphirkalt 9 minutes ago

          Though I think you would still need to have the Python build dependencies installed for that to work.

    • Tatiana343 2 hours ago

      ok

  • dr_kiszonka 27 minutes ago

    Microsoft's and some of Google's TTS models make the simplest mistakes. For instance, they sometimes read "i.e." as "for example." This is a problem if you have low vision and use TTS for, say, proofreading your emails.

    Why does it happen? I'm genuinely curious.

    • lynx97 21 minutes ago

      Well, speech synthesizers are pretty much famous for speaking all sorts of things wrong. But what I find very concerning about LLM based TTS is that some of them cant really speak numbers greater then 100. They try, but fail a lot. At least tts-1-hd was pretty much doing this for almost every 3 or 4 digit number. Especially noticeable when it is supposed to read a year number.

  • mlboss 7 hours ago
    • KaiserPro a few seconds ago

      was it cross trained on futurama voices?

    • smusamashah an hour ago

      The reddit video is awesome. I don't understand how people are calling it an OK model. Under 25MB and cpu only for this quality is amazing.

    • tapper an hour ago

      Sounds slow and like something from an anine

      • ricardobeat an hour ago

        Speech speed is always a tunable parameter and not something intrinsic to the model.

        The comparison to make is expressiveness and correct intonation for long sentences vs something like espeak. It actually sounds amazing for the size. The closest thing is probably KokoroTTS at 82M params and ~300MB.

  • klipklop 2 hours ago

    I tried it. Not bad for the size (of the model) and speed. Once you install all the massive number of libraries and things needed we are a far cry away from 25MB though. Cool project nonetheless.

    • Dayshine 2 hours ago

      It mentions ONNX, so I imagine an ONNX model is or will be available.

      ONNX runtime is a single library, with C#'s package being ~115MB compressed.

      Not tiny, but usually only a few lines to actually run and only a single dependency.

      • divamgupta 2 hours ago

        We will try to get rid of dependencies.

    • WhyNotHugo an hour ago

      Usually pulling in lots of libraries helps develop/iterate faster. Then can be removed later once the whole thing starts to take shape.

  • blopker 7 hours ago

    Web version: https://clowerweb.github.io/kitten-tts-web-demo/

    It sounds ok, but impressive for the size.

    • nine_k 7 hours ago

      Does anybody find it funny that sci-fi movies have to heavily distort "robot voices" to make them sound "convincingly robotic"? A robotic, explicitly non-natural voice would be perfectly acceptable, and even desirable, in many situations. I don't expect a smart toaster to talk like a BBC host; it'd be enough is the speech if easy to recognize.

      • userbinator 4 hours ago

        A robotic, explicitly non-natural voice would be perfectly acceptable, and even desirable, in many situations[...]it'd be enough is the speech if easy to recognize.

        We've had formant synths for several decades, and they're perfectly understandable and require a tiny amount of computing power, but people tend not to want to listen to them:

        https://en.wikipedia.org/wiki/Software_Automatic_Mouth

        https://simulationcorner.net/index.php?page=sam (try it yourself to hear what it sounds like)

      • looperhacks 40 minutes ago

        I remember that the novelization of the fifth element describes that the cops are taught to speak as robotic as possible when using speakers for some reason. Always found the idea weird that someone would _want_ that

      • roywiggins 7 hours ago

        This one is at least an interesting idea: https://genderlessvoice.com/

        • dang 3 hours ago

          Meet Q, a Genderless Voice - https://news.ycombinator.com/item?id=19505835 - March 2019 (235 comments)

        • cosmojg 6 hours ago

          The voice sounds great! I find it quite aesthetically pleasing, but it's far from genderless.

        • degamad 5 hours ago

          Interesting concept, but why is that site filled with Top X blogspam?

        • cyberax 2 hours ago

          It doesn't sound genderless.

      • incone123 2 hours ago

        Depends on the movie. Ash and Bishop in the Alien franchise sound human until there's a dramatic reason to sound more 'robotic'.

        I agree with your wider point. I use Google TTS with Moon+Reader all the time (I tried audio books read by real humans but I prefer the consistency of TTS)

        • regularfry 22 minutes ago

          Slightly different there because it's important in both cases that Ripley (and we) can't tell they're androids until it's explicitly uncovered. The whole point is that they're not presented as artificial. Same in Blade Runner: "more human than human". You don't have a film without the ambiguity there.

      • Twirrim 4 hours ago

        > I don't expect a smart toaster to talk like a BBC host;

        Well sure, the BBC have already established that it's supposed to sound like a brit doing an impersonation of an American: https://www.youtube.com/watch?v=LRq_SAuQDec

    • bkyan 3 hours ago

      I got an error when I tried the demo with 6 sentences, but it worked great when I reduced the text to 3 sentences. Is the length limit due to the model or just a limitation for the demo?

      • divamgupta 2 hours ago

        Currently we don't have chunking enabled yet. We will add it soon. That will remove the length limitations.

      • cess11 3 hours ago

        Perhaps a length limit? I tried this:

        "This first Book proposes, first in brief, the whole Subject, Mans disobedience, and the loss thereupon of Paradise wherein he was plac't: Then touches the prime cause of his fall, the Serpent, or rather Satan in the Serpent; who revolting from God, and drawing to his side many Legions of Angels, was by the command of God driven out of Heaven with all his Crew into the great Deep."

        It takes a while until it starts generating sound on my i7 cores but it kind of works.

        This also works:

        "blah. bleh. blih. bloh. blyh. bluh."

        So I don't think it's a limit on punctuation. Voice quality is quite bad though, not as far from the old school C64 SAM (https://discordier.github.io/sam/) of the eighties as I expected.

    • Retr0id 6 hours ago

      I tried to replicate their demo text but it doesn't sound as good for some reason.

      If anyone else wants to try:

      > Kitten TTS is an open-source series of tiny and expressive text-to-speech models for on-device applications. Our smallest model is less than 25 megabytes.

      • cortesoft 5 hours ago

        Is the demo using the not smallest model?

    • quantummagic 7 hours ago

      Doesn't work here. Backend module returns 404 :

      https://clowerweb.github.io/node_modules/onnxruntime-web/dis...

    • Aardwolf an hour ago

      On PC it's a python dependency hell but someone managed to package it in self contained JS code that works offline once it loaded the model? How is that done?

      • a2128 an hour ago

        ONNXRuntime makes it fairly easy, you just need to provide a path to the ONNX file, give it inputs in the correct format, and use the outputs. The ONNXRuntime library handles the rest. You can see this in the main.js file: https://github.com/clowerweb/kitten-tts-web-demo/blob/main/m...

        Plus, Python software are dependency hell in general, while webpages have to be self-contained by their nature (thank god we no longer have Silverlight and Java applets...)

    • nxnsxnbx 3 hours ago

      Thanks, I was looking for that. While the reddit demo sounds ok, even though on a level we reached a couple of years ago, all TTS samples I tried were barley understandable at all

      • divamgupta 2 hours ago

        This is just an early checkpoint. We hope that the quality will improve in the future.

    • itake 5 hours ago

      > Error generating speech: failed to call OrtRun(). ERROR_CODE: 2, ERROR_MESSAGE: Non-zero status code returned while running Expand node. Name:'/bert/Expand' Status Message: invalid expand shape

      Doesn't seem to work with thai.

    • rohan_joshi 2 hours ago

      yeah, this is just a preview model from an early checkpoint. the full model release will be next week which includes a 15M model and an 80M model, both of which will have much higher quality than this preview.

    • belchiorb 2 hours ago

      This doesn’t seem to work on Safari. Works great on Chrome, though

    • kenarsa 7 hours ago

      Try https://github.com/Picovoice/orca It's about 7MB all included

      • gary_0 6 hours ago

        Not open source. "You will need internet connectivity to validate your AccessKey with Picovoice license servers ... If you wish to increase your limits, you can purchase a subscription plan." https://github.com/Picovoice/orca#accesskey

        • papichulo2023 2 hours ago

          The guy is just spamming the project in a lot of comments.

        • cakealert 3 hours ago

          Going online is a dealbreaker but if you really need it you could use ghidra to fix that. I had tried to find a conversion of their model to onnx (making their proprietary pipeline useless) but failed.

          Hopefully open source will render them irrelevant in the future.

      • satvikpendem 7 hours ago

        Does an apk for Android exist for replacing its speech to text engine? I tried sherpa-onnx but it was too slow for real time usage it seemed, and especially so for audiobooks when sped up.

  • nine_k 8 hours ago

    I hope this is the future. Offline, small ML models, running inference on ubiquitous, inexpensive hardware. Models that are easy to integrate into other things, into devices and apps, and even to drive from other models maybe.

    • WhyNotHugo an hour ago

      Dedicated single-purpose hardware with models would be even less energy-intensive. It's theoretically possible to design chips which run neural networks and alike using just resistors (rather than transistors).

      Such hardware is not general-purpose, and upgrading the model would not be possible, but there's plenty of use-cases where this is reasonable.

    • divamgupta 2 hours ago

      That is our vision too!

    • rohan_joshi 2 hours ago

      yeah totally. the quality of these tiny models are only going to go up.

  • keyle 6 hours ago

    I don't mind so much the size in MB, the fact that it's pure CPU and the quality, what I do mind however is the latency. I hope it's fast.

    Aside: Are there any models for understanding voice to text, fully offline, without training?

    I will be very impressed when we will be able to have a conversation with an AI at a natural rate and not "probe, space, response"

    • Dayshine 2 hours ago

      Nvidia's parakeet https://huggingface.co/nvidia/parakeet-tdt-0.6b-v2 appears to be state of the art for english: 10x faster than Whisper.

      My mid-range AMD CPU is multiple times faster than realtime with parakeet.

    • jiehong 3 hours ago

      Voice to text fully offline can be done with whisper. A few apps offer it for dictation or transcription.

    • blensor 3 hours ago

      "The brown fox jumps over the lazy dog.."

      Average duration per generation: 1.28 seconds

      Characters processed per second: 30.35

      --

      "Um"

      Average duration per generation: 0.22 seconds

      Characters processed per second: 9.23

      --

      "The brown fox jumps over the lazy dog.. The brown fox jumps over the lazy dog.."

      Average duration per generation: 2.25 seconds

      Characters processed per second: 35.04

      --

      processor : 0

      vendor_id : AuthenticAMD

      cpu family : 25

      model : 80

      model name : AMD Ryzen 7 5800H with Radeon Graphics

      stepping : 0

      microcode : 0xa50000c

      cpu MHz : 1397.397

      cache size : 512 KB

      • keyle 3 hours ago

        assuming most answers will be more than a sentence, 2.25 seconds is already long enough if you factor the token generation in between... and imagine with reasoning!... We're not there yet.

      • moffkalast 3 hours ago

        Hmm that actually seems extremely slow, Piper can crank out a sentence almost instantly on a Pi 4 which is a like a sloth compared to that Ryzen and the speech quality seems about the same at first glance.

        I suppose it would make sense if you want to include it on top of an LLM that's already occupying most of a GPU and this could run in the limited VRAM that's left.

    • colechristensen 3 hours ago

      >Aside: Are there any models for understanding voice to text, fully offline, without training?

      OpenAI's whisper is a few years old and pretty solid.

      https://github.com/openai/whisper

    • Teever 4 hours ago

      Any idea what factors play into latency in TTS models?

      • divamgupta 2 hours ago

        Mostly model size, and input size. Some models which use attention are O(N^2)

  • toisanji 7 hours ago

    Wow, amazing and good work, I hope to see more amazing models running on CPUs!

    • rohan_joshi 2 hours ago

      thanks, we're going to release many more models in the future, that can run on just CPUs.

  • killerstorm an hour ago

    I'm curious why smallish TTS models have metallic voice quality.

    The pronunciation sounds about right - i thought it's the hard part. And the model does it well. But voice timbre should be simpler to fix? Like, a simple FIR might improve it?

  • vahid4m 5 hours ago

    amazing! can't wait to integrate it into https://desktop.with.audio I'm already using KokorosTTS without a GPU. It works fairly well on Apple Silicon.

    Foundational tools like this open up the possiblity of one-time payment or even free tools.

    • rohan_joshi 2 hours ago

      would love to see how that turns out. the full model release next week will be more expressive and higher quality than this one so we're excited to see you try that out.

  • wkat4242 7 hours ago

    Hmm the quality is not so impressive. I'm looking for a really naturally sounding model. Not very happy with piper/kokoro, XTTS was a bit complex to set up.

    For STT whisper is really amazing. But I miss a good TTS. And I don't mind throwing GPU power at it. But anyway. this isn't it either, this sounds worse than kokoro.

    • kamranjon 6 hours ago

      The best open one I've found so far is Dia - https://github.com/nari-labs/dia - it has some limitations, but i think it's really impressive and I can run it on my laptop.

    • guskel 6 hours ago

      Chatterbox is also worth a try.

    • jainilprajapati 5 hours ago

      You should give try to https://pinokio.co/

    • echelon 7 hours ago

      > Hmm the quality is not so impressive. [...] And I don't mind throwing GPU power at it.

      This isn't for you, then. You should evaluate quality here based on the fact you don't need a GPU.

      Back in the pre-Tacotron2 days, I was running slim TTS and vocoder models like GlowTTS and MelGAN on Digital Ocean droplets. No GPU to speak of. It cost next to nothing to run.

      Since then, the trend has been to scale up. We need more models to scale down.

      In the future we'll see small models living on-device. Embedded within toys and tools that don't need or want a network connection. Deployed with Raspberry Pi.

      Edge AI will be huge for robotics, toys and consumer products, and gaming (ie. world models).

    • kenarsa 7 hours ago
  • onair4you 7 hours ago

    Okay, lots of details information and example code, great. But skimming through I didn’t see any audio samples to judge the quality?

  • victorbjorklund 3 hours ago

    It is not the best TTS but it is freaking amazing it can be done by such a small model and it is good enough for so many use cases.

    • rohan_joshi 2 hours ago

      thanks, but keep in mind that this model is just a preview checkpoint that is only 10% trained. the full release next week will be of much higher quality and it will include a 15M model and an 80M model.

  • BenGosub 27 minutes ago

    I wonder what would it take to extend it with a custom voice?

  • dang 4 hours ago

    Most of these comments were originally posted to a different thread (https://news.ycombinator.com/item?id=44806543). I've moved them hither because on HN we always prefer to give the project creators credit for their work.

    (it does however explain how many of these comments are older than the thread they are now children of)

  • babycommando 3 hours ago

    Someone please port this to ONNX so we don't need to do all this ass tooling

  • sandreas 6 hours ago

    Cool.

    While I think this is indeed impressive and has a specific use case (e.g. in the embedded sector), I'm not totally convinced that the quality is good enough to replace bigger models.

    With fish-speech[1] and f5-tts[2] there are at least 2 open source models pushing the quality limits of offline text-to-speech. I tested F5-TTS with an old NVidia 1660 (6GB VRAM) and it worked ok-ish, so running it on a little more modern hardware will not cost you a fortune and produce MUCH higher quality with multi-language and zero-shot support.

    For Android there is SherpaTTS[3], which plays pretty well with most TTS Applications.

    1: https://github.com/fishaudio/fish-speech

    2: https://github.com/SWivid/F5-TTS

    3: https://github.com/woheller69/ttsengine

    • divamgupta 2 hours ago

      We have released just a preview of the model. We hope to get the model much better in the future releases.

  • maxloh 5 hours ago

    Hi. Will the training and fine-tuning code also be released?

    It would be great if the training data were released too!

  • RobKohr 7 hours ago

    What's a good one in reverse; speech to text?

  • Perz1val an hour ago

    Is the name a joke on "If the emperor had a tts device"? It's funny

  • wewewedxfgdf 5 hours ago
  • pkaye 7 hours ago

    Where does the training data come for the models? Is there an openly available dataset the people use?

  • indigodaddy 4 hours ago

    Can coqui run in cpu only?

    • palmfacehn 3 hours ago

      Yes, XTTS2 has been reasonably performant for me and the cloning is acceptable.

  • countfeng 2 hours ago

    Very good model, thanks for the open source

    • rohan_joshi 2 hours ago

      thanks a lot, this model is just a preview checkpoint. the full release next week will be of much higher quality.

  • righthand 4 hours ago

    The sample rate does more than change the quality.

  • mayli 7 hours ago

    Is this english only?

    • a2128 7 hours ago

      If you're looking for other languages, Piper has been around in this scene for much longer and they have open-source training code and a lot of models (they're ~60MB instead of 25MB but whatever...) https://huggingface.co/rhasspy/piper-voices/tree/main

    • riedel an hour ago

      Actually I found it irritating that the readme does not mention the language at all. I think it is not good practice to deduce it from the language of the readme itself. I would not like to have German language tts models with only a German readme...

    • evgpbfhnr 7 hours ago

      I tried on some Japanese for the kicks of it, it reads... "Chinese letter chinese letter japanese letter chinese letter..." :D

      But yeah, if it's like any of the others we'll likely see a different "model" per language down the line based on the same techniques

    • g7r 7 hours ago

      Yes. The FAQ says that multilingual capabilities are in the works.

  • wewewedxfgdf 7 hours ago

    say is only 193K on MacOS

      ls -lah /usr/bin/say
      -rwxr-xr-x  1 root  wheel   193K 15 Nov  2024 /usr/bin/say
    
    Usage:

      M1-Mac-mini ~ % say "hello world this is the kitten TTS model speaking"
    • dented42 7 hours ago

      That’s not a far comparison. Say just calls the speech synthesis APIs that have been around since at least Mac OS 8.

      That being said, the ‘classical’ (pre-AI) speech synthesisers are much smaller than kitten, so you’re not wrong per se, just for the wrong reason.

      • deathanatos 5 hours ago

        The linked repository at the top-level here has several gigabytes of dependencies, too.

    • selcuka 7 hours ago

      SAM on Commodore 64 was only 6K:

      https://project64.c64.org/Software/SAM10.TXT

      Obviously it's not fair to compare these with ML models.

    • wnoise 7 hours ago

      And what dynamic libraries s it linked to? And what other data are they pulling in?

    • tonypapousek 5 hours ago

      Tried that on 26 beta, and the default voice sounds a lot smoother than it used it.

      Running `man say` reveals that "this tool uses the Speech Synthesis manager", so I'm guessing the Apple Intelligence stuff is kicking in.

      • dented42 an hour ago

        Nothing to do with Apple Intelligence. The speech synthesiser manager (the term manager was used for OS components in Classic Mac OS) has been around since the mid 90s or so. The change you’re hearing is probably a new/modified default voice.

    • satvikpendem 7 hours ago

      `say` sounds terrible compared to modern neural network based text to speech engines.

      • wewewedxfgdf 6 hours ago

        Sounds about the same as Kitten TTS.

        • satvikpendem 6 hours ago

          To me it sounds worse, especially on the construction of certain more complex sentences or words.

  • mg 4 hours ago

    Good TTS feels like it is something that should be natively built into every consumer device. So the user can decide if they want to read or listen to the text at hand.

    I'm surprised that phone manufacturers do not include good TTS models in their browser APIs for example. So that websites can build good audio interfaces.

    I for one would love to build a text editor that the user can use completely via audio. Text input might already be feasible via the "speak to type" feature, both Android and iOS offer.

    But there seems to be no good way to output spoken text without doing round-trips to a server and generate the audio there.

    The interface I would like would offer a way to talk to write and then commands like "Ok editor, read the last paragraph" or "Ok editor, delete the last sentence".

    It could be cool to do writing this way while walking. Just with a headset connected to a phone that sits in one's pocket.

    • pjc50 43 minutes ago

      Can't most people read faster than they can hear? Isn't this why phone menus are so awful?

      > But there seems to be no good way to output spoken text without doing round-trips to a server and generate the audio there

      As people have been pointing out, we've had mediocre TTS since the 80s. If it was a real benefit people would be using even the inadequate version.

    • jiehong 3 hours ago

      On Mac OS you can "speak" a text in almost every app, using built in voice (like the Siri voice or some older voices). All offline, and even from the terminal with "say".

  • tapper an hour ago

    I am blind and use NVDA with a sinth. How is this news? I don't get it! My sinth is called eloquence and is 4089KB

    • mwcampbell an hour ago

      Does your Eloquence installation include multiple languages? The one I have is only 1876 KB for US English only. And classic DECtalk is even smaller; I have here a version that's only 638 KB (again, US English only).

  • android521 3 hours ago

    it would be great if there is typescript support in the future

  • GaggiX 8 hours ago

    https://huggingface.co/KittenML/kitten-tts-nano-0.1

    https://github.com/KittenML/KittenTTS

    This is the model and Github page, this blog post looks very much AI generated.

  • andai 7 hours ago

    Can you run it in reverse for speech recognition?

  • glietu 5 hours ago

    Kudos guys!

  • jainilprajapati 5 hours ago

  • khanan 2 hours ago

    "please join our DISCORD!"...