Show HN: Kitten TTS – 25MB CPU-Only, Open-Source TTS Model

(github.com)

281 points | by divamgupta 4 hours ago ago

148 comments

The headline feature isn’t the 25 MB footprint alone. It’s that KittenTTS is Apache-2.0. That combo means you can embed a fully offline voice in Pi Zero-class hardware or even battery-powered toys without worrying about GPUs, cloud calls, or restrictive licenses. In one stroke it turns voice everywhere from a hardware/licensing problem into a packaging problem. Quality tweaks can come later; unlocking that deployment tier is the real game-changer.

[-]

defanor 3 hours ago

A Festival's English model, festvox-kallpc16k, is about 6 MB, and it is a large model; festvox-kallpc8k is about 3.5 MB.

eSpeak NG's data files take about 12 MB (multi-lingual).

I guess this one may generate more natural-sounding speech, but older or lower-end computers were capable of decent speech synthesis previously as well.

[-]

Joel_Mckay 2 hours ago

Custom voices could be added, but the speed was more important to some users.

$ ls -lh /usr/bin/flite

Listed as 27K last I checked.

I recall some Blind users were able to decode Gordon 8-bit dialogue at speeds most people found incomprehensible. =3

woadwarrior01 21 minutes ago

> It’s that KittenTTS is Apache-2.0

Have you seen the code[1] in the repo? It uses phonemizer[2] which is GPL-3.0 licensed. In its current state, it's effectively GPL licensed.

[1]: https://github.com/KittenML/KittenTTS/blob/main/kittentts/on...

[2]: https://github.com/bootphon/phonemizer

entropie 35 minutes ago

I play around with a nvidia jetson orin nano super right now and its actually pretty usuable with gemma3:4b and quite fast - even image processing is done in like 10-20 seconds but this is with GPU support. When something is not working and ollama is not using the GPU this calls take ages because the cpu is just bad.

Iam curious how fast this is with CPU only.

pjc50 42 minutes ago

> KittenTTS is Apache-2.0

What about the training data? Is everyone 100% confident that models are not a derived work of the training inputs now, even if they can reproduce input exactly?

rohan_joshi 2 hours ago

yeah, we are super excited to build tiny ai models that are super high quality. local voice interfaces are inevitable and we want to power those in the future. btw, this model is just a preview, and the full release next week will be of much higher quality, along w another ~80M model ;)

phh 44 minutes ago

It depends on espeak-ng which is GPLv3

antisol 2 hours ago

  System Requirements
  Works literally everywhere

Haha, on one of my machines my python version is too old, and the package/dependencies don't want to install.

On another machie the python version is too new, and the package/dependencies don't want to install.

[-]

sigmoid10 an hour ago

There are still people who use machine wide python installs instead of environments? Python dependency hell was already bad years ago, but today it's completely impractical to do it this way. Even on raspberries.

[-]

lynx97 19 minutes ago

Debian pretty much "solved" this by making pip refuse to install packages if you are not in an venv.

hahn-kev an hour ago

Python man

divamgupta 2 hours ago

We are working to fix that. Thanks

[-]

pjc50 an hour ago

"Fixing python packaging" is somewhat harder than AGI.

raybb an hour ago

Have you considered offering a uvx command to run to get people going quickly?

[-]

zelphirkalt 9 minutes ago

Though I think you would still need to have the Python build dependencies installed for that to work.

Tatiana343 2 hours ago

dr_kiszonka 27 minutes ago

Microsoft's and some of Google's TTS models make the simplest mistakes. For instance, they sometimes read "i.e." as "for example." This is a problem if you have low vision and use TTS for, say, proofreading your emails.

Why does it happen? I'm genuinely curious.

[-]

lynx97 21 minutes ago

Well, speech synthesizers are pretty much famous for speaking all sorts of things wrong. But what I find very concerning about LLM based TTS is that some of them cant really speak numbers greater then 100. They try, but fail a lot. At least tts-1-hd was pretty much doing this for almost every 3 or 4 digit number. Especially noticeable when it is supposed to read a year number.

mlboss 7 hours ago

Reddit post with generated audio sample: https://www.reddit.com/r/LocalLLaMA/comments/1mhyzp7/kitten_...

[-]

KaiserPro a few seconds ago

was it cross trained on futurama voices?

smusamashah an hour ago

The reddit video is awesome. I don't understand how people are calling it an OK model. Under 25MB and cpu only for this quality is amazing.

tapper an hour ago

Sounds slow and like something from an anine

[-]

ricardobeat an hour ago

Speech speed is always a tunable parameter and not something intrinsic to the model.

The comparison to make is expressiveness and correct intonation for long sentences vs something like espeak. It actually sounds amazing for the size. The closest thing is probably KokoroTTS at 82M params and ~300MB.

klipklop 2 hours ago

I tried it. Not bad for the size (of the model) and speed. Once you install all the massive number of libraries and things needed we are a far cry away from 25MB though. Cool project nonetheless.

[-]

Dayshine 2 hours ago

It mentions ONNX, so I imagine an ONNX model is or will be available.

ONNX runtime is a single library, with C#'s package being ~115MB compressed.

Not tiny, but usually only a few lines to actually run and only a single dependency.

[-]

divamgupta 2 hours ago

We will try to get rid of dependencies.

WhyNotHugo an hour ago

Usually pulling in lots of libraries helps develop/iterate faster. Then can be removed later once the whole thing starts to take shape.

blopker 7 hours ago

Web version: https://clowerweb.github.io/kitten-tts-web-demo/

It sounds ok, but impressive for the size.

[-]

nine_k 7 hours ago

Does anybody find it funny that sci-fi movies have to heavily distort "robot voices" to make them sound "convincingly robotic"? A robotic, explicitly non-natural voice would be perfectly acceptable, and even desirable, in many situations. I don't expect a smart toaster to talk like a BBC host; it'd be enough is the speech if easy to recognize.

[-]

userbinator 4 hours ago

A robotic, explicitly non-natural voice would be perfectly acceptable, and even desirable, in many situations[...]it'd be enough is the speech if easy to recognize.

We've had formant synths for several decades, and they're perfectly understandable and require a tiny amount of computing power, but people tend not to want to listen to them:

https://en.wikipedia.org/wiki/Software_Automatic_Mouth

https://simulationcorner.net/index.php?page=sam (try it yourself to hear what it sounds like)

[-]

miki123211 2 hours ago

SAM and the way it works is not what people typically associate with the term "formant synthesizer."

DECtalk[1,2] would be a much better example, that's as formant as you get.

[1] https://en.wikipedia.org/wiki/DECtalk [2] https://webspeak.terminal.ink

saretup 3 hours ago

Well, this one is a bit too jarring to the ears.

[-]

rixed 3 hours ago

But there is no latency, as opposed to KittenTTS, so it certainly has its applications too.

actionfromafar 2 hours ago

I think it's charming

cess11 3 hours ago

Try this demo, which has more knobs:

https://discordier.github.io/sam/

tapper an hour ago

Yeah blind people love eloquence

looperhacks 40 minutes ago

I remember that the novelization of the fifth element describes that the cops are taught to speak as robotic as possible when using speakers for some reason. Always found the idea weird that someone would _want_ that

roywiggins 7 hours ago

This one is at least an interesting idea: https://genderlessvoice.com/

[-]

dang 3 hours ago

Meet Q, a Genderless Voice - https://news.ycombinator.com/item?id=19505835 - March 2019 (235 comments)

cosmojg 6 hours ago

The voice sounds great! I find it quite aesthetically pleasing, but it's far from genderless.

degamad 5 hours ago

Interesting concept, but why is that site filled with Top X blogspam?

cyberax 2 hours ago

It doesn't sound genderless.

incone123 2 hours ago

Depends on the movie. Ash and Bishop in the Alien franchise sound human until there's a dramatic reason to sound more 'robotic'.

I agree with your wider point. I use Google TTS with Moon+Reader all the time (I tried audio books read by real humans but I prefer the consistency of TTS)

[-]

regularfry 22 minutes ago

Slightly different there because it's important in both cases that Ripley (and we) can't tell they're androids until it's explicitly uncovered. The whole point is that they're not presented as artificial. Same in Blade Runner: "more human than human". You don't have a film without the ambiguity there.

Twirrim 4 hours ago

> I don't expect a smart toaster to talk like a BBC host;

Well sure, the BBC have already established that it's supposed to sound like a brit doing an impersonation of an American: https://www.youtube.com/watch?v=LRq_SAuQDec

bkyan 3 hours ago

I got an error when I tried the demo with 6 sentences, but it worked great when I reduced the text to 3 sentences. Is the length limit due to the model or just a limitation for the demo?

[-]

divamgupta 2 hours ago

Currently we don't have chunking enabled yet. We will add it soon. That will remove the length limitations.

cess11 3 hours ago

Perhaps a length limit? I tried this:

"This first Book proposes, first in brief, the whole Subject, Mans disobedience, and the loss thereupon of Paradise wherein he was plac't: Then touches the prime cause of his fall, the Serpent, or rather Satan in the Serpent; who revolting from God, and drawing to his side many Legions of Angels, was by the command of God driven out of Heaven with all his Crew into the great Deep."

It takes a while until it starts generating sound on my i7 cores but it kind of works.

This also works:

"blah. bleh. blih. bloh. blyh. bluh."

So I don't think it's a limit on punctuation. Voice quality is quite bad though, not as far from the old school C64 SAM (https://discordier.github.io/sam/) of the eighties as I expected.

Retr0id 6 hours ago

I tried to replicate their demo text but it doesn't sound as good for some reason.

If anyone else wants to try:

> Kitten TTS is an open-source series of tiny and expressive text-to-speech models for on-device applications. Our smallest model is less than 25 megabytes.

[-]

cortesoft 5 hours ago

Is the demo using the not smallest model?

quantummagic 7 hours ago

Doesn't work here. Backend module returns 404 :

https://clowerweb.github.io/node_modules/onnxruntime-web/dis...

[-]

Retr0id 7 hours ago

Looks like this commit 15 minutes ago broke it https://github.com/clowerweb/kitten-tts-web-demo/commit/6b5c...

(seems reverted now)

Aardwolf an hour ago

On PC it's a python dependency hell but someone managed to package it in self contained JS code that works offline once it loaded the model? How is that done?

[-]

a2128 an hour ago

ONNXRuntime makes it fairly easy, you just need to provide a path to the ONNX file, give it inputs in the correct format, and use the outputs. The ONNXRuntime library handles the rest. You can see this in the main.js file: https://github.com/clowerweb/kitten-tts-web-demo/blob/main/m...

Plus, Python software are dependency hell in general, while webpages have to be self-contained by their nature (thank god we no longer have Silverlight and Java applets...)

nxnsxnbx 3 hours ago

Thanks, I was looking for that. While the reddit demo sounds ok, even though on a level we reached a couple of years ago, all TTS samples I tried were barley understandable at all

[-]

divamgupta 2 hours ago

This is just an early checkpoint. We hope that the quality will improve in the future.

itake 5 hours ago

> Error generating speech: failed to call OrtRun(). ERROR_CODE: 2, ERROR_MESSAGE: Non-zero status code returned while running Expand node. Name:'/bert/Expand' Status Message: invalid expand shape

Doesn't seem to work with thai.

[-]

jainilprajapati 5 hours ago

You can also try on https://clowerweb.github.io/node_modules/onnxruntime-web/dis...

rohan_joshi 2 hours ago

yeah, this is just a preview model from an early checkpoint. the full model release will be next week which includes a 15M model and an 80M model, both of which will have much higher quality than this preview.

belchiorb 2 hours ago

This doesn’t seem to work on Safari. Works great on Chrome, though

[-]

divamgupta 2 hours ago

Hmm, we will look into it.

[-]

tapper an hour ago

You should post on the NVDA email list. https://nvda.groups.io/g/nvda Or the Screen reader list: https://winaccess.groups.io/g/winaccess FYI blind people do not like any lag when reading that’s is why so many still use eloquence and espeak.

kenarsa 7 hours ago

Try https://github.com/Picovoice/orca It's about 7MB all included

[-]

gary_0 6 hours ago

Not open source. "You will need internet connectivity to validate your AccessKey with Picovoice license servers ... If you wish to increase your limits, you can purchase a subscription plan." https://github.com/Picovoice/orca#accesskey

[-]

papichulo2023 2 hours ago

The guy is just spamming the project in a lot of comments.

cakealert 3 hours ago

Going online is a dealbreaker but if you really need it you could use ghidra to fix that. I had tried to find a conversion of their model to onnx (making their proprietary pipeline useless) but failed.

Hopefully open source will render them irrelevant in the future.

satvikpendem 7 hours ago

Does an apk for Android exist for replacing its speech to text engine? I tried sherpa-onnx but it was too slow for real time usage it seemed, and especially so for audiobooks when sped up.

[-]

kenarsa 6 hours ago

https://github.com/Picovoice/orca/tree/main/demo%2Fandroid

[-]

satvikpendem 6 hours ago

I can't test this out right now, is this just a demo or is it actually an apk for replacing the engine? Because those are two different things, the latter can be used any time you want to read something aloud on the page for example. This is the sherpa-onnx one I'm talking about.

https://k2-fsa.github.io/sherpa/onnx/tts/apk-engine.html

nine_k 8 hours ago

I hope this is the future. Offline, small ML models, running inference on ubiquitous, inexpensive hardware. Models that are easy to integrate into other things, into devices and apps, and even to drive from other models maybe.

[-]

WhyNotHugo an hour ago

Dedicated single-purpose hardware with models would be even less energy-intensive. It's theoretically possible to design chips which run neural networks and alike using just resistors (rather than transistors).

Such hardware is not general-purpose, and upgrading the model would not be possible, but there's plenty of use-cases where this is reasonable.

divamgupta 2 hours ago

That is our vision too!

rohan_joshi 2 hours ago

yeah totally. the quality of these tiny models are only going to go up.

keyle 6 hours ago

I don't mind so much the size in MB, the fact that it's pure CPU and the quality, what I do mind however is the latency. I hope it's fast.

Aside: Are there any models for understanding voice to text, fully offline, without training?

I will be very impressed when we will be able to have a conversation with an AI at a natural rate and not "probe, space, response"

[-]

Dayshine 2 hours ago

Nvidia's parakeet https://huggingface.co/nvidia/parakeet-tdt-0.6b-v2 appears to be state of the art for english: 10x faster than Whisper.

My mid-range AMD CPU is multiple times faster than realtime with parakeet.

jiehong 3 hours ago

Voice to text fully offline can be done with whisper. A few apps offer it for dictation or transcription.

blensor 3 hours ago

"The brown fox jumps over the lazy dog.."

Average duration per generation: 1.28 seconds

Characters processed per second: 30.35

"Um"

Average duration per generation: 0.22 seconds

Characters processed per second: 9.23

"The brown fox jumps over the lazy dog.. The brown fox jumps over the lazy dog.."

Average duration per generation: 2.25 seconds

Characters processed per second: 35.04

processor : 0

vendor_id : AuthenticAMD

cpu family : 25

model : 80

model name : AMD Ryzen 7 5800H with Radeon Graphics

stepping : 0

microcode : 0xa50000c

cpu MHz : 1397.397

cache size : 512 KB

[-]

keyle 3 hours ago

assuming most answers will be more than a sentence, 2.25 seconds is already long enough if you factor the token generation in between... and imagine with reasoning!... We're not there yet.

moffkalast 3 hours ago

Hmm that actually seems extremely slow, Piper can crank out a sentence almost instantly on a Pi 4 which is a like a sloth compared to that Ryzen and the speech quality seems about the same at first glance.

I suppose it would make sense if you want to include it on top of an LLM that's already occupying most of a GPU and this could run in the limited VRAM that's left.

colechristensen 3 hours ago

>Aside: Are there any models for understanding voice to text, fully offline, without training?

OpenAI's whisper is a few years old and pretty solid.

https://github.com/openai/whisper

Teever 4 hours ago

Any idea what factors play into latency in TTS models?

[-]

divamgupta 2 hours ago

Mostly model size, and input size. Some models which use attention are O(N^2)

toisanji 7 hours ago

Wow, amazing and good work, I hope to see more amazing models running on CPUs!

[-]

rohan_joshi 2 hours ago

thanks, we're going to release many more models in the future, that can run on just CPUs.

killerstorm an hour ago

I'm curious why smallish TTS models have metallic voice quality.

The pronunciation sounds about right - i thought it's the hard part. And the model does it well. But voice timbre should be simpler to fix? Like, a simple FIR might improve it?

vahid4m 5 hours ago

amazing! can't wait to integrate it into https://desktop.with.audio I'm already using KokorosTTS without a GPU. It works fairly well on Apple Silicon.

Foundational tools like this open up the possiblity of one-time payment or even free tools.

[-]

rohan_joshi 2 hours ago

would love to see how that turns out. the full model release next week will be more expressive and higher quality than this one so we're excited to see you try that out.

wkat4242 7 hours ago

Hmm the quality is not so impressive. I'm looking for a really naturally sounding model. Not very happy with piper/kokoro, XTTS was a bit complex to set up.

For STT whisper is really amazing. But I miss a good TTS. And I don't mind throwing GPU power at it. But anyway. this isn't it either, this sounds worse than kokoro.

[-]

kamranjon 6 hours ago

The best open one I've found so far is Dia - https://github.com/nari-labs/dia - it has some limitations, but i think it's really impressive and I can run it on my laptop.

guskel 6 hours ago

Chatterbox is also worth a try.

jainilprajapati 5 hours ago

You should give try to https://pinokio.co/

echelon 7 hours ago

> Hmm the quality is not so impressive. [...] And I don't mind throwing GPU power at it.

This isn't for you, then. You should evaluate quality here based on the fact you don't need a GPU.

Back in the pre-Tacotron2 days, I was running slim TTS and vocoder models like GlowTTS and MelGAN on Digital Ocean droplets. No GPU to speak of. It cost next to nothing to run.

Since then, the trend has been to scale up. We need more models to scale down.

In the future we'll see small models living on-device. Embedded within toys and tools that don't need or want a network connection. Deployed with Raspberry Pi.

Edge AI will be huge for robotics, toys and consumer products, and gaming (ie. world models).

kenarsa 7 hours ago

Try https://github.com/Picovoice/orca

onair4you 7 hours ago

Okay, lots of details information and example code, great. But skimming through I didn’t see any audio samples to judge the quality?

[-]

TheAceOfHearts 7 hours ago

They posted a demo on reddit[0]. It sounds amazing given the tiny size.

[0] https://old.reddit.com/r/LocalLLaMA/comments/1mhyzp7/kitten_...

[-]

onair4you 6 hours ago

Thanks! Yeah. It definitely isn’t the absolute best in quality but it trounces the default TTS options on macOS (as third party developers are locked out of the Siri voices). And for less than the size of many modern web pages…

victorbjorklund 3 hours ago

It is not the best TTS but it is freaking amazing it can be done by such a small model and it is good enough for so many use cases.

[-]

rohan_joshi 2 hours ago

thanks, but keep in mind that this model is just a preview checkpoint that is only 10% trained. the full release next week will be of much higher quality and it will include a 15M model and an 80M model.

BenGosub 27 minutes ago

I wonder what would it take to extend it with a custom voice?

dang 4 hours ago

Most of these comments were originally posted to a different thread (https://news.ycombinator.com/item?id=44806543). I've moved them hither because on HN we always prefer to give the project creators credit for their work.

(it does however explain how many of these comments are older than the thread they are now children of)

babycommando 3 hours ago

Someone please port this to ONNX so we don't need to do all this ass tooling

sandreas 6 hours ago

Cool.

While I think this is indeed impressive and has a specific use case (e.g. in the embedded sector), I'm not totally convinced that the quality is good enough to replace bigger models.

With fish-speech[1] and f5-tts[2] there are at least 2 open source models pushing the quality limits of offline text-to-speech. I tested F5-TTS with an old NVidia 1660 (6GB VRAM) and it worked ok-ish, so running it on a little more modern hardware will not cost you a fortune and produce MUCH higher quality with multi-language and zero-shot support.

For Android there is SherpaTTS[3], which plays pretty well with most TTS Applications.

1: https://github.com/fishaudio/fish-speech

2: https://github.com/SWivid/F5-TTS

3: https://github.com/woheller69/ttsengine

[-]

divamgupta 2 hours ago

We have released just a preview of the model. We hope to get the model much better in the future releases.

maxloh 5 hours ago

Hi. Will the training and fine-tuning code also be released?

It would be great if the training data were released too!

RobKohr 7 hours ago

What's a good one in reverse; speech to text?

[-]

jasonjmcghee 7 hours ago

Whisper and the many variants. Here's a good implementation.

https://github.com/ggml-org/whisper.cpp

wenc 4 hours ago

This one is a whisper-based Python package

https://github.com/primaprashant/hns

Perz1val an hour ago

Is the name a joke on "If the emperor had a tts device"? It's funny

wewewedxfgdf 5 hours ago

Chrome does TTS too.

https://codepen.io/logicalmadboy/pen/RwpqMRV

pkaye 7 hours ago

Where does the training data come for the models? Is there an openly available dataset the people use?

indigodaddy 4 hours ago

Can coqui run in cpu only?

[-]

palmfacehn 3 hours ago

Yes, XTTS2 has been reasonably performant for me and the cloning is acceptable.

countfeng 2 hours ago

Very good model, thanks for the open source

[-]

rohan_joshi 2 hours ago

thanks a lot, this model is just a preview checkpoint. the full release next week will be of much higher quality.

righthand 4 hours ago

The sample rate does more than change the quality.

mayli 7 hours ago

Is this english only?

[-]

a2128 7 hours ago

If you're looking for other languages, Piper has been around in this scene for much longer and they have open-source training code and a lot of models (they're ~60MB instead of 25MB but whatever...) https://huggingface.co/rhasspy/piper-voices/tree/main

[-]

kenarsa 7 hours ago

Or use https://github.com/Picovoice/orca which is about 7MB and supports 8 languages

[-]

pezgrande 5 hours ago

you need api key and internet access to run locally? lol. Classic .NET project.

riedel an hour ago

Actually I found it irritating that the readme does not mention the language at all. I think it is not good practice to deduce it from the language of the readme itself. I would not like to have German language tts models with only a German readme...

evgpbfhnr 7 hours ago

I tried on some Japanese for the kicks of it, it reads... "Chinese letter chinese letter japanese letter chinese letter..." :D

But yeah, if it's like any of the others we'll likely see a different "model" per language down the line based on the same techniques

g7r 7 hours ago

Yes. The FAQ says that multilingual capabilities are in the works.

wewewedxfgdf 7 hours ago

say is only 193K on MacOS

  ls -lah /usr/bin/say
  -rwxr-xr-x  1 root  wheel   193K 15 Nov  2024 /usr/bin/say

Usage:

  M1-Mac-mini ~ % say "hello world this is the kitten TTS model speaking"

[-]

dented42 7 hours ago

That’s not a far comparison. Say just calls the speech synthesis APIs that have been around since at least Mac OS 8.

That being said, the ‘classical’ (pre-AI) speech synthesisers are much smaller than kitten, so you’re not wrong per se, just for the wrong reason.

[-]

deathanatos 5 hours ago

The linked repository at the top-level here has several gigabytes of dependencies, too.

selcuka 7 hours ago

SAM on Commodore 64 was only 6K:

https://project64.c64.org/Software/SAM10.TXT

Obviously it's not fair to compare these with ML models.

wnoise 7 hours ago

And what dynamic libraries s it linked to? And what other data are they pulling in?

tonypapousek 5 hours ago

Tried that on 26 beta, and the default voice sounds a lot smoother than it used it.

Running `man say` reveals that "this tool uses the Speech Synthesis manager", so I'm guessing the Apple Intelligence stuff is kicking in.

[-]

dented42 an hour ago

Nothing to do with Apple Intelligence. The speech synthesiser manager (the term manager was used for OS components in Classic Mac OS) has been around since the mid 90s or so. The change you’re hearing is probably a new/modified default voice.

satvikpendem 7 hours ago

`say` sounds terrible compared to modern neural network based text to speech engines.

[-]

wewewedxfgdf 6 hours ago

Sounds about the same as Kitten TTS.

[-]

satvikpendem 6 hours ago

To me it sounds worse, especially on the construction of certain more complex sentences or words.

mg 4 hours ago

Good TTS feels like it is something that should be natively built into every consumer device. So the user can decide if they want to read or listen to the text at hand.

I'm surprised that phone manufacturers do not include good TTS models in their browser APIs for example. So that websites can build good audio interfaces.

I for one would love to build a text editor that the user can use completely via audio. Text input might already be feasible via the "speak to type" feature, both Android and iOS offer.

But there seems to be no good way to output spoken text without doing round-trips to a server and generate the audio there.

The interface I would like would offer a way to talk to write and then commands like "Ok editor, read the last paragraph" or "Ok editor, delete the last sentence".

It could be cool to do writing this way while walking. Just with a headset connected to a phone that sits in one's pocket.

[-]

pjc50 43 minutes ago

Can't most people read faster than they can hear? Isn't this why phone menus are so awful?

> But there seems to be no good way to output spoken text without doing round-trips to a server and generate the audio there

As people have been pointing out, we've had mediocre TTS since the 80s. If it was a real benefit people would be using even the inadequate version.

jiehong 3 hours ago

On Mac OS you can "speak" a text in almost every app, using built in voice (like the Siri voice or some older voices). All offline, and even from the terminal with "say".

tapper an hour ago

I am blind and use NVDA with a sinth. How is this news? I don't get it! My sinth is called eloquence and is 4089KB

[-]

mwcampbell an hour ago

Does your Eloquence installation include multiple languages? The one I have is only 1876 KB for US English only. And classic DECtalk is even smaller; I have here a version that's only 638 KB (again, US English only).

android521 3 hours ago

it would be great if there is typescript support in the future

[-]

divamgupta 2 hours ago

Yup it runs on the web browser. https://clowerweb.github.io/kitten-tts-web-demo/

GaggiX 8 hours ago

https://huggingface.co/KittenML/kitten-tts-nano-0.1

https://github.com/KittenML/KittenTTS

This is the model and Github page, this blog post looks very much AI generated.

andai 7 hours ago

Can you run it in reverse for speech recognition?

[-]