Llasa: Llama-Based Speech Synthesis

(llasatts.github.io)

160 points | by CalmStorm a day ago ago

19 comments

  • ks2048 19 hours ago

    Odd that the page doesn't seem to link to either,

    paper: https://arxiv.org/abs/2502.04128

    github: https://github.com/zhenye234/LLaSA_training

    • thot_experiment 16 hours ago

      Interesting that there isn't a mention of Orpheus as prior art either since it's the exact same thing.

      (https://github.com/canopyai/Orpheus-TTS)

      • gapeleon 12 hours ago

        > Interesting that there isn't a mention of Orpheus as prior art either

        Llasa-3b (https://huggingface.co/HKUSTAudio/Llasa-3B) came out before Orpheus (https://huggingface.co/canopylabs/orpheus-3b-0.1-ft).

        > it's the exact same thing.

        They're very similar, but they're not the exact same thing.

        Llasa uses xcodec2, a much simpler, lossless 16khz wav codec. This makes it superior for one-shot voice cloning.

        Orpheus' 24khz snac codec is lossy which makes it difficult to use for zero-shot cloning as the reference audio gets degraded during tokenization. You can test this here: https://huggingface.co/spaces/Gapeleon/snac_test

        But when finetuned on 50+ audio samples, it produces much cleaner 24khz audio than Llasa, and the snac model is much easier to run on consumer hardware than xcodec2 (87t/s for realtime speech, which can be achieved on an RTX3080 for example)

        • oezi 8 hours ago

          Do you happen to know why Orpheus and Llasa use Finetuning for voice cloning?

          Zonos uses 128-float embeddings for voices and it seems so much nicer. Because you can just mix and match voices without changing the model.

        • oezi 8 hours ago

          Isn't xcodec2 also lossy? I thought it is also just another neural codec (50 tok/s, single codebook).

          What are people using to upsampling back to 44,1 or 48 khz? Anything fancy?

          • woodson an hour ago

            They’re both lossy. They use a VAE-VQ type architecture trained with a combination of losses/discriminators. The differences are mainly the encoder/decoder architecture, the type of bottleneck quantization (RVQ, FSQ, etc.) and of course the training data.

  • CalmStorm a day ago

    LLaSA is a simple framework for speech synthesis that employs a single-layer vector quantizer (VQ) codec and a single Transformer architecture to fully align with standard LLMs such as LLaMA.

    • WastedCucumber 21 hours ago

      Probably the title should have the correct capitalization then. Cause I was fully expecting a speech synthesis tool that sounded like llamas talking human language and now I'm bummed out!

  • StevenNunez 21 hours ago

    I can't wait see this integrated into Open WebUI! These sound amazing.

    • gapeleon 12 hours ago

      You can run an openai-compatible endpoint and point open-webui at it if you want this. I had to add a function to filter out markdown lists, code, etc as the model was choking on them.

  • mring33621 20 hours ago

    the long 'uuuuhhhhhhh' from some of the lesser models is killing me.

    • gapeleon 12 hours ago

      This finetune seems pretty stable (1b llasa) https://huggingface.co/spaces/HKUST-Audio/Llasa-1B-multi-spe...

      1B is actually huge for a TTS model. Here's an 82m model with probably the most stable/coherent output of all the open weights tts models I've tested: https://huggingface.co/spaces/hexgrad/Kokoro-TTS

      But if you mean zero-shot cloning, yeah they all seem to have those slurred speech artefacts from time to time.

    • nialv7 2 hours ago

      the mispronunciation of 行 and 行 in the Chinese sample is killing me too XD

    • jszymborski 19 hours ago

      based on the samples, it really seams like anything smaller than 3B is pretty useless.

      • hadlock 18 hours ago

        If you're doing a home lab voice assistant 1B is nice, because on a 12gb gpu you can run a moderately competent 7b LLM and two 1b models; 1 for speech to text and also text to speech, plus some for the wake word monitor. Maybe in a couple of years we can combine all this into a single ~8b model that runs efficiently on 12gb gpu. Nvidia doesn't seem very incentivized right now to sell consumer GPUs that can run all this on a single consumer grade chip when they're making so much money selling commercial grade 48gb cards.

  • dheera 19 hours ago

    > employs a single-layer vector quantizer (VQ) codec and a single Transformer architecture to fully align

    I really wish when new models were released that they would draw a diagram of all the layers and the tensor input and output sizes at each layer, with zoom in/out capabilities if needed using D3.js or whatever visualization framework if needed. Every single layer should be on there with its input and output sizes.

    These one-sentence descriptions, and approximate block diagrams with arrows pointing at each other are never enough to understand how something is actually implemented.

    • dr_kiszonka 12 hours ago

      That might be intentional.

    • imtringued 7 hours ago

      This already exists in Transformer Lab and ONNX (not recommended for transformers).

      You can also build a custom version of llama.cpp that writes out the ggml compute graph. What's irritating is that hugging face didn't add it to their GGUF file viewer.

    • exe34 18 hours ago

      Sounds like a solid SaaS business plan!