MisoTTS

Project Url: MisoLabsAI/MisoTTS
Introduction: Miso TTS is an 8 billion, highly emotive text-to-speech model
More: Author   ReportBugs   
Tags:
Miso TTS 8B

State-of-the-Art Text-to-Speech Model

Website Hugging Face GitHub X

Quickstart | Model Introduction | Model Summary | Usage | Safety


Quickstart

To quickly try the model, you can use the demo hosted on our landing page at misolabs.ai. To try it locally, follow the instructions below.

If you do not have uv installed yet:

curl -LsSf https://astral.sh/uv/install.sh | sh

Then clone the repository and create the environment:

git clone https://github.com/MisoLabsAI/MisoTTS.git
cd MisoTTS
uv sync --python 3.10
source .venv/bin/activate

Then run the example conversation. By default, run_misotts.py loads the public model from MisoLabs/MisoTTS and downloads it into the Hugging Face cache if it is not already present on your machine:

uv run python run_misotts.py

The script writes full_conversation.wav in the repository root.

With pip instead of uv:

python3.10 -m venv .venv
source .venv/bin/activate
pip install -e .
python run_misotts.py

Model Introduction

Miso TTS 8B is a text-to-dialogue RVQ Transformer inspired by the Sesame CSM architecture. It generates Mimi audio codes from text and optional audio context, using a large Llama 3.2-style backbone and a smaller autoregressive audio decoder. To find out more about the architecture, read our blog post.

The model is designed for high-quality conversational speech generation. This repository contains the inference code, model definition, and setup instructions for running Miso TTS locally.

Language support: Miso TTS 8B currently supports English only.


Model Summary

Item Value
Model Miso TTS 8B
Organization Miso Labs
Task Text-to-speech
Architecture RVQ Transformer
Backbone llama-8B
Audio decoder llama-300M
Text vocabulary 128,256
Audio vocabulary 2,051
Audio codebooks 32
Audio tokenizer Mimi
Max sequence length 2,048
Languages English only

Architecture

Miso TTS 8B uses two transformer components:

  • A large backbone transformer that consumes text/audio-frame embeddings.
  • A smaller decoder transformer that autoregressively predicts higher-order audio codebooks within each frame.

The backbone accepts interleaved text and audio tokens, allowing it to condition its generations on the conversation history.


Usage

Python

import torch
import torchaudio

from generator import load_miso_8b

device = "cuda" if torch.cuda.is_available() else "cpu"

generator = load_miso_8b(
    device=device,
    model_path_or_repo_id="MisoLabs/MisoTTS",
)

audio = generator.generate(
    text="Hello from Miso.",
    speaker=0,
    context=[],
    max_audio_length_ms=10_000,
)

torchaudio.save("miso.wav", audio.unsqueeze(0).cpu(), generator.sample_rate)

Prompted generation

Miso TTS can condition on prior audio for voice cloning. This is optional; the quickstart example above runs without prompt audio.

import torchaudio

from generator import Segment, load_miso_8b

generator = load_miso_8b(device="cuda")

prompt_audio, sample_rate = torchaudio.load("prompt.wav")
prompt_audio = torchaudio.functional.resample(
    prompt_audio.squeeze(0),
    orig_freq=sample_rate,
    new_freq=generator.sample_rate,
)

context = [
    Segment(
        speaker=0,
        text="This is the transcript for the prompt audio.",
        audio=prompt_audio,
    )
]

audio = generator.generate(
    text="This is the next sentence to synthesize.",
    speaker=0,
    context=context,
    max_audio_length_ms=10_000,
)

Weights

The model weights are hosted publicly on Hugging Face:

uv run python run_misotts.py

The default model repository is MisoLabs/MisoTTS. The first run downloads the model automatically through Hugging Face Hub; later runs reuse the cached copy.

The first run also downloads the SilentCipher watermarking model from sony/silentcipher. If that separate download times out, rerun the command; the Hugging Face cache resumes from files that already completed.


Deployment Notes

Miso TTS 8B is a large model. For best results, use a CUDA GPU with sufficient VRAM for the checkpoint precision you are loading. The default inference path uses torch.bfloat16.


Safety

Miso TTS is a speech generation model. Do not use it to impersonate people, create deceptive audio, commit fraud, or generate harmful content.

Generated audio is watermarked by default. If you deploy this model in another application, use your own private watermark key and keep it secret.


Apps
About Me
GitHub: Trinea
Facebook: Dev Tools
AI Daily Digest