Update README.md
Browse files
README.md
CHANGED
@@ -1,3 +1,143 @@
|
|
1 |
-
---
|
2 |
-
|
3 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
---
|
2 |
+
language:
|
3 |
+
- en
|
4 |
+
license: apache-2.0
|
5 |
+
tags:
|
6 |
+
- text-to-speech
|
7 |
+
- tts
|
8 |
+
- voice-synthesis
|
9 |
+
- voice-cloning
|
10 |
+
- zero-shot
|
11 |
+
- emotion-control
|
12 |
+
library_name: chatterbox-tts
|
13 |
+
pipeline_tag: text-to-speech
|
14 |
+
---
|
15 |
+
|
16 |
+
# Chatterbox TTS
|
17 |
+
|
18 |
+
<img width="1200" alt="cb-big2" src="https://github.com/user-attachments/assets/bd8c5f03-e91d-4ee5-b680-57355da204d1" />
|
19 |
+
|
20 |
+
## Model Description
|
21 |
+
|
22 |
+
Chatterbox is Resemble AI's first production-grade open-source text-to-speech (TTS) model. Built on a 0.5B Llama backbone and trained on 0.5M hours of cleaned data, it delivers state-of-the-art zero-shot TTS performance that consistently outperforms leading closed-source systems like ElevenLabs in side-by-side evaluations.
|
23 |
+
|
24 |
+
### Key Features
|
25 |
+
|
26 |
+
- **State-of-the-art zero-shot TTS**: Generate natural-sounding speech without fine-tuning
|
27 |
+
- **Emotion exaggeration control**: First open-source TTS model with adjustable emotional intensity
|
28 |
+
- **Ultra-stable generation**: Alignment-informed inference for consistent outputs
|
29 |
+
- **Voice cloning**: Easy voice conversion with audio prompts
|
30 |
+
- **Built-in watermarking**: PerTh (Perceptual Threshold) watermarking for responsible AI
|
31 |
+
- **Production-ready**: Sub-200ms latency suitable for real-time applications
|
32 |
+
|
33 |
+
## Intended Uses & Limitations
|
34 |
+
|
35 |
+
### Intended Uses
|
36 |
+
|
37 |
+
- Content creation (videos, memes, games)
|
38 |
+
- AI agents and voice assistants
|
39 |
+
- Interactive media and applications
|
40 |
+
- Educational content
|
41 |
+
- Accessibility tools
|
42 |
+
- Creative projects requiring expressive speech
|
43 |
+
|
44 |
+
### Limitations
|
45 |
+
|
46 |
+
- Currently supports English only
|
47 |
+
- Requires CUDA-capable GPU for optimal performance
|
48 |
+
- Output includes imperceptible watermarks for traceability
|
49 |
+
|
50 |
+
### Ethical Considerations
|
51 |
+
|
52 |
+
- All generated audio includes Resemble AI's PerTh watermarking for responsible use tracking
|
53 |
+
- Users must comply with applicable laws and ethical guidelines
|
54 |
+
- Not intended for creating deceptive or harmful content
|
55 |
+
- Please review the disclaimer section before use
|
56 |
+
|
57 |
+
## How to Use
|
58 |
+
|
59 |
+
### Installation
|
60 |
+
|
61 |
+
```bash
|
62 |
+
pip install chatterbox-tts
|
63 |
+
```
|
64 |
+
|
65 |
+
### Basic Usage
|
66 |
+
|
67 |
+
```python
|
68 |
+
import torchaudio as ta
|
69 |
+
from chatterbox.tts import ChatterboxTTS
|
70 |
+
|
71 |
+
# Initialize model
|
72 |
+
model = ChatterboxTTS.from_pretrained(device="cuda")
|
73 |
+
|
74 |
+
# Generate speech from text
|
75 |
+
text = "Ezreal and Jinx teamed up with Ahri, Yasuo, and Teemo to take down the enemy's Nexus in an epic late-game pentakill."
|
76 |
+
wav = model.generate(text)
|
77 |
+
ta.save("output.wav", wav, model.sr)
|
78 |
+
|
79 |
+
# Generate with custom voice
|
80 |
+
AUDIO_PROMPT_PATH = "path/to/voice/sample.wav"
|
81 |
+
wav = model.generate(text, audio_prompt_path=AUDIO_PROMPT_PATH)
|
82 |
+
ta.save("output_custom_voice.wav", wav, model.sr)
|
83 |
+
```
|
84 |
+
|
85 |
+
### Advanced Usage Tips
|
86 |
+
|
87 |
+
#### General Use (TTS and Voice Agents)
|
88 |
+
- Default settings (`exaggeration=0.5`, `cfg=0.5`) work well for most prompts
|
89 |
+
- For fast-speaking reference voices, lower `cfg` to ~0.3 for better pacing
|
90 |
+
|
91 |
+
#### Expressive or Dramatic Speech
|
92 |
+
- Use lower `cfg` values (~0.3) with higher `exaggeration` (≥0.7)
|
93 |
+
- Higher exaggeration speeds up speech; lower cfg compensates with deliberate pacing
|
94 |
+
|
95 |
+
## Model Details
|
96 |
+
|
97 |
+
### Architecture
|
98 |
+
- **Backbone**: 0.5B parameter Llama-based architecture
|
99 |
+
- **Training Data**: 0.5M hours of cleaned speech data
|
100 |
+
- **Special Features**: Alignment-informed inference for stability
|
101 |
+
|
102 |
+
### Performance
|
103 |
+
- Consistently preferred over ElevenLabs in side-by-side evaluations
|
104 |
+
- Ultra-low latency (<200ms) suitable for production use
|
105 |
+
- Stable generation with minimal artifacts
|
106 |
+
|
107 |
+
## Citation
|
108 |
+
|
109 |
+
If you use Chatterbox in your research or projects, please cite:
|
110 |
+
|
111 |
+
```bibtex
|
112 |
+
@software{chatterbox2024,
|
113 |
+
title = {Chatterbox TTS},
|
114 |
+
author = {Resemble AI},
|
115 |
+
year = {2024},
|
116 |
+
url = {https://github.com/resemble-ai/chatterbox}
|
117 |
+
}
|
118 |
+
```
|
119 |
+
|
120 |
+
## Acknowledgments
|
121 |
+
|
122 |
+
- [CosyVoice](https://github.com/FunAudioLLM/CosyVoice)
|
123 |
+
- [HiFT-GAN](https://github.com/yl4579/HiFTNet)
|
124 |
+
- [Llama 3](https://github.com/meta-llama/llama3)
|
125 |
+
|
126 |
+
## Links
|
127 |
+
|
128 |
+
- 🎧 [Listen to demo samples](https://resemble-ai.github.io/chatterbox_demopage/)
|
129 |
+
- 🤗 [Try it on Hugging Face Spaces](https://huggingface.co/spaces/ResembleAI/Chatterbox)
|
130 |
+
- 📊 [View benchmarks on Podonos](https://podonos.com/resembleai/chatterbox)
|
131 |
+
- 🏢 [Resemble AI TTS Service](https://resemble.ai) (for scaled production use)
|
132 |
+
|
133 |
+
## Disclaimer
|
134 |
+
|
135 |
+
This model should not be used for creating deceptive, harmful, or malicious content. Users are responsible for ensuring their use complies with all applicable laws and ethical guidelines. All generated audio includes imperceptible watermarks for responsible AI tracking.
|
136 |
+
|
137 |
+
## License
|
138 |
+
|
139 |
+
This model is licensed under the MIT License. See the LICENSE file for details.
|
140 |
+
|
141 |
+
---
|
142 |
+
|
143 |
+
*Made with ♥️ by [Resemble AI](https://resemble.ai)*
|