erichartford commited on
Commit
0d07681
·
verified ·
1 Parent(s): 641e6f9

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +143 -3
README.md CHANGED
@@ -1,3 +1,143 @@
1
- ---
2
- license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - en
4
+ license: apache-2.0
5
+ tags:
6
+ - text-to-speech
7
+ - tts
8
+ - voice-synthesis
9
+ - voice-cloning
10
+ - zero-shot
11
+ - emotion-control
12
+ library_name: chatterbox-tts
13
+ pipeline_tag: text-to-speech
14
+ ---
15
+
16
+ # Chatterbox TTS
17
+
18
+ <img width="1200" alt="cb-big2" src="https://github.com/user-attachments/assets/bd8c5f03-e91d-4ee5-b680-57355da204d1" />
19
+
20
+ ## Model Description
21
+
22
+ Chatterbox is Resemble AI's first production-grade open-source text-to-speech (TTS) model. Built on a 0.5B Llama backbone and trained on 0.5M hours of cleaned data, it delivers state-of-the-art zero-shot TTS performance that consistently outperforms leading closed-source systems like ElevenLabs in side-by-side evaluations.
23
+
24
+ ### Key Features
25
+
26
+ - **State-of-the-art zero-shot TTS**: Generate natural-sounding speech without fine-tuning
27
+ - **Emotion exaggeration control**: First open-source TTS model with adjustable emotional intensity
28
+ - **Ultra-stable generation**: Alignment-informed inference for consistent outputs
29
+ - **Voice cloning**: Easy voice conversion with audio prompts
30
+ - **Built-in watermarking**: PerTh (Perceptual Threshold) watermarking for responsible AI
31
+ - **Production-ready**: Sub-200ms latency suitable for real-time applications
32
+
33
+ ## Intended Uses & Limitations
34
+
35
+ ### Intended Uses
36
+
37
+ - Content creation (videos, memes, games)
38
+ - AI agents and voice assistants
39
+ - Interactive media and applications
40
+ - Educational content
41
+ - Accessibility tools
42
+ - Creative projects requiring expressive speech
43
+
44
+ ### Limitations
45
+
46
+ - Currently supports English only
47
+ - Requires CUDA-capable GPU for optimal performance
48
+ - Output includes imperceptible watermarks for traceability
49
+
50
+ ### Ethical Considerations
51
+
52
+ - All generated audio includes Resemble AI's PerTh watermarking for responsible use tracking
53
+ - Users must comply with applicable laws and ethical guidelines
54
+ - Not intended for creating deceptive or harmful content
55
+ - Please review the disclaimer section before use
56
+
57
+ ## How to Use
58
+
59
+ ### Installation
60
+
61
+ ```bash
62
+ pip install chatterbox-tts
63
+ ```
64
+
65
+ ### Basic Usage
66
+
67
+ ```python
68
+ import torchaudio as ta
69
+ from chatterbox.tts import ChatterboxTTS
70
+
71
+ # Initialize model
72
+ model = ChatterboxTTS.from_pretrained(device="cuda")
73
+
74
+ # Generate speech from text
75
+ text = "Ezreal and Jinx teamed up with Ahri, Yasuo, and Teemo to take down the enemy's Nexus in an epic late-game pentakill."
76
+ wav = model.generate(text)
77
+ ta.save("output.wav", wav, model.sr)
78
+
79
+ # Generate with custom voice
80
+ AUDIO_PROMPT_PATH = "path/to/voice/sample.wav"
81
+ wav = model.generate(text, audio_prompt_path=AUDIO_PROMPT_PATH)
82
+ ta.save("output_custom_voice.wav", wav, model.sr)
83
+ ```
84
+
85
+ ### Advanced Usage Tips
86
+
87
+ #### General Use (TTS and Voice Agents)
88
+ - Default settings (`exaggeration=0.5`, `cfg=0.5`) work well for most prompts
89
+ - For fast-speaking reference voices, lower `cfg` to ~0.3 for better pacing
90
+
91
+ #### Expressive or Dramatic Speech
92
+ - Use lower `cfg` values (~0.3) with higher `exaggeration` (≥0.7)
93
+ - Higher exaggeration speeds up speech; lower cfg compensates with deliberate pacing
94
+
95
+ ## Model Details
96
+
97
+ ### Architecture
98
+ - **Backbone**: 0.5B parameter Llama-based architecture
99
+ - **Training Data**: 0.5M hours of cleaned speech data
100
+ - **Special Features**: Alignment-informed inference for stability
101
+
102
+ ### Performance
103
+ - Consistently preferred over ElevenLabs in side-by-side evaluations
104
+ - Ultra-low latency (<200ms) suitable for production use
105
+ - Stable generation with minimal artifacts
106
+
107
+ ## Citation
108
+
109
+ If you use Chatterbox in your research or projects, please cite:
110
+
111
+ ```bibtex
112
+ @software{chatterbox2024,
113
+ title = {Chatterbox TTS},
114
+ author = {Resemble AI},
115
+ year = {2024},
116
+ url = {https://github.com/resemble-ai/chatterbox}
117
+ }
118
+ ```
119
+
120
+ ## Acknowledgments
121
+
122
+ - [CosyVoice](https://github.com/FunAudioLLM/CosyVoice)
123
+ - [HiFT-GAN](https://github.com/yl4579/HiFTNet)
124
+ - [Llama 3](https://github.com/meta-llama/llama3)
125
+
126
+ ## Links
127
+
128
+ - 🎧 [Listen to demo samples](https://resemble-ai.github.io/chatterbox_demopage/)
129
+ - 🤗 [Try it on Hugging Face Spaces](https://huggingface.co/spaces/ResembleAI/Chatterbox)
130
+ - 📊 [View benchmarks on Podonos](https://podonos.com/resembleai/chatterbox)
131
+ - 🏢 [Resemble AI TTS Service](https://resemble.ai) (for scaled production use)
132
+
133
+ ## Disclaimer
134
+
135
+ This model should not be used for creating deceptive, harmful, or malicious content. Users are responsible for ensuring their use complies with all applicable laws and ethical guidelines. All generated audio includes imperceptible watermarks for responsible AI tracking.
136
+
137
+ ## License
138
+
139
+ This model is licensed under the MIT License. See the LICENSE file for details.
140
+
141
+ ---
142
+
143
+ *Made with ♥️ by [Resemble AI](https://resemble.ai)*