File size: 8,220 Bytes
4b3e838 6268480 4e85ebb 2d5297d 2e54573 2d5297d 2e54573 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 |
---
datasets:
- pkufool/libriheavy
language:
- en
pipeline_tag: text-to-speech
library_name: transformers
---
# SLED-TTS: Efficient Speech Language Modeling via Energy Distance in Continuous Space
[](https://huggingface.co/collections/ICTNLP/sled-tts-680253e19c889010a1a376ac)
[](https://www.wechat.com)
[](https://ict.cas.cn)
## Codes: https://github.com/ictnlp/SLED-TTS
## Key features
- **Autoregressive Continuous Modeling**: SLED models speech in a continuous latent space using a speacial type of maximum mean discrepancy as the objective.
- **Streaming Synthesis**: SLED supports streaming synthesis, enabling speech generation to start as soon as the text stream begins.
- **Voice Cloning**: Capable of generating speech based on a 3-second prefix or reference utterance as prompt.
## Demo
You can check SLED in action by exploring the [demo page](https://sled-demo.github.io/).
<div style="display: flex;">
<img src="https://github.com/user-attachments/assets/0f6ee8a0-4258-48a2-a670-5556672dbc18" width="200" style="margin-right: 20px;"/>
<img src="https://github.com/user-attachments/assets/f48848b0-58d9-403a-86d1-80683565a4d7" width="500"/>
</div>
## Available Models on Hugging Face
We have made SLED available on [Hugging Face](https://huggingface.co/collections/ICTNLP/sled-tts-680253e19c889010a1a376ac), currently offering two distinct English models for different use cases:
1. **[SLED-TTS-Libriheavy](https://huggingface.co/ICTNLP/SLED-TTS-Libriheavy)**: This model is trained on the Libriheavy dataset and provides high-quality text-to-speech synthesis.
2. **[SLED-TTS-Streaming-Libriheavy](https://huggingface.co/ICTNLP/SLED-TTS-Streaming-Libriheavy)**: This variant supports **streaming decoding**, which generates a 0.6-second speech chunk for every 5 text tokens received. It’s ideal for applications requiring low-latency audio generation.
The Mandarin models are on the way! Alternatively, you can train your own SLED-TTS models by following the guidelines below.
## Usage
**We provide the training and inference code for SLED-TTS.**
### Installation
``` sh
git clone https://github.com/ictnlp/SLED-TTS.git
cd SLED-TTS
pip install -e ./
```
We currently utilize the sum of the first 8 embedding vectors from [Encodec_24khz](https://huggingface.co/facebook/encodec_24khz) as the continuous latent vector. To proceed, ensure that [Encodec_24khz](https://huggingface.co/facebook/encodec_24khz) is downloaded and cached in your HuggingFace dir.
### Inference
- Set the `CHECKPOINT` variable to the path of the cached **[SLED-TTS-Libriheavy](https://huggingface.co/ICTNLP/SLED-TTS-Libriheavy)** or **[SLED-TTS-Streaming-Libriheavy](https://huggingface.co/ICTNLP/SLED-TTS-Streaming-Libriheavy)** model.
- Diverse generation results can be obtained by varying the `SEED` variable.
``` sh
CHECKPOINT=/path/to/checkpoint
CFG=2.0
SEED=0
```
***Offline Inference***
``` sh
python scripts/run_offline.py \
--model_name_or_path ${CHECKPOINT} \
--cfg ${CFG} \
--input "My remark pleases him, but I soon prove to him that it is not the right way to speak. However perfect may have been the language of that ancient writer." \
--seed ${SEED}
```
***Streaming Inference***
``` sh
python scripts/run_stream.py \
--model_name_or_path ${CHECKPOINT} \
--cfg ${CFG} \
--input "My remark pleases him, but I soon prove to him that it is not the right way to speak. However perfect may have been the language of that ancient writer." \
--seed ${SEED}
# Please note that we have simulated the generation in a streaming environment in run_stream.py for evaluating its quality.
# However, the existing code does not actually provide a streaming API.
```
***Voice Clone***
You can adjust the prompt speech by setting `--prompt_text` and `--prompt_audio`.
``` sh
python scripts/run_voice_clone.py \
--prompt_text "Were I in the warm room with all the splendor and magnificence!" \
--prompt_audio "example_prompt.flac" \
--model_name_or_path ${CHECKPOINT} \
--cfg ${CFG} \
--input "Perhaps the other trees from the forest will come to look at me!" \
--seed ${SEED}
```
### Training
***Data Processing***
#TODO
***Training Offline Model***
``` sh
OUTPUT_DIR=./runs/libriheavy
mkdir -p $OUTPUT_DIR
LOG_FILE=${OUTPUT_DIR}/log
BATCH_SIZE=8
UPDATE_FREQ=8
# assume 8 proc per node, then WORLD_SIZE * 8 * BATCH_SIZE * UPDATE_FREQ == 512
torchrun --nnodes ${WORLD_SIZE} --node_rank ${RANK} --nproc_per_node 8 --master_addr ${MASTER_ADDR} --master_port ${MASTER_PORT} \
./scripts/train_libriheavy.py \
--training_cfg 0.1 \
--num_hidden_layers 12 --diffloss_d 6 --noise_channels 128 \
--dataloader_num_workers 8 \
--dataloader_pin_memory True \
--remove_unused_columns False \
--label_names audio_inputs \
--group_by_speech_length \
--do_train \
--do_eval \
--eval_strategy steps \
--eval_steps 10000 \
--prediction_loss_only \
--per_device_train_batch_size ${BATCH_SIZE} \
--per_device_eval_batch_size 24 \
--gradient_accumulation_steps ${UPDATE_FREQ} \
--bf16 \
--learning_rate 5e-4 \
--weight_decay 0.01 \
--adam_beta1 0.9 \
--adam_beta2 0.999 \
--adam_epsilon 1e-8 \
--max_grad_norm 1.0 \
--max_steps 300000 \
--lr_scheduler_type "linear" \
--warmup_steps 32000 \
--logging_first_step \
--logging_steps 100 \
--save_steps 10000 \
--save_total_limit 10 \
--output_dir ${OUTPUT_DIR} \
--report_to tensorboard \
--disable_tqdm True \
--ddp_timeout 3600 --overwrite_output_dir
```
***Training Streaming Model***
``` sh
OUTPUT_DIR=./runs/libriheavy_stream
mkdir -p $OUTPUT_DIR
LOG_FILE=${OUTPUT_DIR}/log
BATCH_SIZE=8
UPDATE_FREQ=8
# assume 8 proc per node, then WORLD_SIZE * 8 * BATCH_SIZE * UPDATE_FREQ == 512
torchrun --nnodes ${WORLD_SIZE} --node_rank ${RANK} --nproc_per_node 8 --master_addr ${MASTER_ADDR} --master_port ${MASTER_PORT} \
./scripts/train_libriheavy_stream.py \
--finetune_path ./runs/libriheavy/checkpoint-300000/model.safetensors \
--stream_n 5 --stream_m 45 \
--training_cfg 0.1 \
--num_hidden_layers 12 --diffloss_d 6 --noise_channels 128 \
--dataloader_num_workers 8 \
--dataloader_pin_memory True \
--remove_unused_columns False \
--label_names audio_inputs \
--group_by_speech_length \
--do_train \
--do_eval \
--eval_strategy steps \
--eval_steps 10000 \
--prediction_loss_only \
--per_device_train_batch_size ${BATCH_SIZE} \
--per_device_eval_batch_size 24 \
--gradient_accumulation_steps ${UPDATE_FREQ} \
--bf16 \
--learning_rate 3e-4 \
--weight_decay 0.01 \
--adam_beta1 0.9 \
--adam_beta2 0.999 \
--adam_epsilon 1e-8 \
--max_grad_norm 1.0 \
--max_steps 100000 \
--lr_scheduler_type "linear" \
--warmup_steps 10000 \
--logging_first_step \
--logging_steps 100 \
--save_steps 10000 \
--save_total_limit 10 \
--output_dir ${OUTPUT_DIR} \
--report_to tensorboard \
--disable_tqdm True \
--ddp_timeout 3600 --overwrite_output_dir
```
## Code Contributors
- [Zhengrui Ma](https://scholar.google.com/citations?user=dUgq6tEAAAAJ)
- [Chenze Shao](https://scholar.google.com/citations?user=LH_rZf8AAAAJ)
## Ackonwledgement
This work is inspired by following great works:
- Continuous Visual Autoregressive Generation via Score Maximization
- Autoregressive Image Generation without Vector Quantization
- A Spectral Energy Distance for Parallel Speech Synthesis
## Citation
```
@misc{ma2025efficientspeechlanguagemodeling,
title={Efficient Speech Language Modeling via Energy Distance in Continuous Latent Space},
author={Zhengrui Ma and Yang Feng and Chenze Shao and Fandong Meng and Jie Zhou and Min Zhang},
year={2025},
eprint={2505.13181},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2505.13181},
}
``` |