|
--- |
|
datasets: |
|
- pkufool/libriheavy |
|
language: |
|
- en |
|
pipeline_tag: text-to-speech |
|
library_name: transformers |
|
--- |
|
|
|
# SLED-TTS: Efficient Speech Language Modeling via Energy Distance in Continuous Space |
|
[](https://huggingface.co/collections/ICTNLP/sled-tts-680253e19c889010a1a376ac) |
|
[](https://www.wechat.com) |
|
[](https://ict.cas.cn) |
|
|
|
|
|
## Codes: https://github.com/ictnlp/SLED-TTS |
|
|
|
## Key features |
|
- **Autoregressive Continuous Modeling**: SLED models speech in a continuous latent space using a speacial type of maximum mean discrepancy as the objective. |
|
- **Streaming Synthesis**: SLED supports streaming synthesis, enabling speech generation to start as soon as the text stream begins. |
|
- **Voice Cloning**: Capable of generating speech based on a 3-second prefix or reference utterance as prompt. |
|
|
|
|
|
|
|
## Demo |
|
You can check SLED in action by exploring the [demo page](https://sled-demo.github.io/). |
|
<div style="display: flex;"> |
|
<img src="https://github.com/user-attachments/assets/0f6ee8a0-4258-48a2-a670-5556672dbc18" width="200" style="margin-right: 20px;"/> |
|
<img src="https://github.com/user-attachments/assets/f48848b0-58d9-403a-86d1-80683565a4d7" width="500"/> |
|
</div> |
|
|
|
## Available Models on Hugging Face |
|
|
|
We have made SLED available on [Hugging Face](https://huggingface.co/collections/ICTNLP/sled-tts-680253e19c889010a1a376ac), currently offering two distinct English models for different use cases: |
|
|
|
1. **[SLED-TTS-Libriheavy](https://huggingface.co/ICTNLP/SLED-TTS-Libriheavy)**: This model is trained on the Libriheavy dataset and provides high-quality text-to-speech synthesis. |
|
|
|
2. **[SLED-TTS-Streaming-Libriheavy](https://huggingface.co/ICTNLP/SLED-TTS-Streaming-Libriheavy)**: This variant supports **streaming decoding**, which generates a 0.6-second speech chunk for every 5 text tokens received. It’s ideal for applications requiring low-latency audio generation. |
|
|
|
|
|
The Mandarin models are on the way! Alternatively, you can train your own SLED-TTS models by following the guidelines below. |
|
|
|
## Usage |
|
**We provide the training and inference code for SLED-TTS.** |
|
|
|
### Installation |
|
``` sh |
|
git clone https://github.com/ictnlp/SLED-TTS.git |
|
cd SLED-TTS |
|
pip install -e ./ |
|
``` |
|
|
|
We currently utilize the sum of the first 8 embedding vectors from [Encodec_24khz](https://huggingface.co/facebook/encodec_24khz) as the continuous latent vector. To proceed, ensure that [Encodec_24khz](https://huggingface.co/facebook/encodec_24khz) is downloaded and cached in your HuggingFace dir. |
|
|
|
### Inference |
|
- Set the `CHECKPOINT` variable to the path of the cached **[SLED-TTS-Libriheavy](https://huggingface.co/ICTNLP/SLED-TTS-Libriheavy)** or **[SLED-TTS-Streaming-Libriheavy](https://huggingface.co/ICTNLP/SLED-TTS-Streaming-Libriheavy)** model. |
|
- Diverse generation results can be obtained by varying the `SEED` variable. |
|
``` sh |
|
CHECKPOINT=/path/to/checkpoint |
|
CFG=2.0 |
|
SEED=0 |
|
``` |
|
***Offline Inference*** |
|
``` sh |
|
python scripts/run_offline.py \ |
|
--model_name_or_path ${CHECKPOINT} \ |
|
--cfg ${CFG} \ |
|
--input "My remark pleases him, but I soon prove to him that it is not the right way to speak. However perfect may have been the language of that ancient writer." \ |
|
--seed ${SEED} |
|
``` |
|
***Streaming Inference*** |
|
``` sh |
|
python scripts/run_stream.py \ |
|
--model_name_or_path ${CHECKPOINT} \ |
|
--cfg ${CFG} \ |
|
--input "My remark pleases him, but I soon prove to him that it is not the right way to speak. However perfect may have been the language of that ancient writer." \ |
|
--seed ${SEED} |
|
# Please note that we have simulated the generation in a streaming environment in run_stream.py for evaluating its quality. |
|
# However, the existing code does not actually provide a streaming API. |
|
``` |
|
***Voice Clone*** |
|
|
|
You can adjust the prompt speech by setting `--prompt_text` and `--prompt_audio`. |
|
``` sh |
|
python scripts/run_voice_clone.py \ |
|
--prompt_text "Were I in the warm room with all the splendor and magnificence!" \ |
|
--prompt_audio "example_prompt.flac" \ |
|
--model_name_or_path ${CHECKPOINT} \ |
|
--cfg ${CFG} \ |
|
--input "Perhaps the other trees from the forest will come to look at me!" \ |
|
--seed ${SEED} |
|
``` |
|
|
|
### Training |
|
|
|
***Data Processing*** |
|
#TODO |
|
|
|
***Training Offline Model*** |
|
``` sh |
|
OUTPUT_DIR=./runs/libriheavy |
|
mkdir -p $OUTPUT_DIR |
|
LOG_FILE=${OUTPUT_DIR}/log |
|
|
|
BATCH_SIZE=8 |
|
UPDATE_FREQ=8 |
|
# assume 8 proc per node, then WORLD_SIZE * 8 * BATCH_SIZE * UPDATE_FREQ == 512 |
|
|
|
torchrun --nnodes ${WORLD_SIZE} --node_rank ${RANK} --nproc_per_node 8 --master_addr ${MASTER_ADDR} --master_port ${MASTER_PORT} \ |
|
./scripts/train_libriheavy.py \ |
|
--training_cfg 0.1 \ |
|
--num_hidden_layers 12 --diffloss_d 6 --noise_channels 128 \ |
|
--dataloader_num_workers 8 \ |
|
--dataloader_pin_memory True \ |
|
--remove_unused_columns False \ |
|
--label_names audio_inputs \ |
|
--group_by_speech_length \ |
|
--do_train \ |
|
--do_eval \ |
|
--eval_strategy steps \ |
|
--eval_steps 10000 \ |
|
--prediction_loss_only \ |
|
--per_device_train_batch_size ${BATCH_SIZE} \ |
|
--per_device_eval_batch_size 24 \ |
|
--gradient_accumulation_steps ${UPDATE_FREQ} \ |
|
--bf16 \ |
|
--learning_rate 5e-4 \ |
|
--weight_decay 0.01 \ |
|
--adam_beta1 0.9 \ |
|
--adam_beta2 0.999 \ |
|
--adam_epsilon 1e-8 \ |
|
--max_grad_norm 1.0 \ |
|
--max_steps 300000 \ |
|
--lr_scheduler_type "linear" \ |
|
--warmup_steps 32000 \ |
|
--logging_first_step \ |
|
--logging_steps 100 \ |
|
--save_steps 10000 \ |
|
--save_total_limit 10 \ |
|
--output_dir ${OUTPUT_DIR} \ |
|
--report_to tensorboard \ |
|
--disable_tqdm True \ |
|
--ddp_timeout 3600 --overwrite_output_dir |
|
|
|
``` |
|
|
|
***Training Streaming Model*** |
|
``` sh |
|
OUTPUT_DIR=./runs/libriheavy_stream |
|
mkdir -p $OUTPUT_DIR |
|
LOG_FILE=${OUTPUT_DIR}/log |
|
|
|
BATCH_SIZE=8 |
|
UPDATE_FREQ=8 |
|
# assume 8 proc per node, then WORLD_SIZE * 8 * BATCH_SIZE * UPDATE_FREQ == 512 |
|
|
|
torchrun --nnodes ${WORLD_SIZE} --node_rank ${RANK} --nproc_per_node 8 --master_addr ${MASTER_ADDR} --master_port ${MASTER_PORT} \ |
|
./scripts/train_libriheavy_stream.py \ |
|
--finetune_path ./runs/libriheavy/checkpoint-300000/model.safetensors \ |
|
--stream_n 5 --stream_m 45 \ |
|
--training_cfg 0.1 \ |
|
--num_hidden_layers 12 --diffloss_d 6 --noise_channels 128 \ |
|
--dataloader_num_workers 8 \ |
|
--dataloader_pin_memory True \ |
|
--remove_unused_columns False \ |
|
--label_names audio_inputs \ |
|
--group_by_speech_length \ |
|
--do_train \ |
|
--do_eval \ |
|
--eval_strategy steps \ |
|
--eval_steps 10000 \ |
|
--prediction_loss_only \ |
|
--per_device_train_batch_size ${BATCH_SIZE} \ |
|
--per_device_eval_batch_size 24 \ |
|
--gradient_accumulation_steps ${UPDATE_FREQ} \ |
|
--bf16 \ |
|
--learning_rate 3e-4 \ |
|
--weight_decay 0.01 \ |
|
--adam_beta1 0.9 \ |
|
--adam_beta2 0.999 \ |
|
--adam_epsilon 1e-8 \ |
|
--max_grad_norm 1.0 \ |
|
--max_steps 100000 \ |
|
--lr_scheduler_type "linear" \ |
|
--warmup_steps 10000 \ |
|
--logging_first_step \ |
|
--logging_steps 100 \ |
|
--save_steps 10000 \ |
|
--save_total_limit 10 \ |
|
--output_dir ${OUTPUT_DIR} \ |
|
--report_to tensorboard \ |
|
--disable_tqdm True \ |
|
--ddp_timeout 3600 --overwrite_output_dir |
|
``` |
|
|
|
|
|
## Code Contributors |
|
|
|
- [Zhengrui Ma](https://scholar.google.com/citations?user=dUgq6tEAAAAJ) |
|
- [Chenze Shao](https://scholar.google.com/citations?user=LH_rZf8AAAAJ) |
|
|
|
|
|
|
|
## Ackonwledgement |
|
This work is inspired by following great works: |
|
- Continuous Visual Autoregressive Generation via Score Maximization |
|
- Autoregressive Image Generation without Vector Quantization |
|
- A Spectral Energy Distance for Parallel Speech Synthesis |
|
|
|
## Citation |
|
``` |
|
@misc{ma2025efficientspeechlanguagemodeling, |
|
title={Efficient Speech Language Modeling via Energy Distance in Continuous Latent Space}, |
|
author={Zhengrui Ma and Yang Feng and Chenze Shao and Fandong Meng and Jie Zhou and Min Zhang}, |
|
year={2025}, |
|
eprint={2505.13181}, |
|
archivePrefix={arXiv}, |
|
primaryClass={cs.CL}, |
|
url={https://arxiv.org/abs/2505.13181}, |
|
} |
|
``` |