F5-TTS Vietnamese Model
Vietnamese Text-to-Speech model based on F5-TTS architecture.
Model Details
- Base Model: F5-TTS
- Language: Vietnamese
- Training Steps: 71,000
- Sample Rate: 24kHz
- Mel Channels: 100
Usage
import torch
import soundfile as sf
from f5_tts.model import CFM, DiT
from f5_tts.model.utils import get_tokenizer, convert_char_to_pinyin
from vocos import Vocos
# Load model
device = "cuda" if torch.cuda.is_available() else "cpu"
# Download checkpoint
from huggingface_hub import hf_hub_download
checkpoint_path = hf_hub_download(repo_id="YOUR_USERNAME/F5-TTS-Vietnamese", filename="model_71000.pt")
vocab_path = hf_hub_download(repo_id="YOUR_USERNAME/F5-TTS-Vietnamese", filename="vocab.txt")
# Load vocab
vocab_char_map, vocab_size = get_tokenizer(vocab_path, tokenizer="custom")
# Initialize model
model = CFM(
transformer=DiT(
dim=1024,
depth=22,
heads=16,
ff_mult=2,
text_dim=512,
conv_layers=4,
text_num_embeds=vocab_size,
mel_dim=100
),
mel_spec_kwargs=dict(
n_fft=1024,
hop_length=256,
win_length=1024,
n_mel_channels=100,
target_sample_rate=24000,
mel_spec_type="vocos",
),
odeint_kwargs=dict(method="euler"),
vocab_char_map=vocab_char_map,
).to(device)
# Load checkpoint
checkpoint = torch.load(checkpoint_path, map_location=device)
state_dict = {k.replace("ema_model.", ""): v for k, v in checkpoint["ema_model_state_dict"].items() if k not in ["initted", "step"]}
model.load_state_dict(state_dict)
model.eval()
# Load vocoder
vocoder = Vocos.from_pretrained("charactr/vocos-mel-24khz").to(device)
# Inference
ref_audio = "reference.wav" # Your reference audio
ref_text = "Đây là văn bản tham chiếu"
gen_text = "Đây là văn bản cần tạo giọng nói"
# ... (see full example in repository)
Training Details
- Dataset: Vietnamese speech dataset
- Optimizer: AdamW
- Scheduler: Linear warmup + decay
- Batch size: Dynamic (frame-based)
Limitations
- Best quality with reference audio 3-15 seconds
- Vietnamese language only
- Requires good quality reference audio
Citation
@article{chen2024f5tts,
title={F5-TTS: A Fairerr, Faster, and Fully Non-Autoregressive Text-to-Speech System},
author={Chen, Yushen and others},
journal={arXiv preprint},
year={2024}
}
License
CC-BY-NC-4.0
- Downloads last month
- 40