F5-TTS Vietnamese Model

Vietnamese Text-to-Speech model based on F5-TTS architecture.

Model Details

  • Base Model: F5-TTS
  • Language: Vietnamese
  • Training Steps: 71,000
  • Sample Rate: 24kHz
  • Mel Channels: 100

Usage

import torch
import soundfile as sf
from f5_tts.model import CFM, DiT
from f5_tts.model.utils import get_tokenizer, convert_char_to_pinyin
from vocos import Vocos

# Load model
device = "cuda" if torch.cuda.is_available() else "cpu"

# Download checkpoint
from huggingface_hub import hf_hub_download
checkpoint_path = hf_hub_download(repo_id="YOUR_USERNAME/F5-TTS-Vietnamese", filename="model_71000.pt")
vocab_path = hf_hub_download(repo_id="YOUR_USERNAME/F5-TTS-Vietnamese", filename="vocab.txt")

# Load vocab
vocab_char_map, vocab_size = get_tokenizer(vocab_path, tokenizer="custom")

# Initialize model
model = CFM(
    transformer=DiT(
        dim=1024,
        depth=22,
        heads=16,
        ff_mult=2,
        text_dim=512,
        conv_layers=4,
        text_num_embeds=vocab_size,
        mel_dim=100
    ),
    mel_spec_kwargs=dict(
        n_fft=1024,
        hop_length=256,
        win_length=1024,
        n_mel_channels=100,
        target_sample_rate=24000,
        mel_spec_type="vocos",
    ),
    odeint_kwargs=dict(method="euler"),
    vocab_char_map=vocab_char_map,
).to(device)

# Load checkpoint
checkpoint = torch.load(checkpoint_path, map_location=device)
state_dict = {k.replace("ema_model.", ""): v for k, v in checkpoint["ema_model_state_dict"].items() if k not in ["initted", "step"]}
model.load_state_dict(state_dict)
model.eval()

# Load vocoder
vocoder = Vocos.from_pretrained("charactr/vocos-mel-24khz").to(device)

# Inference
ref_audio = "reference.wav"  # Your reference audio
ref_text = "Đây là văn bản tham chiếu"
gen_text = "Đây là văn bản cần tạo giọng nói"

# ... (see full example in repository)

Training Details

  • Dataset: Vietnamese speech dataset
  • Optimizer: AdamW
  • Scheduler: Linear warmup + decay
  • Batch size: Dynamic (frame-based)

Limitations

  • Best quality with reference audio 3-15 seconds
  • Vietnamese language only
  • Requires good quality reference audio

Citation

@article{chen2024f5tts,
  title={F5-TTS: A Fairerr, Faster, and Fully Non-Autoregressive Text-to-Speech System},
  author={Chen, Yushen and others},
  journal={arXiv preprint},
  year={2024}
}

License

CC-BY-NC-4.0

Downloads last month
40
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support