π£οΈ Fine-Tuned SpeechT5 Model
This repository contains a fine-tuned version of SpeechT5 trained on approximately 60 minutes of Great voice I found in Youtube(it's might be AI generated) for text-to-speech (TTS) generation.
π§ Model Overview
The goal of this model is to replicate the tone, rhythm, and delivery style of Andrew Tateβs speeches using the SpeechT5 architecture.
It performs well for short speech synthesis tasks but still exhibits a slightly metallic sound due to limited training data.
βοΈ Training Configuration
| Parameter | Value |
|---|---|
| Batch Size | 8 |
| Learning Rate | 8e-5 |
| Optimizer | AdamW |
| Scheduler | Linear |
| Training Steps | 7000 |
ποΈ Dataset
- Duration: ~1h18min minutes of clean audio
- Sampling Rate: 16 kHz
- Format: WAV
- Text Source: Manual transcriptions
π§ Results
- The model produces clear and expressive speech aligned with Andrew Tateβs vocal tone.
- Some metallic artifacts are still audible, likely due to the dataset size and limited training steps.
- Further training and data augmentation could improve naturalness.
π Recommendations for Improvement
- Increase total training audio to 2β3 hours for better voice consistency.
π§© Model Architecture
- Base Model:
microsoft/speecht5_tts - Fine-Tuning Framework: Hugging Face Transformers
- Optimizer: AdamW
Example
- Downloads last month
- 183
Model tree for bakhil-aissa/speecht5_stoic_voice
Base model
microsoft/speecht5_tts