🗣️ Fine-Tuned SpeechT5 Model

This repository contains a fine-tuned version of SpeechT5 trained on approximately 60 minutes of Great voice I found in Youtube(it's might be AI generated) for text-to-speech (TTS) generation.

🧠 Model Overview

The goal of this model is to replicate the tone, rhythm, and delivery style of Andrew Tate’s speeches using the SpeechT5 architecture.
It performs well for short speech synthesis tasks but still exhibits a slightly metallic sound due to limited training data.

⚙️ Training Configuration

Parameter	Value
Batch Size	8
Learning Rate	8e-5
Optimizer	AdamW
Scheduler	Linear
Training Steps	7000

🗂️ Dataset

Duration: ~1h18min minutes of clean audio
Sampling Rate: 16 kHz
Format: WAV
Text Source: Manual transcriptions

🎧 Results

The model produces clear and expressive speech aligned with Andrew Tate’s vocal tone.
Some metallic artifacts are still audible, likely due to the dataset size and limited training steps.
Further training and data augmentation could improve naturalness.

🚀 Recommendations for Improvement

Increase total training audio to 2–3 hours for better voice consistency.

🧩 Model Architecture

Base Model: microsoft/speecht5_tts
Fine-Tuning Framework: Hugging Face Transformers
Optimizer: AdamW

Example

Downloads last month: 183

Safetensors

Model size

0.1B params

Tensor type

F32

Model tree for bakhil-aissa/speecht5_stoic_voice

Base model

microsoft/speecht5_tts

Quantized

(3)

this model