Ijazah_Palsu_V1
Ijazah_Palsu_V1
is a fine-tuned version of the F5-TTS text-to-speech model, trained specifically on Indonesian voice data. The goal of this project is to explore and improve the expressiveness and pronunciation accuracy of Indonesian TTS models, particularly in real-world, varied-speaker conditions.
โ ๏ธ Status: This model is in Beta/Experimental phase and should be used for research or evaluation purposes only.
๐ Model Details
- Base Model: SWivid/F5-TTS - F5TTS_v1_Base
- Fine-tuned By:
PapaRazi
- Language: ๐ฎ๐ฉ Indonesian
- Vocabulary Size: 2,564 tokens
- Model Size: 5,02 GB
- Training Hardware: Single GPU โ NVIDIA RTX 3060 (12GB VRAM)
Training Details
- Training Time: ~50 hours of audio (50:27:14)
- Total Samples: 66,233 (โ 8.15 GB)
- Vocabulary Size: 2,564
- Hardware: NVIDIA RTX 3060 12GB
- Precision: FP16 Mixed Precision
Training Config (excerpt from config.json
):
{
"learning_rate": 1e-05,
"batch_size_per_gpu": 1600,
"batch_size_type": "frame",
"epochs": 28,
"save_per_updates": 20000,
"keep_last_n_checkpoints": 6,
"last_per_updates": 10000,
"tokenizer_type": "pinyin",
"mixed_precision": "fp16"
}
The training dataset contains varied voices from different speakers (~10+ unique voices), covering both formal and conversational speech, including manually added synthetic samples (e.g., number reading via gTTS).
๐ Training Curves
๐บ Loss Over Time
๐ป Learning Rate Schedule
Training was manually stopped after approximately 300,000 steps, even though the learning rate had reached near zero. At this stage, the loss curve showed signs of instability and fluctuated without consistent downward progress. Based on qualitative evaluation of generated samples, the model was deemed sufficiently trained for a Beta release.
While performance was already usable, especially for Indonesian TTS tasks, further fine-tuning is still planned for:
- Improving number and currency pronunciation
- Enhancing long-form sentence fluency
- Reducing jitter in expressive speech samples
โ ๏ธ Known Limitations This model currently struggles with pronouncing numbers and numerical formats accurately (e.g., years, large numbers, currency values).
This is a common challenge in early-stage fine-tuning and can be attributed to:
Limited exposure to numerical utterances in the training dataset.
Variability in how numbers can be pronounced in Indonesian.
A dedicated sub-dataset for numerals and structured numeric expressions is being prepared and will be used in future fine-tuning phases.
Planned Improvement: In future versions, the model will be fine-tuned further using curated audio-text pairs focused specifically on number reading, dates, and currency values to enhance numerical pronunciation accuracy.
๐ Sample Audio
Example inference output using Ijazah_Palsu_V1
model:
Text Input:
Caranya adalah cari kata kunci yang paling populer di situ, yaitu...
Text Input:
Kan Judi juga namanya Dewa, kemudian yang di Kamboja itu juga namanya Dewa.
Text Input:
kamarnya, lemarinya, rumahnya, dan lain sebagainya, dan mereka membuang barang-barang yang sudah saatnya dibuang, disitulah pentingnya di-clutter.
Text Input:
Ini adalah model TTS pertama saya. Kalau ada kekurangan, mohon dimaafkan.
๐ฆ Usage
Manual Download & Usage
You can also download this model manually by placing the .pt
checkpoint and corresponding vocab.txt
file inside your F5-TTS checkpoint folder.
Inference via F5-TTS CLI:
f5-tts_infer-cli \
--model "PapaRazi/Ijazah_Palsu_V1" \
--ref_audio "ref.wav" \
--ref_text "reference text" \
--gen_text "generated Indonesian text"
Model tree for PapaRazi/Ijazah_Palsu_V1
Base model
SWivid/F5-TTS