Ijazah_Palsu_V1

Ijazah_Palsu_V1 is a fine-tuned version of the F5-TTS text-to-speech model, trained specifically on Indonesian voice data. The goal of this project is to explore and improve the expressiveness and pronunciation accuracy of Indonesian TTS models, particularly in real-world, varied-speaker conditions.

โš ๏ธ Status: This model is in Beta/Experimental phase and should be used for research or evaluation purposes only.


๐Ÿ” Model Details

  • Base Model: SWivid/F5-TTS - F5TTS_v1_Base
  • Fine-tuned By: PapaRazi
  • Language: ๐Ÿ‡ฎ๐Ÿ‡ฉ Indonesian
  • Vocabulary Size: 2,564 tokens
  • Model Size: 5,02 GB
  • Training Hardware: Single GPU โ€” NVIDIA RTX 3060 (12GB VRAM)

Training Details

  • Training Time: ~50 hours of audio (50:27:14)
  • Total Samples: 66,233 (โ‰ˆ 8.15 GB)
  • Vocabulary Size: 2,564
  • Hardware: NVIDIA RTX 3060 12GB
  • Precision: FP16 Mixed Precision

Training Config (excerpt from config.json):

{
  "learning_rate": 1e-05,
  "batch_size_per_gpu": 1600,
  "batch_size_type": "frame",
  "epochs": 28,
  "save_per_updates": 20000,
  "keep_last_n_checkpoints": 6,
  "last_per_updates": 10000,
  "tokenizer_type": "pinyin",
  "mixed_precision": "fp16"
}

The training dataset contains varied voices from different speakers (~10+ unique voices), covering both formal and conversational speech, including manually added synthetic samples (e.g., number reading via gTTS).


๐Ÿ“‰ Training Curves

๐Ÿ”บ Loss Over Time

Loss Graph

๐Ÿ”ป Learning Rate Schedule

Learning Rate Graph

Training was manually stopped after approximately 300,000 steps, even though the learning rate had reached near zero. At this stage, the loss curve showed signs of instability and fluctuated without consistent downward progress. Based on qualitative evaluation of generated samples, the model was deemed sufficiently trained for a Beta release.

While performance was already usable, especially for Indonesian TTS tasks, further fine-tuning is still planned for:

  • Improving number and currency pronunciation
  • Enhancing long-form sentence fluency
  • Reducing jitter in expressive speech samples

โš ๏ธ Known Limitations This model currently struggles with pronouncing numbers and numerical formats accurately (e.g., years, large numbers, currency values).

This is a common challenge in early-stage fine-tuning and can be attributed to:

Limited exposure to numerical utterances in the training dataset.

Variability in how numbers can be pronounced in Indonesian.

A dedicated sub-dataset for numerals and structured numeric expressions is being prepared and will be used in future fine-tuning phases.

Planned Improvement: In future versions, the model will be fine-tuned further using curated audio-text pairs focused specifically on number reading, dates, and currency values to enhance numerical pronunciation accuracy.

๐Ÿ”Š Sample Audio

Example inference output using Ijazah_Palsu_V1 model:

Text Input:
Caranya adalah cari kata kunci yang paling populer di situ, yaitu...

Text Input:
Kan Judi juga namanya Dewa, kemudian yang di Kamboja itu juga namanya Dewa.

Text Input:
kamarnya, lemarinya, rumahnya, dan lain sebagainya, dan mereka membuang barang-barang yang sudah saatnya dibuang, disitulah pentingnya di-clutter.

Text Input:
Ini adalah model TTS pertama saya. Kalau ada kekurangan, mohon dimaafkan.

๐Ÿ“ฆ Usage

Manual Download & Usage

You can also download this model manually by placing the .pt checkpoint and corresponding vocab.txt file inside your F5-TTS checkpoint folder.

Inference via F5-TTS CLI:

f5-tts_infer-cli \
  --model "PapaRazi/Ijazah_Palsu_V1" \
  --ref_audio "ref.wav" \
  --ref_text "reference text" \
  --gen_text "generated Indonesian text"
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for PapaRazi/Ijazah_Palsu_V1

Base model

SWivid/F5-TTS
Finetuned
(49)
this model