Whisper Small - Arabic Multi-Dialect

Fine-tuned Whisper Small model for Arabic speech recognition across multiple dialects.

Model Details

Model Description

This is a fine-tuned version of OpenAI's Whisper Small model, trained on Arabic multi-dialect speech data for automatic speech recognition tasks.

Developed by: Madlook
Model type: Whisper (Encoder-Decoder Transformer)
Language(s): Arabic
License: Apache 2.0
Finetuned from model: openai/whisper-small

Model Sources

Repository: https://huggingface.co/openai/whisper-small
Paper: Robust Speech Recognition via Large-Scale Weak Supervision

Uses

Direct Use

This model can be used for automatic transcription of Arabic speech across multiple dialects. It processes audio files and outputs Arabic text transcriptions.

Downstream Use

Can be integrated into:

Voice assistants for Arabic speakers
Subtitle generation systems
Voice-to-text applications
Arabic language learning tools

Out-of-Scope Use

Not suitable for production-critical applications without further validation
Not designed for languages other than Arabic
Not recommended for medical or legal transcription requiring high accuracy

Bias, Risks, and Limitations

Moderate accuracy with 48.85% WER on validation set
Performance varies across different Arabic dialects
Best results on clear, high-quality audio
Trained on limited dataset (40% subset)
May not generalize well to domain-specific vocabulary

Recommendations

Users should validate performance on their specific use case before deployment. Consider additional fine-tuning for specific dialects or domains.

How to Get Started with the Model

from transformers import pipeline

transcriber = pipeline(
    "automatic-speech-recognition",
    model="MadLook/whisper-small-arabic-multidialect"
)

result = transcriber("arabic_audio.mp3")
print(result["text"])

Or with more control:

import torch
from transformers import WhisperForConditionalGeneration, WhisperProcessor
import librosa

# Load model and processor
model = WhisperForConditionalGeneration.from_pretrained("MadLook/whisper-small-arabic-multidialect")
processor = WhisperProcessor.from_pretrained("MadLook/whisper-small-arabic-multidialect")

# Load audio
audio, sr = librosa.load("arabic_audio.mp3", sr=16000)

# Process
input_features = processor(audio, sampling_rate=16000, return_tensors="pt").input_features

# Generate transcription
with torch.no_grad():
    predicted_ids = model.generate(input_features)

# Decode
transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)[0]
print(transcription)

Training Details

Training Data

Trained on 40% subset of Arabic multi-dialect speech dataset:

Training samples: 37,835
Validation samples: 2,628
Test samples: 2,628

Training Procedure

Training Hyperparameters

Training epochs: 7
Training batch size: 12
Evaluation batch size: 12
Learning rate: 1e-5
Warmup steps: 300
Optimizer: AdamW
LR scheduler: Cosine
Training regime: fp16 mixed precision
Gradient checkpointing: Enabled
Gradient accumulation steps: 1

Evaluation

Testing Data, Factors & Metrics

Testing Data

Validation set: 2,628 samples from Arabic multi-dialect dataset

Metrics

WER (Word Error Rate): Primary metric
CER (Character Error Rate): Secondary metric

Results

Validation Set Performance:

WER: 48.85%

Technical Specifications

Model Architecture and Objective

Architecture: Whisper Small (Encoder-Decoder Transformer)
Parameters: ~244M
Objective: Sequence-to-sequence speech recognition
Input: 80-channel log-mel spectrogram
Output: Arabic text tokens

Compute Infrastructure

Hardware

GPU: CUDA-enabled GPU with 25GB VRAM
Training time: ~6 hours

Software

Framework: Transformers + PyTorch
Precision: FP16 mixed precision training

Citation

BibTeX:

@article{radford2022whisper,
  title={Robust Speech Recognition via Large-Scale Weak Supervision},
  author={Radford, Alec and Kim, Jong Wook and Xu, Tao and Brockman, Greg and McLeavey, Christine and Sutskever, Ilya},
  journal={arXiv preprint arXiv:2212.04356},
  year={2022}
}

Model Card Authors

Madlook

Downloads last month: 41

Safetensors

Model size

0.2B params

Tensor type

F32

Model tree for MadLook/whisper-small-arabic-multidialect

Base model

openai/whisper-small

Finetuned

(3073)

this model

MadLook
/

whisper-small-arabic-multidialect