Automatic Speech Recognition
Transformers
Safetensors
Arabic
whisper
arabic
speech-recognition
asr

Whisper Small - Arabic Multi-Dialect

Fine-tuned Whisper Small model for Arabic speech recognition across multiple dialects.

Model Details

Model Description

This is a fine-tuned version of OpenAI's Whisper Small model, trained on Arabic multi-dialect speech data for automatic speech recognition tasks.

  • Developed by: Madlook
  • Model type: Whisper (Encoder-Decoder Transformer)
  • Language(s): Arabic
  • License: Apache 2.0
  • Finetuned from model: openai/whisper-small

Model Sources

Uses

Direct Use

This model can be used for automatic transcription of Arabic speech across multiple dialects. It processes audio files and outputs Arabic text transcriptions.

Downstream Use

Can be integrated into:

  • Voice assistants for Arabic speakers
  • Subtitle generation systems
  • Voice-to-text applications
  • Arabic language learning tools

Out-of-Scope Use

  • Not suitable for production-critical applications without further validation
  • Not designed for languages other than Arabic
  • Not recommended for medical or legal transcription requiring high accuracy

Bias, Risks, and Limitations

  • Moderate accuracy with 48.85% WER on validation set
  • Performance varies across different Arabic dialects
  • Best results on clear, high-quality audio
  • Trained on limited dataset (40% subset)
  • May not generalize well to domain-specific vocabulary

Recommendations

Users should validate performance on their specific use case before deployment. Consider additional fine-tuning for specific dialects or domains.

How to Get Started with the Model

from transformers import pipeline

transcriber = pipeline(
    "automatic-speech-recognition",
    model="MadLook/whisper-small-arabic-multidialect"
)

result = transcriber("arabic_audio.mp3")
print(result["text"])

Or with more control:

import torch
from transformers import WhisperForConditionalGeneration, WhisperProcessor
import librosa

# Load model and processor
model = WhisperForConditionalGeneration.from_pretrained("MadLook/whisper-small-arabic-multidialect")
processor = WhisperProcessor.from_pretrained("MadLook/whisper-small-arabic-multidialect")

# Load audio
audio, sr = librosa.load("arabic_audio.mp3", sr=16000)

# Process
input_features = processor(audio, sampling_rate=16000, return_tensors="pt").input_features

# Generate transcription
with torch.no_grad():
    predicted_ids = model.generate(input_features)

# Decode
transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)[0]
print(transcription)

Training Details

Training Data

Trained on 40% subset of Arabic multi-dialect speech dataset:

  • Training samples: 37,835
  • Validation samples: 2,628
  • Test samples: 2,628

Training Procedure

Training Hyperparameters

  • Training epochs: 7
  • Training batch size: 12
  • Evaluation batch size: 12
  • Learning rate: 1e-5
  • Warmup steps: 300
  • Optimizer: AdamW
  • LR scheduler: Cosine
  • Training regime: fp16 mixed precision
  • Gradient checkpointing: Enabled
  • Gradient accumulation steps: 1

Evaluation

Testing Data, Factors & Metrics

Testing Data

Validation set: 2,628 samples from Arabic multi-dialect dataset

Metrics

  • WER (Word Error Rate): Primary metric
  • CER (Character Error Rate): Secondary metric

Results

Validation Set Performance:

  • WER: 48.85%

Technical Specifications

Model Architecture and Objective

  • Architecture: Whisper Small (Encoder-Decoder Transformer)
  • Parameters: ~244M
  • Objective: Sequence-to-sequence speech recognition
  • Input: 80-channel log-mel spectrogram
  • Output: Arabic text tokens

Compute Infrastructure

Hardware

  • GPU: CUDA-enabled GPU with 25GB VRAM
  • Training time: ~6 hours

Software

  • Framework: Transformers + PyTorch
  • Precision: FP16 mixed precision training

Citation

BibTeX:

@article{radford2022whisper,
  title={Robust Speech Recognition via Large-Scale Weak Supervision},
  author={Radford, Alec and Kim, Jong Wook and Xu, Tao and Brockman, Greg and McLeavey, Christine and Sutskever, Ilya},
  journal={arXiv preprint arXiv:2212.04356},
  year={2022}
}

Model Card Authors

Madlook

Downloads last month
41
Safetensors
Model size
0.2B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for MadLook/whisper-small-arabic-multidialect

Finetuned
(3073)
this model

Datasets used to train MadLook/whisper-small-arabic-multidialect