Bagus Whisper Small Indonesian (ONNX)
This is an ONNX-converted version of Bagus/whisper-small-id-cv17, optimized for browser-based inference with Transformers.js.
Model Description
Bagus/whisper-small-id-cv17 is a fine-tuned Whisper Small model for Indonesian automatic speech recognition (ASR). It achieves 5.9% Word Error Rate (WER) on Indonesian speech, making it one of the most accurate Indonesian ASR models available.
This ONNX version provides:
- β Browser-compatible format for client-side inference
- β Multiple quantization levels (q4f16, fp16, int8, q8)
- β Optimized for Transformers.js
- β No server required - runs entirely in the browser
Model Details
- Base Model: OpenAI Whisper Small (242M parameters)
- Language: Indonesian (id)
- Training Data: Mozilla Common Voice 17.0 (Indonesian)
- Performance: 5.9% WER
- License: Apache 2.0
- Original Model: Bagus/whisper-small-id-cv17
Quantization Options
This repository includes multiple quantization levels optimized for different use cases:
| Quantization | Encoder | Decoder | Total Size | Use Case |
|---|---|---|---|---|
| q4f16 | 51 MB | 138 MB | ~191 MB | Recommended for browsers - Best balance |
| fp16 | 168 MB | 293 MB | ~463 MB | Higher quality, larger size |
| int8/q8 | 87 MB | 299 MB | ~386 MB | Good quality, moderate size |
| fp32 | 337 MB | 587 MB | ~924 MB | Full precision (reference) |
Usage
With Transformers.js (Browser)
import { pipeline } from '@xenova/transformers';
// Create ASR pipeline
const transcriber = await pipeline(
'automatic-speech-recognition',
'cmaree/Bagus-whisper-small-id-onnx',
{
dtype: 'q4f16', // Recommended for browser
device: 'wasm', // or 'webgpu' if available
}
);
// Transcribe audio
const result = await transcriber(audioData, {
language: 'indonesian',
task: 'transcribe',
});
console.log(result.text);
Custom Wrapper (Recommended)
import {
AutoProcessor,
AutoModelForSpeechSeq2Seq,
AutoTokenizer
} from '@xenova/transformers';
class IndonesianWhisper {
constructor() {
this.model = null;
this.processor = null;
this.tokenizer = null;
}
async load(options = {}) {
const modelId = 'cmaree/Bagus-whisper-small-id-onnx';
const device = options.device || 'wasm';
this.processor = await AutoProcessor.from_pretrained(modelId);
this.tokenizer = await AutoTokenizer.from_pretrained(modelId);
this.model = await AutoModelForSpeechSeq2Seq.from_pretrained(modelId, {
device: device,
dtype: device === 'webgpu' ? 'fp32' : 'q4f16'
});
}
async transcribe(audioData, options = {}) {
const processed = await this.processor(audioData, {
sampling_rate: options.sampling_rate || 16000
});
const outputs = await this.model.generate(processed.input_features, {
max_new_tokens: 128,
num_beams: 1,
language: 'indonesian',
task: 'transcribe',
return_timestamps: false,
});
const transcription = this.tokenizer.batch_decode(outputs, {
skip_special_tokens: true
});
return { text: transcription[0] };
}
}
// Usage
const whisper = new IndonesianWhisper();
await whisper.load({ device: 'wasm' });
const result = await whisper.transcribe(audioBuffer);
console.log(result.text);
Performance Benchmarks
Accuracy
- Word Error Rate (WER): 5.9%
- Dataset: Common Voice 17.0 (Indonesian test set)
- Comparison:
- Whisper Tiny (generic): ~25-30% WER
- Whisper Small (generic): ~10-15% WER
- Bagus Whisper Small (Indonesian): 5.9% WER β
Inference Speed (Browser)
Tested on typical hardware with WASM backend:
| Device | Browser | Speed (RTF) | Notes |
|---|---|---|---|
| Desktop (i7) | Chrome | ~0.3x | 3-4 seconds for 10s audio |
| Laptop (M1) | Safari | ~0.2x | 2-3 seconds for 10s audio |
| Mobile (Flagship) | Chrome | ~0.5x | 5-6 seconds for 10s audio |
RTF = Real-time Factor (lower is faster)
With WebGPU (when available):
- 2-3x faster inference
- ~0.1-0.15x RTF on modern GPUs
Training Details
Original Model Training
- Base Model: OpenAI Whisper Small
- Training Data: Mozilla Common Voice 17.0 (Indonesian subset)
- Training Framework: PyTorch 2.3.0
- Learning Rate: 1e-05
- Batch Size: 32 (with gradient accumulation)
- Training Steps: 20,000
- Mixed Precision: Native AMP
ONNX Conversion
- Conversion Tool: Transformers.js conversion script
- ONNX Opset: 14
- Quantization: Dynamic quantization (q4f16, int8, fp16)
- Optimization: ONNXSlim for size reduction
Intended Use
Primary Use Cases
- β Real-time Indonesian speech transcription in web browsers
- β Offline voice-to-text applications
- β Voice assistants and chatbots
- β Transcription services
- β Accessibility tools
Out of Scope
- β Speaker diarization (identifying who is speaking)
- β Real-time translation (use with translation models)
- β Audio quality enhancement
- β Languages other than Indonesian
Limitations
Audio Quality: Performance degrades with:
- Background noise
- Low-quality recordings
- Multiple speakers
- Strong accents or dialects
Domain Specificity: Trained on Common Voice dataset
- May perform better on clear, scripted speech
- May struggle with spontaneous conversations
- Domain-specific jargon may not be recognized
Resource Requirements:
- Requires ~191MB download (q4f16)
- Needs modern browser with WASM support
- May be slow on low-end mobile devices
Ethical Considerations
Bias
- Trained on Common Voice data which may not represent all Indonesian speakers equally
- May have biases toward certain accents, dialects, or demographic groups
- Users should evaluate performance on their specific use case
Privacy
- Model runs entirely in the browser (no data sent to servers)
- Audio processing happens locally
- Users maintain full control over their data
Responsible Use
- Should not be used for surveillance without consent
- Should not be sole basis for high-stakes decisions
- Users should verify transcriptions for critical applications
Citation
Original Model
@misc{bagus2024whisper,
author = {Bagus},
title = {Whisper Small Indonesian CV17},
year = {2024},
publisher = {HuggingFace},
url = {https://huggingface.co/Bagus/whisper-small-id-cv17}
}
Whisper
@misc{radford2022whisper,
title={Robust Speech Recognition via Large-Scale Weak Supervision},
author={Alec Radford and Jong Wook Kim and Tao Xu and Greg Brockman and Christine McLeavey and Ilya Sutskever},
year={2022},
eprint={2212.04356},
archivePrefix={arXiv},
primaryClass={eess.AS}
}
Transformers.js
@software{transformersjs,
author = {Xenova},
title = {Transformers.js: State-of-the-art Machine Learning for the web},
year = {2024},
url = {https://github.com/xenova/transformers.js}
}
Acknowledgments
- Original Model: Bagus for fine-tuning Whisper on Indonesian
- Base Model: OpenAI for the Whisper architecture
- Dataset: Mozilla Foundation for Common Voice
- Conversion Tools: Xenova for Transformers.js and ONNX conversion scripts
- ONNX Optimization: Microsoft for ONNX Runtime
Contact
For issues or questions about this ONNX version:
- Open an issue on the model repository
- Check Transformers.js documentation
For the original model:
License
This model is released under the Apache 2.0 license, consistent with the original Whisper model and the fine-tuned version.
- Downloads last month
- 123
Dataset used to train cmaree/Bagus-whisper-small-id-onnx
Evaluation results
- Word Error Rate on Common Voice 17.0 (Indonesian)test set self-reported5.900