Bagus Whisper Small Indonesian (ONNX)

This is an ONNX-converted version of Bagus/whisper-small-id-cv17, optimized for browser-based inference with Transformers.js.

Model Description

Bagus/whisper-small-id-cv17 is a fine-tuned Whisper Small model for Indonesian automatic speech recognition (ASR). It achieves 5.9% Word Error Rate (WER) on Indonesian speech, making it one of the most accurate Indonesian ASR models available.

This ONNX version provides:

✅ Browser-compatible format for client-side inference
✅ Multiple quantization levels (q4f16, fp16, int8, q8)
✅ Optimized for Transformers.js
✅ No server required - runs entirely in the browser

Model Details

Base Model: OpenAI Whisper Small (242M parameters)
Language: Indonesian (id)
Training Data: Mozilla Common Voice 17.0 (Indonesian)
Performance: 5.9% WER
License: Apache 2.0
Original Model: Bagus/whisper-small-id-cv17

Quantization Options

This repository includes multiple quantization levels optimized for different use cases:

Quantization	Encoder	Decoder	Total Size	Use Case
q4f16	51 MB	138 MB	~191 MB	Recommended for browsers - Best balance
fp16	168 MB	293 MB	~463 MB	Higher quality, larger size
int8/q8	87 MB	299 MB	~386 MB	Good quality, moderate size
fp32	337 MB	587 MB	~924 MB	Full precision (reference)

Usage

With Transformers.js (Browser)

import { pipeline } from '@xenova/transformers';

// Create ASR pipeline
const transcriber = await pipeline(
  'automatic-speech-recognition',
  'cmaree/Bagus-whisper-small-id-onnx',
  {
    dtype: 'q4f16',  // Recommended for browser
    device: 'wasm',  // or 'webgpu' if available
  }
);

// Transcribe audio
const result = await transcriber(audioData, {
  language: 'indonesian',
  task: 'transcribe',
});

console.log(result.text);

Custom Wrapper (Recommended)

import {
  AutoProcessor,
  AutoModelForSpeechSeq2Seq,
  AutoTokenizer
} from '@xenova/transformers';

class IndonesianWhisper {
  constructor() {
    this.model = null;
    this.processor = null;
    this.tokenizer = null;
  }

  async load(options = {}) {
    const modelId = 'cmaree/Bagus-whisper-small-id-onnx';
    const device = options.device || 'wasm';

    this.processor = await AutoProcessor.from_pretrained(modelId);
    this.tokenizer = await AutoTokenizer.from_pretrained(modelId);
    this.model = await AutoModelForSpeechSeq2Seq.from_pretrained(modelId, {
      device: device,
      dtype: device === 'webgpu' ? 'fp32' : 'q4f16'
    });
  }

  async transcribe(audioData, options = {}) {
    const processed = await this.processor(audioData, {
      sampling_rate: options.sampling_rate || 16000
    });

    const outputs = await this.model.generate(processed.input_features, {
      max_new_tokens: 128,
      num_beams: 1,
      language: 'indonesian',
      task: 'transcribe',
      return_timestamps: false,
    });

    const transcription = this.tokenizer.batch_decode(outputs, {
      skip_special_tokens: true
    });

    return { text: transcription[0] };
  }
}

// Usage
const whisper = new IndonesianWhisper();
await whisper.load({ device: 'wasm' });
const result = await whisper.transcribe(audioBuffer);
console.log(result.text);

Performance Benchmarks

Accuracy

Word Error Rate (WER): 5.9%
Dataset: Common Voice 17.0 (Indonesian test set)
Comparison:
- Whisper Tiny (generic): ~25-30% WER
- Whisper Small (generic): ~10-15% WER
- Bagus Whisper Small (Indonesian): 5.9% WER ⭐

Inference Speed (Browser)

Tested on typical hardware with WASM backend:

Device	Browser	Speed (RTF)	Notes
Desktop (i7)	Chrome	~0.3x	3-4 seconds for 10s audio
Laptop (M1)	Safari	~0.2x	2-3 seconds for 10s audio
Mobile (Flagship)	Chrome	~0.5x	5-6 seconds for 10s audio

RTF = Real-time Factor (lower is faster)

With WebGPU (when available):

2-3x faster inference
~0.1-0.15x RTF on modern GPUs

Training Details

Original Model Training

Base Model: OpenAI Whisper Small
Training Data: Mozilla Common Voice 17.0 (Indonesian subset)
Training Framework: PyTorch 2.3.0
Learning Rate: 1e-05
Batch Size: 32 (with gradient accumulation)
Training Steps: 20,000
Mixed Precision: Native AMP

ONNX Conversion

Conversion Tool: Transformers.js conversion script
ONNX Opset: 14
Quantization: Dynamic quantization (q4f16, int8, fp16)
Optimization: ONNXSlim for size reduction

Intended Use

Primary Use Cases

✅ Real-time Indonesian speech transcription in web browsers
✅ Offline voice-to-text applications
✅ Voice assistants and chatbots
✅ Transcription services
✅ Accessibility tools

Out of Scope

❌ Speaker diarization (identifying who is speaking)
❌ Real-time translation (use with translation models)
❌ Audio quality enhancement
❌ Languages other than Indonesian

Limitations

Audio Quality: Performance degrades with:
- Background noise
- Low-quality recordings
- Multiple speakers
- Strong accents or dialects
Domain Specificity: Trained on Common Voice dataset
- May perform better on clear, scripted speech
- May struggle with spontaneous conversations
- Domain-specific jargon may not be recognized
Resource Requirements:
- Requires ~191MB download (q4f16)
- Needs modern browser with WASM support
- May be slow on low-end mobile devices

Ethical Considerations

Bias

Trained on Common Voice data which may not represent all Indonesian speakers equally
May have biases toward certain accents, dialects, or demographic groups
Users should evaluate performance on their specific use case

Privacy

Model runs entirely in the browser (no data sent to servers)
Audio processing happens locally
Users maintain full control over their data

Responsible Use

Should not be used for surveillance without consent
Should not be sole basis for high-stakes decisions
Users should verify transcriptions for critical applications

Citation

Original Model

@misc{bagus2024whisper,
  author = {Bagus},
  title = {Whisper Small Indonesian CV17},
  year = {2024},
  publisher = {HuggingFace},
  url = {https://huggingface.co/Bagus/whisper-small-id-cv17}
}

Whisper

@misc{radford2022whisper,
  title={Robust Speech Recognition via Large-Scale Weak Supervision},
  author={Alec Radford and Jong Wook Kim and Tao Xu and Greg Brockman and Christine McLeavey and Ilya Sutskever},
  year={2022},
  eprint={2212.04356},
  archivePrefix={arXiv},
  primaryClass={eess.AS}
}

Transformers.js

@software{transformersjs,
  author = {Xenova},
  title = {Transformers.js: State-of-the-art Machine Learning for the web},
  year = {2024},
  url = {https://github.com/xenova/transformers.js}
}

Acknowledgments

Original Model: Bagus for fine-tuning Whisper on Indonesian
Base Model: OpenAI for the Whisper architecture
Dataset: Mozilla Foundation for Common Voice
Conversion Tools: Xenova for Transformers.js and ONNX conversion scripts
ONNX Optimization: Microsoft for ONNX Runtime

Contact

For issues or questions about this ONNX version:

Open an issue on the model repository
Check Transformers.js documentation

For the original model:

Visit Bagus/whisper-small-id-cv17

License

This model is released under the Apache 2.0 license, consistent with the original Whisper model and the fine-tuned version.

Downloads last month: 123

Dataset used to train cmaree/Bagus-whisper-small-id-onnx

Evaluation results

Word Error Rate on Common Voice 17.0 (Indonesian)
test set self-reported

5.900

View on Papers With Code