Bagus Whisper Small Indonesian (ONNX)

This is an ONNX-converted version of Bagus/whisper-small-id-cv17, optimized for browser-based inference with Transformers.js.

Model Description

Bagus/whisper-small-id-cv17 is a fine-tuned Whisper Small model for Indonesian automatic speech recognition (ASR). It achieves 5.9% Word Error Rate (WER) on Indonesian speech, making it one of the most accurate Indonesian ASR models available.

This ONNX version provides:

  • βœ… Browser-compatible format for client-side inference
  • βœ… Multiple quantization levels (q4f16, fp16, int8, q8)
  • βœ… Optimized for Transformers.js
  • βœ… No server required - runs entirely in the browser

Model Details

  • Base Model: OpenAI Whisper Small (242M parameters)
  • Language: Indonesian (id)
  • Training Data: Mozilla Common Voice 17.0 (Indonesian)
  • Performance: 5.9% WER
  • License: Apache 2.0
  • Original Model: Bagus/whisper-small-id-cv17

Quantization Options

This repository includes multiple quantization levels optimized for different use cases:

Quantization Encoder Decoder Total Size Use Case
q4f16 51 MB 138 MB ~191 MB Recommended for browsers - Best balance
fp16 168 MB 293 MB ~463 MB Higher quality, larger size
int8/q8 87 MB 299 MB ~386 MB Good quality, moderate size
fp32 337 MB 587 MB ~924 MB Full precision (reference)

Usage

With Transformers.js (Browser)

import { pipeline } from '@xenova/transformers';

// Create ASR pipeline
const transcriber = await pipeline(
  'automatic-speech-recognition',
  'cmaree/Bagus-whisper-small-id-onnx',
  {
    dtype: 'q4f16',  // Recommended for browser
    device: 'wasm',  // or 'webgpu' if available
  }
);

// Transcribe audio
const result = await transcriber(audioData, {
  language: 'indonesian',
  task: 'transcribe',
});

console.log(result.text);

Custom Wrapper (Recommended)

import {
  AutoProcessor,
  AutoModelForSpeechSeq2Seq,
  AutoTokenizer
} from '@xenova/transformers';

class IndonesianWhisper {
  constructor() {
    this.model = null;
    this.processor = null;
    this.tokenizer = null;
  }

  async load(options = {}) {
    const modelId = 'cmaree/Bagus-whisper-small-id-onnx';
    const device = options.device || 'wasm';

    this.processor = await AutoProcessor.from_pretrained(modelId);
    this.tokenizer = await AutoTokenizer.from_pretrained(modelId);
    this.model = await AutoModelForSpeechSeq2Seq.from_pretrained(modelId, {
      device: device,
      dtype: device === 'webgpu' ? 'fp32' : 'q4f16'
    });
  }

  async transcribe(audioData, options = {}) {
    const processed = await this.processor(audioData, {
      sampling_rate: options.sampling_rate || 16000
    });

    const outputs = await this.model.generate(processed.input_features, {
      max_new_tokens: 128,
      num_beams: 1,
      language: 'indonesian',
      task: 'transcribe',
      return_timestamps: false,
    });

    const transcription = this.tokenizer.batch_decode(outputs, {
      skip_special_tokens: true
    });

    return { text: transcription[0] };
  }
}

// Usage
const whisper = new IndonesianWhisper();
await whisper.load({ device: 'wasm' });
const result = await whisper.transcribe(audioBuffer);
console.log(result.text);

Performance Benchmarks

Accuracy

  • Word Error Rate (WER): 5.9%
  • Dataset: Common Voice 17.0 (Indonesian test set)
  • Comparison:
    • Whisper Tiny (generic): ~25-30% WER
    • Whisper Small (generic): ~10-15% WER
    • Bagus Whisper Small (Indonesian): 5.9% WER ⭐

Inference Speed (Browser)

Tested on typical hardware with WASM backend:

Device Browser Speed (RTF) Notes
Desktop (i7) Chrome ~0.3x 3-4 seconds for 10s audio
Laptop (M1) Safari ~0.2x 2-3 seconds for 10s audio
Mobile (Flagship) Chrome ~0.5x 5-6 seconds for 10s audio

RTF = Real-time Factor (lower is faster)

With WebGPU (when available):

  • 2-3x faster inference
  • ~0.1-0.15x RTF on modern GPUs

Training Details

Original Model Training

  • Base Model: OpenAI Whisper Small
  • Training Data: Mozilla Common Voice 17.0 (Indonesian subset)
  • Training Framework: PyTorch 2.3.0
  • Learning Rate: 1e-05
  • Batch Size: 32 (with gradient accumulation)
  • Training Steps: 20,000
  • Mixed Precision: Native AMP

ONNX Conversion

  • Conversion Tool: Transformers.js conversion script
  • ONNX Opset: 14
  • Quantization: Dynamic quantization (q4f16, int8, fp16)
  • Optimization: ONNXSlim for size reduction

Intended Use

Primary Use Cases

  • βœ… Real-time Indonesian speech transcription in web browsers
  • βœ… Offline voice-to-text applications
  • βœ… Voice assistants and chatbots
  • βœ… Transcription services
  • βœ… Accessibility tools

Out of Scope

  • ❌ Speaker diarization (identifying who is speaking)
  • ❌ Real-time translation (use with translation models)
  • ❌ Audio quality enhancement
  • ❌ Languages other than Indonesian

Limitations

  1. Audio Quality: Performance degrades with:

    • Background noise
    • Low-quality recordings
    • Multiple speakers
    • Strong accents or dialects
  2. Domain Specificity: Trained on Common Voice dataset

    • May perform better on clear, scripted speech
    • May struggle with spontaneous conversations
    • Domain-specific jargon may not be recognized
  3. Resource Requirements:

    • Requires ~191MB download (q4f16)
    • Needs modern browser with WASM support
    • May be slow on low-end mobile devices

Ethical Considerations

Bias

  • Trained on Common Voice data which may not represent all Indonesian speakers equally
  • May have biases toward certain accents, dialects, or demographic groups
  • Users should evaluate performance on their specific use case

Privacy

  • Model runs entirely in the browser (no data sent to servers)
  • Audio processing happens locally
  • Users maintain full control over their data

Responsible Use

  • Should not be used for surveillance without consent
  • Should not be sole basis for high-stakes decisions
  • Users should verify transcriptions for critical applications

Citation

Original Model

@misc{bagus2024whisper,
  author = {Bagus},
  title = {Whisper Small Indonesian CV17},
  year = {2024},
  publisher = {HuggingFace},
  url = {https://huggingface.co/Bagus/whisper-small-id-cv17}
}

Whisper

@misc{radford2022whisper,
  title={Robust Speech Recognition via Large-Scale Weak Supervision},
  author={Alec Radford and Jong Wook Kim and Tao Xu and Greg Brockman and Christine McLeavey and Ilya Sutskever},
  year={2022},
  eprint={2212.04356},
  archivePrefix={arXiv},
  primaryClass={eess.AS}
}

Transformers.js

@software{transformersjs,
  author = {Xenova},
  title = {Transformers.js: State-of-the-art Machine Learning for the web},
  year = {2024},
  url = {https://github.com/xenova/transformers.js}
}

Acknowledgments

  • Original Model: Bagus for fine-tuning Whisper on Indonesian
  • Base Model: OpenAI for the Whisper architecture
  • Dataset: Mozilla Foundation for Common Voice
  • Conversion Tools: Xenova for Transformers.js and ONNX conversion scripts
  • ONNX Optimization: Microsoft for ONNX Runtime

Contact

For issues or questions about this ONNX version:

For the original model:

License

This model is released under the Apache 2.0 license, consistent with the original Whisper model and the fine-tuned version.

Downloads last month
123
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Dataset used to train cmaree/Bagus-whisper-small-id-onnx

Evaluation results