Den4ikAI/whisper-large-v2-no-digits-norm-punct

This is a special version of the openai/whisper-large-v2 model whose vocabulary has had all tokens corresponding to digits removed, as well as tokens with extraneous punctuation.

The primary goal of this modification is to force the model to generate numbers as words rather than digits. This is extremely useful for text normalization tasks, for example when preparing data for text-to-speech (TTS) systems, where numbers need to be fully spelled out.

Comparison with the Original Model

Here’s a clear example demonstrating the difference in behavior between the models when transcribing the same audio clip containing the phrase “Билет стоил двадцать тысяч рублей” (“The ticket cost twenty thousand rubles”).

Model Transcription Output
openai/whisper-large-v2 (Original) <|startoftranscript|><|ru|><|transcribe|><|notimestamps|> Билет стоил **20000** рублей.<|endoftext|>
Den4ikAI/whisper-large-v2-no-digits-norm-punct (This model) <|startoftranscript|><|ru|><|transcribe|><|notimestamps|> Билет стоил **двадцать тысяч** рублей.<|endoftext|>

As you can see, this modified model correctly normalized the number into words, whereas the original version left it as digits.

How to Use

You can use this model just like any other Whisper model in the transformers library.

from transformers import WhisperProcessor, WhisperForConditionalGeneration
import torchaudio
import torch

# Specify the device (GPU if available)
device = "cuda:0" if torch.cuda.is_available() else "cpu"

# Load the audio file
wav, sr = torchaudio.load("numbers5.mp3")
# Convert to mono and resample to 16 kHz
if wav.shape[0] > 1:
    wav = torch.mean(wav, dim=0, keepdim=True)
resampler = torchaudio.transforms.Resample(sr, 16000)
wav = resampler(wav)
audio_input = wav.squeeze(0)

# Load the processor and model
model_id = "Den4ikAI/whisper-large-v2-no-digits-norm-punct"
processor = WhisperProcessor.from_pretrained(model_id)
model = WhisperForConditionalGeneration.from_pretrained(model_id).to(device)

# Prepare inputs and extract features
input_features = processor(
    audio_input,
    sampling_rate=16000,
    return_tensors="pt"
).input_features.to(device)

# Generate token IDs (for Russian specify language="russian")
predicted_ids = model.generate(input_features, language="russian", task="transcribe")

# Decode tokens back to text
transcription = processor.batch_decode(
    predicted_ids,
    skip_special_tokens=False
)

print(transcription)

# Example output for an audio clip with numbers:
# ['<|startoftranscript|><|ru|><|transcribe|><|notimestamps|> Билет стоил двадцать тысяч рублей.<|endoftext|>']
Downloads last month
11
Safetensors
Model size
1.54B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Den4ikAI/whisper-large-v2-no-digits-norm-punct

Finetuned
(213)
this model