How can I transcribe an audio file that’s longer than an hour when I have only 12 GB of VRAM?
Short clips—just a few‑dozen seconds—transcribe flawlessly, but large files immediately run out of memory.
Is there any way to keep VRAM usage to around 8 GB while processing an 11‑hour recording?
You could try two things:
- Limit attention window as discussed here: https://developer.nvidia.com/blog/pushing-the-boundaries-of-speech-recognition-with-nemo-parakeet-asr-models/#parakeet_models_for_long-form_speech_inference (this comes at a cost of little degradation in performance ) You could vary attention window length based on GPU consumption.
- With chunk based inference, see this script: https://github.com/NVIDIA/NeMo/blob/main/examples/asr/asr_chunked_inference/rnnt/speech_to_text_buffered_infer_rnnt.py. You could use buffer of 30 sec and chunk length of 10 sec (adjust this based on performance vs accuracy requirement)
Thanks, I succeeded in the first step.
The test.py
import nemo.collections.asr as nemo_asr
import time
asr_model = nemo_asr.models.ASRModel.from_pretrained(model_name="nvidia/parakeet-tdt-0.6b-v2")
start_time = time.time()
# Enable local attention
asr_model.change_attention_model("rel_pos_local_attn", [128, 128]) # local attn
# Enable chunking for subsampling module
asr_model.change_subsampling_conv_chunking_factor(1) # 1 = auto select
# output = asr_model.transcribe(['2086-149220-0033.wav'], timestamps=True)
output = asr_model.transcribe(['test.wav'], timestamps=True)
end_time = time.time()
execution_time = end_time - start_time
word_timestamps = output[0].timestamp['word']
segment_timestamps = output[0].timestamp['segment']
for stamp in segment_timestamps:
print(f"{stamp['start']}s - {stamp['end']}s : {stamp['segment']}")
print(f"Done: {execution_time:.2f} Secs")
A 22-minute audio took 9 seconds to transcribe, which is twice as fast as Whisper large 3 turbo.
The recognition accuracy for American TV shows with background music and laughter is pretty good.
Occasionally, “Ted” is recognized as “Dad”, and “Mrs.” is broken into a new line.
54.96s - 55.84s : Oh, Dad. (Correct: Ted.)
112.48s - 113.36s : She’s the future Mrs.
113.44s - 114.32000000000001s : Ted Mosby.
(Correct should be: She’s the future Mrs. Ted Mosby.)
For the second step, I tried adding the following to speech_to_text_buffered_infer_rnnt.py:
asr_model.change_attention_model("rel_pos_local_attn", [128, 128]) # local attn
asr_model.change_subsampling_conv_chunking_factor(1)
But it didn’t work—still ran into VRAM overflow.
Could you provide a reference code that only uses 8GB VRAM?
Very impressed with parakeet-tdt-0.6b-v2
but doesn't seem to work on long-form if you need timestamps.
I noticed the long-form script (examples/asr/asr_chunked_inference/rnnt/speech_to_text_buffered_infer_rnnt.py
) returns the logits, does that mean timestamps can be extracted?
@will1130 any progress on a solution from your side? I've also been working on this as it breaks my workflow
Unfortunately I've determined that it is a bug with the nemo library itself. See this issue: https://github.com/NVIDIA/NeMo/issues/7166#issuecomment-2902756222
So far >2 weeks without any resolution from nvidias side.