How can I transcribe an audio file that’s longer than an hour when I have only 12 GB of VRAM?

#15
by will1130 - opened

Short clips—just a few‑dozen seconds—transcribe flawlessly, but large files immediately run out of memory.

Is there any way to keep VRAM usage to around 8 GB while processing an 11‑hour recording?

NVIDIA org

You could try two things:

  1. Limit attention window as discussed here: https://developer.nvidia.com/blog/pushing-the-boundaries-of-speech-recognition-with-nemo-parakeet-asr-models/#parakeet_models_for_long-form_speech_inference (this comes at a cost of little degradation in performance ) You could vary attention window length based on GPU consumption.
  2. With chunk based inference, see this script: https://github.com/NVIDIA/NeMo/blob/main/examples/asr/asr_chunked_inference/rnnt/speech_to_text_buffered_infer_rnnt.py. You could use buffer of 30 sec and chunk length of 10 sec (adjust this based on performance vs accuracy requirement)

Thanks, I succeeded in the first step.
The test.py

import nemo.collections.asr as nemo_asr
import time

asr_model = nemo_asr.models.ASRModel.from_pretrained(model_name="nvidia/parakeet-tdt-0.6b-v2")
start_time = time.time()
# Enable local attention
asr_model.change_attention_model("rel_pos_local_attn", [128, 128])  # local attn
 
# Enable chunking for subsampling module
asr_model.change_subsampling_conv_chunking_factor(1)  # 1 = auto select

# output = asr_model.transcribe(['2086-149220-0033.wav'], timestamps=True)

output = asr_model.transcribe(['test.wav'], timestamps=True)

end_time = time.time()
execution_time = end_time - start_time

word_timestamps = output[0].timestamp['word']
segment_timestamps = output[0].timestamp['segment']

for stamp in segment_timestamps:
    print(f"{stamp['start']}s - {stamp['end']}s : {stamp['segment']}")

print(f"Done: {execution_time:.2f} Secs")

A 22-minute audio took 9 seconds to transcribe, which is twice as fast as Whisper large 3 turbo.

The recognition accuracy for American TV shows with background music and laughter is pretty good.
Occasionally, “Ted” is recognized as “Dad”, and “Mrs.” is broken into a new line.

54.96s - 55.84s : Oh, Dad. (Correct: Ted.)

112.48s - 113.36s : She’s the future Mrs.
113.44s - 114.32000000000001s : Ted Mosby.
(Correct should be: She’s the future Mrs. Ted Mosby.)

For the second step, I tried adding the following to speech_to_text_buffered_infer_rnnt.py:

    asr_model.change_attention_model("rel_pos_local_attn", [128, 128])  # local attn
    asr_model.change_subsampling_conv_chunking_factor(1) 

But it didn’t work—still ran into VRAM overflow.
Could you provide a reference code that only uses 8GB VRAM?

Very impressed with parakeet-tdt-0.6b-v2 but doesn't seem to work on long-form if you need timestamps.
I noticed the long-form script (examples/asr/asr_chunked_inference/rnnt/speech_to_text_buffered_infer_rnnt.py) returns the logits, does that mean timestamps can be extracted?

@will1130 any progress on a solution from your side? I've also been working on this as it breaks my workflow

Any update? @will1130

Unfortunately I've determined that it is a bug with the nemo library itself. See this issue: https://github.com/NVIDIA/NeMo/issues/7166#issuecomment-2902756222

So far >2 weeks without any resolution from nvidias side.

Sign up or log in to comment