CUDA Out of Memory

#14
by ep5000 - opened

Hi,

I have 2 x 24GB Tesla P40 GPUs and I get CUDA out of memory when only one GPU is fully saturated. As such it appears that device_map="auto" isn't taking effect. Any thoughts?

Here's the code I'm using (pulled from the examples, only change is atten_implementation="eager" due to the GPU model not supporting flash attention:


import sys
import torch
from PIL import Image
from transformers import AutoTokenizer, AutoProcessor, AutoModelForImageTextToText

model_path = "nanonets/Nanonets-OCR-s"

model = AutoModelForImageTextToText.from_pretrained(
    model_path, 
    torch_dtype="auto", 
    device_map="auto", 
    attn_implementation="eager"
)
model.eval()

tokenizer = AutoTokenizer.from_pretrained(model_path)
processor = AutoProcessor.from_pretrained(model_path)


def ocr_page_with_nanonets_s(image_path, model, processor, max_new_tokens=4096):
    prompt = """Extract the text from the above document as if you were reading it naturally. Return the equations in LaTeX representation. If there is an image in the document and image caption is not present, add a small description of the image inside the <img></img> tag; otherwise, add the image caption inside <img></img>. Watermarks should be wrapped in brackets. Ex: <watermark>OFFICIAL COPY</watermark>. Page numbers should be wrapped in brackets. Ex: <page_number>14</page_number> or <page_number>9/22</page_number>. Prefer using ☐ and β˜‘ for check boxes."""
    image = Image.open(image_path)
    messages = [
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": [
            {"type": "image", "image": f"file://{image_path}"},
            {"type": "text", "text": prompt},
        ]},
    ]
    text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
    inputs = processor(text=[text], images=[image], padding=True, return_tensors="pt")
    inputs = inputs.to(model.device)
    
    output_ids = model.generate(**inputs, max_new_tokens=max_new_tokens, do_sample=False)
    generated_ids = [output_ids[len(input_ids):] for input_ids, output_ids in zip(inputs.input_ids, output_ids)]
    
    output_text = processor.batch_decode(generated_ids, skip_special_tokens=True, clean_up_tokenization_spaces=True)
    return output_text[0]

image_path = sys.argv[1]
result = ocr_page_with_nanonets_s(image_path, model, processor, max_new_tokens=15000)
print(result)

With the above code the following error is generated:

torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 15.36 GiB. GPU 0 has a total capacity of 23.87 GiB of which 3.25 GiB is free. Including non-PyTorch memory, this process has 20.61 GiB memory in use. Of the allocated memory 20.31 GiB is allocated by PyTorch, and 132.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
Nanonets org

What are your image dimensions? Maybe the model is creating too many tokens from your image. Try resizing the image to a fixed size like 2048x2048 and see if it works. You should be able to load this into a single 24 GB machine.

Sign up or log in to comment