High-Accuracy ECG Image Interpretation with LLaMA 3.2

This repository contains the official fine-tuned model from the paper: "High-Accuracy ECG Image Interpretation using Parameter-Efficient LoRA Fine-Tuning with Multimodal LLaMA 3.2".

Paper: arXiv:2501.18670

This model was developed by Nandakishor M and Anjali M at Convai Innovations. It is designed to provide high-accuracy, comprehensive interpretation of electrocardiogram (ECG) images.

Model Details

Base Model: unsloth/Llama-3.2-11B-Vision-Instruct
Fine-tuning Strategy: Parameter-Efficient LoRA
Dataset: ECGInstruct, a large-scale dataset with 1 million instruction-following samples derived from public sources like MIMIC-IV ECG and PTB-XL.
Primary Use: Automated analysis and report generation from ECG images to assist cardiologists and medical professionals in diagnosing a wide range of cardiac conditions.

How to Use

This model was trained using Unsloth to achieve high performance and memory efficiency. The following code provides a complete example of how to load the model in 4-bit precision and run inference.

You can run the code using Free Google Colab at :

import torch
from unsloth import FastVisionModel
from transformers import AutoProcessor, TextStreamer
from PIL import Image
from IPython.display import display

# Make sure you have an ECG image file, e.g., 'my_ecg.jpg'
image_path = "my_ecg.jpg"

# Load the 4-bit quantized model and processor
model, processor = FastVisionModel.from_pretrained(
    model_name="convaiinnovations/ECG-Instruct-Llama-3.2-11B-Vision",
    max_seq_length=4096,
    dtype=None,
    load_in_4bit=True,
    device_map="cuda"
)

# Enable fast inference mode
FastVisionModel.for_inference(model)

# Load the image
image = Image.open(image_path).convert("RGB")

# Define the instruction
query = "You are an expert cardiologist. Write an in-depth diagnosis report from this ECG data, including the final diagnosis."

# Prepare the prompt
messages = [
    {"role": "user", "content": [
        {"type": "image"},
        {"type": "text", "text": query}
    ]}
]
input_text = processor.apply_chat_template(messages, add_generation_prompt=True)

# Process inputs
inputs = processor(
    text=input_text,
    images=image,
    return_tensors="pt",
).to("cuda")

# Set up streamer for token-by-token output
text_streamer = TextStreamer(processor.tokenizer, skip_prompt=True)

# Generate the report
_ = model.generate(**inputs,
                    streamer=text_streamer,
                    max_new_tokens=512,
                    use_cache=True,
                    temperature=0.2,
                    min_p=0.1)

# To see the input image in a notebook:
# display(image.resize((600, 400)))

Training and Fine-tuning

The model was fine-tuned on the ECGInstruct dataset using a parameter-efficient LoRA strategy, which significantly improves performance on ECG interpretation tasks while preserving the base model's extensive knowledge.

Key Hyperparameters:

LoRA Rank (r): 64
LoRA Alpha (alpha): 128
LoRA Dropout: 0.05
Learning Rate: 2e-4 with a cosine scheduler
Epochs: 3
Hardware: 4x NVIDIA A100 80GB GPUs
Framework: Unsloth with DeepSpeed ZeRO-2

Note: As described in the paper, the lm_head and embed_tokens layers were excluded from LoRA adaptation to maintain generation stability.

Evaluation

The fine-tuned model demonstrates state-of-the-art performance, significantly outperforming the baseline LLaMA 3.2 model across all metrics.

Task	Metric	Baseline	Ours (Fine-tuned)
Abnorm. Det.	AUC	0.51	0.98
	Macro F1	0.33	0.74
	Hamming Loss	0.49	0.11
Report Gen.	Report Score	47.8	85.4

Report Score was evaluated using GPT-4o against expert-annotated ground truth reports.

Citation

If you use this model in your research, please cite our paper:

@misc{nandakishor2025highaccuracy,
      title={High-Accuracy ECG Image Interpretation using Parameter-Efficient LoRA Fine-Tuning with Multimodal LLaMA 3.2},
      author={Nandakishor M and Anjali M},
      year={2025},
      eprint={2501.18670},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}

convaiinnovations
/

ECG-Instruct-Llama-3.2-11B-Vision