High-Accuracy ECG Image Interpretation with LLaMA 3.2
This repository contains the official fine-tuned model from the paper: "High-Accuracy ECG Image Interpretation using Parameter-Efficient LoRA Fine-Tuning with Multimodal LLaMA 3.2".
Paper: arXiv:2501.18670
This model was developed by Nandakishor M and Anjali M at Convai Innovations. It is designed to provide high-accuracy, comprehensive interpretation of electrocardiogram (ECG) images.
Model Details
- Base Model:
unsloth/Llama-3.2-11B-Vision-Instruct
- Fine-tuning Strategy: Parameter-Efficient LoRA
- Dataset:
ECGInstruct
, a large-scale dataset with 1 million instruction-following samples derived from public sources like MIMIC-IV ECG and PTB-XL. - Primary Use: Automated analysis and report generation from ECG images to assist cardiologists and medical professionals in diagnosing a wide range of cardiac conditions.
How to Use
This model was trained using Unsloth to achieve high performance and memory efficiency. The following code provides a complete example of how to load the model in 4-bit precision and run inference.
You can run the code using Free Google Colab at :
import torch
from unsloth import FastVisionModel
from transformers import AutoProcessor, TextStreamer
from PIL import Image
from IPython.display import display
# Make sure you have an ECG image file, e.g., 'my_ecg.jpg'
image_path = "my_ecg.jpg"
# Load the 4-bit quantized model and processor
model, processor = FastVisionModel.from_pretrained(
model_name="convaiinnovations/ECG-Instruct-Llama-3.2-11B-Vision",
max_seq_length=4096,
dtype=None,
load_in_4bit=True,
device_map="cuda"
)
# Enable fast inference mode
FastVisionModel.for_inference(model)
# Load the image
image = Image.open(image_path).convert("RGB")
# Define the instruction
query = "You are an expert cardiologist. Write an in-depth diagnosis report from this ECG data, including the final diagnosis."
# Prepare the prompt
messages = [
{"role": "user", "content": [
{"type": "image"},
{"type": "text", "text": query}
]}
]
input_text = processor.apply_chat_template(messages, add_generation_prompt=True)
# Process inputs
inputs = processor(
text=input_text,
images=image,
return_tensors="pt",
).to("cuda")
# Set up streamer for token-by-token output
text_streamer = TextStreamer(processor.tokenizer, skip_prompt=True)
# Generate the report
_ = model.generate(**inputs,
streamer=text_streamer,
max_new_tokens=512,
use_cache=True,
temperature=0.2,
min_p=0.1)
# To see the input image in a notebook:
# display(image.resize((600, 400)))
Training and Fine-tuning
The model was fine-tuned on the ECGInstruct
dataset using a parameter-efficient LoRA strategy, which significantly improves performance on ECG interpretation tasks while preserving the base model's extensive knowledge.
Key Hyperparameters:
- LoRA Rank (
r
): 64 - LoRA Alpha (
alpha
): 128 - LoRA Dropout: 0.05
- Learning Rate: 2e-4 with a cosine scheduler
- Epochs: 3
- Hardware: 4x NVIDIA A100 80GB GPUs
- Framework: Unsloth with DeepSpeed ZeRO-2
Note: As described in the paper, the lm_head
and embed_tokens
layers were excluded from LoRA adaptation to maintain generation stability.
Evaluation
The fine-tuned model demonstrates state-of-the-art performance, significantly outperforming the baseline LLaMA 3.2 model across all metrics.
Task | Metric | Baseline | Ours (Fine-tuned) |
---|---|---|---|
Abnorm. Det. | AUC | 0.51 | 0.98 |
Macro F1 | 0.33 | 0.74 | |
Hamming Loss | 0.49 | 0.11 | |
Report Gen. | Report Score | 47.8 | 85.4 |
Report Score was evaluated using GPT-4o against expert-annotated ground truth reports.
Citation
If you use this model in your research, please cite our paper:
@misc{nandakishor2025highaccuracy,
title={High-Accuracy ECG Image Interpretation using Parameter-Efficient LoRA Fine-Tuning with Multimodal LLaMA 3.2},
author={Nandakishor M and Anjali M},
year={2025},
eprint={2501.18670},
archivePrefix={arXiv},
primaryClass={cs.CV}
}
- Downloads last month
- 31
Model tree for convaiinnovations/ECG-Instruct-Llama-3.2-11B-Vision
Base model
meta-llama/Llama-3.2-11B-Vision-Instruct