Disentangling Reasoning and Knowledge in Medical Large Language Models
Introduction

Medical reasoning in large language models aims to replicate clinicians' cognitive processes when interpreting patient data and making diagnostic decisions. However, widely used benchmarksโsuch as MedQA-USMLE, MedMCQA, and PubMedQAโmix questions that require multi-step reasoning with those answerable through factual recall, complicating reasoning evaluation. To address this, we develop a PubMedBERT-based classifier (81% agreement with expert annotations) to disentangle reasoning-heavy from knowledge-heavy questions across 11 biomedical QA benchmarks, revealing that only 32.8% require complex reasoning. Using this stratification, we evaluate biomedical models (HuatuoGPT-o1, MedReason, m1) and general-domain models (DeepSeek-R1, o4-mini, Qwen3), and consistently observe lower performance on reasoning versus knowledge (e.g., HuatuoGPT-o1: 56.9% vs. 44.8%). To assess robustness, we conduct adversarial evaluations where models are prefilled with incorrect answers before being asked to reconsider. Biomedical models show substantial degradation in this setting (e.g., MedReason drops from 50.4% to 24.4%), while RL-trained and larger general-domain models are more resilient. Performance declines more on reasoning-heavy questions, highlighting the brittleness of current medical reasoning capabilities. Based on these insights, we train BioMed-R1 models using supervised fine-tuning and reinforcement learning on reasoning-heavy and adversarial examples, encouraging self-correction and backtracking. Our models achieve the strongest overall and adversarial performance among similarly sized biomedical LLMs, yet ample room for improvement remains. Incorporating additional reasoning-rich data sourcesโsuch as clinical case reportsโand developing training strategies that promote reasoning under uncertainty may further enhance robustness and diagnostic reliability.

BioMed-R1 can be used just like Qwen/Qwen2.5-32B-Instruct
. You can deploy it with tools like vllm or Sglang, or perform direct inference:
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained("zou-lab/BioMed-R1-32B",torch_dtype="auto",device_map="auto")
tokenizer = AutoTokenizer.from_pretrained("zou-lab/BioMed-R1-32B")
input_text = "Does vagus nerve contribute to the development of steatohepatitis and obesity in phosphatidylethanolamine N-methyltransferase deficient mice?"
messages = [{"role": "user", "content": input_text}]
inputs = tokenizer(tokenizer.apply_chat_template(messages, tokenize=False,add_generation_prompt=True
), return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=2048)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
๐๐ผ Acknowledgement
We gratefully acknowledge the contributions of HuatuoGPT-o1, MedReason, and M1.
We also thank the developers of the outstanding tools Curator, TRL, vLLM, and SGLang, which made this work possible.
๐ Citation
@article{thapa2025disentangling,
title={Disentangling Reasoning and Knowledge in Medical Large Language Models},
author={Thapa, Rahul and Wu, Qingyang and Wu, Kevin and Zhang, Harrison and Zhang, Angela and Wu, Eric and Ye, Haotian and Bedi, Suhana and Aresh, Nevin and Boen, Joseph and Reddy, Shriya and Athiwaratkun, Ben and Song, Shuaiwen Leon and Zou, James},
journal={arXiv preprint arXiv:2505.11462},
year={2025},
url={https://arxiv.org/abs/2505.11462}
}
- Downloads last month
- 0