Disentangling Reasoning and Knowledge in Medical Large Language Models

Introduction

Medical reasoning in large language models aims to replicate clinicians' cognitive processes when interpreting patient data and making diagnostic decisions. However, widely used benchmarks—such as MedQA-USMLE, MedMCQA, and PubMedQA—mix questions that require multi-step reasoning with those answerable through factual recall, complicating reasoning evaluation. To address this, we develop a PubMedBERT-based classifier (81% agreement with expert annotations) to disentangle reasoning-heavy from knowledge-heavy questions across 11 biomedical QA benchmarks, revealing that only 32.8% require complex reasoning. Using this stratification, we evaluate biomedical models (HuatuoGPT-o1, MedReason, m1) and general-domain models (DeepSeek-R1, o4-mini, Qwen3), and consistently observe lower performance on reasoning versus knowledge (e.g., HuatuoGPT-o1: 56.9% vs. 44.8%). To assess robustness, we conduct adversarial evaluations where models are prefilled with incorrect answers before being asked to reconsider. Biomedical models show substantial degradation in this setting (e.g., MedReason drops from 50.4% to 24.4%), while RL-trained and larger general-domain models are more resilient. Performance declines more on reasoning-heavy questions, highlighting the brittleness of current medical reasoning capabilities. Based on these insights, we train BioMed-R1 models using supervised fine-tuning and reinforcement learning on reasoning-heavy and adversarial examples, encouraging self-correction and backtracking. Our models achieve the strongest overall and adversarial performance among similarly sized biomedical LLMs, yet ample room for improvement remains. Incorporating additional reasoning-rich data sources—such as clinical case reports—and developing training strategies that promote reasoning under uncertainty may further enhance robustness and diagnostic reliability.

BioMed-R1 can be used just like Qwen/Qwen2.5-32B-Instruct. You can deploy it with tools like vllm or Sglang, or perform direct inference:

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained("zou-lab/BioMed-R1-32B",torch_dtype="auto",device_map="auto")
tokenizer = AutoTokenizer.from_pretrained("zou-lab/BioMed-R1-32B")

input_text = "Does vagus nerve contribute to the development of steatohepatitis and obesity in phosphatidylethanolamine N-methyltransferase deficient mice?"
messages = [{"role": "user", "content": input_text}]

inputs = tokenizer(tokenizer.apply_chat_template(messages, tokenize=False,add_generation_prompt=True
), return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=2048)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

🙏🏼 Acknowledgement

We gratefully acknowledge the contributions of HuatuoGPT-o1, MedReason, and M1.
We also thank the developers of the outstanding tools Curator, TRL, vLLM, and SGLang, which made this work possible.

📖 Citation

@article{thapa2025disentangling,
  title={Disentangling Reasoning and Knowledge in Medical Large Language Models},
  author={Thapa, Rahul and Wu, Qingyang and Wu, Kevin and Zhang, Harrison and Zhang, Angela and Wu, Eric and Ye, Haotian and Bedi, Suhana and Aresh, Nevin and Boen, Joseph and Reddy, Shriya and Athiwaratkun, Ben and Song, Shuaiwen Leon and Zou, James},
  journal={arXiv preprint arXiv:2505.11462},
  year={2025},
  url={https://arxiv.org/abs/2505.11462}
}

zou-lab
/

BioMed-R1-32B

Disentangling Reasoning and Knowledge in Medical Large Language Models

Introduction

🙏🏼 Acknowledgement

📖 Citation

Model tree for zou-lab/BioMed-R1-32B