Text Generation
Safetensors
English
medical

Disentangling Reasoning and Knowledge in Medical Large Language Models

Introduction

overall_workflow

Medical reasoning in large language models aims to replicate clinicians' cognitive processes when interpreting patient data and making diagnostic decisions. However, widely used benchmarksโ€”such as MedQA-USMLE, MedMCQA, and PubMedQAโ€”mix questions that require multi-step reasoning with those answerable through factual recall, complicating reasoning evaluation. To address this, we develop a PubMedBERT-based classifier (81% agreement with expert annotations) to disentangle reasoning-heavy from knowledge-heavy questions across 11 biomedical QA benchmarks, revealing that only 32.8% require complex reasoning. Using this stratification, we evaluate biomedical models (HuatuoGPT-o1, MedReason, m1) and general-domain models (DeepSeek-R1, o4-mini, Qwen3), and consistently observe lower performance on reasoning versus knowledge (e.g., HuatuoGPT-o1: 56.9% vs. 44.8%). To assess robustness, we conduct adversarial evaluations where models are prefilled with incorrect answers before being asked to reconsider. Biomedical models show substantial degradation in this setting (e.g., MedReason drops from 50.4% to 24.4%), while RL-trained and larger general-domain models are more resilient. Performance declines more on reasoning-heavy questions, highlighting the brittleness of current medical reasoning capabilities. Based on these insights, we train BioMed-R1 models using supervised fine-tuning and reinforcement learning on reasoning-heavy and adversarial examples, encouraging self-correction and backtracking. Our models achieve the strongest overall and adversarial performance among similarly sized biomedical LLMs, yet ample room for improvement remains. Incorporating additional reasoning-rich data sourcesโ€”such as clinical case reportsโ€”and developing training strategies that promote reasoning under uncertainty may further enhance robustness and diagnostic reliability.

reason_vs_knowledge

BioMed-R1 can be used just like Qwen/Qwen2.5-32B-Instruct. You can deploy it with tools like vllm or Sglang, or perform direct inference:

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained("zou-lab/BioMed-R1-32B",torch_dtype="auto",device_map="auto")
tokenizer = AutoTokenizer.from_pretrained("zou-lab/BioMed-R1-32B")

input_text = "Does vagus nerve contribute to the development of steatohepatitis and obesity in phosphatidylethanolamine N-methyltransferase deficient mice?"
messages = [{"role": "user", "content": input_text}]

inputs = tokenizer(tokenizer.apply_chat_template(messages, tokenize=False,add_generation_prompt=True
), return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=2048)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

๐Ÿ™๐Ÿผ Acknowledgement

We gratefully acknowledge the contributions of HuatuoGPT-o1, MedReason, and M1.
We also thank the developers of the outstanding tools Curator, TRL, vLLM, and SGLang, which made this work possible.

๐Ÿ“– Citation

@article{thapa2025disentangling,
  title={Disentangling Reasoning and Knowledge in Medical Large Language Models},
  author={Thapa, Rahul and Wu, Qingyang and Wu, Kevin and Zhang, Harrison and Zhang, Angela and Wu, Eric and Ye, Haotian and Bedi, Suhana and Aresh, Nevin and Boen, Joseph and Reddy, Shriya and Athiwaratkun, Ben and Song, Shuaiwen Leon and Zou, James},
  journal={arXiv preprint arXiv:2505.11462},
  year={2025},
  url={https://arxiv.org/abs/2505.11462}
}
Downloads last month
0
Safetensors
Model size
32.8B params
Tensor type
BF16
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for zou-lab/BioMed-R1-32B

Base model

Qwen/Qwen2.5-32B
Finetuned
(221)
this model