ChetKao/Bohdi-Qwen2.5-7B-Instruct

Bohdi: Heterogeneous LLM Fusion with Automatic Data Exploration

Junqi Gao ^1,2, Zhichang Guo ², Dazhi Zhang ², Dong Li ^1,2, Runze Liu ³, Pengfei Li ², Kai Tian ⁴, Biqing Qi^1,†

¹ Shanghai Artificial Intelligence Laboratory

² School of Mathematics, Harbin Institute of Technology

³ Tsinghua Shenzhen International Graduate School, Tsinghua University

⁴ Department of Electronic Engineering, Tsinghua University

^† Corresponding Author

📄 Introduction

Bohdi is a novel framework for heterogeneous Large Language Model (LLM) fusion that integrates the strengths of multiple source LLMs into a target LLM through adaptive knowledge exploration and automatic data generation. Unlike existing methods that rely on real data from limited domains and use fixed data allocation proportions, Bohdi dynamically adjusts sampling based on the target LLM's performance and generates data automatically through a hierarchical knowledge tree structure. This ensures comprehensive domain coverage and balanced capability enhancement without the need for real data. Our github page is Bohdi.

✨ Features

🚀 Synthetic-Data-Only Fusion: Bohdi operates without relying on real data, making it highly efficient and versatile.

🌳 Dynamic Domain Exploration: Through the hierarchical knowledge tree and Sprout/Harvest operations, Bohdi explores new domains and generates data automatically.

🔄 Adaptive Data Allocation: The DynaBranches mechanism with IR ensures dynamic adjustment of data sampling proportions based on the target LLM’s capabilities.

⚙️ Installation

Main Environment for Distillation

conda env create -f environment_Bohdi.yaml

Environment for Evaluation

conda env create -f opencompass_env.yaml

Preparation for Evaluation Suite

# The version we used: opencompass 0.3.4
git clone https://github.com/open-compass/opencompass opencompass
cd opencompass
pip install -e .

⏳ Distillation Training

To train the target LLM using Bohdi, follow these steps:

Prepare Source LLMs: Ensure you have access to the source LLMs you want to fuse. If you want to follow our setup, please download the following models:

# Source Models
Qwen/Qwen2.5-14B-Instruct
mistralai/Mistral-Small-24B-Instruct-2501
microsoft/phi-4
# Target Models
meta-llama/Llama-3.2-3B-Instruct
meta-llama/Llama-3.1-8B-Instruct
Qwen/Qwen2.5-7B-Instruct
google/gemma-2-9b-it

Run Bohdi For Distillation Please first configure the relevant paths in run_bohdi.sh according to your actual paths, and then run:
```
source activate bohdi
cd your project path
bash run_bohdi.sh
```

📏 Evaluation

We use OpenCompass for evaluation and perform inference based on VLLM. To evaluate your model, please configure the relevant paths in eval_opencompass.sh according to your actual paths, and then run:

source activate opencompass
cd your project path
bash eval_opencompass.sh

📚 Citation

@article{gao2025bohdi,
  title={Bohdi: Heterogeneous LLM Fusion with Automatic Data Exploration},
  author={Junqi Gao and Zhichang Guo and Dazhi Zhang and Dong Li and Runze Liu and Pengfei Li and Kai Tian and Biqing Qi},
  journal={arXiv preprint arXiv:2506.15721},
  year={2025},
  url={https://doi.org/10.48550/arXiv.2506.15721}
}