Junqi Gao 1,2, Zhichang Guo 2, Dazhi Zhang 2, Dong Li 1,2 , Runze Liu 3, Pengfei Li 2, Kai Tian 4, Biqing Qi1,β
1 Shanghai Artificial Intelligence Laboratory
2 School of Mathematics, Harbin Institute of Technology
3 Tsinghua Shenzhen International Graduate School, Tsinghua University
4 Department of Electronic Engineering, Tsinghua University
β Corresponding Author
π Introduction
Bohdi is a novel framework for heterogeneous Large Language Model (LLM) fusion that integrates the strengths of multiple source LLMs into a target LLM through adaptive knowledge exploration and automatic data generation. Unlike existing methods that rely on real data from limited domains and use fixed data allocation proportions, Bohdi dynamically adjusts sampling based on the target LLM's performance and generates data automatically through a hierarchical knowledge tree structure. This ensures comprehensive domain coverage and balanced capability enhancement without the need for real data. Our github page is Bohdi.
β¨ Features
π Synthetic-Data-Only Fusion: Bohdi operates without relying on real data, making it highly efficient and versatile.
π³ Dynamic Domain Exploration: Through the hierarchical knowledge tree and Sprout/Harvest operations, Bohdi explores new domains and generates data automatically.
π Adaptive Data Allocation: The DynaBranches mechanism with IR ensures dynamic adjustment of data sampling proportions based on the target LLMβs capabilities.
βοΈ Installation
Main Environment for Distillation
conda env create -f environment_Bohdi.yaml
Environment for Evaluation
conda env create -f opencompass_env.yaml
Preparation for Evaluation Suite
# The version we used: opencompass 0.3.4
git clone https://github.com/open-compass/opencompass opencompass
cd opencompass
pip install -e .
β³ Distillation Training
To train the target LLM using Bohdi, follow these steps:
- Prepare Source LLMs: Ensure you have access to the source LLMs you want to fuse. If you want to follow our setup, please download the following models:
# Source Models Qwen/Qwen2.5-14B-Instruct mistralai/Mistral-Small-24B-Instruct-2501 microsoft/phi-4 # Target Models meta-llama/Llama-3.2-3B-Instruct meta-llama/Llama-3.1-8B-Instruct Qwen/Qwen2.5-7B-Instruct google/gemma-2-9b-it
- Run Bohdi For Distillation
Please first configure the relevant paths in
run_bohdi.sh
according to your actual paths, and then run:source activate bohdi cd your project path bash run_bohdi.sh
π Evaluation
We use OpenCompass for evaluation and perform inference based on VLLM. To evaluate your model, please configure the relevant paths in eval_opencompass.sh
according to your actual paths, and then run:
source activate opencompass
cd your project path
bash eval_opencompass.sh
π Citation
@article{gao2025bohdi,
title={Bohdi: Heterogeneous LLM Fusion with Automatic Data Exploration},
author={Junqi Gao and Zhichang Guo and Dazhi Zhang and Dong Li and Runze Liu and Pengfei Li and Kai Tian and Biqing Qi},
journal={arXiv preprint arXiv:2506.15721},
year={2025},
url={https://doi.org/10.48550/arXiv.2506.15721}
}
- Downloads last month
- 0