Bohdi: Heterogeneous LLM Fusion with Automatic Data Exploration

arXiv

Junqi Gao 1,2, Zhichang Guo 2, Dazhi Zhang 2, Dong Li 1,2 , Runze Liu 3, Pengfei Li 2, Kai Tian 4, Biqing Qi1,†

1 Shanghai Artificial Intelligence Laboratory

2 School of Mathematics, Harbin Institute of Technology

3 Tsinghua Shenzhen International Graduate School, Tsinghua University

4 Department of Electronic Engineering, Tsinghua University

† Corresponding Author

πŸ“„ Introduction

Bohdi is a novel framework for heterogeneous Large Language Model (LLM) fusion that integrates the strengths of multiple source LLMs into a target LLM through adaptive knowledge exploration and automatic data generation. Unlike existing methods that rely on real data from limited domains and use fixed data allocation proportions, Bohdi dynamically adjusts sampling based on the target LLM's performance and generates data automatically through a hierarchical knowledge tree structure. This ensures comprehensive domain coverage and balanced capability enhancement without the need for real data. Our github page is Bohdi.

✨ Features

πŸš€ Synthetic-Data-Only Fusion: Bohdi operates without relying on real data, making it highly efficient and versatile.

🌳 Dynamic Domain Exploration: Through the hierarchical knowledge tree and Sprout/Harvest operations, Bohdi explores new domains and generates data automatically.

πŸ”„ Adaptive Data Allocation: The DynaBranches mechanism with IR ensures dynamic adjustment of data sampling proportions based on the target LLM’s capabilities.

βš™οΈ Installation

Main Environment for Distillation

conda env create -f environment_Bohdi.yaml

Environment for Evaluation

conda env create -f opencompass_env.yaml

Preparation for Evaluation Suite

# The version we used: opencompass 0.3.4
git clone https://github.com/open-compass/opencompass opencompass
cd opencompass
pip install -e .

⏳ Distillation Training

To train the target LLM using Bohdi, follow these steps:

  1. Prepare Source LLMs: Ensure you have access to the source LLMs you want to fuse. If you want to follow our setup, please download the following models:
    # Source Models
    Qwen/Qwen2.5-14B-Instruct
    mistralai/Mistral-Small-24B-Instruct-2501
    microsoft/phi-4
    # Target Models
    meta-llama/Llama-3.2-3B-Instruct
    meta-llama/Llama-3.1-8B-Instruct
    Qwen/Qwen2.5-7B-Instruct
    google/gemma-2-9b-it
    
  2. Run Bohdi For Distillation Please first configure the relevant paths in run_bohdi.sh according to your actual paths, and then run:
    source activate bohdi
    cd your project path
    bash run_bohdi.sh
    

πŸ“ Evaluation

We use OpenCompass for evaluation and perform inference based on VLLM. To evaluate your model, please configure the relevant paths in eval_opencompass.sh according to your actual paths, and then run:

source activate opencompass
cd your project path
bash eval_opencompass.sh

πŸ“š Citation

@article{gao2025bohdi,
  title={Bohdi: Heterogeneous LLM Fusion with Automatic Data Exploration},
  author={Junqi Gao and Zhichang Guo and Dazhi Zhang and Dong Li and Runze Liu and Pengfei Li and Kai Tian and Biqing Qi},
  journal={arXiv preprint arXiv:2506.15721},
  year={2025},
  url={https://doi.org/10.48550/arXiv.2506.15721}
}
Downloads last month
0
Safetensors
Model size
7.62B params
Tensor type
BF16
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for ChetKao/Bohdi-Qwen2.5-7B-Instruct

Base model

Qwen/Qwen2.5-7B
Finetuned
(2401)
this model
Quantizations
2 models