Speeding Up Training

Section under construction. Feel free to contribute!

vLLM for fast generation in online methods

Online methods such as GRPO or Online DPO require the model to generate completions, which is often a slow process and can significantly impact training time. To speed up generation, you can use vLLM, a library that enables fast generation through, among other things, PagedAttention. TRL's online trainers support vLLM, greatly improving training speed.

To use vLLM, first install it using:

pip install vllm

pip install "trl[vllm]"

Then, enable it by passing use_vllm=True in the training arguments.

from trl import OnlineDPOConfig

training_args = OnlineDPOConfig(..., use_vllm=True)

First, start a vLLM server by running:

trl vllm-serve --model <model_name>

Then, run the training script and pass use_vllm=True in the training arguments.

from trl import GRPOConfig

training_args = GRPOConfig(..., use_vllm=True)

You can customize the server configuration by passing additional arguments. For more information, see vLLM integration.

When using vLLM, ensure that the GPUs assigned for training and generation are separate to avoid resource conflicts. For instance, if you plan to use 4 GPUs for training and another 4 for vLLM generation, you can specify GPU allocation using CUDA_VISIBLE_DEVICES.

Set GPUs 0-3 for vLLM generation:

CUDA_VISIBLE_DEVICES=0,1,2,3 trl vllm-serve --model <model_name>

And GPUs 4-7 for training:

CUDA_VISIBLE_DEVICES=4,5,6,7 accelerate launch train.py