Spaces:

ivangabriele
/

trl-sandbox

Paused

App Files Files Community

trl-sandbox / docs /source /distributing_training.md

ivangabriele

feat: initialize project

2f5127c verified 14 days ago

preview code

raw

history blame contribute delete

2.62 kB

	# Distributing Training

	<Tip warning={true}>
	Section under construction. Feel free to contribute!
	</Tip>

	## Multi-GPU Training with TRL

	The trainers in TRL use [🤗 Accelerate](https://github.com/huggingface/accelerate) to enable distributed training across multiple GPUs or nodes. To do so, first create an [🤗 Accelerate](https://github.com/huggingface/accelerate) config file by running

	```bash
	accelerate config
	```

	and answering the questions according to your multi-GPU / multi-node setup. You can then launch distributed training by running:

	```bash
	accelerate launch train.py
	```

	We also provide config files in the [examples folder](https://github.com/huggingface/trl/tree/main/examples/accelerate_configs) that can be used as templates. To use these templates, simply pass the path to the config file when launching a job, e.g.:

	```shell
	accelerate launch --config_file examples/accelerate_configs/multi_gpu.yaml train.py <SCRIPT_ARGS>
	```

	This automatically distributes the workload across all available GPUs.

	Under the hood, [🤗 Accelerate](https://github.com/huggingface/accelerate) creates one model per GPU. Each process:
	- Processes its own batch of data
	- Computes the loss and gradients for that batch
	- Shares gradient updates across all GPUs

	![](https://huggingface.co/datasets/trl-lib/documentation-images/resolve/main/multi_gpu.png)

	The effective batch size is calculated as:

	$$
	\text{Batch Size} = \text{per\_device\_train\_batch\_size} \times \text{num\_devices} \times \text{gradient\_accumulation\_steps}
	$$

	To maintain a consistent batch size when scaling to multiple GPUs, make sure to update `per_device_train_batch_size` and `gradient_accumulation_steps` accordingly.

	Example, these configurations are equivalent, and should yield the same results:

	\| Number of GPUs \| Per device batch size \| Gradient accumulation steps \| Comments \|
	\| --- \| --- \| --- \| --- \|
	\| 1 \| 32 \| 1 \| Possibly high memory usage, but faster training \|
	\| 1 \| 4 \| 8 \| Lower memory usage, slower training \|
	\| 8 \| 4 \| 1 \| Multi-GPU to get the best of both worlds \|

	<Tip>

	Having one model per GPU can lead to high memory usage, which may not be feasible for large models or low-memory GPUs. In such cases, you can leverage [DeepSpeed](https://github.com/deepspeedai/DeepSpeed), which provides optimizations like model sharding, Zero Redundancy Optimizer, mixed precision training, and offloading to CPU or NVMe. Check out our [DeepSpeed Integration](deepspeed_integration.md) guide for more details.

	</Tip>

	## Multi-Nodes Training

	We're working on a guide for multi-node training. Stay tuned! 🚀