Published 6/12/2023, last updated 6/24/2024
Multi-GPU Inference with Accelerate (archived)
Run inference faster by passing prompts to multiple GPUs in parallel.
UPDATE 2024: The information in this post may be dated or inaccurate. As Alex Salinas pointed out in the comments below, this code should likely be using torchpippy instead of using split_between_processes.
Historically, more attention has been paid to distributed training than to distributed inference. After all, training is more computationally expensive. Larger and more complex LLMs, however, can take a long time to perform text completion tasks. Whether for research or in production, it's valuable to parallelize inference in order to maximize performance.
It's important to recognize that there's a difference between distributing the weights of one model across multiple GPUs and distributing model prompts or inputs across multiple models. The first is relatively simple, while the latter (which I'll be focusing on) is slightly more involved.
A week ago, in version 0.20.0, HuggingFace Accelerate released a feature that significantly simplifies multi-GPU inference: Accelerator.split_between_processes()
. It's based on torch.distributed
, but is much simpler to use.
Let's look at how we can use this new feature with LLaMA. The code will be written assuming that you've saved LLaMA weights in the Hugging Face Transformers format.
First, start out by importing the required modules and initializing the tokenizer and model.
Notice how we pass in device_map="auto"
. This allows Accelerate to evenly spread out the model's weights across available GPUs.
If we wanted, we could run model.to(accelerator.device)
. This would transfer the model to a specific GPU. accelerator.device
will be different for each process running in parallel, so you could have a model loaded onto GPU 0, another loaded onto GPU 1, etc. In this case, though, we'll stick with device_map="auto"
. This allows us to use larger models than could fit on a single GPU.
Next, we'll write the code to perform inference!
Finally, all that remains is to launch Accelerate using the Accelerate CLI:
When we run the above code, 4 copies of the model are loaded across available GPUs. Our prompts are evenly split across the 4 models, which significantly improves performance.
The output of the above code (after logs from loading the models) should look like this:
I hope this was helpful! You can learn more about Accelerate and distributed inference in the Accelerate documentation here if you're interested.
If you liked this article, don't forget to share it and follow me at @nebrelbug on X!