CAUTION: Support of GPU acceleration is preliminary. There are known issues.
Generally, all backends supported by GGML are available, with a focus on below backends.
| Backend | Target devices |
|---|---|
| CUDA | Nvidia GPU |
| RPC | Any |
| Vulkan | GPU |
To build with Vulkan:
cmake -B build -DGGML_VULKAN=1
cmake --build build --config ReleaseTo build with CUDA:
cmake -B build -DGGML_CUDA=1
cmake --build build --config ReleaseFor more information, please checkout Build llama.cpp locally.
Use -ngl (--n_gpu_layers) to specify number of layers to be deployed to different backend devices
(Just treat the word gpu as an alias of backend device).
We name the staffs before the first layer as "Prolog", and
staffs after the last layer as "Epilog". "Prolog" and "Epilog" are treated as special layers, and they can also be configured from -ngl
by including prolog and epilog respectively.
Suppose there is a model with 10 hidden layers:
-ngl 5: put the first 5 layers to the first device;-ngl 100: put all layers to the first device;-ngl 5,prolog: put the first 5 layers, and "Prolog" layer to the first device;-ngl 100,prolog,epilog: put all layers, "Prolog" layer and "Epilog" layer to the first device.-ngl all: equivalent to-ngl 99999,prolog,epilog.
The full format of -ngl is -ngl [id:]layer_specs[;id:layer_specs]... id is device ID. If id is omitted, 0 is assumed.
layer_spec can be a positive integer, prolog, epilog, a combination of these; or just all.
Suppose device 0 is GPU, and device 1 is CPU, -ngl 1:5;0:10 will put the first 5 layers to CPU, the next 10 layers to GPU,
and all other layers to CPU as default.
You can use -mgl (--model_gpu_layers) to specify number of layers of a specific model to be deployed to different backend devices.
-mgl MODEL N, in which N shares the same syntax as -ngl and MODEL can be
main: the main model.vis: the vision accessory model (which typically project images/videos into LLM).aud: the audio accessory model (which typically project audio into LLM).any: any model.
-ngl N is equivalent to -mgl any N.
Tip: Use --show_devices to check all available devices and --show to check basic hyper parameters of a model.
-
Custom operators (
ggml::map_custom...);If hidden layers of a model use custom operators, then GPU acceleration is unavailable.
-
Models with
tie_word_embeddings = true;Ensure
PrologandEpiloglayers are on the same device. -
Other issues;
If a model has
10hidden layers and-ngl 10not work, then try-ngl all,-ngl 10,epilog, or-ngl 9.
-
Having trouble with Python binding on Windows with CUDA?
Copy these DLL to the
bindingsfolder:cublas64_12.dllcudart64_12.dllcublasLt64_12.dll