Skip to content

Commit cdd6466

Browse files
authored
Add SPMD on GPU instructions (#6684)
1 parent 39db138 commit cdd6466

File tree

2 files changed

+45
-7
lines changed

2 files changed

+45
-7
lines changed

docs/pjrt.md

Lines changed: 3 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -28,8 +28,6 @@ _New features in PyTorch/XLA r2.0_:
2828
* New `xm.rendezvous` implementation that scales to thousands of TPU cores
2929
* [experimental] `torch.distributed` support for TPU v2 and v3, including
3030
`pjrt://` `init_method`
31-
* [experimental] Single-host GPU support in PJRT. Multi-host support coming
32-
soon!
3331

3432
## TL;DR
3533

@@ -192,8 +190,6 @@ for more information.
192190

193191
### GPU
194192

195-
*Warning: GPU support is still highly experimental!*
196-
197193
### Single-node GPU training
198194

199195
To use GPUs with PJRT, simply set `PJRT_DEVICE=CUDA` and configure
@@ -226,7 +222,7 @@ PJRT_DEVICE=CUDA torchrun \
226222
- `--nnodes`: how many GPU machines to be used.
227223
- `--node_rank`: the index of the current GPU machines. The value can be 0, 1, ..., ${NUMBER_GPU_VM}-1.
228224
- `--nproc_per_node`: the number of GPU devices to be used on the current machine.
229-
- `--rdzv_endpoint`: the endpoint of the GPU machine with node_rank==0, in the form <host>:<port>. The `host` will be the internal IP address. The port can be any available port on the machine.
225+
- `--rdzv_endpoint`: the endpoint of the GPU machine with node_rank==0, in the form `host:port`. The `host` will be the internal IP address. The `port` can be any available port on the machine. For single-node training/inference, this parameter can be omitted.
230226

231227
For example, if you want to train on 2 GPU machines: machine_0 and machine_1, on the first GPU machine machine_0, run
232228

@@ -235,7 +231,7 @@ For example, if you want to train on 2 GPU machines: machine_0 and machine_1, on
235231
--nnodes=2 \
236232
--node_rank=0 \
237233
--nproc_per_node=4 \
238-
--rdzv_endpoint="<MACHINE_0_IP_ADDRESS>:12355" pytorch/xla/test/test_train_mp_imagenet.py --fake_data --pjrt_distributed --batch_size=128 --num_epochs=1
234+
--rdzv_endpoint="<MACHINE_0_INTERNAL_IP_ADDRESS>:12355" pytorch/xla/test/test_train_mp_imagenet.py --fake_data --pjrt_distributed --batch_size=128 --num_epochs=1
239235
```
240236

241237
On the second GPU machine, run
@@ -245,7 +241,7 @@ On the second GPU machine, run
245241
--nnodes=2 \
246242
--node_rank=1 \
247243
--nproc_per_node=4 \
248-
--rdzv_endpoint="<MACHINE_0_IP_ADDRESS>:12355" pytorch/xla/test/test_train_mp_imagenet.py --fake_data --pjrt_distributed --batch_size=128 --num_epochs=1
244+
--rdzv_endpoint="<MACHINE_0_INTERNAL_IP_ADDRESS>:12355" pytorch/xla/test/test_train_mp_imagenet.py --fake_data --pjrt_distributed --batch_size=128 --num_epochs=1
249245
```
250246

251247
the difference between the 2 commands above are `--node_rank` and potentially `--nproc_per_node` if you want to use different number of GPU devices on each machine. All the rest are identical. For more information about `torchrun`, please refer to this [page](https://pytorch.org/docs/stable/elastic/run.html).

docs/spmd.md

Lines changed: 42 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -357,6 +357,48 @@ Unlike existing DDP and FSDP, under the SPMD mode, there is always a single proc
357357
There is no code change required to go from single TPU host to TPU Pod if you construct your mesh and partition spec based on the number of devices instead of some hardcode constant. To run the PyTorch/XLA workload on TPU Pod, please refer to the [Pods section](https://github.com/pytorch/xla/blob/master/docs/pjrt.md#pods) of our PJRT guide.
358358

359359

360+
### Running SPMD on GPU
361+
362+
PyTorch/XLA supports SPMD on NVIDIA GPU (single-node or multi-nodes). The training/inference script remains the same as the one used for TPU, such as this [ResNet script](https://github.com/pytorch/xla/blob/1dc78948c0c9d018d8d0d2b4cce912552ab27083/test/spmd/test_train_spmd_imagenet.py). To execute the script using SPMD, we leverage `torchrun`:
363+
364+
```
365+
PJRT_DEVICE=CUDA \
366+
torchrun \
367+
--nnodes=${NUM_GPU_MACHINES} \
368+
--node_rank=${RANK_OF_CURRENT_MACHINE} \
369+
--nproc_per_node=1 \
370+
--rdzv_endpoint="<MACHINE_0_IP_ADDRESS>:<PORT>" \
371+
training_or_inference_script_using_spmd.py
372+
```
373+
- `--nnodes`: how many GPU machines to be used.
374+
- `--node_rank`: the index of the current GPU machines. The value can be 0, 1, ..., ${NUMBER_GPU_VM}-1.
375+
- `--nproc_per_node`: the value must be 1 due to the SPMD requirement.
376+
- `--rdzv_endpoint`: the endpoint of the GPU machine with node_rank==0, in the form `host:port`. The host will be the internal IP address. The `port` can be any available port on the machine. For single-node training/inference, this parameter can be omitted.
377+
378+
For example, if you want to train a ResNet model on 2 GPU machines using SPMD, you can run the script below on the first machine:
379+
```
380+
XLA_USE_SPMD=1 PJRT_DEVICE=CUDA \
381+
torchrun \
382+
--nnodes=2 \
383+
--node_rank=0 \
384+
--nproc_per_node=1 \
385+
--rdzv_endpoint="<MACHINE_0_INTERNAL_IP_ADDRESS>:12355" \
386+
pytorch/xla/test/spmd/test_train_spmd_imagenet.py --fake_data --batch_size 128
387+
```
388+
and run the following on the second machine:
389+
```
390+
XLA_USE_SPMD=1 PJRT_DEVICE=CUDA \
391+
torchrun \
392+
--nnodes=2 \
393+
--node_rank=1 \
394+
--nproc_per_node=1 \
395+
--rdzv_endpoint="<MACHINE_0_INTERNAL_IP_ADDRESS>:12355" \
396+
pytorch/xla/test/spmd/test_train_spmd_imagenet.py --fake_data --batch_size 128
397+
```
398+
399+
For more information, please refer to the [SPMD support on GPU RFC](https://github.com/pytorch/xla/issues/6256).
400+
401+
360402
## Reference Examples
361403

362404

0 commit comments

Comments
 (0)