Skip to content

Commit c011ec3

Browse files
committed
Be more clear about CUDA vs. MPI_Init order.
Signed-off-by: George Bosilca <gbosilca@nvidia.com>
1 parent cfb5505 commit c011ec3

File tree

1 file changed

+17
-11
lines changed

1 file changed

+17
-11
lines changed

docs/tuning-apps/networking/cuda.rst

Lines changed: 17 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -669,17 +669,23 @@ Tuning Guide*, which can be found in the `Cornelis Networks Customer Center
669669
When do I need to select a CUDA device?
670670
---------------------------------------
671671

672-
"mpi-cuda-dev-selection"
673-
674-
OpenMPI requires CUDA resources allocated for internal use. These
675-
are allocated lazily when they are first needed, e.g. CUDA IPC mem handles
676-
are created when a communication routine first requires them during a
677-
transfer. So, the CUDA device needs to be selected before the first MPI
678-
call requiring a CUDA resource. MPI_Init and most communicator related
679-
operations do not create any CUDA resources (guaranteed for MPI_Init,
680-
MPI_Comm_rank, MPI_Comm_size, MPI_Comm_split_type and MPI_Comm_free). It
681-
is thus possible to use those routines to query rank information and use
682-
those to select a GPU, e.g. using
672+
Open MPI requires CUDA resources allocated for internal use. When possible,
673+
these resources are allocated lazily when they are first needed, e.g. CUDA
674+
IPC mem handles are created when a communication routine first requires them
675+
during a transfer. MPI_Init and most communicator related operations do not
676+
create any CUDA resources (guaranteed at least for MPI_Comm_rank,
677+
MPI_Comm_size on ``MPI_COMM_WORLD``).
678+
679+
However, this is not always the case. In certain instances, such as when
680+
using PSM2 or the ``smcuda`` BTL (with the OB1 PML), it is not feasible to
681+
delay the CUDA resources allocation. Consequently, these resources will need
682+
to be allocated during ``MPI_Init()``.
683+
684+
Regardless of the situation, the CUDA device must be selected before the first
685+
MPI call that requires a CUDA resource. When CUDA resources can be initialized
686+
lazily, it is possible to use the aforementioned communicator-related operations
687+
to query rank information and utilize that to select a GPU.
688+
683689

684690
.. code-block:: c
685691

0 commit comments

Comments
 (0)