14.4 Device Mesh
Getting Started with DeviceMesh
Created Date: 2025-07-07
Setting up distributed communicators, i.e. NVIDIA Collective Communication Library (NCCL) communicators, for distributed training can pose a significant challenge. For workloads where users need to compose different parallelisms, users would need to manually set up and manage NCCL communicators (for example, ProcessGroup) for each parallelism solution. This process could be complicated and susceptible to errors. DeviceMesh can simplify this process, making it more manageable and less prone to errors.
14.4.1 What is DeviceMesh
DeviceMesh is a higher level abstraction that manages ProcessGroup. It allows users to effortlessly create inter-node and intra-node process groups without worrying about how to set up ranks correctly for different sub process groups. Users can also easily manage the underlying process_groups/devices for multi-dimensional parallelism via DeviceMesh.
14.4.2 Why DeviceMesh is Useful
DeviceMesh is useful when working with multi-dimensional parallelism (i.e. 3-D parallel) where parallelism composability is required. For example, when your parallelism solutions require both communication across hosts and within each host. The image above shows that we can create a 2D mesh that connects the devices within each host, and connects each device with its counterpart on the other hosts in a homogeneous setup.
Without DeviceMesh, users would need to manually set up NCCL communicators, cuda devices on each process before applying any parallelism, which could be quite complicated. The following code snippet illustrates a hybrid sharding 2-D Parallel pattern setup without DeviceMesh. First, we need to manually calculate the shard group and replicate group. Then, we need to assign the correct shard and replicate group to each rank.
14.4.3 How to use DeviceMesh with HSDP
Hybrid Sharding Data Parallel(HSDP) is 2D strategy to perform FSDP within a host and DDP across hosts.
14.4.4 How to Use DeviceMesh for Your Custom Parallel Solutions
When working with large scale training, you might have more complex custom parallel training composition. For example, you may need to slice out sub-meshes for different parallelism solutions. DeviceMesh allows users to slice child mesh from the parent mesh and re-use the NCCL communicators already created when the parent mesh is initialized.
14.4.5 Conclusion
In conclusion, we have learned about DeviceMesh and init_device_mesh(), as well as how they can be used to describe the layout of devices across the cluster.