Skip to content

graph/tuning: disable LL128 on PATH_P2C inter-node paths (GH200 corruption fix)#2053

Open
ryanhankins wants to merge 1 commit intoNVIDIA:masterfrom
ryanhankins:master
Open

graph/tuning: disable LL128 on PATH_P2C inter-node paths (GH200 corruption fix)#2053
ryanhankins wants to merge 1 commit intoNVIDIA:masterfrom
ryanhankins:master

Conversation

@ryanhankins
Copy link
Copy Markdown

On GH200 (Grace Hopper), the inter-node NIC->GPU receive path is classified as PATH_P2C (PCIe->NVLink-C2C). LL128 requires that when the receiver observes the flag at the end of a 128-byte chunk, all preceding bytes in that chunk are already visible in GPU memory. That ordering guarantee is not met on GH200: the NIC writes directly into GPU HBM via PCIe->NVLink-C2C, and the two 64-byte halves of the 128-byte chunk can become visible to the GPU independently. If the flag half (last 8 bytes) becomes visible before the data half, the consumer reads stale or incomplete data, resulting in silent corruption.

Previously the condition to enable LL128 on Hopper/Blackwell was:

typeInter <= PATH_PXN

This included PATH_P2C because numerically P2C < PXN. The fix changes the condition to:

typeInter <= PATH_PXB || typeInter == PATH_PXN

which explicitly excludes PATH_P2C. LL128 remains enabled for NVLink-switch fabrics (PATH_PXN) and closer intra-node paths where the ordering guarantees hold. LL and Simple are unaffected and were verified clean on GH200.

Verified:
NCCL_PROTO=Simple - clean
NCCL_PROTO=LL - clean
NCCL_PROTO=LL128 - corrupt (fixed by this change)
NCCL_LL128_C2C=0 - clean (env-var workaround, superseded by this fix)

Description

graph/tuning: disable LL128 on PATH_P2C inter-node paths (GH200 corruption fix)

On GH200 (Grace Hopper), the NIC->GPU receive path is classified as PATH_P2C (PCIe->NVLink->C2C). NCCL's LL128 protocol requires that when the receiver sees the flag byte at
the end of a 128-byte chunk, all preceding data in that chunk is already visible in GPU memory. That guarantee does not hold on GH200: the two 64-byte halves of the chunk
can become visible independently across the PCIe->C2C boundary, causing silent data corruption.

Root cause: the previous condition typeInter <= PATH_PXN inadvertently included PATH_P2C (numerically P2C < PXN).

Fix: change the condition to typeInter <= PATH_PXB || typeInter == PATH_PXN, explicitly excluding PATH_P2C. LL128 stays enabled for NVLink-switch fabrics and tighter intra-node paths. LL and Simple are unaffected.

Verified clean on GH200: with SIMPLE,LL,LL128 protocols.

Reference any related issues or PRs:

#2001

Performance Impact

N/A

… corruption

On GH200 (Grace Hopper), the inter-node NIC->GPU receive path is
classified as PATH_P2C (PCIe->NVLink-C2C).  LL128 requires that when
the receiver observes the flag at the end of a 128-byte chunk, all
preceding bytes in that chunk are already visible in GPU memory.  That
ordering guarantee is not met on GH200: the NIC writes directly into
GPU HBM via PCIe->NVLink-C2C, and the two 64-byte halves of the
128-byte chunk can become visible to the GPU independently.  If the
flag half (last 8 bytes) becomes visible before the data half, the
consumer reads stale or incomplete data, resulting in silent corruption.

Previously the condition to enable LL128 on Hopper/Blackwell was:

  typeInter <= PATH_PXN

This included PATH_P2C because numerically P2C < PXN.  The fix changes
the condition to:

  typeInter <= PATH_PXB || typeInter == PATH_PXN

which explicitly excludes PATH_P2C.  LL128 remains enabled for
NVLink-switch fabrics (PATH_PXN) and closer intra-node paths where
the ordering guarantees hold.  LL and Simple are unaffected and were
verified clean on GH200.

Verified:
  NCCL_PROTO=Simple  - clean
  NCCL_PROTO=LL      - clean
  NCCL_PROTO=LL128   - corrupt (fixed by this change)
  NCCL_LL128_C2C=0   - clean (env-var workaround, superseded by this fix)

Signed-off-by: Ryan Hankins <ryan.hankins@hpe.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant