-
Notifications
You must be signed in to change notification settings - Fork 314
Description
I am testing p6-b300 instances and they don't seem to launch correctly when I request EFA interface (if EFA is disabled, the nodes boot but without EFA).
Based on consultation with AWS eng, there is a different in B300 nodes which likely needs a modification in the source code:
P6-B300 has 17 network cards, of which the primary network card (network card index 0) only supports ENA (4 ENIs), the remaining secondary network cards (indexes 1-16) support EFA and ENA. It is required to attach an ENI from the primary network card (network card 0) on device index 0 to be used as default ENI for instance connectivity. Given the default ENI attachment is satisfied, EFAs and ENAs can be attached to the secondary cards as desired.
Example to support ENA on primary and EFA on secondary NICs:
--network-interfaces
NetworkCardIndex=0,DeviceIndex=0,Groups=$SG_ID,SubnetId=$SUBNET_ID,InterfaceType=interface \ # required
NetworkCardIndex={1..16},DeviceIndex=0,Groups=$SG_ID,SubnetId=$SUBNET_ID,InterfaceType=efa-only # for additional ENIs alternatively use type efa or add optional interface type on the primary/secondary cards
I see this error in clustermgmtd log file:
2025-12-06 00:23:17,350 - [slurm_plugin.instance_manager:_launch_instances] - ERROR - Encountered exception when launching instances for nodes (x2) ['b300-st-b300-1', 'b300-st-b300-2']: An error occurred (AttachmentLimitExceeded) when calling the RunInstances operation: EFA interface count 17 exceeds allowed limit forp6-b300.48xlarge. EFA ENI limits exceeded on following network cards: Network Card 0 (requested: 1, limit: 0)
Our config looks like this:
- Name: b300
CustomSlurmSettings:
...
HealthChecks:
...
CapacityType: CAPACITY_BLOCK
ComputeResources:
- Name: b300
InstanceType: p6-b300.48xlarge
CapacityReservationTarget:
CapacityReservationId: ...
MinCount: 2
MaxCount: 2
Efa:
Enabled: true
Networking:
PlacementGroup:
Enabled: false
Networking:
SubnetIds:
- ...
ComputeSettings:
LocalStorage:
RootVolume:
Size: 500
Iam:
S3Access:
...
AdditionalIamPolicies:
...