Skip to content

Cannot create p6-b300 instances #7143

@dakoner

Description

@dakoner

I am testing p6-b300 instances and they don't seem to launch correctly when I request EFA interface (if EFA is disabled, the nodes boot but without EFA).

Based on consultation with AWS eng, there is a different in B300 nodes which likely needs a modification in the source code:

P6-B300 has 17 network cards, of which the primary network card (network card index 0) only supports ENA (4 ENIs), the remaining secondary network cards (indexes 1-16) support EFA and ENA. It is required to attach an ENI from the primary network card (network card 0) on device index 0 to be used as default ENI for instance connectivity. Given the default ENI attachment is satisfied, EFAs and ENAs can be attached to the secondary cards as desired.
Example to support ENA on primary and EFA on secondary NICs:
--network-interfaces
NetworkCardIndex=0,DeviceIndex=0,Groups=$SG_ID,SubnetId=$SUBNET_ID,InterfaceType=interface \ # required
NetworkCardIndex={1..16},DeviceIndex=0,Groups=$SG_ID,SubnetId=$SUBNET_ID,InterfaceType=efa-only # for additional ENIs alternatively use type efa or add optional interface type on the primary/secondary cards

I see this error in clustermgmtd log file:

2025-12-06 00:23:17,350 - [slurm_plugin.instance_manager:_launch_instances] - ERROR - Encountered exception when launching instances for nodes (x2) ['b300-st-b300-1', 'b300-st-b300-2']: An error occurred (AttachmentLimitExceeded) when calling the RunInstances operation: EFA interface count 17 exceeds allowed limit forp6-b300.48xlarge. EFA ENI limits exceeded on following network cards: Network Card 0 (requested: 1, limit: 0)

Our config looks like this:

  - Name: b300
     CustomSlurmSettings:
       ...
     HealthChecks:
       ...
     CapacityType: CAPACITY_BLOCK
     ComputeResources:
       - Name: b300
         InstanceType: p6-b300.48xlarge
         CapacityReservationTarget:
           CapacityReservationId: ...
         MinCount: 2
         MaxCount: 2
         Efa:
           Enabled: true
         Networking:
           PlacementGroup:
             Enabled: false
     Networking:
       SubnetIds:
         - ...
     ComputeSettings:
       LocalStorage:
         RootVolume:
           Size: 500
     Iam:
       S3Access:
         ...
       AdditionalIamPolicies:
         ...

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions