What happened?
When implementing Project Quotas, we're running into a problem where rolling restarts of machine deployments fail due to quota validation. The issue occurs because during a rolling update, the machine controller temporarily creates new machines before deleting old ones, causing a brief period where the total resource usage exceeds the quota limits.
Error Message:
failed to sync Machineset replicas: admission webhook "machines.cluster.k8c.io" denied the request: requested CPU "6" would exceed current quota (quota/used "20"/"18")
Root Cause:
- Machine controller uses a rolling update strategy with
maxSurge to maintain availability
- New machines are created before old machines are deleted
- Quota validation happens at admission webhook level before the old machines are removed
- This creates a temporary overage that fails quota validation
Expected behavior
The machine controller should be able to perform rolling updates without failing quota validation when:
- The replica count of the MachineDeployment is not being increased (i.e., it's a rolling restart/update, not a scale-up)
- The temporary overage is within the
maxSurge limits of the rolling update strategy
- The total resource usage will return to normal levels once the rolling update completes
Proposed Solution:
- Add a quota bypass mechanism for machines created during rolling updates
- Use an annotation
kubermatic.io/bypass-quota-validation=true on machines created during rolling updates
- Modify the admission webhook to detect quota-related errors and bypass validation when the annotation is present
- Ensure the bypass only applies to quota-related errors, not other validation failures
How to reproduce the issue?
-
Setup a MachineDeployment with quota limits:
apiVersion: "cluster.k8s.io/v1alpha1"
kind: MachineDeployment
metadata:
name: test-deployment
spec:
replicas: 3
strategy:
type: RollingUpdate
rollingUpdate:
maxSurge: 1
maxUnavailable: 1
template:
spec:
providerSpec:
value:
cloudProvider: "azure"
cloudProviderSpec:
vmSize: "Standard_D2s_v3" # 2 vCPUs per machine
-
Set up resource quotas:
apiVersion: v1
kind: ResourceQuota
metadata:
name: compute-quota
spec:
hard:
requests.cpu: "6" # Total of 3 machines * 2 CPUs = 6 CPUs
-
Trigger a rolling update by changing the machine template (e.g., update the OS image or kubelet version)
-
Observe the failure when the machine controller tries to create the 4th machine (3 existing + 1 new for rolling update)
Additional details
Current Flow:
MachineDeployment Update → MachineSet Scaling → Machine Creation → Quota Check → FAIL
Proposed Flow:
MachineDeployment Update → MachineSet Scaling → Machine Creation (with bypass annotation) → Quota Check → BYPASS → SUCCESS
What happened?
When implementing Project Quotas, we're running into a problem where rolling restarts of machine deployments fail due to quota validation. The issue occurs because during a rolling update, the machine controller temporarily creates new machines before deleting old ones, causing a brief period where the total resource usage exceeds the quota limits.
Error Message:
Root Cause:
maxSurgeto maintain availabilityExpected behavior
The machine controller should be able to perform rolling updates without failing quota validation when:
maxSurgelimits of the rolling update strategyProposed Solution:
kubermatic.io/bypass-quota-validation=trueon machines created during rolling updatesHow to reproduce the issue?
Setup a MachineDeployment with quota limits:
Set up resource quotas:
Trigger a rolling update by changing the machine template (e.g., update the OS image or kubelet version)
Observe the failure when the machine controller tries to create the 4th machine (3 existing + 1 new for rolling update)
Additional details
Current Flow:
Proposed Flow: