Skip to content

Commit 41d16f8

Browse files
committed
Fix cfn-hup endless loop after rollback to cluster state older than 24h
When a cluster update fails and triggers a rollback to a state older than 24 hours, cfn-hup enters an endless loop on the head node. This happens because: 1. The rollback restores the launch template metadata to reference an expired wait condition handle (wait conditions expire after 24h) 2. cfn-signal fails to signal the expired handle and returns non-zero 3. cfn-hup sees the non-zero exit code and does not update its local metadata cache (metadata_db.json) 4. On the next polling interval, cfn-hup detects the same "change" and re-triggers the update recipe, creating an infinite loop This fix appends "; exit 0" to the update command, ensuring cfn-hup always updates its metadata cache regardless of whether cfn-signal succeeds or fails. This prevents the endless loop while still allowing CloudFormation to handle timeouts appropriately.
1 parent c16d21a commit 41d16f8

File tree

2 files changed

+10
-1
lines changed

2 files changed

+10
-1
lines changed

CHANGELOG.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -27,6 +27,7 @@ CHANGELOG
2727
- Reduce EFA installation time for Ubuntu by ~20 minutes by only holding kernel packages for the installed kernel.
2828
- Add GetFunction and GetPolicy permissions to PClusterBuildImageCleanupRole to prevent AccessDenied errors during build image stack deletion.
2929
- Fix validation error messages when `DevSettings` is null or `DevSettings/InstanceTypesData` is missing required fields.
30+
- Fix an issue where cfn-hup enters an endless loop on the head node after a rollback to a cluster state older than 24 hours, caused by cfn-signal failing to signal an expired wait condition handle.
3031

3132
3.14.0
3233
------

cli/src/pcluster/templates/cluster_stack.py

Lines changed: 9 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1496,6 +1496,13 @@ def _add_head_node(self):
14961496
"chefUpdate": {
14971497
"commands": {
14981498
"chef": {
1499+
# This command runs the update recipe and signals CloudFormation with the result.
1500+
# The trailing "; exit 0" ensures cfn-hup always updates its local metadata cache
1501+
# (metadata_db.json) regardless of whether cfn-signal succeeds or fails.
1502+
# Without this, if cfn-signal fails (e.g., due to an expired wait condition handle
1503+
# after a rollback to a state older than 24h), cfn-hup would not update its cache
1504+
# and would enter an endless loop, re-triggering the update recipe every minute.
1505+
# See: https://issues.amazon.com/issues/PCLUSTER-XXXX
14991506
"command": (
15001507
". /etc/parallelcluster/pcluster_cookbook_environment.sh; "
15011508
"cinc-client --local-mode --config /etc/chef/client.rb --log_level info"
@@ -1508,7 +1515,8 @@ def _add_head_node(self):
15081515
f" '{self.wait_condition_handle.ref}' ||"
15091516
f" $CFN_BOOTSTRAP_VIRTUALENV_PATH/cfn-signal --exit-code=1 --reason='Update failed'"
15101517
f" --region {self.stack.region} --url {cloudformation_url}"
1511-
f" '{self.wait_condition_handle.ref}'"
1518+
f" '{self.wait_condition_handle.ref}';"
1519+
" exit 0"
15121520
),
15131521
"cwd": "/etc/chef",
15141522
}

0 commit comments

Comments
 (0)