Skip to content

Commit 61c38ae

Browse files
committed
Aggregate polling for retrieving job pending reason
1 parent 1eee4f9 commit 61c38ae

6 files changed

Lines changed: 192 additions & 95 deletions

File tree

docs/config_reference.rst

Lines changed: 44 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -438,12 +438,14 @@ System Partition Configuration
438438

439439
Ignore the ``ReqNodeNotAvail`` Slurm state.
440440

441-
If a job associated to a test is in pending state with the Slurm reason ``ReqNodeNotAvail`` and a list of unavailable nodes is also specified, ReFrame will check the status of the nodes and, if all of them are indeed down, it will cancel the job.
442-
Sometimes, however, when Slurm's backfill algorithm takes too long to compute, Slurm will set the pending reason to ``ReqNodeNotAvail`` and mark all system nodes as unavailable, causing ReFrame to kill the job.
441+
If a job associated to a test is in pending state with the Slurm reason ``ReqNodeNotAvail``, ReFrame will cancel the job.
442+
Sometimes, however, when Slurm's backfill algorithm takes too long to compute, Slurm will set the pending reason to ``ReqNodeNotAvail`` and mark all system nodes as unavailable, causing ReFrame to its jobs.
443443
In such cases, you may set this parameter to :obj:`True` to avoid this.
444444

445445
This option is relevant for the Slurm backends only.
446446

447+
.. seealso:: :attr:`~config.systems.partitions.sched_options.slurm_job_cancel_reasons`
448+
447449
.. py:attribute:: systems.partitions.sched_options.job_submit_timeout
448450
449451
:required: No
@@ -490,6 +492,45 @@ System Partition Configuration
490492
.. versionadded:: 4.8
491493

492494

495+
.. py:attribute:: systems.partitions.sched_options.slurm_job_cancel_reasons
496+
497+
:required: No
498+
:default: ``["ReqNodeNotAvail"]``
499+
500+
Reasons to proactively cancel a pending Slurm job.
501+
502+
If a job associated to a test is in pending state with one of the reasons listed here, ReFrame will cancel the job.
503+
504+
This option is relevant for the Slurm backends only.
505+
506+
.. versionadded:: 4.10
507+
508+
.. seealso::
509+
510+
:attr:`~config.systems.partitions.sched_options.ignore_reqnodenotavail`
511+
:attr:`~config.systems.partitions.sched_options.slurm_pending_job_reason_poll_freq`
512+
513+
514+
.. py:attribute:: systems.partitions.sched_options.slurm_pending_job_reason_poll_freq
515+
516+
:required: No
517+
:default: ``10``
518+
519+
Frequency of polling for the reason a Slurm job is pending.
520+
521+
When using the ``slurm`` backend, ReFrame needs to explicitly issue an ``squeue`` command to get the reason a job is pending, in order to cancel it if it is blocked indefinitely.
522+
This option controls the frequency of this polling.
523+
It is an integer number counting the number of job polling cycles before issuing the ``squeue`` command.
524+
ReFrame will issue a single ``squeue`` command to query all pending jobs at once.
525+
However, if your system is sensitive to Slurm RPC calls, you may consider increasing this value.
526+
527+
This option is relevant for the ``slurm`` backend only.
528+
529+
.. versionadded:: 4.10
530+
531+
.. seealso:: :attr:`~config.systems.partitions.sched_options.slurm_job_cancel_reasons`
532+
533+
493534
.. py:attribute:: systems.partitions.sched_options.unqualified_hostnames
494535
495536
:required: No
@@ -1647,7 +1688,7 @@ The additional properties for the ``httpjson`` handler are the following:
16471688
These may depend on the server configuration.
16481689

16491690
.. note::
1650-
If you specify an authorization header here, it will be evaluated at the start of the test session and potentially expire.
1691+
If you specify an authorization header here, it will be evaluated at the start of the test session and potentially expire.
16511692
Consider using the :attr:`~config.logging.handlers_perflog..httpjson..authorization_header` parameter instead for dynamic authorization headers.
16521693

16531694
.. versionadded:: 4.2

docs/manpage.rst

Lines changed: 31 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -2050,7 +2050,7 @@ Whenever an environment variable is associated with a configuration option, its
20502050

20512051
.. envvar:: RFM_IGNORE_REQNODENOTAVAIL
20522052

2053-
Do not treat specially jobs in pending state with the reason ``ReqNodeNotAvail`` (Slurm only).
2053+
Do not cancel jobs that are pending due to the reason ``ReqNodeNotAvail`` (Slurm only).
20542054

20552055
.. table::
20562056
:align: left
@@ -2402,6 +2402,36 @@ Whenever an environment variable is associated with a configuration option, its
24022402
.. versionadded:: 4.7
24032403

24042404

2405+
.. envvar:: RFM_SLURM_JOB_CANCEL_REASONS
2406+
2407+
Reasons to proactively cancel a pending Slurm job.
2408+
2409+
.. table::
2410+
:align: left
2411+
2412+
================================== ==================
2413+
Associated command line option N/A
2414+
Associated configuration parameter :attr:`~config.systems.partitions.sched_options.slurm_job_cancel_reasons`
2415+
================================== ==================
2416+
2417+
.. versionadded:: 4.10
2418+
2419+
2420+
.. envvar:: RFM_SLURM_PENDING_JOB_REASON_POLL_FREQ
2421+
2422+
Frequency of polling for the reason a Slurm job is pending.
2423+
2424+
.. table::
2425+
:align: left
2426+
2427+
================================== ==================
2428+
Associated command line option N/A
2429+
Associated configuration parameter :attr:`~config.systems.partitions.sched_options.slurm_pending_job_reason_poll_freq`
2430+
================================== ==================
2431+
2432+
.. versionadded:: 4.10
2433+
2434+
24052435
.. envvar:: RFM_STAGE_DIR
24062436

24072437
Directory prefix for staging test resources.

reframe/core/schedulers/__init__.py

Lines changed: 12 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -175,6 +175,18 @@ def cancel(self, job):
175175
:meta private:
176176
'''
177177

178+
def cancel_many(self, jobs):
179+
'''Cancel multiple jobs at once.
180+
181+
By default, this method calls :meth:`cancel` for each job. Backends
182+
may override this method to implement more efficient cancellation.
183+
184+
:arg jobs: The job descriptors to cancel.
185+
:meta private:
186+
'''
187+
for job in jobs:
188+
self.cancel(job)
189+
178190
@abc.abstractmethod
179191
def finished(self, job):
180192
'''Poll a job.

reframe/core/schedulers/slurm.py

Lines changed: 82 additions & 91 deletions
Original file line numberDiff line numberDiff line change
@@ -20,7 +20,6 @@
2020
from reframe.core.backends import register_scheduler
2121
from reframe.core.exceptions import (SpawnedProcessError,
2222
JobBlockedError,
23-
JobError,
2423
JobSchedulerError)
2524
from reframe.utility import nodelist_abbrev, seconds_to_hms
2625

@@ -101,13 +100,6 @@ def is_cancelling(self):
101100

102101
@register_scheduler('slurm')
103102
class SlurmJobScheduler(sched.JobScheduler):
104-
# In some systems, scheduler performance is sensitive to the squeue poll
105-
# ratio. In this backend, squeue is used to obtain the reason a job is
106-
# blocked, so as to cancel it if it is blocked indefinitely. The following
107-
# variable controls the frequency of squeue polling compared to the
108-
# standard job state polling using sacct.
109-
SACCT_SQUEUE_RATIO = 10
110-
111103
# This matches the format for both normal and heterogeneous jobs,
112104
# as well as job arrays.
113105
# For heterogeneous jobs, the job_id has the following format:
@@ -125,24 +117,19 @@ def __init__(self):
125117
# Reasons to cancel a pending job: if the job is expected to remain
126118
# pending for a much longer time then usual (mostly if a sysadmin
127119
# intervention is required)
128-
self._cancel_reasons = ['FrontEndDown',
129-
'Licenses', # May require sysadmin
130-
'NodeDown',
131-
'PartitionDown',
132-
'PartitionInactive',
133-
'PartitionNodeLimit',
134-
'QOSJobLimit',
135-
'QOSResourceLimit',
136-
'QOSUsageThreshold']
137-
ignore_reqnodenotavail = self.get_option('ignore_reqnodenotavail')
138-
if not ignore_reqnodenotavail:
139-
self._cancel_reasons.append('ReqNodeNotAvail')
120+
self._cancel_reasons = self.get_option('slurm_job_cancel_reasons')
121+
if self.get_option('ignore_reqnodenotavail'):
122+
with suppress(ValueError):
123+
self._cancel_reasons.remove('ReqNodeNotAvail')
140124

141125
self._update_state_count = 0
142126
self._submit_timeout = self.get_option('job_submit_timeout')
143127
self._use_nodes_opt = self.get_option('use_nodes_option')
144128
self._resubmit_on_errors = self.get_option('resubmit_on_errors')
145129
self._max_sacct_failures = self.get_option('max_sacct_failures')
130+
self._pending_job_reason_poll_freq = self.get_option(
131+
'slurm_pending_job_reason_poll_freq'
132+
)
146133
self._num_sacct_failures = 0
147134
self._sched_access_in_submit = self.get_option(
148135
'sched_access_in_submit'
@@ -529,6 +516,7 @@ def poll(self, *jobs):
529516
jobid = re.split(r'_|\+', s.group('jobid'))[0]
530517
job_info.setdefault(jobid, []).append(s)
531518

519+
# Update job information
532520
for job in jobs:
533521
try:
534522
jobarr_info = job_info[job.jobid]
@@ -538,10 +526,6 @@ def poll(self, *jobs):
538526
# Join the states with ',' in case of job arrays|heterogeneous jobs
539527
job._state = ','.join(m.group('state') for m in jobarr_info)
540528

541-
if not self._update_state_count % self.SACCT_SQUEUE_RATIO:
542-
self._cancel_if_blocked(job)
543-
544-
self._cancel_if_pending_too_long(job)
545529
if slurm_state_completed(job.state):
546530
# Since Slurm exitcodes are positive take the maximum one
547531
job._exitcode = max(
@@ -554,74 +538,71 @@ def poll(self, *jobs):
554538
job, (m.group('end') for m in jobarr_info)
555539
)
556540

557-
def _cancel_if_pending_too_long(self, job):
558-
if not job.max_pending_time or not slurm_state_pending(job.state):
559-
return
541+
# Cancel jobs that blocked or pending for too long
542+
if not self._update_state_count % self._pending_job_reason_poll_freq:
543+
self._cancel_if_blocked(jobs)
544+
545+
self._cancel_if_pending_too_long(jobs)
546+
547+
def _cancel_if_pending_too_long(self, jobs):
548+
cancel_joblist = []
549+
for job in jobs:
550+
if not job.max_pending_time or not slurm_state_pending(job.state):
551+
continue
560552

561-
t_pending = time.time() - job.submit_time
562-
if t_pending >= job.max_pending_time:
563-
self.log('maximum pending time for job exceeded; cancelling it')
564-
self.cancel(job)
565-
job._exception = JobError('maximum pending time exceeded',
566-
job.jobid)
553+
t_pending = time.time() - job.submit_time
554+
if t_pending >= job.max_pending_time:
555+
self.log('maximum pending time for job exceeded; cancelling it')
556+
cancel_joblist.append(job)
567557

568-
def _cancel_if_blocked(self, job, reasons=None):
569-
if (job.is_cancelling or not slurm_state_pending(job.state)):
558+
if cancel_joblist:
559+
self.cancel_many(cancel_joblist)
560+
561+
def _cancel_if_blocked(self, jobs, reasons=None):
562+
pending_jobs = {
563+
job.jobid: job for job in jobs
564+
if not job.is_cancelling and slurm_state_pending(job.state)
565+
}
566+
if not pending_jobs:
570567
return
571568

572-
if not reasons:
573-
completed = osext.run_command('squeue -h -j %s -o %%r' % job.jobid)
574-
reasons = completed.stdout.splitlines()
575-
if not reasons:
576-
# Can't retrieve job's state. Perhaps it has finished already
577-
# and does not show up in the output of squeue
578-
return
569+
if reasons:
570+
for jobid, reason in reasons.items():
571+
try:
572+
pending_reasons[pending_jobs[jobid]] = reason
573+
except KeyError:
574+
# Job listed in reasons is not pending
575+
continue
576+
else:
577+
job_ids = ",".join(pending_jobs.keys())
578+
completed = osext.run_command(f'squeue -h -j {job_ids} -o "%A|%r"')
579+
pending_reasons = {}
580+
for line in completed.stdout.splitlines():
581+
jobid, reason = line.split('|', maxsplit=1)
582+
pending_reasons[pending_jobs[jobid]] = reason
583+
584+
cancel_joblist = {}
585+
for job, reason_descr in pending_reasons.items():
586+
# The reason description may have two parts as follows:
587+
# "ReqNodeNotAvail, UnavailableNodes:nid00[408,411-415]"
588+
reason = reason_descr.split(',', maxsplit=1)[0].strip()
589+
if reason not in self._cancel_reasons:
590+
continue
579591

580-
# For slurm job arrays the squeue output consists of multiple lines
581-
for r in reasons:
582-
self._do_cancel_if_blocked(job, r)
592+
cancel_joblist[job] = reason_descr
583593

584-
def _do_cancel_if_blocked(self, job, reason_descr):
585-
'''Check if blocking reason ``reason_descr`` is unrecoverable and
586-
cancel the job in this case.'''
594+
if not cancel_joblist:
595+
return
587596

588-
# The reason description may have two parts as follows:
589-
# "ReqNodeNotAvail, UnavailableNodes:nid00[408,411-415]"
590-
try:
591-
reason, reason_details = reason_descr.split(',', maxsplit=1)
592-
except ValueError:
593-
# no reason details
594-
reason, reason_details = reason_descr, None
595-
596-
if reason in self._cancel_reasons:
597-
if reason == 'ReqNodeNotAvail' and reason_details:
598-
self.log('Job blocked due to ReqNodeNotAvail')
599-
node_match = re.match(
600-
r'UnavailableNodes:(?P<node_names>\S+)?',
601-
reason_details.strip()
602-
)
603-
if node_match:
604-
node_names = node_match['node_names']
605-
if node_names:
606-
# Retrieve the info of the unavailable nodes and check
607-
# if they are indeed down. According to Slurm's docs
608-
# this should not be necessary, but we check anyways
609-
# to be on the safe side.
610-
self.log(f'Checking if nodes {node_names!r} '
611-
f'are indeed unavailable')
612-
nodes = self._get_nodes_by_name(node_names)
613-
if not any(self.is_node_down(n) for n in nodes):
614-
return
615-
616-
self.cancel(job)
617-
reason_msg = (
618-
'job cancelled because it was blocked due to '
619-
'a perhaps non-recoverable reason: ' + reason
620-
)
621-
if reason_details is not None:
622-
reason_msg += ', ' + reason_details
623-
624-
job._exception = JobBlockedError(reason_msg, job.jobid)
597+
# Cancel all jobs at once
598+
self.cancel_many(cancel_joblist.keys())
599+
600+
# Set the job exception
601+
for job, reason_descr in cancel_joblist.items():
602+
job._exception = JobBlockedError(
603+
f'job cancelled because it was blocked '
604+
f'due to reason: {reason_descr}', job.jobid
605+
)
625606

626607
def wait(self, job):
627608
# Quickly return in case we have finished already
@@ -640,8 +621,17 @@ def wait(self, job):
640621
self._merge_files(job)
641622

642623
def cancel(self, job):
643-
_run_strict(f'scancel {job.jobid}', timeout=self._submit_timeout)
644-
job._is_cancelling = True
624+
self.cancel_many([job])
625+
626+
def cancel_many(self, jobs):
627+
if not jobs:
628+
self.log('no jobs to cancel')
629+
return
630+
631+
_run_strict(f'scancel {" ".join(job.jobid for job in jobs)}',
632+
timeout=self._submit_timeout)
633+
for job in jobs:
634+
job._is_cancelling = True
645635

646636
def finished(self, job):
647637
return slurm_state_completed(job.state)
@@ -707,10 +697,11 @@ def poll(self, *jobs):
707697

708698
# Use ',' to join nodes to be consistent with Slurm syntax
709699
job._nodespec = ','.join(m.group('nodespec') for m in job_match)
710-
self._cancel_if_blocked(
711-
job, [s.group('reason') for s in state_match]
712-
)
713-
self._cancel_if_pending_too_long(job)
700+
701+
self._cancel_if_blocked(
702+
jobs, {jobid: s.group('reason') for jobid, s in jobinfo.items()}
703+
)
704+
self._cancel_if_pending_too_long(jobs)
714705

715706

716707
def _create_nodes(descriptions):

0 commit comments

Comments
 (0)