recovery control of workers

Dear Hyperqueue team,

My script for task farming is:

#!/bin/bash
rm -rf hq-server-dir
mkdir hq-server-dir
rm -rf job-*

export HQ_SERVER_DIR="$(pwd)/hq-server-dir"

start the login node as server 
./hq server start --server-dir "$HQ_SERVER_DIR" &
sleep 5

script to reserve 2 compute nodes for specif time in this case 2 workers
./request_hq_workers.sh 2

task farming 
./hq submit --server-dir "$HQ_SERVER_DIR" --array 0-39 --cpus=1 --resource gpus/nvidia=1 ./run_reaxff_lammps.sh

python script that stop the server and workers if all tasks were finish
python3 monitor_hq.py 1 40

That works fine without any problem for short times (for testing) but when I pretend uses for long time 10-24 hours, the server in the login could be fall and I lost the control of workers (cpu nodes) to assign new tasks and the nodes become idle until the rest of requested time finished. If there any way to recovery the control of workers if I have the information of fall server.

Best regards,
Samuel

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

recovery control of workers #978

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

recovery control of workers #978

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions