-
Notifications
You must be signed in to change notification settings - Fork 40
Description
Dear Hyperqueue team,
My script for task farming is:
#!/bin/bash
rm -rf hq-server-dir
mkdir hq-server-dir
rm -rf job-*
export HQ_SERVER_DIR="$(pwd)/hq-server-dir"
start the login node as server
./hq server start --server-dir "$HQ_SERVER_DIR" &
sleep 5
script to reserve 2 compute nodes for specif time in this case 2 workers
./request_hq_workers.sh 2
task farming
./hq submit --server-dir "$HQ_SERVER_DIR" --array 0-39 --cpus=1 --resource gpus/nvidia=1 ./run_reaxff_lammps.sh
python script that stop the server and workers if all tasks were finish
python3 monitor_hq.py 1 40
That works fine without any problem for short times (for testing) but when I pretend uses for long time 10-24 hours, the server in the login could be fall and I lost the control of workers (cpu nodes) to assign new tasks and the nodes become idle until the rest of requested time finished. If there any way to recovery the control of workers if I have the information of fall server.
Best regards,
Samuel