Skip to content

recovery control of workers #978

@samuelcajahuaringa

Description

@samuelcajahuaringa

Dear Hyperqueue team,

My script for task farming is:

#!/bin/bash
rm -rf hq-server-dir
mkdir hq-server-dir
rm -rf job-*

export HQ_SERVER_DIR="$(pwd)/hq-server-dir"

start the login node as server
./hq server start --server-dir "$HQ_SERVER_DIR" &
sleep 5

script to reserve 2 compute nodes for specif time in this case 2 workers
./request_hq_workers.sh 2

task farming
./hq submit --server-dir "$HQ_SERVER_DIR" --array 0-39 --cpus=1 --resource gpus/nvidia=1 ./run_reaxff_lammps.sh

python script that stop the server and workers if all tasks were finish
python3 monitor_hq.py 1 40

That works fine without any problem for short times (for testing) but when I pretend uses for long time 10-24 hours, the server in the login could be fall and I lost the control of workers (cpu nodes) to assign new tasks and the nodes become idle until the rest of requested time finished. If there any way to recovery the control of workers if I have the information of fall server.

Best regards,
Samuel

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions