Skip to content

feat: automatic job grouping#95

Draft
Stellatsuu wants to merge 16 commits intomainfrom
automatic-job-grouping
Draft

feat: automatic job grouping#95
Stellatsuu wants to merge 16 commits intomainfrom
automatic-job-grouping

Conversation

@Stellatsuu
Copy link
Contributor

@Stellatsuu Stellatsuu commented Feb 3, 2026

DRAFT PR
cc @aldbr @arrabito @natthan-pigoux

Closes: #66
Related to: #61

Changes:

  • added input_data: list[str | File] to TransformationSubmissionModel
  • added optional inputs-file parameter to Transformation CLI:
    dirac-cwl transformation submit file.cwl --inputs-file file.yaml
  • renamed parameter-path to input_files: list[str] in Job CLI:
    dirac-cwl job submit file.cwl --input-files file1.yaml file2.yaml ...
  • added group_size executionHooksHint to Transformation Workflows, such as:
hints:
  - class: dirac:ExecutionHooks
    group_size: (int)
  • group_size determines the number of jobs to be created and how many inputs files they will contain in submit_transformation_router, by default, it equals 1, which mean a job will be created for each input in the inputs file. Once the list of jobs is created, it is sent to the job_router and processed.
  • added simple tests and workflows (e.g: count and list the inputs files contained in each created job)
  • quick fix about JobWrapper related tests: task.cwl was created during post_process but never cleared after running tests. Couldn't manage to create a fixture to do that (I had strange errors?), it probably can be done prettier.

Comments:

TODO after this PR:

class TransformationSubmissionModel(BaseModel):
"""Transformation definition sent to the router."""

# Allow arbitrary types to be passed to the model
model_config = ConfigDict(arbitrary_types_allowed=True)

task: CommandLineTool | Workflow | ExpressionTool
input_data: Optional[list[str | File] | None] = None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As we are going to integrate input sandbox within transformations (#92), it would be interesting to see if we could reuse the JobInputModel (renamed as InputModel?)

Copy link
Contributor Author

@Stellatsuu Stellatsuu Feb 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regarding @arrabito comments:

  • I agree that we don't need to have input sandbox for now, so it can't be local files.
  • I don't remember how we will add support for sandboxes in the transformation system. For simplicity, I would keep just LFN paths for now.
  • As said before, in my opinion there is no need to support/create sandboxes for now.

Do I still make this change in this PR? Or wouldn't it be better to do it in a (futur) sandbox PR? Maybe I missunderstood what you meant here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's make this change in a future sandbox PR I would say

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thinking a little bit further, we may also want to allow local file paths, but only to be used for Local execution (without adding them to SB).

So if the submission is local we allow only local paths, while if the submission is to DIRAC we allow only LFN paths.

In this way, we could also execute transformations locally.

Eventually later on, we will also allow local file paths for DIRAC submission (adding them to ISB).

@aldbr what do you think?

@Stellatsuu
Copy link
Contributor Author

@aldbr Regarding this part of the code:

# Temporary comment
# if transformation_execution_hooks.configuration and transformation_execution_hooks.group_size:
# # Get the metadata class
# transformation_metadata = transformation_execution_hooks.to_runtime(transformation)
#
# # Build the input cwl for the jobs to submit
# logger.info("Getting the input data for the transformation...")
# input_data_dict = {}
# min_length = None
# for input_name, group_size in transformation_execution_hooks.group_size.items():
# # Get input query
# logger.info("\t- Getting input query for %s...", input_name)
# input_query = transformation_metadata.get_input_query(input_name)
# if not input_query:
# raise RuntimeError("Input query not found.")
#
# # Wait for the input to be available
# logger.info("\t- Waiting for input data for %s...", input_name)
# logger.debug("\t\t- Query: %s", input_query)
# logger.debug("\t\t- Group Size: %s", group_size)
# while not (inputs := _get_inputs(input_query, group_size)):
# logger.debug("\t\t- Result: %s", inputs)
# time.sleep(5)
# logger.info("\t- Input data for %s available.", input_name)
# if not min_length or len(inputs) < min_length:
# min_length = len(inputs)
#
# # Update the input data in the metadata
# # Only keep the first min_length inputs
# input_data_dict[input_name] = inputs[:min_length]
#
# # Get the JobModelParameter for each input
# job_model_params = _generate_job_model_parameter(input_data_dict)
# logger.info("Input data for the transformation retrieved!")

Are we planning on keeping it? Just so I un-comment it and make the changes related to the group_size type change.

@Stellatsuu
Copy link
Contributor Author

Waiting on #66 (comment) and #95 (comment) approval about what we're doing, and then, PR should be ready to be fully reviewed (and potentially merged 🙏).

@aldbr
Copy link
Contributor

aldbr commented Feb 10, 2026

@aldbr Regarding this part of the code:

# Temporary comment
# if transformation_execution_hooks.configuration and transformation_execution_hooks.group_size:
# # Get the metadata class
# transformation_metadata = transformation_execution_hooks.to_runtime(transformation)
#
# # Build the input cwl for the jobs to submit
# logger.info("Getting the input data for the transformation...")
# input_data_dict = {}
# min_length = None
# for input_name, group_size in transformation_execution_hooks.group_size.items():
# # Get input query
# logger.info("\t- Getting input query for %s...", input_name)
# input_query = transformation_metadata.get_input_query(input_name)
# if not input_query:
# raise RuntimeError("Input query not found.")
#
# # Wait for the input to be available
# logger.info("\t- Waiting for input data for %s...", input_name)
# logger.debug("\t\t- Query: %s", input_query)
# logger.debug("\t\t- Group Size: %s", group_size)
# while not (inputs := _get_inputs(input_query, group_size)):
# logger.debug("\t\t- Result: %s", inputs)
# time.sleep(5)
# logger.info("\t- Input data for %s available.", input_name)
# if not min_length or len(inputs) < min_length:
# min_length = len(inputs)
#
# # Update the input data in the metadata
# # Only keep the first min_length inputs
# input_data_dict[input_name] = inputs[:min_length]
#
# # Get the JobModelParameter for each input
# job_model_params = _generate_job_model_parameter(input_data_dict)
# logger.info("Input data for the transformation retrieved!")

Are we planning on keeping it? Just so I un-comment it and make the changes related to the group_size type change.

Yes we want to keep it. A transformation should either get inputs from the CLI, or from a DataCatalog/Bookkeeping service.

@Stellatsuu
Copy link
Contributor Author

Stellatsuu commented Feb 10, 2026

I’m also not sure whether the job_grouping workflow is worth keeping. It only counts the file in each job and list them, based on the inputs_file content.
I'm not sure on what to test otherwise than that.

Also, the test_run_transformation_with_inputs_file test only check if the transformation succeed, it doesn't check if the number of groups and jobs created are correct, maybe I should change that? If so, I think it would be better to create a dedicated test for this case, so we still have a more general test to see if transformation with inputs_file succeed or not and a dedicated test for this automatic grouping check.

If you have any ideas.

@Stellatsuu
Copy link
Contributor Author

@aldbr Regarding this part of the code:

# Temporary comment
# if transformation_execution_hooks.configuration and transformation_execution_hooks.group_size:
# # Get the metadata class
# transformation_metadata = transformation_execution_hooks.to_runtime(transformation)
#
# # Build the input cwl for the jobs to submit
# logger.info("Getting the input data for the transformation...")
# input_data_dict = {}
# min_length = None
# for input_name, group_size in transformation_execution_hooks.group_size.items():
# # Get input query
# logger.info("\t- Getting input query for %s...", input_name)
# input_query = transformation_metadata.get_input_query(input_name)
# if not input_query:
# raise RuntimeError("Input query not found.")
#
# # Wait for the input to be available
# logger.info("\t- Waiting for input data for %s...", input_name)
# logger.debug("\t\t- Query: %s", input_query)
# logger.debug("\t\t- Group Size: %s", group_size)
# while not (inputs := _get_inputs(input_query, group_size)):
# logger.debug("\t\t- Result: %s", inputs)
# time.sleep(5)
# logger.info("\t- Input data for %s available.", input_name)
# if not min_length or len(inputs) < min_length:
# min_length = len(inputs)
#
# # Update the input data in the metadata
# # Only keep the first min_length inputs
# input_data_dict[input_name] = inputs[:min_length]
#
# # Get the JobModelParameter for each input
# job_model_params = _generate_job_model_parameter(input_data_dict)
# logger.info("Input data for the transformation retrieved!")

Are we planning on keeping it? Just so I un-comment it and make the changes related to the group_size type change.

Yes we want to keep it. A transformation should either get inputs from the CLI, or from a DataCatalog/Bookkeeping service.

Since group_size is now an int, this code is kinda broken now, no? I need an input_name to retrieve the input_query: input_query = transformation_metadata.get_input_query(input_name), where would this value be now? In a Transformation hint? As a list of input_names?

@arrabito
Copy link
Contributor

Since group_size is now an int, this code is kinda broken now, no? I need an input_name to retrieve the input_query: input_query = transformation_metadata.get_input_query(input_name), where would this value be now? In a Transformation hint? As a list of input_names?

As far as I see, I'm not sure that any input_name is needed anymore.

In the current QueryBasedPlugin, input_name is just used to build the LFN path, see:

https://github.com/DIRACGrid/dirac-cwl/blob/main/src/dirac_cwl/execution_hooks/plugins/core.py#L37C31-L37C41

Probably we could just change get_input_query to not take any argument and just build LFN path as:

/query_root/vo/campaign/site/data_type

instead of:

/query_root/vo/campaign/site/data_type/input_name

Then, I guess that the group_size in yaml file should be specified as:

- class: dirac:ExecutionHooks
  hook_plugin: QueryBasedPlugin
  group_size: 5

instead of:

- class: dirac:ExecutionHooks
  hook_plugin: QueryBasedPlugin
  group_size:
    input-data: 5

@aldbr do you agree?

(Maybe some other changes are needed that I haven't thought).

Comment on lines +98 to +99
task_file = job_wrapper.job_path / "task.cwl"
task_file.unlink(missing_ok=True)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see that you are using these 2 lines in different test.
Can you create or reuse an existing fixture within conftest?
You can generally yield and then add these 2 lines and they will be executed after the execution of the test. Example: https://docs.pytest.org/en/6.2.x/fixture.html#yield-fixtures-recommended

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried to add a fixture like that:

@pytest.fixture
def cleanup_wrapper():
    job_wrapper = JobWrapper()
    yield job_wrapper

    task_file = job_wrapper.job_path / "task.cwl"
    task_file.unlink(missing_ok=True)

So I could use the JobWrapper yield in the tests and cleanup after (because I need access to job_path value)

But I kept having errors about the create_sandbox method, this also happens when the fixture is not called in any test, just having it in conftest makes the tests fail:

self = <[AttributeError("'pathlib._local.PosixPath' object has no attribute '_raw_paths'") raised in repr()] PosixPath object at 0x1070f0c80>
args = (<coroutine object create_sandbox at 0x1070ebb40>,), paths = [], arg = <coroutine object create_sandbox at 0x1070ebb40>
path = <coroutine object create_sandbox at 0x1070ebb40>
E TypeError: argument should be a str or an os.PathLike object where __fspath__ returns a str, not 'coroutine'

I spent a lot of time on this yesterday and I still don't understand why it occurs so that's why I just added the lines that were working.

If you have any ideas on why it doesn't work

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't looked at your code, but I had a similar error when the DIRAC_PROTO_LOCAL variable was not correctly set.
So maybe you can check lines in your code with:
os.environ["DIRAC_PROTO_LOCAL"]

@aldbr
Copy link
Contributor

aldbr commented Feb 10, 2026

Since group_size is now an int, this code is kinda broken now, no? I need an input_name to retrieve the input_query: input_query = transformation_metadata.get_input_query(input_name), where would this value be now? In a Transformation hint? As a list of input_names?

As far as I see, I'm not sure that any input_name is needed anymore.

In the current QueryBasedPlugin, input_name is just used to build the LFN path, see:

https://github.com/DIRACGrid/dirac-cwl/blob/main/src/dirac_cwl/execution_hooks/plugins/core.py#L37C31-L37C41

Probably we could just change get_input_query to not take any argument and just build LFN path as:

/query_root/vo/campaign/site/data_type

instead of:

/query_root/vo/campaign/site/data_type/input_name

Then, I guess that the group_size in yaml file should be specified as:

- class: dirac:ExecutionHooks
  hook_plugin: QueryBasedPlugin
  group_size: 5

instead of:

- class: dirac:ExecutionHooks
  hook_plugin: QueryBasedPlugin
  group_size:
    input-data: 5

@aldbr do you agree?

(Maybe some other changes are needed that I haven't thought).

Yes I agree. In any case, this is going to be revised at some point with the hints proposed in #69

@Stellatsuu Stellatsuu self-assigned this Feb 17, 2026
@Stellatsuu
Copy link
Contributor Author

Current PR status:

@Stellatsuu Stellatsuu deployed to github-pages February 19, 2026 12:56 — with GitHub Actions Active
@aldbr aldbr requested a review from natthan-pigoux March 2, 2026 16:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Support data-processing transformations specified through a static list of input files

3 participants