Add MLflow support and expose logging configuration in TrainingArgs #680

RobotSail · 2026-01-27T14:33:45Z

Summary

Adds MLflow as a new logging backend alongside TensorBoard, W&B, and async JSONL
Exposes logging configuration through TrainingArgs for programmatic API usage
Adds wandb_project and wandb_entity fields to TrainingArgs for consistency

Changes

New `TrainingArgs` fields

Field	Type	Default	Description
`logger_type`	`str`	`"async"`	Comma-separated loggers: `tensorboard`, `wandb`, `mlflow`, `async`
`run_name`	`str \| None`	`None`	Run name with placeholder support (`{time}`, `{rank}`, etc.)
`mlflow_tracking_uri`	`str \| None`	`None`	MLflow tracking server URI
`mlflow_experiment_name`	`str \| None`	`None`	MLflow experiment name
`wandb_project`	`str \| None`	`None`	W&B project name
`wandb_entity`	`str \| None`	`None`	W&B team/entity name

New `MLflowHandler` class

Implements the same interface as TensorBoardHandler and WandbHandler:

Logs metrics via mlflow.log_metrics()
Logs hyperparameters via mlflow.log_params()
Supports tracking_uri and experiment_name configuration
Falls back to MLFLOW_TRACKING_URI and MLFLOW_EXPERIMENT_NAME env vars

Updated `run_training()` API

Previously, run_training() hardcoded the logger to "async". Now it reads from TrainingArgs:

# Before
setup_metric_logger("async", None, train_args.ckpt_output_dir)

# After
setup_metric_logger(
    train_args.logger_type,
    train_args.run_name,
    train_args.ckpt_output_dir,
    mlflow_tracking_uri=train_args.mlflow_tracking_uri,
    mlflow_experiment_name=train_args.mlflow_experiment_name,
    wandb_project=train_args.wandb_project,
    wandb_entity=train_args.wandb_entity,
)

Example Usage

from instructlab.training import run_training, TrainingArgs, TorchrunArgs

train_args = TrainingArgs(
    model_path="meta-llama/Llama-3.1-8B-Instruct",
    data_path="./data.jsonl",
    ckpt_output_dir="./outputs",
    # ... other required fields ...
    
    # New logging configuration
    logger_type="wandb,mlflow",
    run_name="experiment-{time}",
    mlflow_tracking_uri="http://localhost:5000",
    mlflow_experiment_name="my-experiments",
    wandb_project="my-project",
)

run_training(torch_args, train_args)

Test plan

Verify MLflow handler logs metrics correctly to a local MLflow server
Verify W&B logging still works with new wandb_project/wandb_entity fields
Verify backward compatibility: existing code without logging params defaults to async
Verify comma-separated logger_type enables multiple backends simultaneously

🤖 Generated with Claude Code

Summary by CodeRabbit

New Features
- Added MLflow experiment tracking with configurable tracking URI and experiment name.
- Added optional Weights & Biases integration (project and entity).
- Added logger selection, run naming, and TensorBoard log-dir options for metric backends.
- New CLI options to configure these metric logging backends and experiment settings.
Documentation
- Updated usage examples to show multi-backend metric logging and configuration.

_{✏️ Tip: You can customize this high-level summary in your review settings.}

coderabbitai · 2026-01-27T14:34:08Z

📝 Walkthrough

Walkthrough

Adds MLflow integration and related logging options: new TrainingArgs fields for run and backend configuration, an MLflowHandler and MLflow wiring in the metric logger, and CLI/entrypoint propagation of the new logging flags.

Changes

Cohort / File(s)	Summary
Configuration Extension `src/instructlab/training/config.py`	Added seven fields to `TrainingArgs`: `logger_type`, `run_name`, `mlflow_tracking_uri`, `mlflow_experiment_name`, `wandb_project`, `wandb_entity`, and `tensorboard_log_dir`.
Logging Backend Implementation `src/instructlab/training/logger.py`	Added public `MLflowHandler` with lifecycle (`_setup`, `emit`, `close`), safe `mlflow` import handling, metric flattening and hparams handling, and extended `setup_metric_logger()` signature to accept MLflow/WandB/TensorBoard parameters.
CLI Integration / Entrypoints `src/instructlab/training/main_ds.py`	Added CLI options `--mlflow_tracking_uri`, `--mlflow_experiment_name`, `--wandb_project`, `--wandb_entity`, `--tensorboard_log_dir`; updated calls to `setup_metric_logger()` and propagated flags into subprocess argument construction.

Sequence Diagram(s)

sequenceDiagram
    participant User
    participant CLI as main_ds.py
    participant Setup as setup_metric_logger()
    participant MLflowHandler
    participant MLflow as MLflow Backend

    User->>CLI: invoke training with --mlflow_tracking_uri / --mlflow_experiment_name / --run_name
    CLI->>Setup: call setup_metric_logger(..., mlflow_tracking_uri, mlflow_experiment_name, run_name, ...)
    Setup->>MLflowHandler: instantiate MLflowHandler(run_name, tracking_uri, experiment_name, ...)
    MLflowHandler->>MLflowHandler: _setup() (validate mlflow import, configure)
    MLflowHandler->>MLflow: set_tracking_uri(tracking_uri)
    MLflowHandler->>MLflow: set_experiment(experiment_name)
    MLflowHandler->>MLflow: start_run(run_name)
    Note over Setup,MLflowHandler: Handler registered with logging configuration
    CLI->>MLflowHandler: emit(LogRecord with metrics)
    MLflowHandler->>MLflow: log_metrics(flattened_metrics, step)
    MLflowHandler->>MLflow: end_run() on close()

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Poem

🐇 I nibbled code and stitched a trail,

Runs and metrics set to sail,
MLflow whispers, WandB beams,
TensorBoard keeps dreaming dreams,
A rabbit logs the data tale.

🚥 Pre-merge checks | ✅ 3

✅ Passed checks (3 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title accurately and specifically captures the two main changes: adding MLflow support to the logging system and exposing logging configuration through TrainingArgs fields.
Docstring Coverage	✅ Passed	Docstring coverage is 88.89% which is sufficient. The required threshold is 80.00%.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing touches

📝 Generate docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

Exposes logging configuration (tensorboard, wandb, mlflow, jsonl) through flat kwargs in sft(), osft(), and lora_sft() convenience functions. ## New Parameters - `loggers`: List of loggers to enable (e.g., ["wandb", "mlflow", "jsonl"]) - `run_name`: Run name with placeholder support ({time}, {rank}) - `log_level`: Logging level (DEBUG, INFO, WARNING, ERROR, CRITICAL) - `logging_steps`: How often to log metrics - `wandb_project`, `wandb_entity`, `wandb_run_name`: W&B configuration - `tensorboard_log_dir`: TensorBoard output directory - `mlflow_tracking_uri`, `mlflow_experiment_name`: MLflow configuration ## Backend Support | Logger | SFT | OSFT | LoRA | |-------------|-----|------|------| | wandb | Yes | Yes | Yes | | tensorboard | Yes | No | Yes | | mlflow | Yes | No | Yes | | jsonl | Yes | Yes | No | OSFT emits warnings for unsupported loggers/params and continues. Depends on: instructlab/training#680 Co-Authored-By: Claude Opus 4.5 <[email protected]>

coderabbitai

Actionable comments posted: 1

🤖 Fix all issues with AI agents

In `@src/instructlab/training/main_ds.py`:
- Around line 275-283: The call to setup_metric_logger in main() uses
unnecessary defensive getattr() for mlflow/wandb fields; replace getattr(args,
"mlflow_tracking_uri", None), getattr(args, "mlflow_experiment_name", None),
getattr(args, "wandb_project", None), and getattr(args, "wandb_entity", None)
with direct attribute access args.mlflow_tracking_uri,
args.mlflow_experiment_name, args.wandb_project, and args.wandb_entity
respectively so it matches the pattern used in run_training() and with
train_args.

🧹 Nitpick comments (2)

src/instructlab/training/logger.py (2)

638-665: Unused log_dir parameter.

The log_dir parameter is stored as self.log_dir but never used in _setup() or elsewhere. The docstring mentions it's "used as artifact location" but the implementation doesn't pass it to MLflow. Either use it to set the artifact location or remove it to avoid confusion.

♻️ Option 1: Use log_dir as artifact location

     def _setup(self):
         """Initialize the MLflow run with the configured settings."""
         if mlflow is None:
             msg = (
                 "Could not initialize MLflowHandler because package mlflow could not be imported.\n"
                 "Please ensure it is installed by running 'pip install mlflow'"
             )
             raise RuntimeError(msg)

         if self.tracking_uri:
             mlflow.set_tracking_uri(self.tracking_uri)

         if self.experiment_name:
-            mlflow.set_experiment(self.experiment_name)
+            mlflow.set_experiment(
+                self.experiment_name,
+                artifact_location=str(self.log_dir),
+            )

         self._mlflow_run = mlflow.start_run(
             run_name=self.run_name, **self.mlflow_init_kwargs
         )

♻️ Option 2: Remove unused parameter

     def __init__(
         self,
         level: int = logging.INFO,
         run_name: str | None = None,
-        log_dir: str | os.PathLike = "logs",
         tracking_uri: str | None = None,
         experiment_name: str | None = None,
         **mlflow_init_kwargs: Any,
     ):
         """Initialize the MLflow logger and check for required dependencies.

         Args:
             level: The logging level for this handler
             run_name: Name of the run, can contain placeholders
-            log_dir: Directory where MLflow artifacts should be stored (used as artifact location)
             tracking_uri: MLflow tracking server URI (e.g., "http://localhost:5000")
             experiment_name: Name of the MLflow experiment
             **mlflow_init_kwargs: Additional keyword arguments passed to mlflow.start_run()
         """
         super().__init__(level)

         self.run_name = _substitute_placeholders(run_name)
-        self.log_dir = Path(log_dir)
         self.tracking_uri = tracking_uri
         self.experiment_name = experiment_name
         self.mlflow_init_kwargs = mlflow_init_kwargs.copy()

         self._mlflow_run = None

Note: If removing log_dir, also update setup_metric_logger to not pass it to the MLflow handler config.

711-721: Consider adding a debug log for skipped non-numeric metrics.

Non-numeric values are silently skipped. For consistency with TensorBoardHandler (which warns on type errors), consider adding a debug-level message to help users understand why certain values aren't appearing in MLflow metrics.

♻️ Proposed change

         # Filter to only numeric values for metrics
         metrics_dict = {}
         for k, v in flat_dict.items():
             try:
                 metrics_dict[k] = float(v)
             except (ValueError, TypeError):
                 # Skip non-numeric values for metrics
-                pass
+                warnings.warn(
+                    f"MLflowHandler skipping non-numeric metric '{k}' with value {type(v).__name__}",
+                    stacklevel=2,
+                )

src/instructlab/training/main_ds.py

…logging - Add tensorboard_log_dir field to TrainingArgs in config.py - Update setup_metric_logger to use tensorboard_log_dir when provided - Add CLI argument for tensorboard_log_dir - Wire tensorboard_log_dir through run_training() to subprocess command This allows users to specify a custom directory for TensorBoard logs, defaulting to output_dir if not specified. Co-Authored-By: Claude Opus 4.5 <[email protected]>

- Replace defensive getattr() with direct attribute access in main_ds.py since args are guaranteed to exist from argparse defaults - Remove unused log_dir parameter from MLflowHandler - Add debug logging for non-numeric metrics skipped by MLflowHandler Co-Authored-By: Claude Opus 4.5 <[email protected]>

add support for mlflow

d33da9a

mergify bot added the ci-failure label Jan 27, 2026

RobotSail mentioned this pull request Jan 27, 2026

Add unified logging configuration for all algorithms Red-Hat-AI-Innovation-Team/training_hub#34

Open

4 tasks

fix formatting changes

d35920b

coderabbitai bot reviewed Jan 27, 2026

View reviewed changes

src/instructlab/training/main_ds.py Show resolved Hide resolved

RobotSail and others added 2 commits January 27, 2026 14:56

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add MLflow support and expose logging configuration in TrainingArgs #680

Add MLflow support and expose logging configuration in TrainingArgs #680

Uh oh!

RobotSail commented Jan 27, 2026 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Jan 27, 2026 •

edited

Loading

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Poem

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Add MLflow support and expose logging configuration in TrainingArgs #680

Are you sure you want to change the base?

Add MLflow support and expose logging configuration in TrainingArgs #680

Uh oh!

Conversation

RobotSail commented Jan 27, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

New TrainingArgs fields

New MLflowHandler class

Updated run_training() API

Example Usage

Test plan

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Jan 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Poem

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

RobotSail commented Jan 27, 2026 •

edited by coderabbitai bot

Loading

New `TrainingArgs` fields

New `MLflowHandler` class

Updated `run_training()` API

coderabbitai bot commented Jan 27, 2026 •

edited

Loading