Skip to content

Conversation

@RobotSail
Copy link
Member

@RobotSail RobotSail commented Jan 27, 2026

Summary

  • Adds MLflow as a new logging backend alongside TensorBoard, W&B, and async JSONL
  • Exposes logging configuration through TrainingArgs for programmatic API usage
  • Adds wandb_project and wandb_entity fields to TrainingArgs for consistency

Changes

New TrainingArgs fields

Field Type Default Description
logger_type str "async" Comma-separated loggers: tensorboard, wandb, mlflow, async
run_name str | None None Run name with placeholder support ({time}, {rank}, etc.)
mlflow_tracking_uri str | None None MLflow tracking server URI
mlflow_experiment_name str | None None MLflow experiment name
wandb_project str | None None W&B project name
wandb_entity str | None None W&B team/entity name

New MLflowHandler class

Implements the same interface as TensorBoardHandler and WandbHandler:

  • Logs metrics via mlflow.log_metrics()
  • Logs hyperparameters via mlflow.log_params()
  • Supports tracking_uri and experiment_name configuration
  • Falls back to MLFLOW_TRACKING_URI and MLFLOW_EXPERIMENT_NAME env vars

Updated run_training() API

Previously, run_training() hardcoded the logger to "async". Now it reads from TrainingArgs:

# Before
setup_metric_logger("async", None, train_args.ckpt_output_dir)

# After
setup_metric_logger(
    train_args.logger_type,
    train_args.run_name,
    train_args.ckpt_output_dir,
    mlflow_tracking_uri=train_args.mlflow_tracking_uri,
    mlflow_experiment_name=train_args.mlflow_experiment_name,
    wandb_project=train_args.wandb_project,
    wandb_entity=train_args.wandb_entity,
)

Example Usage

from instructlab.training import run_training, TrainingArgs, TorchrunArgs

train_args = TrainingArgs(
    model_path="meta-llama/Llama-3.1-8B-Instruct",
    data_path="./data.jsonl",
    ckpt_output_dir="./outputs",
    # ... other required fields ...
    
    # New logging configuration
    logger_type="wandb,mlflow",
    run_name="experiment-{time}",
    mlflow_tracking_uri="http://localhost:5000",
    mlflow_experiment_name="my-experiments",
    wandb_project="my-project",
)

run_training(torch_args, train_args)

Test plan

  • Verify MLflow handler logs metrics correctly to a local MLflow server
  • Verify W&B logging still works with new wandb_project/wandb_entity fields
  • Verify backward compatibility: existing code without logging params defaults to async
  • Verify comma-separated logger_type enables multiple backends simultaneously

🤖 Generated with Claude Code

Summary by CodeRabbit

  • New Features

    • Added MLflow experiment tracking with configurable tracking URI and experiment name.
    • Added optional Weights & Biases integration (project and entity).
    • Added logger selection, run naming, and TensorBoard log-dir options for metric backends.
    • New CLI options to configure these metric logging backends and experiment settings.
  • Documentation

    • Updated usage examples to show multi-backend metric logging and configuration.

✏️ Tip: You can customize this high-level summary in your review settings.

@coderabbitai
Copy link

coderabbitai bot commented Jan 27, 2026

📝 Walkthrough

Walkthrough

Adds MLflow integration and related logging options: new TrainingArgs fields for run and backend configuration, an MLflowHandler and MLflow wiring in the metric logger, and CLI/entrypoint propagation of the new logging flags.

Changes

Cohort / File(s) Summary
Configuration Extension
src/instructlab/training/config.py
Added seven fields to TrainingArgs: logger_type, run_name, mlflow_tracking_uri, mlflow_experiment_name, wandb_project, wandb_entity, and tensorboard_log_dir.
Logging Backend Implementation
src/instructlab/training/logger.py
Added public MLflowHandler with lifecycle (_setup, emit, close), safe mlflow import handling, metric flattening and hparams handling, and extended setup_metric_logger() signature to accept MLflow/WandB/TensorBoard parameters.
CLI Integration / Entrypoints
src/instructlab/training/main_ds.py
Added CLI options --mlflow_tracking_uri, --mlflow_experiment_name, --wandb_project, --wandb_entity, --tensorboard_log_dir; updated calls to setup_metric_logger() and propagated flags into subprocess argument construction.

Sequence Diagram(s)

sequenceDiagram
    participant User
    participant CLI as main_ds.py
    participant Setup as setup_metric_logger()
    participant MLflowHandler
    participant MLflow as MLflow Backend

    User->>CLI: invoke training with --mlflow_tracking_uri / --mlflow_experiment_name / --run_name
    CLI->>Setup: call setup_metric_logger(..., mlflow_tracking_uri, mlflow_experiment_name, run_name, ...)
    Setup->>MLflowHandler: instantiate MLflowHandler(run_name, tracking_uri, experiment_name, ...)
    MLflowHandler->>MLflowHandler: _setup() (validate mlflow import, configure)
    MLflowHandler->>MLflow: set_tracking_uri(tracking_uri)
    MLflowHandler->>MLflow: set_experiment(experiment_name)
    MLflowHandler->>MLflow: start_run(run_name)
    Note over Setup,MLflowHandler: Handler registered with logging configuration
    CLI->>MLflowHandler: emit(LogRecord with metrics)
    MLflowHandler->>MLflow: log_metrics(flattened_metrics, step)
    MLflowHandler->>MLflow: end_run() on close()
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Poem

🐇 I nibbled code and stitched a trail,

Runs and metrics set to sail,
MLflow whispers, WandB beams,
TensorBoard keeps dreaming dreams,
A rabbit logs the data tale.

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately and specifically captures the two main changes: adding MLflow support to the logging system and exposing logging configuration through TrainingArgs fields.
Docstring Coverage ✅ Passed Docstring coverage is 88.89% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@mergify mergify bot added the ci-failure label Jan 27, 2026
RobotSail added a commit to Red-Hat-AI-Innovation-Team/training_hub that referenced this pull request Jan 27, 2026
Exposes logging configuration (tensorboard, wandb, mlflow, jsonl) through
flat kwargs in sft(), osft(), and lora_sft() convenience functions.

## New Parameters

- `loggers`: List of loggers to enable (e.g., ["wandb", "mlflow", "jsonl"])
- `run_name`: Run name with placeholder support ({time}, {rank})
- `log_level`: Logging level (DEBUG, INFO, WARNING, ERROR, CRITICAL)
- `logging_steps`: How often to log metrics
- `wandb_project`, `wandb_entity`, `wandb_run_name`: W&B configuration
- `tensorboard_log_dir`: TensorBoard output directory
- `mlflow_tracking_uri`, `mlflow_experiment_name`: MLflow configuration

## Backend Support

| Logger      | SFT | OSFT | LoRA |
|-------------|-----|------|------|
| wandb       | Yes | Yes  | Yes  |
| tensorboard | Yes | No   | Yes  |
| mlflow      | Yes | No   | Yes  |
| jsonl       | Yes | Yes  | No   |

OSFT emits warnings for unsupported loggers/params and continues.

Depends on: instructlab/training#680

Co-Authored-By: Claude Opus 4.5 <[email protected]>
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Fix all issues with AI agents
In `@src/instructlab/training/main_ds.py`:
- Around line 275-283: The call to setup_metric_logger in main() uses
unnecessary defensive getattr() for mlflow/wandb fields; replace getattr(args,
"mlflow_tracking_uri", None), getattr(args, "mlflow_experiment_name", None),
getattr(args, "wandb_project", None), and getattr(args, "wandb_entity", None)
with direct attribute access args.mlflow_tracking_uri,
args.mlflow_experiment_name, args.wandb_project, and args.wandb_entity
respectively so it matches the pattern used in run_training() and with
train_args.
🧹 Nitpick comments (2)
src/instructlab/training/logger.py (2)

638-665: Unused log_dir parameter.

The log_dir parameter is stored as self.log_dir but never used in _setup() or elsewhere. The docstring mentions it's "used as artifact location" but the implementation doesn't pass it to MLflow. Either use it to set the artifact location or remove it to avoid confusion.

♻️ Option 1: Use log_dir as artifact location
     def _setup(self):
         """Initialize the MLflow run with the configured settings."""
         if mlflow is None:
             msg = (
                 "Could not initialize MLflowHandler because package mlflow could not be imported.\n"
                 "Please ensure it is installed by running 'pip install mlflow'"
             )
             raise RuntimeError(msg)

         if self.tracking_uri:
             mlflow.set_tracking_uri(self.tracking_uri)

         if self.experiment_name:
-            mlflow.set_experiment(self.experiment_name)
+            mlflow.set_experiment(
+                self.experiment_name,
+                artifact_location=str(self.log_dir),
+            )

         self._mlflow_run = mlflow.start_run(
             run_name=self.run_name, **self.mlflow_init_kwargs
         )
♻️ Option 2: Remove unused parameter
     def __init__(
         self,
         level: int = logging.INFO,
         run_name: str | None = None,
-        log_dir: str | os.PathLike = "logs",
         tracking_uri: str | None = None,
         experiment_name: str | None = None,
         **mlflow_init_kwargs: Any,
     ):
         """Initialize the MLflow logger and check for required dependencies.

         Args:
             level: The logging level for this handler
             run_name: Name of the run, can contain placeholders
-            log_dir: Directory where MLflow artifacts should be stored (used as artifact location)
             tracking_uri: MLflow tracking server URI (e.g., "http://localhost:5000")
             experiment_name: Name of the MLflow experiment
             **mlflow_init_kwargs: Additional keyword arguments passed to mlflow.start_run()
         """
         super().__init__(level)

         self.run_name = _substitute_placeholders(run_name)
-        self.log_dir = Path(log_dir)
         self.tracking_uri = tracking_uri
         self.experiment_name = experiment_name
         self.mlflow_init_kwargs = mlflow_init_kwargs.copy()

         self._mlflow_run = None

Note: If removing log_dir, also update setup_metric_logger to not pass it to the MLflow handler config.


711-721: Consider adding a debug log for skipped non-numeric metrics.

Non-numeric values are silently skipped. For consistency with TensorBoardHandler (which warns on type errors), consider adding a debug-level message to help users understand why certain values aren't appearing in MLflow metrics.

♻️ Proposed change
         # Filter to only numeric values for metrics
         metrics_dict = {}
         for k, v in flat_dict.items():
             try:
                 metrics_dict[k] = float(v)
             except (ValueError, TypeError):
                 # Skip non-numeric values for metrics
-                pass
+                warnings.warn(
+                    f"MLflowHandler skipping non-numeric metric '{k}' with value {type(v).__name__}",
+                    stacklevel=2,
+                )

RobotSail and others added 2 commits January 27, 2026 14:56
…logging

- Add tensorboard_log_dir field to TrainingArgs in config.py
- Update setup_metric_logger to use tensorboard_log_dir when provided
- Add CLI argument for tensorboard_log_dir
- Wire tensorboard_log_dir through run_training() to subprocess command

This allows users to specify a custom directory for TensorBoard logs,
defaulting to output_dir if not specified.

Co-Authored-By: Claude Opus 4.5 <[email protected]>
- Replace defensive getattr() with direct attribute access in main_ds.py
  since args are guaranteed to exist from argparse defaults
- Remove unused log_dir parameter from MLflowHandler
- Add debug logging for non-numeric metrics skipped by MLflowHandler

Co-Authored-By: Claude Opus 4.5 <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants