Dotnet - Add support for Foundry Adaptive evals by alliscode · Pull Request #6267 · microsoft/agent-framework

alliscode · 2026-06-02T15:10:49Z

This pull request adds support for consuming pre-existing Azure AI Foundry rubric (adaptive) evaluators in the .NET agent framework, enabling per-dimension scoring and CI gating on those dimensions. It introduces new core types for rubric evaluators, updates the Foundry evaluation pipeline to accept and process rubric references, and provides a comprehensive end-to-end sample (Evaluation_FoundryRubric) demonstrating usage. Additional documentation and environment setup instructions are included.

Rubric Evaluator Support

Introduced new core types in Microsoft.Agents.AI for rubric evaluators, including RubricScore, GeneratedEvaluatorRef, and per-dimension breakdowns in EvalScoreResult.Dimensions. Added assertion methods for CI gating on dimension scores.
Updated Foundry evaluation wiring to accept FoundryEvaluatorSpec (either built-in names or rubric refs), emit the correct wire shape for rubric evaluators, and preserve rubric refs through the evaluation pipeline. Rubric evaluators are skipped for ground-truth checks. [1] [2] [3] [4] [5]

Sample and Documentation

Added a new sample project Evaluation_FoundryRubric that demonstrates connecting to a pre-existing Foundry agent and rubric evaluator, mixing rubric and built-in evaluators, reading per-dimension scores, and enforcing CI quality gates. [1] [2] [3] [4]
Updated related sample READMEs to reference the new rubric evaluator sample. [1] [2]
Added .env.example for environment variable setup and expanded documentation on required endpoints and evaluator/agent distinctions. [1] [2]

Other Improvements

Clarified comments and parameter docs in the Foundry evaluation converter to reflect support for rubric evaluators and their data mapping/tool definitions.

This update enables robust integration with custom rubric evaluators in Azure AI Foundry, supporting advanced evaluation scenarios and CI gating on custom dimensions.

Contribution Checklist

The code builds clean without any errors or warnings
The PR follows the Contribution Guidelines
All unit tests pass, and I have added new tests where possible
Is this a breaking change? If yes, add "[BREAKING]" prefix to the title of the PR.

github-actions · 2026-06-02T15:15:13Z

Python Test Coverage Report •

File	Stmts	Miss	Cover	Missing
packages/foundry/agent_framework_foundry
_foundry_evals.py	336	8	97%	471–472, 507–508, 665, 670, 853, 920
TOTAL	37782	4395	88%

Python Unit Test Overview

Tests	Skipped	Failures	Errors	Time
7517	34 💤	0 ❌	0 🔥	2m 0s ⏱️

Copilot

Pull request overview

Adds cross-language support (Python + .NET) for consuming pre-existing Azure AI Foundry rubric/adaptive evaluators by reference (name/version), surfacing per-dimension rubric scores, and providing assertion helpers for CI gating.

Changes:

Introduces rubric evaluator core types (GeneratedEvaluatorRef, RubricScore, per-dimension score breakdowns) and CI assertion helpers.
Extends Foundry eval wiring to accept mixed evaluator specs (built-ins + rubric refs), emit the correct wire shape (incl. evaluator version), and parse per-dimension scores from result samples.
Adds end-to-end samples and documentation updates showing how to run rubric-based evals and gate on rubric dimensions.

Show a summary per file

File	Description
python/uv.lock	Lockfile update for dependency specifier ordering.
python/samples/05-end-to-end/evaluation/foundry_evals/README.md	Documents how to reference rubric evaluators and gate on dimensions.
python/samples/05-end-to-end/evaluation/foundry_evals/evaluate_with_rubric_sample.py	New runnable sample using a rubric evaluator + dimension gating.
python/samples/05-end-to-end/evaluation/foundry_evals/.env.example	Adds env vars for agent + rubric evaluator refs.
python/packages/foundry/tests/test_foundry_evals.py	Adds unit tests for rubric refs and rubric dimension extraction.
python/packages/foundry/agent_framework_foundry/_foundry_evals.py	Adds `GeneratedEvaluatorRef` support and rubric dimension parsing into results.
python/packages/foundry/agent_framework_foundry/init.py	Exports `GeneratedEvaluatorRef`.
python/packages/core/tests/core/test_local_eval.py	Adds tests for per-dimension rubric assertion helpers.
python/packages/core/agent_framework/foundry/init.pyi	Exposes `GeneratedEvaluatorRef` in stubs.
python/packages/core/agent_framework/foundry/init.py	Lazy-export mapping for `GeneratedEvaluatorRef`.
python/packages/core/agent_framework/_evaluation.py	Adds `RubricScore`, `EvalScoreResult.dimensions`, and rubric assertion helpers.
python/packages/core/agent_framework/init.py	Exports `RubricScore` at top-level.
dotnet/tests/Microsoft.Agents.AI.UnitTests/EvaluationTests.cs	Adds tests for rubric types + new assertion helpers.
dotnet/tests/Microsoft.Agents.AI.Foundry.UnitTests/FoundryEvalsTests.cs	Updates tests to use `FoundryEvaluatorSpec` and adds rubric parsing tests.
dotnet/tests/Microsoft.Agents.AI.Foundry.UnitTests/FoundryEvalConverterTests.cs	Adds tests ensuring rubric refs emit correct testing criteria wire shape.
dotnet/src/Microsoft.Agents.AI/Evaluation/RubricScore.cs	New core type representing a rubric dimension score.
dotnet/src/Microsoft.Agents.AI/Evaluation/GeneratedEvaluatorRef.cs	New core type referencing a provider-registered rubric evaluator.
dotnet/src/Microsoft.Agents.AI/Evaluation/EvalItemResult.cs	Adds `EvalScoreResult.Dimensions` to carry per-dimension rubric breakdown.
dotnet/src/Microsoft.Agents.AI/Evaluation/AgentEvaluationResults.cs	Adds score/dimension assertion helpers for CI gating (incl. recursion into sub-results).
dotnet/src/Microsoft.Agents.AI.Foundry/Evaluation/FoundryEvalWireModels.cs	Adds wire model support for `evaluator_version`.
dotnet/src/Microsoft.Agents.AI.Foundry/Evaluation/FoundryEvaluatorSpec.cs	New discriminated spec (built-in name vs rubric ref) with implicit conversions.
dotnet/src/Microsoft.Agents.AI.Foundry/Evaluation/FoundryEvals.cs	Accepts evaluator specs, preserves rubric refs through filtering, parses rubric dimensions from samples.
dotnet/src/Microsoft.Agents.AI.Foundry/Evaluation/FoundryEvalConverter.cs	Emits correct testing criteria for rubric refs (name/version + mapping) and skips rubric refs for ground-truth checks.
dotnet/samples/05-end-to-end/Evaluation/Evaluation_FoundryRubric/README.md	New sample documentation for Foundry rubric evaluation + gating.
dotnet/samples/05-end-to-end/Evaluation/Evaluation_FoundryRubric/Program.cs	New end-to-end sample program mixing rubric + built-ins and gating on a dimension.
dotnet/samples/05-end-to-end/Evaluation/Evaluation_FoundryRubric/Evaluation_FoundryRubric.csproj	New sample project.
dotnet/samples/05-end-to-end/Evaluation/Evaluation_FoundryRubric/.env.example	New sample env template for rubric evaluation.
dotnet/samples/02-agents/Evaluation/Evaluation_Multimodal/README.md	Adds link to the new rubric evaluation sample.
dotnet/samples/02-agents/Evaluation/Evaluation_ExpectedOutputs/README.md	Adds link to the new rubric evaluation sample.
dotnet/agent-framework-dotnet.slnx	Adds the new rubric evaluation sample project to the solution.
docs/decisions/0023-foundry-evals-integration.md	Records the follow-up decision and design notes for rubric evaluator consumption.

Copilot's findings

Files reviewed: 30/31 changed files
Comments generated: 3

github-actions

Automated Code Review

Reviewers: 4 | Confidence: 90%

✓ Correctness

No actionable issues found in this dimension.

✓ Security Reliability

No actionable issues found in this dimension.

✓ Test Coverage

The PR adds comprehensive test coverage for the new rubric evaluator functionality in .NET (AssertScoreAtLeast, AssertDimensionScoreAtLeast, AssertNoFailedItems, ParseRubricScores, BuildTestingCriteria with rubric refs, FilterToolEvaluators with rubric refs, FindMissingGroundTruthEvaluators skipping rubric refs). The Python side tests assert_dimension_score_at_least thoroughly and covers BuildTestingCriteria, FilterToolEvaluators, and ParseRubricScores. However, the Python assert_score_at_least and assert_no_failed_items methods have zero test coverage despite having non-trivial logic (recursion into sub_results, offender formatting, threshold comparisons).

✗ Design Approach

I found two design issues in the new rubric support. The new sample advertises a CI quality gate but catches and suppresses the failure instead of returning a failing exit code, so the sample’s main scenario does not actually gate CI. Separately, the .NET Foundry evaluator path still auto-appends ToolCallAccuracy whenever tools are present, even when the caller explicitly provided a rubric-only evaluator list; that overrides explicit configuration in a way the Python implementation in this repo avoids. I found one design issue in the new rubric-score extraction path: the helper says it defensively handles SDK shape variation, but its top-level fallback only works for dict samples. A typed SDK sample object that exposes dimension_scores or rubric_scores directly on the sample instance is silently treated as a non-rubric evaluator, so per-dimension scores disappear from EvalScoreResult.dimensions.

Flagged Issues

dotnet/src/Microsoft.Agents.AI.Foundry/Evaluation/FoundryEvals.cs:155-159 unconditionally appends ToolCallAccuracy when tools are present, overiding an explicit rubric-only evaluator list. The Python implementation only auto-adds tool evaluators when evaluators is None.
python/packages/foundry/agent_framework_foundry/_foundry_evals.py:534-555 — _extract_rubric_scores() only searches top-level rubric keys for dict samples. A typed SDK sample object exposing dimension_scores/rubric_scores directly (no properties wrapper) is silently treated as non-rubric, losing per-dimension scores.

Automated review by alliscode's agents

Address PR microsoft#6267 review comments on the .NET FoundryEvals integration: - Add source-compat overloads accepting `string[] evaluators` for `FoundryEvals` ctor, `EvaluateTracesAsync`, and `EvaluateFoundryTargetAsync` so existing callers passing string arrays keep compiling unchanged. New overloads forward via a private `ToSpecs` helper that wraps each name through the implicit `string -> FoundryEvaluatorSpec` conversion. - Guard against `default(FoundryEvaluatorSpec)` entries (both `BuiltinName` and `GeneratedRef` null) that would NRE the downstream converter. Adds `FoundryEvaluatorSpec.IsValid` / `EnsureValid` plus an internal `EnsureAllSpecsValid` helper, wired into the main ctor and both static evaluation entry points. - Add 6 unit tests covering the new validation surface. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

PR microsoft#6267 review comment: the FoundryRubric sample swallowed the AssertDimensionScoreAtLeast failure, so a CI run that included it as a quality gate would still exit 0 even when the rubric regressed. Set `System.Environment.ExitCode = 1` in the catch so CI fails while still letting the rest of the sample's logging complete cleanly. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

PR microsoft#6267 review comment: `_extract_rubric_scores` only searched the `properties` dict when the sample exposed one. When the Azure AI Projects typed SDK returns a Sample object that puts `dimension_scores` / `rubric_scores` directly on the instance (no `properties` wrapper), we missed them and surfaced no per-dimension scores. Add an `else: containers.append(sample)` branch so non-dict typed samples are also inspected for the score keys. Covered by two new tests: one with `dimension_scores` directly on a typed Sample without a `properties` wrapper, and one with the legacy `rubric_scores` key in the same shape. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

PR microsoft#6267 review comments: both assertion helpers shipped without unit tests. Add `TestAssertScoreAtLeast` (above threshold, below w/ offenders, evaluator filter, sub_results recursion) and `TestAssertNoFailedItems` (all passing, failed/errored statuses, sub_results recursion) with a shared `_score_results` fixture builder. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Adds the core rubric-evaluator surface that mirrors the Python work in PR microsoft#6101 (commit e45b934). Provider-agnostic types only — no Foundry coupling. Subsequent commits will wire these into FoundryEvals. - RubricScore: per-dimension score record (Id, Score?, Applicable, Weight, Reason). - EvalScoreResult.Dimensions: optional init-only list of RubricScore. Null for non-rubric (built-in) evaluators. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Adds the provider-agnostic surface for referencing a pre-existing rubric evaluator and gating CI on per-item / per-dimension thresholds. Mirrors Python PR microsoft#6101 commits e5830dd (ref type) and 4bc6046 (asserts). - GeneratedEvaluatorRef: name + optional version/display-name, plus a Latest(name) factory for versionless refs (discouraged for CI; consumers should warn at run time). - AgentEvaluationResults.AssertScoreAtLeast: walks DetailedItems[].Scores, optionally filtered by evaluator name, recurses into SubResults. - AgentEvaluationResults.AssertDimensionScoreAtLeast: walks each score's Dimensions list, skips non-applicable dimensions by default, supports requireApplicable to flip that, recurses into SubResults. - AgentEvaluationResults.AssertNoFailedItems: walks DetailedItems for fail/error statuses, recurses into SubResults. All helpers throw InvalidOperationException (matches existing AssertAllPassed). Truncates offender lists to the first 5 with a '+N more' suffix to keep CI output readable, mirroring the Python helpers. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Adds FoundryEvaluatorSpec, a readonly-struct union with implicit conversions from both string and GeneratedEvaluatorRef so call sites can mix built-in evaluator names with rubric evaluator references: var evals = new FoundryEvals( projectClient, model, new GeneratedEvaluatorRef("policy-rubric", "3"), FoundryEvals.Relevance, FoundryEvals.Coherence); FoundryEvals constructors (3 overloads), EvaluateTracesAsync, and EvaluateFoundryTargetAsync now take FoundryEvaluatorSpec[]/params instead of string[]/params. Existing call sites using string literals or string[] keep working unchanged via implicit conversion. FoundryEvalConverter.BuildTestingCriteria emits the documented Foundry wire format for rubric refs: { "type": "azure_ai_evaluator", "name": <DisplayName ?? Name>, "evaluator_name": <Name>, "evaluator_version": <Version>, // omitted when null "initialization_parameters": { "deployment_name": <model> }, "data_mapping": { conversation arrays, optional tool_definitions } } WireTestingCriterion gains an optional EvaluatorVersion field. Rubric refs are preserved through FilterToolEvaluators (tool-aware but not tool-required) and ignored by FindMissingGroundTruthEvaluators. A versionless ref emits a Trace.TraceWarning at criterion-build time so CI authors notice the floating version (mirrors the Python warning). Adds 6 new Foundry unit tests (3 BuildTestingCriteria rubric paths, 1 FindMissingGroundTruthEvaluators, 1 FilterToolEvaluators preservation, 1 mixed-order). 369/369 Foundry tests pass. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

…core Adds FoundryEvals.ParseRubricScores, called per result inside ParseDetailedItem. Each EvalScoreResult now populates Dimensions when the evaluator's sample carries a rubric breakdown. Accepts three shapes for forward compatibility with provider SDK iterations: 1. sample.properties.dimension_scores (canonical Foundry runtime shape) 2. sample.properties.rubric_scores (preview/legacy key) 3. top-level sample.dimension_scores / sample.rubric_scores (defensive fallback) Entries missing 'id', 'weight', or 'applicable' are skipped without invalidating well-formed siblings. Non-applicable dimensions may omit 'score' (parsed as null). Adds 6 unit tests covering canonical and legacy keys, top-level fallback, no-match returns null, malformed-entry skipping, and the non-applicable null-score path. 375/375 Foundry tests pass. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Adds dotnet/samples/05-end-to-end/Evaluation/Evaluation_FoundryRubric mirroring the Python evaluate_with_rubric_sample.py: - Fetches a pre-existing Foundry agent via AgentAdministrationClient (GetAgentAsync for latest, GetAgentVersionAsync when FOUNDRY_AGENT_VERSION is pinned). - References a rubric evaluator by GeneratedEvaluatorRef(name, version); falls back to GeneratedEvaluatorRef.Latest(name) with the documented floating-version warning. - Mixes the rubric with FoundryEvals.Relevance and FoundryEvals.Coherence in a single FoundryEvals run (implicit string-and-ref conversion). - Prints per-dimension breakdowns from EvalScoreResult.Dimensions for each item. - Demonstrates a CI quality gate with AssertDimensionScoreAtLeast("general_quality", 3.0). Documents the FOUNDRY_PROJECT_ENDPOINT footgun (must be project-scoped URL .../api/projects/<project>, not the bare Azure OpenAI endpoint) and the Eval-Definition-vs-Rubric-Evaluator distinction in the README. Ships a .env.example with the FOUNDRY_* variables. Registers the project in agent-framework-dotnet.slnx and cross-links from the sibling Evaluation_Multimodal / Evaluation_ExpectedOutputs READMEs. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>