Dotnet - Add support for Foundry Adaptive evals#6267
Conversation
There was a problem hiding this comment.
Pull request overview
Adds cross-language support (Python + .NET) for consuming pre-existing Azure AI Foundry rubric/adaptive evaluators by reference (name/version), surfacing per-dimension rubric scores, and providing assertion helpers for CI gating.
Changes:
- Introduces rubric evaluator core types (
GeneratedEvaluatorRef,RubricScore, per-dimension score breakdowns) and CI assertion helpers. - Extends Foundry eval wiring to accept mixed evaluator specs (built-ins + rubric refs), emit the correct wire shape (incl. evaluator version), and parse per-dimension scores from result samples.
- Adds end-to-end samples and documentation updates showing how to run rubric-based evals and gate on rubric dimensions.
Show a summary per file
| File | Description |
|---|---|
| python/uv.lock | Lockfile update for dependency specifier ordering. |
| python/samples/05-end-to-end/evaluation/foundry_evals/README.md | Documents how to reference rubric evaluators and gate on dimensions. |
| python/samples/05-end-to-end/evaluation/foundry_evals/evaluate_with_rubric_sample.py | New runnable sample using a rubric evaluator + dimension gating. |
| python/samples/05-end-to-end/evaluation/foundry_evals/.env.example | Adds env vars for agent + rubric evaluator refs. |
| python/packages/foundry/tests/test_foundry_evals.py | Adds unit tests for rubric refs and rubric dimension extraction. |
| python/packages/foundry/agent_framework_foundry/_foundry_evals.py | Adds GeneratedEvaluatorRef support and rubric dimension parsing into results. |
| python/packages/foundry/agent_framework_foundry/init.py | Exports GeneratedEvaluatorRef. |
| python/packages/core/tests/core/test_local_eval.py | Adds tests for per-dimension rubric assertion helpers. |
| python/packages/core/agent_framework/foundry/init.pyi | Exposes GeneratedEvaluatorRef in stubs. |
| python/packages/core/agent_framework/foundry/init.py | Lazy-export mapping for GeneratedEvaluatorRef. |
| python/packages/core/agent_framework/_evaluation.py | Adds RubricScore, EvalScoreResult.dimensions, and rubric assertion helpers. |
| python/packages/core/agent_framework/init.py | Exports RubricScore at top-level. |
| dotnet/tests/Microsoft.Agents.AI.UnitTests/EvaluationTests.cs | Adds tests for rubric types + new assertion helpers. |
| dotnet/tests/Microsoft.Agents.AI.Foundry.UnitTests/FoundryEvalsTests.cs | Updates tests to use FoundryEvaluatorSpec and adds rubric parsing tests. |
| dotnet/tests/Microsoft.Agents.AI.Foundry.UnitTests/FoundryEvalConverterTests.cs | Adds tests ensuring rubric refs emit correct testing criteria wire shape. |
| dotnet/src/Microsoft.Agents.AI/Evaluation/RubricScore.cs | New core type representing a rubric dimension score. |
| dotnet/src/Microsoft.Agents.AI/Evaluation/GeneratedEvaluatorRef.cs | New core type referencing a provider-registered rubric evaluator. |
| dotnet/src/Microsoft.Agents.AI/Evaluation/EvalItemResult.cs | Adds EvalScoreResult.Dimensions to carry per-dimension rubric breakdown. |
| dotnet/src/Microsoft.Agents.AI/Evaluation/AgentEvaluationResults.cs | Adds score/dimension assertion helpers for CI gating (incl. recursion into sub-results). |
| dotnet/src/Microsoft.Agents.AI.Foundry/Evaluation/FoundryEvalWireModels.cs | Adds wire model support for evaluator_version. |
| dotnet/src/Microsoft.Agents.AI.Foundry/Evaluation/FoundryEvaluatorSpec.cs | New discriminated spec (built-in name vs rubric ref) with implicit conversions. |
| dotnet/src/Microsoft.Agents.AI.Foundry/Evaluation/FoundryEvals.cs | Accepts evaluator specs, preserves rubric refs through filtering, parses rubric dimensions from samples. |
| dotnet/src/Microsoft.Agents.AI.Foundry/Evaluation/FoundryEvalConverter.cs | Emits correct testing criteria for rubric refs (name/version + mapping) and skips rubric refs for ground-truth checks. |
| dotnet/samples/05-end-to-end/Evaluation/Evaluation_FoundryRubric/README.md | New sample documentation for Foundry rubric evaluation + gating. |
| dotnet/samples/05-end-to-end/Evaluation/Evaluation_FoundryRubric/Program.cs | New end-to-end sample program mixing rubric + built-ins and gating on a dimension. |
| dotnet/samples/05-end-to-end/Evaluation/Evaluation_FoundryRubric/Evaluation_FoundryRubric.csproj | New sample project. |
| dotnet/samples/05-end-to-end/Evaluation/Evaluation_FoundryRubric/.env.example | New sample env template for rubric evaluation. |
| dotnet/samples/02-agents/Evaluation/Evaluation_Multimodal/README.md | Adds link to the new rubric evaluation sample. |
| dotnet/samples/02-agents/Evaluation/Evaluation_ExpectedOutputs/README.md | Adds link to the new rubric evaluation sample. |
| dotnet/agent-framework-dotnet.slnx | Adds the new rubric evaluation sample project to the solution. |
| docs/decisions/0023-foundry-evals-integration.md | Records the follow-up decision and design notes for rubric evaluator consumption. |
Copilot's findings
- Files reviewed: 30/31 changed files
- Comments generated: 3
There was a problem hiding this comment.
Automated Code Review
Reviewers: 4 | Confidence: 90%
✓ Correctness
No actionable issues found in this dimension.
✓ Security Reliability
No actionable issues found in this dimension.
✓ Test Coverage
The PR adds comprehensive test coverage for the new rubric evaluator functionality in .NET (AssertScoreAtLeast, AssertDimensionScoreAtLeast, AssertNoFailedItems, ParseRubricScores, BuildTestingCriteria with rubric refs, FilterToolEvaluators with rubric refs, FindMissingGroundTruthEvaluators skipping rubric refs). The Python side tests assert_dimension_score_at_least thoroughly and covers BuildTestingCriteria, FilterToolEvaluators, and ParseRubricScores. However, the Python assert_score_at_least and assert_no_failed_items methods have zero test coverage despite having non-trivial logic (recursion into sub_results, offender formatting, threshold comparisons).
✗ Design Approach
I found two design issues in the new rubric support. The new sample advertises a CI quality gate but catches and suppresses the failure instead of returning a failing exit code, so the sample’s main scenario does not actually gate CI. Separately, the .NET Foundry evaluator path still auto-appends
ToolCallAccuracywhenever tools are present, even when the caller explicitly provided a rubric-only evaluator list; that overrides explicit configuration in a way the Python implementation in this repo avoids. I found one design issue in the new rubric-score extraction path: the helper says it defensively handles SDK shape variation, but its top-level fallback only works for dict samples. A typed SDK sample object that exposesdimension_scoresorrubric_scoresdirectly on the sample instance is silently treated as a non-rubric evaluator, so per-dimension scores disappear fromEvalScoreResult.dimensions.
Flagged Issues
- dotnet/src/Microsoft.Agents.AI.Foundry/Evaluation/FoundryEvals.cs:155-159 unconditionally appends ToolCallAccuracy when tools are present, overiding an explicit rubric-only evaluator list. The Python implementation only auto-adds tool evaluators when
evaluators is None. - python/packages/foundry/agent_framework_foundry/_foundry_evals.py:534-555 —
_extract_rubric_scores()only searches top-level rubric keys for dict samples. A typed SDK sample object exposingdimension_scores/rubric_scoresdirectly (nopropertieswrapper) is silently treated as non-rubric, losing per-dimension scores.
Automated review by alliscode's agents
a7cef93 to
5e6bf02
Compare
Address PR microsoft#6267 review comments on the .NET FoundryEvals integration: - Add source-compat overloads accepting `string[] evaluators` for `FoundryEvals` ctor, `EvaluateTracesAsync`, and `EvaluateFoundryTargetAsync` so existing callers passing string arrays keep compiling unchanged. New overloads forward via a private `ToSpecs` helper that wraps each name through the implicit `string -> FoundryEvaluatorSpec` conversion. - Guard against `default(FoundryEvaluatorSpec)` entries (both `BuiltinName` and `GeneratedRef` null) that would NRE the downstream converter. Adds `FoundryEvaluatorSpec.IsValid` / `EnsureValid` plus an internal `EnsureAllSpecsValid` helper, wired into the main ctor and both static evaluation entry points. - Add 6 unit tests covering the new validation surface. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
PR microsoft#6267 review comment: the FoundryRubric sample swallowed the AssertDimensionScoreAtLeast failure, so a CI run that included it as a quality gate would still exit 0 even when the rubric regressed. Set `System.Environment.ExitCode = 1` in the catch so CI fails while still letting the rest of the sample's logging complete cleanly. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
PR microsoft#6267 review comment: `_extract_rubric_scores` only searched the `properties` dict when the sample exposed one. When the Azure AI Projects typed SDK returns a Sample object that puts `dimension_scores` / `rubric_scores` directly on the instance (no `properties` wrapper), we missed them and surfaced no per-dimension scores. Add an `else: containers.append(sample)` branch so non-dict typed samples are also inspected for the score keys. Covered by two new tests: one with `dimension_scores` directly on a typed Sample without a `properties` wrapper, and one with the legacy `rubric_scores` key in the same shape. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
PR microsoft#6267 review comments: both assertion helpers shipped without unit tests. Add `TestAssertScoreAtLeast` (above threshold, below w/ offenders, evaluator filter, sub_results recursion) and `TestAssertNoFailedItems` (all passing, failed/errored statuses, sub_results recursion) with a shared `_score_results` fixture builder. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Adds the core rubric-evaluator surface that mirrors the Python work in PR microsoft#6101 (commit e45b934). Provider-agnostic types only — no Foundry coupling. Subsequent commits will wire these into FoundryEvals. - RubricScore: per-dimension score record (Id, Score?, Applicable, Weight, Reason). - EvalScoreResult.Dimensions: optional init-only list of RubricScore. Null for non-rubric (built-in) evaluators. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Adds the provider-agnostic surface for referencing a pre-existing rubric evaluator and gating CI on per-item / per-dimension thresholds. Mirrors Python PR microsoft#6101 commits e5830dd (ref type) and 4bc6046 (asserts). - GeneratedEvaluatorRef: name + optional version/display-name, plus a Latest(name) factory for versionless refs (discouraged for CI; consumers should warn at run time). - AgentEvaluationResults.AssertScoreAtLeast: walks DetailedItems[].Scores, optionally filtered by evaluator name, recurses into SubResults. - AgentEvaluationResults.AssertDimensionScoreAtLeast: walks each score's Dimensions list, skips non-applicable dimensions by default, supports requireApplicable to flip that, recurses into SubResults. - AgentEvaluationResults.AssertNoFailedItems: walks DetailedItems for fail/error statuses, recurses into SubResults. All helpers throw InvalidOperationException (matches existing AssertAllPassed). Truncates offender lists to the first 5 with a '+N more' suffix to keep CI output readable, mirroring the Python helpers. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Adds FoundryEvaluatorSpec, a readonly-struct union with implicit conversions
from both string and GeneratedEvaluatorRef so call sites can mix built-in
evaluator names with rubric evaluator references:
var evals = new FoundryEvals(
projectClient, model,
new GeneratedEvaluatorRef("policy-rubric", "3"),
FoundryEvals.Relevance,
FoundryEvals.Coherence);
FoundryEvals constructors (3 overloads), EvaluateTracesAsync, and
EvaluateFoundryTargetAsync now take FoundryEvaluatorSpec[]/params instead of
string[]/params. Existing call sites using string literals or string[] keep
working unchanged via implicit conversion.
FoundryEvalConverter.BuildTestingCriteria emits the documented Foundry wire
format for rubric refs:
{
"type": "azure_ai_evaluator",
"name": <DisplayName ?? Name>,
"evaluator_name": <Name>,
"evaluator_version": <Version>, // omitted when null
"initialization_parameters": { "deployment_name": <model> },
"data_mapping": { conversation arrays, optional tool_definitions }
}
WireTestingCriterion gains an optional EvaluatorVersion field. Rubric refs
are preserved through FilterToolEvaluators (tool-aware but not tool-required)
and ignored by FindMissingGroundTruthEvaluators. A versionless ref emits a
Trace.TraceWarning at criterion-build time so CI authors notice the floating
version (mirrors the Python warning).
Adds 6 new Foundry unit tests (3 BuildTestingCriteria rubric paths, 1
FindMissingGroundTruthEvaluators, 1 FilterToolEvaluators preservation, 1
mixed-order). 369/369 Foundry tests pass.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…core Adds FoundryEvals.ParseRubricScores, called per result inside ParseDetailedItem. Each EvalScoreResult now populates Dimensions when the evaluator's sample carries a rubric breakdown. Accepts three shapes for forward compatibility with provider SDK iterations: 1. sample.properties.dimension_scores (canonical Foundry runtime shape) 2. sample.properties.rubric_scores (preview/legacy key) 3. top-level sample.dimension_scores / sample.rubric_scores (defensive fallback) Entries missing 'id', 'weight', or 'applicable' are skipped without invalidating well-formed siblings. Non-applicable dimensions may omit 'score' (parsed as null). Adds 6 unit tests covering canonical and legacy keys, top-level fallback, no-match returns null, malformed-entry skipping, and the non-applicable null-score path. 375/375 Foundry tests pass. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Adds dotnet/samples/05-end-to-end/Evaluation/Evaluation_FoundryRubric mirroring
the Python evaluate_with_rubric_sample.py:
- Fetches a pre-existing Foundry agent via AgentAdministrationClient
(GetAgentAsync for latest, GetAgentVersionAsync when FOUNDRY_AGENT_VERSION
is pinned).
- References a rubric evaluator by GeneratedEvaluatorRef(name, version);
falls back to GeneratedEvaluatorRef.Latest(name) with the documented
floating-version warning.
- Mixes the rubric with FoundryEvals.Relevance and FoundryEvals.Coherence
in a single FoundryEvals run (implicit string-and-ref conversion).
- Prints per-dimension breakdowns from EvalScoreResult.Dimensions for each
item.
- Demonstrates a CI quality gate with AssertDimensionScoreAtLeast("general_quality", 3.0).
Documents the FOUNDRY_PROJECT_ENDPOINT footgun (must be project-scoped URL
.../api/projects/<project>, not the bare Azure OpenAI endpoint) and the
Eval-Definition-vs-Rubric-Evaluator distinction in the README. Ships a
.env.example with the FOUNDRY_* variables.
Registers the project in agent-framework-dotnet.slnx and cross-links from
the sibling Evaluation_Multimodal / Evaluation_ExpectedOutputs READMEs.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
5564cd0 to
710a8a7
Compare
Address PR microsoft#6267 review comments on the .NET FoundryEvals integration: - Add source-compat overloads accepting `string[] evaluators` for `FoundryEvals` ctor, `EvaluateTracesAsync`, and `EvaluateFoundryTargetAsync` so existing callers passing string arrays keep compiling unchanged. New overloads forward via a private `ToSpecs` helper that wraps each name through the implicit `string -> FoundryEvaluatorSpec` conversion. - Guard against `default(FoundryEvaluatorSpec)` entries (both `BuiltinName` and `GeneratedRef` null) that would NRE the downstream converter. Adds `FoundryEvaluatorSpec.IsValid` / `EnsureValid` plus an internal `EnsureAllSpecsValid` helper, wired into the main ctor and both static evaluation entry points. - Add 6 unit tests covering the new validation surface. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
PR microsoft#6267 review comment: the FoundryRubric sample swallowed the AssertDimensionScoreAtLeast failure, so a CI run that included it as a quality gate would still exit 0 even when the rubric regressed. Set `System.Environment.ExitCode = 1` in the catch so CI fails while still letting the rest of the sample's logging complete cleanly. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
PR microsoft#6267 review comment: `_extract_rubric_scores` only searched the `properties` dict when the sample exposed one. When the Azure AI Projects typed SDK returns a Sample object that puts `dimension_scores` / `rubric_scores` directly on the instance (no `properties` wrapper), we missed them and surfaced no per-dimension scores. Add an `else: containers.append(sample)` branch so non-dict typed samples are also inspected for the score keys. Covered by two new tests: one with `dimension_scores` directly on a typed Sample without a `properties` wrapper, and one with the legacy `rubric_scores` key in the same shape. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
PR microsoft#6267 review comments: both assertion helpers shipped without unit tests. Add `TestAssertScoreAtLeast` (above threshold, below w/ offenders, evaluator filter, sub_results recursion) and `TestAssertNoFailedItems` (all passing, failed/errored statuses, sub_results recursion) with a shared `_score_results` fixture builder. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Address PR microsoft#6267 review comments on the .NET FoundryEvals integration: - Add source-compat overloads accepting `string[] evaluators` for `FoundryEvals` ctor, `EvaluateTracesAsync`, and `EvaluateFoundryTargetAsync` so existing callers passing string arrays keep compiling unchanged. New overloads forward via a private `ToSpecs` helper that wraps each name through the implicit `string -> FoundryEvaluatorSpec` conversion. - Guard against `default(FoundryEvaluatorSpec)` entries (both `BuiltinName` and `GeneratedRef` null) that would NRE the downstream converter. Adds `FoundryEvaluatorSpec.IsValid` / `EnsureValid` plus an internal `EnsureAllSpecsValid` helper, wired into the main ctor and both static evaluation entry points. - Add 6 unit tests covering the new validation surface. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
PR microsoft#6267 review comment: the FoundryRubric sample swallowed the AssertDimensionScoreAtLeast failure, so a CI run that included it as a quality gate would still exit 0 even when the rubric regressed. Set `System.Environment.ExitCode = 1` in the catch so CI fails while still letting the rest of the sample's logging complete cleanly. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
PR microsoft#6267 review comment: `_extract_rubric_scores` only searched the `properties` dict when the sample exposed one. When the Azure AI Projects typed SDK returns a Sample object that puts `dimension_scores` / `rubric_scores` directly on the instance (no `properties` wrapper), we missed them and surfaced no per-dimension scores. Add an `else: containers.append(sample)` branch so non-dict typed samples are also inspected for the score keys. Covered by two new tests: one with `dimension_scores` directly on a typed Sample without a `properties` wrapper, and one with the legacy `rubric_scores` key in the same shape. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
PR microsoft#6267 review comments: both assertion helpers shipped without unit tests. Add `TestAssertScoreAtLeast` (above threshold, below w/ offenders, evaluator filter, sub_results recursion) and `TestAssertNoFailedItems` (all passing, failed/errored statuses, sub_results recursion) with a shared `_score_results` fixture builder. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
710a8a7 to
db25a71
Compare
…ic sample The Azure AI Foundry rubric evaluator concept doc page has not yet been published, so the link in the sample README and Program.cs comment 404s. Drop the references until the upstream doc is live. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
| string projectEndpoint = Environment.GetEnvironmentVariable("FOUNDRY_PROJECT_ENDPOINT_3") | ||
| ?? throw new InvalidOperationException("FOUNDRY_PROJECT_ENDPOINT is not set."); | ||
| string model = Environment.GetEnvironmentVariable("FOUNDRY_MODEL_3") | ||
| ?? throw new InvalidOperationException("FOUNDRY_MODEL is not set."); |
This pull request adds support for consuming pre-existing Azure AI Foundry rubric (adaptive) evaluators in the .NET agent framework, enabling per-dimension scoring and CI gating on those dimensions. It introduces new core types for rubric evaluators, updates the Foundry evaluation pipeline to accept and process rubric references, and provides a comprehensive end-to-end sample (
Evaluation_FoundryRubric) demonstrating usage. Additional documentation and environment setup instructions are included.Rubric Evaluator Support
Microsoft.Agents.AIfor rubric evaluators, includingRubricScore,GeneratedEvaluatorRef, and per-dimension breakdowns inEvalScoreResult.Dimensions. Added assertion methods for CI gating on dimension scores.FoundryEvaluatorSpec(either built-in names or rubric refs), emit the correct wire shape for rubric evaluators, and preserve rubric refs through the evaluation pipeline. Rubric evaluators are skipped for ground-truth checks. [1] [2] [3] [4] [5]Sample and Documentation
Evaluation_FoundryRubricthat demonstrates connecting to a pre-existing Foundry agent and rubric evaluator, mixing rubric and built-in evaluators, reading per-dimension scores, and enforcing CI quality gates. [1] [2] [3] [4].env.examplefor environment variable setup and expanded documentation on required endpoints and evaluator/agent distinctions. [1] [2]Other Improvements
This update enables robust integration with custom rubric evaluators in Azure AI Foundry, supporting advanced evaluation scenarios and CI gating on custom dimensions.
Contribution Checklist