COE-302: Implement typed config schemas and validation by kumanday · Pull Request #7 · trilogy-group/StackPerf

kumanday · 2026-03-26T09:49:36Z

Summary

Implement typed config models with YAML loading, validation, and deterministic precedence rules.

Deliverables

Config models (src/benchmark_core/config.py)
Config loader (src/benchmark_core/config_loader.py)
Example config files in configs/providers/, configs/harnesses/, configs/variants/, configs/experiments/, configs/task-cards/

Acceptance Criteria

Invalid configs fail with precise field-level errors
Valid configs load into typed objects
Examples cover Anthropic-surface harness profile (claude-code.yaml)
Examples cover OpenAI-surface harness profile (openai-cli.yaml)

Testing

All 38 tests passing:

Typed config schema tests
Field-level error tests
Config loading and cross-reference validation tests

Config Examples

Anthropic surface:

Provider: Fireworks (anthropic_messages)
Harness: Claude Code (anthropic_messages)

OpenAI surface:

Provider: OpenAI (openai_responses)
Harness: OpenAI CLI (openai_responses)

Evidence

Config Loading Demo

$ python scripts/demo_config_loading.py

Output:

============================================================
COE-302: Config Loading Evidence Demo
============================================================

1. Providers Loaded:
----------------------------------------
  • fireworks
    - protocol_surface: anthropic_messages
    - models: ['kimi-k2-5', 'glm-5']
  • openai
    - protocol_surface: openai_responses
    - models: ['gpt-4o', 'gpt-4o-mini']

2. Harness Profiles Loaded:
----------------------------------------
  • claude-code
    - protocol_surface: anthropic_messages
    - base_url_env: ANTHROPIC_BASE_URL
  • openai-cli
    - protocol_surface: openai_responses
    - base_url_env: OPENAI_BASE_URL

3. Variants Loaded:
----------------------------------------
  • fireworks-glm-5-claude-code
    - provider: fireworks
    - harness_profile: claude-code
    - model_alias: glm-5
  • fireworks-kimi-k2-5-claude-code
    - provider: fireworks
    - harness_profile: claude-code
    - model_alias: kimi-k2-5
  • openai-gpt-4o-cli
    - provider: openai
    - harness_profile: openai-cli
    - model_alias: gpt-4o

4. Experiments Loaded:
----------------------------------------
  • fireworks-terminal-agents-comparison
    - variants: ['fireworks-kimi-k2-5-claude-code', 'fireworks-glm-5-claude-code']

5. Task Cards Loaded:
----------------------------------------
  • repo-auth-analysis
    - goal: identify auth flow, trust boundaries, and risky ed...

6. Protocol Surface Coverage:
----------------------------------------
  Anthropic surfaces:
    - Providers: ['fireworks']
    - Harnesses: ['claude-code']
  OpenAI surfaces:
    - Providers: ['openai']
    - Harnesses: ['openai-cli']

============================================================
All configs loaded successfully!
============================================================

Validation Error Demo

$ python scripts/validation_demo.py

Output:

============================================================
COE-302: Validation Error Evidence Demo
============================================================

1. Provider with empty name:
----------------------------------------
  Field-level error: 1 validation error for ProviderConfig
name
  Value error, must not be empty or whitespace [type=value_error, input_value='', input_type=str]

2. Variant with missing benchmark tags:
----------------------------------------
  Field-level error: 1 validation error for Variant
  Value error, benchmark_tags must include: harness, model, provider [type=value_error, input_value={'name': 'test', 'provide...', 'benchmark_tags': {}}, input_type=dict]

3. Task card with negative timebox:
----------------------------------------
  Field-level error: 1 validation error for TaskCard
session_timebox_minutes
  Value error, session_timebox_minutes must be positive [type=value_error, input_value=-5, input_type=int]

============================================================
All validation errors caught with precise field-level messages!
============================================================

Summary: - Add comprehensive typed config schemas for providers, harness profiles, variants, experiments, and task cards using Pydantic - Implement ConfigLoader with YAML loading and validation - Create ConfigRegistry for cross-reference validation - Add example config files for providers, harnesses, variants, experiments, and task cards - Include support for Anthropic and OpenAI protocol surfaces Key changes: - src/benchmark_core/config.py: Full config models with field validators - src/benchmark_core/config_loader.py: YAML loader with validation logic - configs/providers/: Fireworks and OpenAI provider examples - configs/harnesses/: Claude Code (anthropic) and OpenAI CLI examples - configs/variants/: Valid protocol-compatible variant combinations - configs/experiments/: Example experiment grouping variants - configs/task-cards/: Example task definition - tests/unit/test_config.py: 38 comprehensive unit tests Validation features: - Field-level error messages for precise error reporting - Cross-reference validation (provider/harness/variant references) - Protocol surface compatibility checking - Duplicate name detection within config types - Required benchmark tags validation (harness, provider, model) Rationale: Typed configs provide compile-time safety, IDE autocomplete, and precise validation error messages. Pydantic's field validators enable comprehensive validation at parse time, catching configuration errors before runtime. Tests: - pytest tests/unit/test_config.py -v (38 tests passed) - All acceptance criteria validated: - Invalid configs fail with precise field-level errors - Valid configs load into typed objects - Examples include Anthropic (claude-code) and OpenAI (openai-cli) surfaces Co-authored-by: openhands <openhands@all-hands.dev>

Update test_import_config_models to use the new typed config schemas with proper field requirements: - ProviderConfig: protocol_surface, upstream_base_url_env, api_key_env - HarnessProfile: protocol_surface, base_url_env, api_key_env, model_env - Variant: model_alias, benchmark_tags - TaskCard: name, goal, starting_prompt, stop_condition Co-authored-by: openhands <openhands@all-hands.dev>

github-actions

The implementation is solid, but tests have hardcoded absolute paths that will fail for other developers. Fix test paths and address the unused class before merging.

tests/unit/test_config.py

github-actions · 2026-03-26T10:09:31Z

src/benchmark_core/config.py

+
+    model_config = {"extra": "forbid"}
+
+    check_type: str = Field(..., description="Type of check to perform")


🟠 Important: LaunchCheck class is defined but never used. The launch_checks field in HarnessProfile uses list[str] instead of list[LaunchCheck].

Either:

Remove this class entirely (if launch checks are just freeform strings), or

Update HarnessProfile.launch_checks to list[LaunchCheck] and adjust the YAML loading to parse structured check objects

Suggested change

check_type: str = Field(..., description="Type of check to perform")

# Remove the LaunchCheck class (lines 82-88)

Or update HarnessProfile.launch_checks to use it.

github-actions · 2026-03-26T10:09:31Z

src/benchmark_core/config_loader.py

+
+    def __init__(self, configs_dir: Path | str | None = None) -> None:
+        """Initialize config loader.
+


🟡 Suggestion: Consider adding validation that variant.provider_route matches the provider.route_name when both are present. Currently the cross-reference validation checks provider existence and model alias, but not route consistency.

This could catch config errors where a variant claims to use a different route than the provider exposes.

github-actions · 2026-03-26T10:09:31Z

Makefile

 .PHONY: help install install-dev sync lint format format-check type-check test test-unit test-integration test-cov clean quality dev-setup dev-check

+# Set PYTHONPATH for all targets
+export PYTHONPATH := $(PWD)/src:$(PYTHONPATH)


🟢 Nit: If PYTHONPATH is empty in the shell, this results in a trailing colon (/path/src:) which may cause subtle issues. Consider:

Suggested change

export PYTHONPATH := $(PWD)/src:$(PYTHONPATH)

export PYTHONPATH := $(or $(PYTHONPATH),$(PWD)/src)

Or use := $(PWD)/src without the suffix since project-local code should take precedence anyway.

…eck class

github-actions

The implementation is solid overall. Previous critical issues appear resolved (tests now use relative paths via Path(__file__), no unused LaunchCheck class). However, the PR description lacks required evidence of real code execution.

Must Fix: Add an Evidence section showing the config loading working end-to-end (e.g., a Python snippet that loads and prints configs, or CLI output). Unit test output alone does not satisfy the evidence requirement.

github-actions · 2026-03-26T10:12:15Z

Makefile

 .PHONY: help install install-dev sync lint format format-check type-check test test-unit test-integration test-cov clean quality dev-setup dev-check

+# Set PYTHONPATH for all targets
+export PYTHONPATH := $(PWD)/src:$(PYTHONPATH)


🟢 Nit: Acknowledged from previous review. If PYTHONPATH is empty in the shell, this results in a trailing colon (/path/src:) which may cause subtle issues. Consider $(or $(PYTHONPATH),$(PWD)/src) or simply $(PWD)/src without the suffix since project-local code should take precedence.

github-actions · 2026-03-26T10:12:15Z

src/benchmark_core/config_loader.py

+                errors.append(
+                    f"Variant '{variant_name}': referenced provider '{variant.provider}' not found"
+                )
+            else:


🟡 Suggestion: Consider adding validation that variant.provider_route matches provider.route_name when both are present. Currently the cross-reference validation checks provider existence and model alias, but not route consistency.

This could catch config errors where a variant claims to use a different route than the provider defines. Example check:

if variant.provider_route and provider.route_name: if variant.provider_route != provider.route_name: errors.append(...)

This is optional since provider_route may be intentionally different in some routing scenarios.

github-actions

The implementation is solid. Previous critical issues (hardcoded paths, unused LaunchCheck class) are resolved. However, the PR description lacks the required Evidence section showing actual code execution.

tests/unit/test_config.py

Makefile

Co-authored-by: openhands <openhands@all-hands.dev>

github-actions

The implementation is solid. Previous critical issues (hardcoded paths, unused LaunchCheck class) are resolved. One minor documentation item to address.

tests/unit/test_config.py

src/benchmark_core/config.py

scripts/demo_config_loading.py

github-actions · 2026-03-26T10:50:45Z

src/benchmark_core/config_loader.py

+                errors.append(
+                    f"Variant '{variant_name}': referenced provider '{variant.provider}' not found"
+                )
+            else:


🟡 Suggestion: Consider adding validation that variant.provider_route matches provider.route_name when both are present. This could catch config errors where a variant claims to use a different route than its provider defines. Currently cross-reference validation checks provider existence, model alias, and protocol surface, but not route consistency.

This is optional - the current validation is already solid.

Makefile

Co-authored-by: openhands <openhands@all-hands.dev>

github-actions

Implementation is solid. Previous critical issues (hardcoded paths, unused LaunchCheck class) are resolved. Tests use portable Path(__file__) patterns. Demo scripts exist and work correctly.

Must Fix

🔴 PR description lacks Evidence section

Previous reviews requested this. The demo scripts exist and work correctly:

python scripts/demo_config_loading.py
python scripts/validation_demo.py

Please add an Evidence section to the PR description with the output from these commands. This demonstrates the config loading works end-to-end beyond unit tests.

Example format:

Evidence

Config Loading:

$ python scripts/demo_config_loading.py
============================================================
COE-302: Config Loading Evidence Demo
============================================================
...
All configs loaded successfully!
============================================================

Validation Errors:

$ python scripts/validation_demo.py
============================================================
COE-302: Validation Error Evidence Demo
============================================================
...
All validation errors caught with precise field-level messages!
============================================================

Unresolved Suggestions

The following from previous reviews are optional improvements (not blocking):

🟡 Suggestion (config_loader.py:84): Consider adding validation that variant.provider_route matches provider.route_name when both are present. This could catch config errors where a variant claims to use a different route than its provider defines. Currently cross-reference validation checks provider existence and model alias, but not route consistency.

Co-authored-by: openhands <openhands@all-hands.dev>

github-actions

The implementation is solid. Previous critical issues (hardcoded paths, unused LaunchCheck class) are resolved. Tests use portable Path(__file__) patterns. Demo scripts exist and work correctly.

Must Fix

🔴 PR description lacks Evidence section

Previous reviews requested this. The demo scripts exist and work correctly:

python scripts/demo_config_loading.py
python scripts/validation_demo.py

Please add an Evidence section to the PR description with the output from these commands showing the config loading working end-to-end. This is a requirement per the project's code review standards.

Resolved Issues

🟢 Tests now use portable paths via Path(__file__) - hardcoded absolute path issue is fixed
🟢 Unused LaunchCheck class removed - launch_checks as list[str] is appropriate
🟢 Demo scripts exist and demonstrate real code execution

Suggestions (Optional)

🟡 config_loader.py:84 - Consider adding validation that variant.provider_route matches provider.route_name when both are present. Currently cross-reference validation checks provider existence and model alias, but not route consistency. This could catch config errors where a variant claims to use a different route than its provider defines.

All 38 tests pass. Once the Evidence section is added, this is ready to merge.

github-actions

Implementation is solid. Previous critical issues (hardcoded paths, unused LaunchCheck class) are resolved. Tests use portable Path(__file__) patterns.

Must Fix

🔴 PR description lacks Evidence section

Previous reviews (4+ times) explicitly requested an Evidence section in the PR description. Creating EVIDENCE.md is helpful but does not satisfy the requirement. The PR description itself must include:

python scripts/demo_config_loading.py
python scripts/validation_demo.py

With the output showing end-to-end execution.

Also, PR description says "38 tests passing" but EVIDENCE.md shows "58 passed" - please update.

Suggestion (optional)

The route consistency validation suggestion from previous reviews remains unaddressed but is not blocking.

github-actions

Implementation is solid. All critical issues from previous reviews are resolved:

Tests use portable Path(__file__) patterns
Unused LaunchCheck class removed
Evidence section now in PR description with actual command output

All acceptance criteria pass. Ready to merge.

…relation keys (#14) * COE-306: Build LiteLLM collection job for raw request records and correlation keys - Implement LiteLLMCollector with idempotent ingest and watermark tracking - Add CollectionDiagnostics for missing field reporting - Add CollectionJobService in benchmark_core/services.py - Preserve session correlation keys in metadata - Add comprehensive unit tests (29 tests, all passing) Co-authored-by: openhands <openhands@all-hands.dev> * Update workpad: mark all tasks complete, add validation evidence * Update workpad: document GitHub PR blocker * COE-306: Update workpad - PR creation blocked, ready for human action * COE-306: Update workpad - document active GitHub PR blocker * COE-306: Final workpad update - sync HEAD commit hash * COE-306: Update workpad for retry #2 - document PR creation blocker * COE-306: Final workpad - document complete blockers status * COE-306: Final workpad - correct HEAD commit hash * COE-306: Retry #3 - Update workpad with PR creation blocker status * COE-306: Retry #4 - Update workpad with retry status * COE-306: Final retry #4 workpad - confirmed PAT permission blocker, all fallbacks exhausted * COE-306: Add PR description for manual creation * COE-306: Final workpad - ready for manual PR creation * COE-306: Retry #5 - Document PR creation blocker status after LLM provider change * COE-306: Retry #6 - Updated workpad with retry #6 blocker status * COE-306: Retry #7 - Update workpad with retry #7 confirmation * COE-306: Final workpad - confirmed PAT blocker, ready for manual PR * COE-306: Session #8 - PR #14 created successfully, workpad updated * COE-306: Update environment stamp to c083393 * COE-306: Address PR feedback - fix watermark logic, rename field, add evidence - Fix watermark/start_time interaction: use max() instead of unconditional override - Rename requests_new to requests_normalized for clarity - Remove WORKPAD.md from repo (add to .gitignore) - Add runtime evidence via scripts/demo_collector.py - Add test for watermark/start_time interaction - Update PR_DESCRIPTION.md with Evidence section --------- Co-authored-by: openhands <openhands@all-hands.dev>

kumanday and others added 3 commits March 26, 2026 04:35

chore: set PYTHONPATH in Makefile and fix type annotation

b0705cb

kumanday added the symphony Symphony orchestrated task label Mar 26, 2026

kumanday added 2 commits March 26, 2026 05:06

style: remove trailing whitespace from test_config.py

0184ff5

fix: add types-PyYAML for mypy and remove unused type: ignore

dbfeb91