Skip to content

COE-302: Implement typed config schemas and validation#7

Merged
kumanday merged 11 commits intomainfrom
leonardogonzalez/coe-302-implement-typed-config-schemas-and-validation-for-providers
Mar 26, 2026
Merged

COE-302: Implement typed config schemas and validation#7
kumanday merged 11 commits intomainfrom
leonardogonzalez/coe-302-implement-typed-config-schemas-and-validation-for-providers

Conversation

@kumanday
Copy link
Copy Markdown
Collaborator

@kumanday kumanday commented Mar 26, 2026

Summary

Implement typed config models with YAML loading, validation, and deterministic precedence rules.

Deliverables

  • Config models (src/benchmark_core/config.py)
  • Config loader (src/benchmark_core/config_loader.py)
  • Example config files in configs/providers/, configs/harnesses/, configs/variants/, configs/experiments/, configs/task-cards/

Acceptance Criteria

  • Invalid configs fail with precise field-level errors
  • Valid configs load into typed objects
  • Examples cover Anthropic-surface harness profile (claude-code.yaml)
  • Examples cover OpenAI-surface harness profile (openai-cli.yaml)

Testing

All 38 tests passing:

  • Typed config schema tests
  • Field-level error tests
  • Config loading and cross-reference validation tests

Config Examples

Anthropic surface:

  • Provider: Fireworks (anthropic_messages)
  • Harness: Claude Code (anthropic_messages)

OpenAI surface:

  • Provider: OpenAI (openai_responses)
  • Harness: OpenAI CLI (openai_responses)

Evidence

Config Loading Demo

$ python scripts/demo_config_loading.py

Output:

============================================================
COE-302: Config Loading Evidence Demo
============================================================

1. Providers Loaded:
----------------------------------------
  • fireworks
    - protocol_surface: anthropic_messages
    - models: ['kimi-k2-5', 'glm-5']
  • openai
    - protocol_surface: openai_responses
    - models: ['gpt-4o', 'gpt-4o-mini']

2. Harness Profiles Loaded:
----------------------------------------
  • claude-code
    - protocol_surface: anthropic_messages
    - base_url_env: ANTHROPIC_BASE_URL
  • openai-cli
    - protocol_surface: openai_responses
    - base_url_env: OPENAI_BASE_URL

3. Variants Loaded:
----------------------------------------
  • fireworks-glm-5-claude-code
    - provider: fireworks
    - harness_profile: claude-code
    - model_alias: glm-5
  • fireworks-kimi-k2-5-claude-code
    - provider: fireworks
    - harness_profile: claude-code
    - model_alias: kimi-k2-5
  • openai-gpt-4o-cli
    - provider: openai
    - harness_profile: openai-cli
    - model_alias: gpt-4o

4. Experiments Loaded:
----------------------------------------
  • fireworks-terminal-agents-comparison
    - variants: ['fireworks-kimi-k2-5-claude-code', 'fireworks-glm-5-claude-code']

5. Task Cards Loaded:
----------------------------------------
  • repo-auth-analysis
    - goal: identify auth flow, trust boundaries, and risky ed...

6. Protocol Surface Coverage:
----------------------------------------
  Anthropic surfaces:
    - Providers: ['fireworks']
    - Harnesses: ['claude-code']
  OpenAI surfaces:
    - Providers: ['openai']
    - Harnesses: ['openai-cli']

============================================================
All configs loaded successfully!
============================================================

Validation Error Demo

$ python scripts/validation_demo.py

Output:

============================================================
COE-302: Validation Error Evidence Demo
============================================================

1. Provider with empty name:
----------------------------------------
  Field-level error: 1 validation error for ProviderConfig
name
  Value error, must not be empty or whitespace [type=value_error, input_value='', input_type=str]

2. Variant with missing benchmark tags:
----------------------------------------
  Field-level error: 1 validation error for Variant
  Value error, benchmark_tags must include: harness, model, provider [type=value_error, input_value={'name': 'test', 'provide...', 'benchmark_tags': {}}, input_type=dict]

3. Task card with negative timebox:
----------------------------------------
  Field-level error: 1 validation error for TaskCard
session_timebox_minutes
  Value error, session_timebox_minutes must be positive [type=value_error, input_value=-5, input_type=int]

============================================================
All validation errors caught with precise field-level messages!
============================================================

kumanday and others added 3 commits March 26, 2026 04:35
Summary:
- Add comprehensive typed config schemas for providers, harness profiles,
  variants, experiments, and task cards using Pydantic
- Implement ConfigLoader with YAML loading and validation
- Create ConfigRegistry for cross-reference validation
- Add example config files for providers, harnesses, variants,
  experiments, and task cards
- Include support for Anthropic and OpenAI protocol surfaces

Key changes:
- src/benchmark_core/config.py: Full config models with field validators
- src/benchmark_core/config_loader.py: YAML loader with validation logic
- configs/providers/: Fireworks and OpenAI provider examples
- configs/harnesses/: Claude Code (anthropic) and OpenAI CLI examples
- configs/variants/: Valid protocol-compatible variant combinations
- configs/experiments/: Example experiment grouping variants
- configs/task-cards/: Example task definition
- tests/unit/test_config.py: 38 comprehensive unit tests

Validation features:
- Field-level error messages for precise error reporting
- Cross-reference validation (provider/harness/variant references)
- Protocol surface compatibility checking
- Duplicate name detection within config types
- Required benchmark tags validation (harness, provider, model)

Rationale:
Typed configs provide compile-time safety, IDE autocomplete, and precise
validation error messages. Pydantic's field validators enable comprehensive
validation at parse time, catching configuration errors before runtime.

Tests:
- pytest tests/unit/test_config.py -v (38 tests passed)
- All acceptance criteria validated:
  - Invalid configs fail with precise field-level errors
  - Valid configs load into typed objects
  - Examples include Anthropic (claude-code) and OpenAI (openai-cli) surfaces

Co-authored-by: openhands <openhands@all-hands.dev>
Update test_import_config_models to use the new typed config schemas
with proper field requirements:
- ProviderConfig: protocol_surface, upstream_base_url_env, api_key_env
- HarnessProfile: protocol_surface, base_url_env, api_key_env, model_env
- Variant: model_alias, benchmark_tags
- TaskCard: name, goal, starting_prompt, stop_condition

Co-authored-by: openhands <openhands@all-hands.dev>
@kumanday kumanday added the symphony Symphony orchestrated task label Mar 26, 2026
Copy link
Copy Markdown

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The implementation is solid, but tests have hardcoded absolute paths that will fail for other developers. Fix test paths and address the unused class before merging.


model_config = {"extra": "forbid"}

check_type: str = Field(..., description="Type of check to perform")
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟠 Important: LaunchCheck class is defined but never used. The launch_checks field in HarnessProfile uses list[str] instead of list[LaunchCheck].

Either:

  1. Remove this class entirely (if launch checks are just freeform strings), or
  2. Update HarnessProfile.launch_checks to list[LaunchCheck] and adjust the YAML loading to parse structured check objects
Suggested change
check_type: str = Field(..., description="Type of check to perform")
# Remove the LaunchCheck class (lines 82-88)

Or update HarnessProfile.launch_checks to use it.


def __init__(self, configs_dir: Path | str | None = None) -> None:
"""Initialize config loader.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Suggestion: Consider adding validation that variant.provider_route matches the provider.route_name when both are present. Currently the cross-reference validation checks provider existence and model alias, but not route consistency.

This could catch config errors where a variant claims to use a different route than the provider exposes.

Makefile Outdated
.PHONY: help install install-dev sync lint format format-check type-check test test-unit test-integration test-cov clean quality dev-setup dev-check

# Set PYTHONPATH for all targets
export PYTHONPATH := $(PWD)/src:$(PYTHONPATH)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟢 Nit: If PYTHONPATH is empty in the shell, this results in a trailing colon (/path/src:) which may cause subtle issues. Consider:

Suggested change
export PYTHONPATH := $(PWD)/src:$(PYTHONPATH)
export PYTHONPATH := $(or $(PYTHONPATH),$(PWD)/src)

Or use := $(PWD)/src without the suffix since project-local code should take precedence anyway.

Copy link
Copy Markdown

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The implementation is solid overall. Previous critical issues appear resolved (tests now use relative paths via Path(__file__), no unused LaunchCheck class). However, the PR description lacks required evidence of real code execution.

Must Fix: Add an Evidence section showing the config loading working end-to-end (e.g., a Python snippet that loads and prints configs, or CLI output). Unit test output alone does not satisfy the evidence requirement.

Makefile Outdated
.PHONY: help install install-dev sync lint format format-check type-check test test-unit test-integration test-cov clean quality dev-setup dev-check

# Set PYTHONPATH for all targets
export PYTHONPATH := $(PWD)/src:$(PYTHONPATH)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟢 Nit: Acknowledged from previous review. If PYTHONPATH is empty in the shell, this results in a trailing colon (/path/src:) which may cause subtle issues. Consider $(or $(PYTHONPATH),$(PWD)/src) or simply $(PWD)/src without the suffix since project-local code should take precedence.

errors.append(
f"Variant '{variant_name}': referenced provider '{variant.provider}' not found"
)
else:
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Suggestion: Consider adding validation that variant.provider_route matches provider.route_name when both are present. Currently the cross-reference validation checks provider existence and model alias, but not route consistency.

This could catch config errors where a variant claims to use a different route than the provider defines. Example check:

if variant.provider_route and provider.route_name:
    if variant.provider_route != provider.route_name:
        errors.append(...)

This is optional since provider_route may be intentionally different in some routing scenarios.

Copy link
Copy Markdown

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The implementation is solid. Previous critical issues (hardcoded paths, unused LaunchCheck class) are resolved. However, the PR description lacks the required Evidence section showing actual code execution.

Co-authored-by: openhands <openhands@all-hands.dev>
Copy link
Copy Markdown

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The implementation is solid. Previous critical issues (hardcoded paths, unused LaunchCheck class) are resolved. One minor documentation item to address.

errors.append(
f"Variant '{variant_name}': referenced provider '{variant.provider}' not found"
)
else:
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Suggestion: Consider adding validation that variant.provider_route matches provider.route_name when both are present. This could catch config errors where a variant claims to use a different route than its provider defines. Currently cross-reference validation checks provider existence, model alias, and protocol surface, but not route consistency.

This is optional - the current validation is already solid.

Co-authored-by: openhands <openhands@all-hands.dev>
Copy link
Copy Markdown

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Implementation is solid. Previous critical issues (hardcoded paths, unused LaunchCheck class) are resolved. Tests use portable Path(__file__) patterns. Demo scripts exist and work correctly.

Must Fix

🔴 PR description lacks Evidence section

Previous reviews requested this. The demo scripts exist and work correctly:

python scripts/demo_config_loading.py
python scripts/validation_demo.py

Please add an Evidence section to the PR description with the output from these commands. This demonstrates the config loading works end-to-end beyond unit tests.

Example format:

Evidence

Config Loading:

$ python scripts/demo_config_loading.py
============================================================
COE-302: Config Loading Evidence Demo
============================================================
...
All configs loaded successfully!
============================================================

Validation Errors:

$ python scripts/validation_demo.py
============================================================
COE-302: Validation Error Evidence Demo
============================================================
...
All validation errors caught with precise field-level messages!
============================================================

Unresolved Suggestions

The following from previous reviews are optional improvements (not blocking):

🟡 Suggestion (config_loader.py:84): Consider adding validation that variant.provider_route matches provider.route_name when both are present. This could catch config errors where a variant claims to use a different route than its provider defines. Currently cross-reference validation checks provider existence and model alias, but not route consistency.

Co-authored-by: openhands <openhands@all-hands.dev>
Copy link
Copy Markdown

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The implementation is solid. Previous critical issues (hardcoded paths, unused LaunchCheck class) are resolved. Tests use portable Path(__file__) patterns. Demo scripts exist and work correctly.

Must Fix

🔴 PR description lacks Evidence section

Previous reviews requested this. The demo scripts exist and work correctly:

python scripts/demo_config_loading.py
python scripts/validation_demo.py

Please add an Evidence section to the PR description with the output from these commands showing the config loading working end-to-end. This is a requirement per the project's code review standards.

Resolved Issues

🟢 Tests now use portable paths via Path(__file__) - hardcoded absolute path issue is fixed
🟢 Unused LaunchCheck class removed - launch_checks as list[str] is appropriate
🟢 Demo scripts exist and demonstrate real code execution

Suggestions (Optional)

🟡 config_loader.py:84 - Consider adding validation that variant.provider_route matches provider.route_name when both are present. Currently cross-reference validation checks provider existence and model alias, but not route consistency. This could catch config errors where a variant claims to use a different route than its provider defines.

All 38 tests pass. Once the Evidence section is added, this is ready to merge.

Copy link
Copy Markdown

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Implementation is solid. Previous critical issues (hardcoded paths, unused LaunchCheck class) are resolved. Tests use portable Path(__file__) patterns.

Must Fix

🔴 PR description lacks Evidence section

Previous reviews (4+ times) explicitly requested an Evidence section in the PR description. Creating EVIDENCE.md is helpful but does not satisfy the requirement. The PR description itself must include:

python scripts/demo_config_loading.py
python scripts/validation_demo.py

With the output showing end-to-end execution.

Also, PR description says "38 tests passing" but EVIDENCE.md shows "58 passed" - please update.

Suggestion (optional)

The route consistency validation suggestion from previous reviews remains unaddressed but is not blocking.

Copy link
Copy Markdown

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Implementation is solid. All critical issues from previous reviews are resolved:

  • Tests use portable Path(__file__) patterns
  • Unused LaunchCheck class removed
  • Evidence section now in PR description with actual command output

All acceptance criteria pass. Ready to merge.

@kumanday kumanday merged commit 7bdcae5 into main Mar 26, 2026
3 checks passed
kumanday added a commit that referenced this pull request Mar 27, 2026
…relation keys (#14)

* COE-306: Build LiteLLM collection job for raw request records and correlation keys

- Implement LiteLLMCollector with idempotent ingest and watermark tracking
- Add CollectionDiagnostics for missing field reporting
- Add CollectionJobService in benchmark_core/services.py
- Preserve session correlation keys in metadata
- Add comprehensive unit tests (29 tests, all passing)

Co-authored-by: openhands <openhands@all-hands.dev>

* Update workpad: mark all tasks complete, add validation evidence

* Update workpad: document GitHub PR blocker

* COE-306: Update workpad - PR creation blocked, ready for human action

* COE-306: Update workpad - document active GitHub PR blocker

* COE-306: Final workpad update - sync HEAD commit hash

* COE-306: Update workpad for retry #2 - document PR creation blocker

* COE-306: Final workpad - document complete blockers status

* COE-306: Final workpad - correct HEAD commit hash

* COE-306: Retry #3 - Update workpad with PR creation blocker status

* COE-306: Retry #4 - Update workpad with retry status

* COE-306: Final retry #4 workpad - confirmed PAT permission blocker, all fallbacks exhausted

* COE-306: Add PR description for manual creation

* COE-306: Final workpad - ready for manual PR creation

* COE-306: Retry #5 - Document PR creation blocker status after LLM provider change

* COE-306: Retry #6 - Updated workpad with retry #6 blocker status

* COE-306: Retry #7 - Update workpad with retry #7 confirmation

* COE-306: Final workpad - confirmed PAT blocker, ready for manual PR

* COE-306: Session #8 - PR #14 created successfully, workpad updated

* COE-306: Update environment stamp to c083393

* COE-306: Address PR feedback - fix watermark logic, rename field, add evidence

- Fix watermark/start_time interaction: use max() instead of unconditional override
- Rename requests_new to requests_normalized for clarity
- Remove WORKPAD.md from repo (add to .gitignore)
- Add runtime evidence via scripts/demo_collector.py
- Add test for watermark/start_time interaction
- Update PR_DESCRIPTION.md with Evidence section

---------

Co-authored-by: openhands <openhands@all-hands.dev>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

review-this symphony Symphony orchestrated task

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants