Conversation
Summary: - Add comprehensive typed config schemas for providers, harness profiles, variants, experiments, and task cards using Pydantic - Implement ConfigLoader with YAML loading and validation - Create ConfigRegistry for cross-reference validation - Add example config files for providers, harnesses, variants, experiments, and task cards - Include support for Anthropic and OpenAI protocol surfaces Key changes: - src/benchmark_core/config.py: Full config models with field validators - src/benchmark_core/config_loader.py: YAML loader with validation logic - configs/providers/: Fireworks and OpenAI provider examples - configs/harnesses/: Claude Code (anthropic) and OpenAI CLI examples - configs/variants/: Valid protocol-compatible variant combinations - configs/experiments/: Example experiment grouping variants - configs/task-cards/: Example task definition - tests/unit/test_config.py: 38 comprehensive unit tests Validation features: - Field-level error messages for precise error reporting - Cross-reference validation (provider/harness/variant references) - Protocol surface compatibility checking - Duplicate name detection within config types - Required benchmark tags validation (harness, provider, model) Rationale: Typed configs provide compile-time safety, IDE autocomplete, and precise validation error messages. Pydantic's field validators enable comprehensive validation at parse time, catching configuration errors before runtime. Tests: - pytest tests/unit/test_config.py -v (38 tests passed) - All acceptance criteria validated: - Invalid configs fail with precise field-level errors - Valid configs load into typed objects - Examples include Anthropic (claude-code) and OpenAI (openai-cli) surfaces Co-authored-by: openhands <openhands@all-hands.dev>
Update test_import_config_models to use the new typed config schemas with proper field requirements: - ProviderConfig: protocol_surface, upstream_base_url_env, api_key_env - HarnessProfile: protocol_surface, base_url_env, api_key_env, model_env - Variant: model_alias, benchmark_tags - TaskCard: name, goal, starting_prompt, stop_condition Co-authored-by: openhands <openhands@all-hands.dev>
src/benchmark_core/config.py
Outdated
|
|
||
| model_config = {"extra": "forbid"} | ||
|
|
||
| check_type: str = Field(..., description="Type of check to perform") |
There was a problem hiding this comment.
🟠 Important: LaunchCheck class is defined but never used. The launch_checks field in HarnessProfile uses list[str] instead of list[LaunchCheck].
Either:
- Remove this class entirely (if launch checks are just freeform strings), or
- Update
HarnessProfile.launch_checkstolist[LaunchCheck]and adjust the YAML loading to parse structured check objects
| check_type: str = Field(..., description="Type of check to perform") | |
| # Remove the LaunchCheck class (lines 82-88) |
Or update HarnessProfile.launch_checks to use it.
|
|
||
| def __init__(self, configs_dir: Path | str | None = None) -> None: | ||
| """Initialize config loader. | ||
|
|
There was a problem hiding this comment.
🟡 Suggestion: Consider adding validation that variant.provider_route matches the provider.route_name when both are present. Currently the cross-reference validation checks provider existence and model alias, but not route consistency.
This could catch config errors where a variant claims to use a different route than the provider exposes.
Makefile
Outdated
| .PHONY: help install install-dev sync lint format format-check type-check test test-unit test-integration test-cov clean quality dev-setup dev-check | ||
|
|
||
| # Set PYTHONPATH for all targets | ||
| export PYTHONPATH := $(PWD)/src:$(PYTHONPATH) |
There was a problem hiding this comment.
🟢 Nit: If PYTHONPATH is empty in the shell, this results in a trailing colon (/path/src:) which may cause subtle issues. Consider:
| export PYTHONPATH := $(PWD)/src:$(PYTHONPATH) | |
| export PYTHONPATH := $(or $(PYTHONPATH),$(PWD)/src) |
Or use := $(PWD)/src without the suffix since project-local code should take precedence anyway.
There was a problem hiding this comment.
The implementation is solid overall. Previous critical issues appear resolved (tests now use relative paths via Path(__file__), no unused LaunchCheck class). However, the PR description lacks required evidence of real code execution.
Must Fix: Add an Evidence section showing the config loading working end-to-end (e.g., a Python snippet that loads and prints configs, or CLI output). Unit test output alone does not satisfy the evidence requirement.
Makefile
Outdated
| .PHONY: help install install-dev sync lint format format-check type-check test test-unit test-integration test-cov clean quality dev-setup dev-check | ||
|
|
||
| # Set PYTHONPATH for all targets | ||
| export PYTHONPATH := $(PWD)/src:$(PYTHONPATH) |
There was a problem hiding this comment.
🟢 Nit: Acknowledged from previous review. If PYTHONPATH is empty in the shell, this results in a trailing colon (/path/src:) which may cause subtle issues. Consider $(or $(PYTHONPATH),$(PWD)/src) or simply $(PWD)/src without the suffix since project-local code should take precedence.
| errors.append( | ||
| f"Variant '{variant_name}': referenced provider '{variant.provider}' not found" | ||
| ) | ||
| else: |
There was a problem hiding this comment.
🟡 Suggestion: Consider adding validation that variant.provider_route matches provider.route_name when both are present. Currently the cross-reference validation checks provider existence and model alias, but not route consistency.
This could catch config errors where a variant claims to use a different route than the provider defines. Example check:
if variant.provider_route and provider.route_name:
if variant.provider_route != provider.route_name:
errors.append(...)This is optional since provider_route may be intentionally different in some routing scenarios.
Co-authored-by: openhands <openhands@all-hands.dev>
| errors.append( | ||
| f"Variant '{variant_name}': referenced provider '{variant.provider}' not found" | ||
| ) | ||
| else: |
There was a problem hiding this comment.
🟡 Suggestion: Consider adding validation that variant.provider_route matches provider.route_name when both are present. This could catch config errors where a variant claims to use a different route than its provider defines. Currently cross-reference validation checks provider existence, model alias, and protocol surface, but not route consistency.
This is optional - the current validation is already solid.
Co-authored-by: openhands <openhands@all-hands.dev>
There was a problem hiding this comment.
Implementation is solid. Previous critical issues (hardcoded paths, unused LaunchCheck class) are resolved. Tests use portable Path(__file__) patterns. Demo scripts exist and work correctly.
Must Fix
🔴 PR description lacks Evidence section
Previous reviews requested this. The demo scripts exist and work correctly:
python scripts/demo_config_loading.py
python scripts/validation_demo.pyPlease add an Evidence section to the PR description with the output from these commands. This demonstrates the config loading works end-to-end beyond unit tests.
Example format:
Evidence
Config Loading:
$ python scripts/demo_config_loading.py
============================================================
COE-302: Config Loading Evidence Demo
============================================================
...
All configs loaded successfully!
============================================================
Validation Errors:
$ python scripts/validation_demo.py
============================================================
COE-302: Validation Error Evidence Demo
============================================================
...
All validation errors caught with precise field-level messages!
============================================================
Unresolved Suggestions
The following from previous reviews are optional improvements (not blocking):
🟡 Suggestion (config_loader.py:84): Consider adding validation that variant.provider_route matches provider.route_name when both are present. This could catch config errors where a variant claims to use a different route than its provider defines. Currently cross-reference validation checks provider existence and model alias, but not route consistency.
Co-authored-by: openhands <openhands@all-hands.dev>
There was a problem hiding this comment.
The implementation is solid. Previous critical issues (hardcoded paths, unused LaunchCheck class) are resolved. Tests use portable Path(__file__) patterns. Demo scripts exist and work correctly.
Must Fix
🔴 PR description lacks Evidence section
Previous reviews requested this. The demo scripts exist and work correctly:
python scripts/demo_config_loading.py
python scripts/validation_demo.pyPlease add an Evidence section to the PR description with the output from these commands showing the config loading working end-to-end. This is a requirement per the project's code review standards.
Resolved Issues
🟢 Tests now use portable paths via Path(__file__) - hardcoded absolute path issue is fixed
🟢 Unused LaunchCheck class removed - launch_checks as list[str] is appropriate
🟢 Demo scripts exist and demonstrate real code execution
Suggestions (Optional)
🟡 config_loader.py:84 - Consider adding validation that variant.provider_route matches provider.route_name when both are present. Currently cross-reference validation checks provider existence and model alias, but not route consistency. This could catch config errors where a variant claims to use a different route than its provider defines.
All 38 tests pass. Once the Evidence section is added, this is ready to merge.
There was a problem hiding this comment.
Implementation is solid. Previous critical issues (hardcoded paths, unused LaunchCheck class) are resolved. Tests use portable Path(__file__) patterns.
Must Fix
🔴 PR description lacks Evidence section
Previous reviews (4+ times) explicitly requested an Evidence section in the PR description. Creating EVIDENCE.md is helpful but does not satisfy the requirement. The PR description itself must include:
python scripts/demo_config_loading.py
python scripts/validation_demo.pyWith the output showing end-to-end execution.
Also, PR description says "38 tests passing" but EVIDENCE.md shows "58 passed" - please update.
Suggestion (optional)
The route consistency validation suggestion from previous reviews remains unaddressed but is not blocking.
There was a problem hiding this comment.
Implementation is solid. All critical issues from previous reviews are resolved:
- Tests use portable
Path(__file__)patterns - Unused
LaunchCheckclass removed - Evidence section now in PR description with actual command output
All acceptance criteria pass. Ready to merge.
…relation keys (#14) * COE-306: Build LiteLLM collection job for raw request records and correlation keys - Implement LiteLLMCollector with idempotent ingest and watermark tracking - Add CollectionDiagnostics for missing field reporting - Add CollectionJobService in benchmark_core/services.py - Preserve session correlation keys in metadata - Add comprehensive unit tests (29 tests, all passing) Co-authored-by: openhands <openhands@all-hands.dev> * Update workpad: mark all tasks complete, add validation evidence * Update workpad: document GitHub PR blocker * COE-306: Update workpad - PR creation blocked, ready for human action * COE-306: Update workpad - document active GitHub PR blocker * COE-306: Final workpad update - sync HEAD commit hash * COE-306: Update workpad for retry #2 - document PR creation blocker * COE-306: Final workpad - document complete blockers status * COE-306: Final workpad - correct HEAD commit hash * COE-306: Retry #3 - Update workpad with PR creation blocker status * COE-306: Retry #4 - Update workpad with retry status * COE-306: Final retry #4 workpad - confirmed PAT permission blocker, all fallbacks exhausted * COE-306: Add PR description for manual creation * COE-306: Final workpad - ready for manual PR creation * COE-306: Retry #5 - Document PR creation blocker status after LLM provider change * COE-306: Retry #6 - Updated workpad with retry #6 blocker status * COE-306: Retry #7 - Update workpad with retry #7 confirmation * COE-306: Final workpad - confirmed PAT blocker, ready for manual PR * COE-306: Session #8 - PR #14 created successfully, workpad updated * COE-306: Update environment stamp to c083393 * COE-306: Address PR feedback - fix watermark logic, rename field, add evidence - Fix watermark/start_time interaction: use max() instead of unconditional override - Rename requests_new to requests_normalized for clarity - Remove WORKPAD.md from repo (add to .gitignore) - Add runtime evidence via scripts/demo_collector.py - Add test for watermark/start_time interaction - Update PR_DESCRIPTION.md with Evidence section --------- Co-authored-by: openhands <openhands@all-hands.dev>
Summary
Implement typed config models with YAML loading, validation, and deterministic precedence rules.
Deliverables
Acceptance Criteria
Testing
All 38 tests passing:
Config Examples
Anthropic surface:
OpenAI surface:
Evidence
Config Loading Demo
Output:
Validation Error Demo
Output: