A local-first Python workspace for transforming heterogeneous ophthalmology datasets into a consistent, analysis-ready structure.
This repository exists to solve a recurring research problem: ophthalmology datasets are highly inconsistent across naming, labels, severity scales, modalities, and metadata quality.
The project standardizes those differences into one canonical schema and one consistent rule system so data can be reliably compared, validated, and reused across studies.
- Cross-dataset harmonization of retinal and ophthalmic imaging data.
- Standardized diagnosis normalization and severity mapping.
- Reproducible data processing with transparent, rule-based logic.
- Better downstream readiness for analytics, machine learning, and registry-style studies.
- Discover or ingest input datasets under src/INPUT/.
- Load data through the loader layer in src/loaders/.
- Standardize structure using the canonical schema in src/schema.py.
- Apply harmonization and inference rules from src/rules.py.
- Validate outputs and quality flags in pipeline/testing workflows.
- Export harmonized artifacts to src/OUTPUT/ for analysis.
Current documentation and codebase design target the following outcomes:
- One shared schema vocabulary across source datasets.
- Consistent diagnosis category mapping and severity interpretation.
- Cleaner provenance and input tracking.
- Reduced ambiguity when extending pipelines or onboarding new datasets.
The workspace is designed to support:
- Multi-modality ophthalmology data harmonization.
- Clinical finding extraction and normalization.
- Confidence-aware inference for noisy labels.
- Dataset onboarding at scale using automated discovery utilities.
- Rule and schema extension without rewriting the full pipeline.
See detailed references:
- docs/DATA-PROCESSING/SCHEMA.md
- docs/DATA-PROCESSING/RULES.md
- docs/DATA-PROCESSING/CODEBOOK.md
This project structure helps surface discoveries such as:
- Which diagnosis labels collapse into equivalent clinical categories.
- Where datasets disagree semantically despite similar field names.
- Which imaging modalities and metadata patterns create frequent validation failures.
- How much harmonization confidence varies by source and data quality.
ophthalmology-dataset-harmonization/
├── README.md
├── docs/
│ ├── INDEX.md
│ ├── METHODOLOGY.md
│ ├── PROJECT_FILES.md
│ ├── inputs.md
│ ├── AUTOMATED_DISCOVERY_README.md
│ └── DATA-PROCESSING/
│ ├── SCHEMA.md
│ ├── RULES.md
│ └── CODEBOOK.md
├── src/
│ ├── schema.py
│ ├── rules.py
│ ├── loaders/
│ ├── pipeline/
│ ├── INPUT/
│ └── OUTPUT/
└── test/- docs/INDEX.md: start here for role-based reading paths.
- docs/METHODOLOGY.md: conceptual and methodological framework.
- docs/PROJECT_FILES.md: authoritative map of maintained documentation.
- docs/inputs.md: how to structure and register incoming datasets.
- docs/AUTOMATED_DISCOVERY_README.md: automated dataset discovery workflow.
python -m pytest test/test_robust_schema.py -v
python -m pytest test/test_expanded_rules.py -vDocumentation and structure have been refactored to remove merge conflicts and align the narrative with the actual repository purpose: ophthalmology dataset harmonization.