Ophthalmology Dataset Harmonization

A local-first Python workspace for transforming heterogeneous ophthalmology datasets into a consistent, analysis-ready structure.

Purpose

This repository exists to solve a recurring research problem: ophthalmology datasets are highly inconsistent across naming, labels, severity scales, modalities, and metadata quality.

The project standardizes those differences into one canonical schema and one consistent rule system so data can be reliably compared, validated, and reused across studies.

What This Workspace Enables

Cross-dataset harmonization of retinal and ophthalmic imaging data.
Standardized diagnosis normalization and severity mapping.
Reproducible data processing with transparent, rule-based logic.
Better downstream readiness for analytics, machine learning, and registry-style studies.

Core Process

Discover or ingest input datasets under src/INPUT/.
Load data through the loader layer in src/loaders/.
Standardize structure using the canonical schema in src/schema.py.
Apply harmonization and inference rules from src/rules.py.
Validate outputs and quality flags in pipeline/testing workflows.
Export harmonized artifacts to src/OUTPUT/ for analysis.

Outcomes

Current documentation and codebase design target the following outcomes:

One shared schema vocabulary across source datasets.
Consistent diagnosis category mapping and severity interpretation.
Cleaner provenance and input tracking.
Reduced ambiguity when extending pipelines or onboarding new datasets.

Capacities

The workspace is designed to support:

Multi-modality ophthalmology data harmonization.
Clinical finding extraction and normalization.
Confidence-aware inference for noisy labels.
Dataset onboarding at scale using automated discovery utilities.
Rule and schema extension without rewriting the full pipeline.

See detailed references:

docs/DATA-PROCESSING/SCHEMA.md
docs/DATA-PROCESSING/RULES.md
docs/DATA-PROCESSING/CODEBOOK.md

Discoveries Enabled

This project structure helps surface discoveries such as:

Which diagnosis labels collapse into equivalent clinical categories.
Where datasets disagree semantically despite similar field names.
Which imaging modalities and metadata patterns create frequent validation failures.
How much harmonization confidence varies by source and data quality.

Repository Structure

ophthalmology-dataset-harmonization/
├── README.md
├── docs/
│   ├── INDEX.md
│   ├── METHODOLOGY.md
│   ├── PROJECT_FILES.md
│   ├── inputs.md
│   ├── AUTOMATED_DISCOVERY_README.md
│   └── DATA-PROCESSING/
│       ├── SCHEMA.md
│       ├── RULES.md
│       └── CODEBOOK.md
├── src/
│   ├── schema.py
│   ├── rules.py
│   ├── loaders/
│   ├── pipeline/
│   ├── INPUT/
│   └── OUTPUT/
└── test/

Documentation Entry Points

docs/INDEX.md: start here for role-based reading paths.
docs/METHODOLOGY.md: conceptual and methodological framework.
docs/PROJECT_FILES.md: authoritative map of maintained documentation.
docs/inputs.md: how to structure and register incoming datasets.
docs/AUTOMATED_DISCOVERY_README.md: automated dataset discovery workflow.

Quick Validation Commands

python -m pytest test/test_robust_schema.py -v
python -m pytest test/test_expanded_rules.py -v

Status

Documentation and structure have been refactored to remove merge conflicts and align the narrative with the actual repository purpose: ophthalmology dataset harmonization.

Name		Name	Last commit message	Last commit date
Latest commit History 29 Commits
.archive		.archive
.github		.github
.ipynm/notebooks		.ipynm/notebooks
.kaggle		.kaggle
docs		docs
src		src
test		test
.gitignore		.gitignore
LICENSE.md		LICENSE.md
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Ophthalmology Dataset Harmonization

Purpose

What This Workspace Enables

Core Process

Outcomes

Capacities

Discoveries Enabled

Repository Structure

Documentation Entry Points

Quick Validation Commands

Status

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Ophthalmology Dataset Harmonization

Purpose

What This Workspace Enables

Core Process

Outcomes

Capacities

Discoveries Enabled

Repository Structure

Documentation Entry Points

Quick Validation Commands

Status

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages