VEP protein

Using protein sequence models to compute Variant Effect Predictions (VEP) across biobank-scale populations.

Overview

This project aims to use protein language models (e.g. ESM2, ESM3) to score the effect of genetic variants on protein sequence, and to study how these variant effect predictions (VEPs) vary across population-scale haplotypes (e.g. 1000 Genomes, HGDP). The main goals are:

Population-aware VEP: Compute VEPs not only for a single reference sequence, but for many haplotypes representing natural variation, so that predictions can be personalized to an individual’s genetic background.
Joint effects and interactions: Quantify how wild-type (WT) population variants and clinical or disease-associated variants jointly influence VEP scores, and test for non-additive interactions between them. This helps interpret incomplete penetrance and context-dependent pathogenicity.
Benchmarking: Compare protein-LM-based VEPs to clinical labels and to existing predictors using ProteinGym, ClinVar, and related resources.

The codebase provides pipelines to fetch haplotypes, run ESM-based scoring (e.g. masked-marginals, pseudo-ppl), merge and analyze VEP outputs, fit linear models for joint WT–clinical effects, and perform interaction testing. Results are intended to support both method development and applied analyses (e.g. penetrance, case studies).

Environment setup

Conda environment files live in conda/. Use the environment that matches what you want to run:

Environment	File	Use case
esm2	`conda/esm2.yml`	Main VEP workflow: ESM2 models, VEP pipeline, haplosaurus, ProteinGym, most analysis notebooks. Start here for the VEP tutorial.
esm3	`conda/esm3.yml`	ESM3 models and related notebooks.
esmfold	`conda/esmfold.yml`	ESMFold structure prediction (separate Python/toolchain).

Quick start:

git clone https://github.com/bschilder/VEP_protein.git && cd VEP_protein
conda env create -f conda/esm2.yml
conda activate esm2

Then set DATA_DIR in src/config.py to the path where you want to store data (default: ~/projects/data/). Use the esm2 kernel when opening the VEP tutorial notebook.

Hardware: A GPU is recommended for running ESM model inference; larger models and batch jobs will be slow on CPU only.

Data: External data (ProteinGym clinical sets, haplotype sequences from Haplosaurus/1000 Genomes, etc.) is downloaded on first use or by pipeline scripts into DATA_DIR. Ensure sufficient disk space for model weights and VEP outputs.

Code organization

Directory	Description
`src/`	Python package used by notebooks and scripts. Key modules:
→ `vep_pipeline.py`	Runs the VEP pipeline (ESM models, scoring strategies, ProteinGym/clinical variants).
→ `vep_analysis.py`	Analysis, plotting, and aggregation of VEP results (distributions, interactions, figures).
→ `vep_metrics.py`	VEP-related metrics and evaluation.
→ `haplosaurus.py`	Haplotype and population sequence handling (e.g. 1000 Genomes, HGDP).
→ `proteingym.py`	ProteinGym clinical mutation and benchmark data.
→ `config.py`	Global config: `DATA_DIR`, `PARAMS_VEP`, `PARAMS_HAPLOTYPES`, palettes.
→ `analysis/`	Analysis helpers: attributions, distributions, matrices, VEP GMM.
→ `Align/`	Alignment utilities (ClustalOmega, Clustalw).
→ `benchmark/`	ClinVar and related benchmarking.
→ `colabfold/`	ColabFold and categorical Jacobians.
→ `ESM.py`, `ESM_predict.py`, `ESM3.py`, `ESMfold.py`	ESM model loading and inference.
`notebooks/`	Step-by-step tutorials and analyses; entry point for reproducing results.
`config/`	ProHap and other pipeline configs (YAML).
`docs/`	Method and metric write-ups (e.g. interactions).
`scripts/`	Shell scripts for batch runs (e.g. `vep_pipeline.sh`, haplosaurus/HGDP).
`tests/`	Pytest tests for haplotype ref, preprocessed index, ID mapping, mutation index, MSA query.
`results/`	Outputs (e.g. plots, parquet).

Data (VEP outputs, caches, etc.) is written under DATA_DIR as set in src/config.py; the repo itself stays code-only.

Getting started

Clone and enter the repo

git clone https://github.com/bschilder/VEP_protein.git && cd VEP_protein

Create and activate the main environment

conda env create -f conda/esm2.yml
conda activate esm2

Set the data directory
Edit DATA_DIR in src/config.py to the on-disk location where you want to store data.
Run the VEP tutorial
Open notebooks/VEP.ipynb, select the esm2 kernel, and run the cells. The notebook imports src (e.g. vep_pipeline, vep_analysis, haplosaurus, proteingym) and walks through loading proteins, clinical variants, and computing VEPs.

Tutorials and notebooks

Notebooks are the main way to reproduce and explore the analyses.

Notebook	Description
VEP.ipynb	Main tutorial: VEP with protein LMs — candidate proteins, ProteinGym clinical mutations, pipeline usage.
VEP_case_studies.ipynb	Case studies using the VEP pipeline.
VEP_embeddings.ipynb	Embeddings and VEP.
VEP_penetrance.ipynb	Penetrance-related VEP analysis.
variant_annotation.ipynb	Variant annotation workflow.
variant_attribution.ipynb	Attribution for variants.
ProHap.ipynb	ProHap haplotype workflow.
haplosaurus.ipynb	Haplosaurus and haplotype handling.
1KG.ipynb, 1KG_wt.ipynb	1000 Genomes–based analyses.
HGDP.ipynb	HGDP population analyses.
ProteinGym.ipynb	ProteinGym benchmarks.
Zenodo.ipynb	Upload/download manuscript data to Zenodo (personalizedVEP record).
ESM3.ipynb	ESM3 model usage.
Evo2_figures.ipynb	Evo2-related figures.
evolocity.ipynb	Evolocity analysis (use `conda/evolocity.yml`).
GVL.ipynb	GVL workflow (use `conda/gvl.yml`).
distributions.ipynb, embeddings.ipynb	Distribution and embedding analyses.
colabfold.ipynb, categorical_jacobians.ipynb	ColabFold and Jacobians.

Other notebooks in notebooks/ (e.g. ensemblVEP, OpenTargets, patient_embeddings) follow the same pattern: they rely on src/ and optional conda envs as noted above.

Documentation

Interaction metrics — Definitions and usage of interaction metrics (e.g. deviation_from_additive, delta_r2, epistasis_fstat).
Methods (interactions) — Methodological details for interaction analyses.

Running tests

From the repo root with the appropriate environment activated (e.g. esm2):

pytest tests/ -v

This runs tests in tests/ (haplotype ref, preprocessed index, ID mapping, mutation index, MSA query).

Citation

If you use this code in your work, please cite the accompanying manuscript (see Manuscript). BibTeX and DOI can be added here once available.

Manuscript

The manuscript/ directory contains the accompanying manuscript sources, including manuscript/figures.ipynb (figure generation) and manuscript/fig/, manuscript/tbl/ for figures and tables.

License

This project is licensed under Creative Commons Attribution-NonCommercial 4.0 International (CC BY-NC 4.0). You may share and adapt the material with attribution; commercial use is not permitted. See LICENSE for the full terms.

Name		Name	Last commit message	Last commit date
Latest commit History 177 Commits
.github/workflows		.github/workflows
.vscode		.vscode
ProteinGym @ 495fdcb		ProteinGym @ 495fdcb
app/sensitization_maps		app/sensitization_maps
conda		conda
config/ProHap		config/ProHap
docs		docs
manuscript		manuscript
metadata		metadata
notebooks		notebooks
results/plots		results/plots
scripts		scripts
src		src
tests		tests
.gitignore		.gitignore
.gitmodules		.gitmodules
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

VEP protein

Overview

Table of contents

Environment setup

Code organization

Getting started

Tutorials and notebooks

Documentation

Running tests

Citation

Manuscript

License

About

Uh oh!

Releases 2

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

VEP protein

Overview

Table of contents

Environment setup

Code organization

Getting started

Tutorials and notebooks

Documentation

Running tests

Citation

Manuscript

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 2

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages