Using protein sequence models to compute Variant Effect Predictions (VEP) across biobank-scale populations.
This project aims to use protein language models (e.g. ESM2, ESM3) to score the effect of genetic variants on protein sequence, and to study how these variant effect predictions (VEPs) vary across population-scale haplotypes (e.g. 1000 Genomes, HGDP). The main goals are:
- Population-aware VEP: Compute VEPs not only for a single reference sequence, but for many haplotypes representing natural variation, so that predictions can be personalized to an individual’s genetic background.
- Joint effects and interactions: Quantify how wild-type (WT) population variants and clinical or disease-associated variants jointly influence VEP scores, and test for non-additive interactions between them. This helps interpret incomplete penetrance and context-dependent pathogenicity.
- Benchmarking: Compare protein-LM-based VEPs to clinical labels and to existing predictors using ProteinGym, ClinVar, and related resources.
The codebase provides pipelines to fetch haplotypes, run ESM-based scoring (e.g. masked-marginals, pseudo-ppl), merge and analyze VEP outputs, fit linear models for joint WT–clinical effects, and perform interaction testing. Results are intended to support both method development and applied analyses (e.g. penetrance, case studies).
- Environment setup
- Code organization
- Getting started
- Tutorials and notebooks
- Documentation
- Citation
- Manuscript
- License
Conda environment files live in conda/. Use the environment that matches what you want to run:
| Environment | File | Use case |
|---|---|---|
| esm2 | conda/esm2.yml |
Main VEP workflow: ESM2 models, VEP pipeline, haplosaurus, ProteinGym, most analysis notebooks. Start here for the VEP tutorial. |
| esm3 | conda/esm3.yml |
ESM3 models and related notebooks. |
| esmfold | conda/esmfold.yml |
ESMFold structure prediction (separate Python/toolchain). |
Quick start:
git clone https://github.com/bschilder/VEP_protein.git && cd VEP_protein
conda env create -f conda/esm2.yml
conda activate esm2Then set DATA_DIR in src/config.py to the path where you want to store data (default: ~/projects/data/). Use the esm2 kernel when opening the VEP tutorial notebook.
Hardware: A GPU is recommended for running ESM model inference; larger models and batch jobs will be slow on CPU only.
Data: External data (ProteinGym clinical sets, haplotype sequences from Haplosaurus/1000 Genomes, etc.) is downloaded on first use or by pipeline scripts into DATA_DIR. Ensure sufficient disk space for model weights and VEP outputs.
| Directory | Description |
|---|---|
src/ |
Python package used by notebooks and scripts. Key modules: |
→ vep_pipeline.py |
Runs the VEP pipeline (ESM models, scoring strategies, ProteinGym/clinical variants). |
→ vep_analysis.py |
Analysis, plotting, and aggregation of VEP results (distributions, interactions, figures). |
→ vep_metrics.py |
VEP-related metrics and evaluation. |
→ haplosaurus.py |
Haplotype and population sequence handling (e.g. 1000 Genomes, HGDP). |
→ proteingym.py |
ProteinGym clinical mutation and benchmark data. |
→ config.py |
Global config: DATA_DIR, PARAMS_VEP, PARAMS_HAPLOTYPES, palettes. |
→ analysis/ |
Analysis helpers: attributions, distributions, matrices, VEP GMM. |
→ Align/ |
Alignment utilities (ClustalOmega, Clustalw). |
→ benchmark/ |
ClinVar and related benchmarking. |
→ colabfold/ |
ColabFold and categorical Jacobians. |
→ ESM.py, ESM_predict.py, ESM3.py, ESMfold.py |
ESM model loading and inference. |
notebooks/ |
Step-by-step tutorials and analyses; entry point for reproducing results. |
config/ |
ProHap and other pipeline configs (YAML). |
docs/ |
Method and metric write-ups (e.g. interactions). |
scripts/ |
Shell scripts for batch runs (e.g. vep_pipeline.sh, haplosaurus/HGDP). |
tests/ |
Pytest tests for haplotype ref, preprocessed index, ID mapping, mutation index, MSA query. |
results/ |
Outputs (e.g. plots, parquet). |
Data (VEP outputs, caches, etc.) is written under DATA_DIR as set in src/config.py; the repo itself stays code-only.
-
Clone and enter the repo
git clone https://github.com/bschilder/VEP_protein.git && cd VEP_protein
-
Create and activate the main environment
conda env create -f conda/esm2.yml conda activate esm2
-
Set the data directory
EditDATA_DIRinsrc/config.pyto the on-disk location where you want to store data. -
Run the VEP tutorial
Opennotebooks/VEP.ipynb, select the esm2 kernel, and run the cells. The notebook importssrc(e.g.vep_pipeline,vep_analysis,haplosaurus,proteingym) and walks through loading proteins, clinical variants, and computing VEPs.
Notebooks are the main way to reproduce and explore the analyses.
| Notebook | Description |
|---|---|
| VEP.ipynb | Main tutorial: VEP with protein LMs — candidate proteins, ProteinGym clinical mutations, pipeline usage. |
| VEP_case_studies.ipynb | Case studies using the VEP pipeline. |
| VEP_embeddings.ipynb | Embeddings and VEP. |
| VEP_penetrance.ipynb | Penetrance-related VEP analysis. |
| variant_annotation.ipynb | Variant annotation workflow. |
| variant_attribution.ipynb | Attribution for variants. |
| ProHap.ipynb | ProHap haplotype workflow. |
| haplosaurus.ipynb | Haplosaurus and haplotype handling. |
| 1KG.ipynb, 1KG_wt.ipynb | 1000 Genomes–based analyses. |
| HGDP.ipynb | HGDP population analyses. |
| ProteinGym.ipynb | ProteinGym benchmarks. |
| Zenodo.ipynb | Upload/download manuscript data to Zenodo (personalizedVEP record). |
| ESM3.ipynb | ESM3 model usage. |
| Evo2_figures.ipynb | Evo2-related figures. |
| evolocity.ipynb | Evolocity analysis (use conda/evolocity.yml). |
| GVL.ipynb | GVL workflow (use conda/gvl.yml). |
| distributions.ipynb, embeddings.ipynb | Distribution and embedding analyses. |
| colabfold.ipynb, categorical_jacobians.ipynb | ColabFold and Jacobians. |
Other notebooks in notebooks/ (e.g. ensemblVEP, OpenTargets, patient_embeddings) follow the same pattern: they rely on src/ and optional conda envs as noted above.
- Interaction metrics — Definitions and usage of interaction metrics (e.g.
deviation_from_additive,delta_r2,epistasis_fstat). - Methods (interactions) — Methodological details for interaction analyses.
From the repo root with the appropriate environment activated (e.g. esm2):
pytest tests/ -vThis runs tests in tests/ (haplotype ref, preprocessed index, ID mapping, mutation index, MSA query).
If you use this code in your work, please cite the accompanying manuscript (see Manuscript). BibTeX and DOI can be added here once available.
The manuscript/ directory contains the accompanying manuscript sources, including manuscript/figures.ipynb (figure generation) and manuscript/fig/, manuscript/tbl/ for figures and tables.
This project is licensed under Creative Commons Attribution-NonCommercial 4.0 International (CC BY-NC 4.0). You may share and adapt the material with attribution; commercial use is not permitted. See LICENSE for the full terms.
