TransMep is a tool for using transfer learning embeddings from protein language models to train variant prediction models from existing mutagenesis data. It is focused on speed and simplicity of use. You just input your dataset and obtain a prediction model, accompanied by detailed reports on performance, hyperparameter optimization, training samples importance and even an attribution to individual mutations.
@article{Hoffbauer2024,
title = {TransMEP: Transfer learning on large protein language models to predict mutation effects of proteins from a small known dataset},
url = {http://dx.doi.org/10.1101/2024.01.12.575432},
DOI = {10.1101/2024.01.12.575432},
publisher = {Cold Spring Harbor Laboratory},
author = {Hoffbauer, Tilman and Strodel, Birgit},
year = {2024},
month = jan
}
Tip: You want to skip the hassle of installation or do not have a NVIDIA GPU? Just use the Google Colab notebook. You get limited access to free GPUs there.
TransMEP's main dependency is PyTorch, which you should install first (Guide). While TransMEP does not require an NVIDIA GPU, it is significantly faster with it. Usually, setting up Torch with GPU acceleration via CUDA is worth the hassle. After that, just install TransMEP on top:
pip install transmepFor a wet-lab study, we recommend the following workflow:
- Sample an initial set of mutants that seed future optimization. For example, one could select the wild type and 9 random mutants with a single mutation. Evaluate these mutants in the lab and collect their target values.
- Train an initial model.
- View the available reports. Note that computing R² values is not applicable in this setting.
- Determine the next variant via the UCB criterion.
- Test it in the lab, measure its target value and go back to step 2 until a sufficiently high target value is achieved.
Each step corresponds to one of the commands explained below.
If you plan on using TransMEP, please get in touch with us.
transmep train
You will be prompted for all relevant parameters if you do not specify them on the command line (for help see transmep --help and transmep train --help).
-
model-pathYour model will be saved as a Torch state dict to this location. All paths can be both local or remote, for more details see fsspec. -
wildtype-pathPath to the wildtype sequence. Here, TransMEP expects a text file with the sequence in the amino acid single letter code, includingXfor rare amino acids. -
variants-pathPath to the known variants that should be used for training (Have a look attransmep splitfor splitting datasets into training and test subsets). Here, TransMEP expects a CSV with at least two columns, namedmutationsandy. Whileyis just the target value,mutationsshould contain the list of mutations of each mutant in the formatA123B+C234D(the wildtype is an empty string). -
alpha-min,alpha-maxThese parameters describe the$\alpha$ values that are tried during hyperparameter optimization. The$\alpha$ value controls the regularization of the model - the higher the value, the worse the fit will be, but the more unlikely is overfitting. Note that the training process can become numerically unstable for very low$\alpha$ values. -
gamma-min,gamma-maxThis parameter is a scale parameter, i.e. how correlated two samples are if they are close in the embedding space. -
alpha-steps,gamma-stepsThe resolution of the grid search. Note that the runtime grows linear with the product ofalpha-stepsandgamma-steps. -
validation-iterationsNumber of iterations for repeated holdout during validation. Higher values give estimates for the generalization error with lower variance, but increase runtime. -
batch-sizeNumber of validation iterations to process in one batch. Lower this value if you get CUDA out of memory errors, and increase it if GPU utilization is low. -
holdout-fractionFraction of samples for holdout during validation. Higher values also decrease the variance of the generalization error estimate, but they increase the bias of the estimate. Usually, you do not want to change this value. -
grid-search-outputPath to save the output of the grid search report to. If you pass an empty string, nothing will be saved. This file can be used for the reports later on.
transmep predict
This command can be used for predicting the target value for new variants.
The variants file is again expected to be a CSV, but this time only the mutations column is required.
The output contains the columns prediction (estimate for the target value) and prediction_std (estimate for the standard deviation of the prediction).
The latter is calculated under the assumption that the RBF kernel used in TransMEP is a correct fit for fitness landscape, so this value should be handled with care and only used for comparing predictions.
transmep ucb
With this command, you can start the search for a variant with a high UCB value should be evaluated next. It starts the genetic algorithm using the following parameters:
model-pathPath to the model fromtransmep train.wildtype-pathPath to the wild type sequence.batch-sizeSize of the batches used for inference on the model to calculate the missing UCB values.kappaKappa value in the UCB criterion.population-sizeSize of the population in the genetic algorithm.restartsHow many repetitions of the genetic algorithm should be performed.num_mutationsMaximum number of mutations per mutant.sitesComma separated list of positions that should be mutated. Set toallto allow mutations on all positions.mutation-probabilityHyperparameter of the genetic algorithm. If not set, this is set to1 / number of sites.crossover-probabilityHyperparameter of the genetic algorithm. If not set, this defaults to 0.5.min-diversityMinimum fraction of distinct mutants in the population before stopping optimization. This defaults to 0.1.max-generationsMaximum number of generations per genetic algorithm repetition. This defaults to 100.
Depending on your available computing time, we recommend to try some variations of the mutation-probability, crossover-probability, min-diversity and max-generations parameters to find the candidate with the highest UCB value.
transmep reports
For variants-path, one should now pass a set of new variants, e.g. the test dataset.
The supported reports are:
-
r2Estimate the coefficient of determination of this model, i.e. the fraction of variance in the dataset explained by the model. This also reports a confidence interval which is based on bootstrapping and the standard deviation estimated by the model. -
mutation_attributionThis report estimates the effect of every single mutation of a mutant on the total value. The y-axis contains the mutations while the x-axis contains the variants. Each column sums up to the predicted target value of the variant. -
grid_searchHere, one can observe the estimated generalization error during grid search for various hyperparameter valuations ($\alpha$ and$\gamma$ ). If the optimum is close to the border, rerun the training process with larger parameter ranges. The colors are on a logarithmic scale. -
training_samples_importanceThis report calculates the importance of each training sample for the prediction of a variant. Each row sums up to 1.
This example is based on a dataset from [Wu et al. 2019], which was also used in my bachelor thesis as C75.
Let's first train a model on the provided example data.
$ transmep train
Path to save the model to (model-path): https://mutation-prediction.thoffbauer.de/transmep/c-wt.txt
Path to the wild type sequence file (wildtype-path): ^CAborted!
(transmep-py3.10) [t@tpc transmep-publication]$ transmep train
Path to save the model to (model-path): mymodel.pt
Path to the wild type sequence file (wildtype-path): https://mutation-prediction.thoffbauer.de/transmep/c-wt.txt
Path to the variants for training (variants-path): https://mutation-prediction.thoffbauer.de/transmep/c-variants-train.csv
Minimum alpha hyperparameter (alpha-min) [0.0001]:
Maximum alpha hyperparameter (alpha-max) [100000.0]:
Number of alpha steps during grid search (alpha-steps) [50]:
Minimum gamma hyperparameter (gamma-min) [0.001]:
Maximum gamma hyperparameter (gamma-max) [1000000.0]:
Number of gamma steps during grid search (gamma-steps) [50]:
Number of iterations for repeated holdout during validation (validation-iterations) [1000]:
Number of validation iterations to process in one batch (batch-size) [100]:
Block size to process in one batch for distance matrix calculation (block-size) [100]:
Fraction of samples to use for validation during repeated holdout (holdout-fraction) [0.1]:
Path to save grid search output to, pass empty string for no output: mymodel-grid.npz
Welcome to TransMEP!
Loading dataset
Loading protein language model (esm2_t30_150M_UR50D)
Embedding training variants
Embedding: 100%|██████████| 424/424 [00:10<00:00, 40.72it/s]
Performing grid search for hyper parameters
HPO: 100%|██████████| 10/10 [02:21<00:00, 14.12s/it]
Grid search wall time: 141.4151s
Fitting final model
Final model trained & saved!
We will also check the grid search report.
$ transmep report
? Which reports do you want to create? [Grid search report (grid_search)]
? What output formats do you want? Output formats not supported by a report will be skipped quietly. done (2 selecti
ons)
? Path to the grid search output mymodel-grid.npz
? Report file prefix mymodel
Creating report grid_search
All reports created!
You can find the report here. As we are satisfied with the result, we move on to finding promising new candidates:
$ transmep ucb
Path to load the model from: mymodel.pt
Path to the wild type sequence file: https://mutation-prediction.thoffbauer.de/transmep/c-wt.txt
Kappa value for UCB. Higher kappa values lead to more exploration: 3
Size of population per restart: 100
How many initializations to try for genetic optimization: 10
Maximum number of mutations to allow.: 5
Comma separated list of positions that should be mutated. Set to 'all' to allow mutations on all positions: 32,46,49,51,53,56,97
Criterion optimization: 100%|██████████| 10/10 [01:13<00:00, 7.39s/it]
Rank 0 with UCB = 1.1940
Y32G+F46S+I53D+L56K+V97N
Rank 1 with UCB = 1.1917
Y32G+F46S+I53G+L56K+V97E
Rank 2 with UCB = 1.1215
Y32A+F46S+I53T+L56A+V97E
Rank 3 with UCB = 1.1201
Y32S+F46A+I53D+L56A+V97R
Rank 4 with UCB = 1.0993
Y32E+I53D+L56V+V97R
Rank 5 with UCB = 1.0981
Y32I+I53D+L56S+V97S
Rank 6 with UCB = 1.0935
Y32T+I53D+V97R
Rank 7 with UCB = 1.0876
Y32S+I53E+V97R
Rank 8 with UCB = 1.0827
Y32P+F46Q+I53S+L56G+V97E
Rank 9 with UCB = 1.0812
Y32G+I53P+L56V+V97E
This project uses Poetry for dependency management, so please install this first. Then, you can install all dependencies using this command:
poetry install
And open a shell inside the virtual environment using:
poetry shell
If you make some changes, please create a PR to merge them into this repository. Also, please ensure that your code is well formatted by running black . and isort .. Tests can be executed using pytest.