DeNovo Peptide Sequencing with Hybrid Multi-Modal Deep Learning

Kyle Sherman Developed using the PyTorch framework

Project Overview

This project aims to introduce an innovative deep learning framework for de novo peptide sequencing from tandem mass spectromety (MS/MS) data.

Leveraging a novel hybrid architecture, this model aims to set a new standard for accuracy and robustness in predicting complete peptide sequences, even for previously uncharacterized peptides, directly from their fragmentation patterns.

The challenge faced in de novo (for the first time) sequencing is critical in the field of proteomics. The goal is to enable the identification of novel proteins, unexpected post-translational modifications (PTMs), and sequence variants missed by traditional database search methods. This approach combines the stength of numberous nural network paradigms to interpret complex mass spectra features, translating them into accurate amino acid sequences.

The Approach

This model employs a complex hybrid architecture designed to capture both sequential and structural information embedded within mass spectra:

Multi-Modal Transformer Encoder-Decoder: The core of the model. Effectively processes sequences of mass spectral peaks (m/z, intensity) and precursor info, and utilizing self-attention mechanism to learn long-range depencencies.
- CNN Integration: 1D Convolutional Neural Networks are incorporated within the encoder to extract robust local features and characteristic fragmentation patters from binned mass spectra.
Graph Neural Network (GNN) Integration: A novel component that models the graph-like relationships within mass spectra data. Peaks are treated as nodes, while edges represent potential amino acid mass differences or neural losses allowing for the model to understand structural depencies directly.
Retrieval-Augmented Generation (RAG): To enhance accuracy for known peptides and provide robust predictions, the model integrates a retrieval mechanism. It consults a large, pre-indexed library of known peptide-spectrum matches (PSMs) to guide or refine its de novo predictions, combining the best of database search with generative capabilities.

Results & Performance Metrics (Coming Soon)

To be filled out once training results become available

Initial Training Data

The model is initially trained and extensively validated on a high-quality, labeled E. coli MS/MS dataset, comprising [Number] of spectra and [Number] of unique peptide sequences.

Key Metrics: tracking:
- Peptide Accuracy: Exact match percentage of predicted sequences to ground truth.
- Amino Acid Accuracy: Percentage of correctly predicted amino acids across all sequences.
- (Optional: Add any other relevant metrics like precision, recall, or spectral similarity scores if you plan to report them.)

Problem Statement

De novo peoptide sequencing is a method used to determine the amino acide sequence of a peptdie directly from mass spectrum data (MS) or tandem mass spectrum data (ms/ms) without relying on a reference database. It involves the analysis of the mass-to-charge ratio (M/Z) and peptide fragments generated during MS to determine the sequence by interpreting the differences in mass between fragment ions - correspinding to specific amino acids.

De Novo Process Overview

Mass Spectromety - Peptids are ionized and fragmented in a mass spectrometer, producing a spectrum of fragmented ion masses.
Fragment Analysis - The mass diferences between peaks in the spectrum are used to infer the order of amino acids - with each amino acid having a characteristic mass.
Sequence Reconstruction - Algorithms create the sequence by matching mass differences to amino acids.
No Database Required - Unlike other methods, de novo sequencing requires no prior knowledge of protein sequences, making this method ideal for analyzing novel proteins.

File Structure

BEGIN IONS
TITLE = 'SCAN TITLE'
PEPMASS = Peptide Mass
CHARGE = Charge of the ion
Mass / Charge
END IONS

Dataset

The dataset is the mass spectrum of Escherichia Coli (E. Coli). This data was obtained from the free and open European PRIDE dataset collection. The datasets included in this library are made specifically for research in the field of proteomics. This dataset is complete and well documented, which makes working with it more simple and easier to verify the accuracy of our findings. In practice, however, this model can and should be trained on a variety of datasets.

Technical Requirements

The project is built with the intention to leverage modern deep learning frameworks, while optimizing for my performance constrained desktop hardware.

Language Python
Deep Learning Framework PyTorch (with cuda and optional cpu compatibility)
Key Libraries Pyteomics, MatPlotlib, numpoy, scikit-learn, scipy, Deep Graph Library, PyToch Geometric, faiss / annoy / hnswlib (for RAG's ANN search), Pandas.

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
.idea		.idea
data		data
notebooks		notebooks
source		source
.env.template		.env.template
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
phase1.md		phase1.md
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

DeNovo Peptide Sequencing with Hybrid Multi-Modal Deep Learning

Project Overview

The Approach

Results & Performance Metrics (Coming Soon)

Initial Training Data

Problem Statement

De Novo Process Overview

File Structure

Dataset

Technical Requirements

About

Uh oh!

Releases

Packages

Languages

License

KSherman97/Proteomics

Folders and files

Latest commit

History

Repository files navigation

DeNovo Peptide Sequencing with Hybrid Multi-Modal Deep Learning

Project Overview

The Approach

Results & Performance Metrics (Coming Soon)

Initial Training Data

Problem Statement

De Novo Process Overview

File Structure

Dataset

Technical Requirements

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages