Kyle Sherman Developed using the PyTorch framework
This project aims to introduce an innovative deep learning framework for de novo peptide sequencing from tandem mass spectromety (MS/MS) data.
Leveraging a novel hybrid architecture, this model aims to set a new standard for accuracy and robustness in predicting complete peptide sequences, even for previously uncharacterized peptides, directly from their fragmentation patterns.
The challenge faced in de novo (for the first time) sequencing is critical in the field of proteomics. The goal is to enable the identification of novel proteins, unexpected post-translational modifications (PTMs), and sequence variants missed by traditional database search methods. This approach combines the stength of numberous nural network paradigms to interpret complex mass spectra features, translating them into accurate amino acid sequences.
This model employs a complex hybrid architecture designed to capture both sequential and structural information embedded within mass spectra:
-
Multi-Modal Transformer Encoder-Decoder: The core of the model. Effectively processes sequences of mass spectral peaks (m/z, intensity) and precursor info, and utilizing self-attention mechanism to learn long-range depencencies.
- CNN Integration: 1D Convolutional Neural Networks are incorporated within the encoder to extract robust local features and characteristic fragmentation patters from binned mass spectra.
-
Graph Neural Network (GNN) Integration: A novel component that models the graph-like relationships within mass spectra data. Peaks are treated as nodes, while edges represent potential amino acid mass differences or neural losses allowing for the model to understand structural depencies directly.
-
Retrieval-Augmented Generation (RAG): To enhance accuracy for known peptides and provide robust predictions, the model integrates a retrieval mechanism. It consults a large, pre-indexed library of known peptide-spectrum matches (PSMs) to guide or refine its de novo predictions, combining the best of database search with generative capabilities.
To be filled out once training results become available
The model is initially trained and extensively validated on a high-quality, labeled E. coli MS/MS dataset, comprising [Number] of spectra and [Number] of unique peptide sequences.
- Key Metrics: tracking:
- Peptide Accuracy: Exact match percentage of predicted sequences to ground truth.
- Amino Acid Accuracy: Percentage of correctly predicted amino acids across all sequences.
- (Optional: Add any other relevant metrics like precision, recall, or spectral similarity scores if you plan to report them.)
De novo peoptide sequencing is a method used to determine the amino acide sequence of a peptdie directly from mass spectrum data (MS) or tandem mass spectrum data (ms/ms) without relying on a reference database. It involves the analysis of the mass-to-charge ratio (M/Z) and peptide fragments generated during MS to determine the sequence by interpreting the differences in mass between fragment ions - correspinding to specific amino acids.
- Mass Spectromety - Peptids are ionized and fragmented in a mass spectrometer, producing a spectrum of fragmented ion masses.
- Fragment Analysis - The mass diferences between peaks in the spectrum are used to infer the order of amino acids - with each amino acid having a characteristic mass.
- Sequence Reconstruction - Algorithms create the sequence by matching mass differences to amino acids.
- No Database Required - Unlike other methods, de novo sequencing requires no prior knowledge of protein sequences, making this method ideal for analyzing novel proteins.
BEGIN IONS
TITLE = 'SCAN TITLE'
PEPMASS = Peptide Mass
CHARGE = Charge of the ion
Mass / Charge
END IONS
The dataset is the mass spectrum of Escherichia Coli (E. Coli). This data was obtained from the free and open European PRIDE dataset collection. The datasets included in this library are made specifically for research in the field of proteomics. This dataset is complete and well documented, which makes working with it more simple and easier to verify the accuracy of our findings. In practice, however, this model can and should be trained on a variety of datasets.
The project is built with the intention to leverage modern deep learning frameworks, while optimizing for my performance constrained desktop hardware.
- Language Python
- Deep Learning Framework PyTorch (with cuda and optional cpu compatibility)
- Key Libraries Pyteomics, MatPlotlib, numpoy, scikit-learn, scipy, Deep Graph Library, PyToch Geometric,
faiss/annoy/hnswlib(for RAG's ANN search), Pandas.