CDState is an unsupervised deconvolution method for tumor bulk RNA-sequencing data, aimed at identifying malignant cell states and their proportions.
-
Clone the repository:
git clone https://github.com/BoevaLab/CDState.git cd CDState -
Create the Conda environment:
conda env create -f environment.yml
-
Activate the environment:
conda activate cdstate
-
Verify Python version:
python --version # should be Python 3.10.x
Input bulk data should have genes in rows, samples in columns.
Input purity should have a column 'purity', with samples in rows.
import CDState_base as cd
import pandas as pd
import copy
import numpy as np
data = pd.read_csv("data/bulkified_mixes/mixa_bulk_sum.csv", index_col=0,sep=',',header=0)
proportions = pd.read_csv("data/bulkified_mixes/seta_bulk_sum.csv", index_col=0,sep=',',header=0)
purity = proportions.loc[:,'Malignant']
purity.rename(index="purity", inplace=True)
purity.index = data.columns- Create CDState object:
k = 3 # number of sources
cn = cd.CDState(data, num_bases=k, global_round = False)- Prepare data - filter out genes from sex chromosomes and keep only highly variable genes for deconvolution:
cn.prepare_data() - Initialize sources as random k samples from
cn.dataafter gene filtering:
n_cols = cn.data.shape[1]
cols = np.random.choice(n_cols, size=k, replace=False)
initial_sources = cn.data[:, cols]
cn.W = copy.copy(initial_sources)
cn.W += 1e-10 # add pseudocount to avoid division by 0- Run Step 1:
cn.factorize()- Run Step 2:
cnG = cd.CDState(data, purity, num_bases=k, global_round = True, gene_list = cn.gene_list)
cnG.H = copy.copy(cn.H) # start from proportions found in Step 1
cnG.W = copy.copy(cn.W) # start from sources found in Step 1
cnG.prepare_data()
cnG.factorize()- (Optional) Recover source expression for filtered-out genes:
cnG.infer_full()CDState returns following outputs:
- outputs from Step 1 are stored in
cn, outputs from Step 2 are stored incnG cn.HandcnG.H: numpy array with inferred source proportions [sources x number of samples]cn.WandcnG.W: numpy array with inferred source expression [filtered genes x sources]cn.WandcnG.W: numpy array with inferred source expression [filtered genes x sources]cn.full_WandcnG.full_W: pandas data frame with source expression inferred for all genes fromdata[genes x sources]cn.gene_listandcnG.gene_list: list of genes used for deconvolutioncnG.mal: column indexes of malignant (malignant = cnG.W[:,cnG.mal])
CDState uses the following parameters that can be adjusted by a user:
| Parameter | Description |
|---|---|
data |
pandas data frame with bulk RNA-seq expression [genes x samples] |
purity |
pandas data frame with purity values for input samples, index should be called 'purity' |
num_bases |
Number of sources for deconvolution (default: 4) |
global_round |
if False, CDState runs Step 1, otherwise runs Step 2 (default: False) |
l1 and l2 |
Weights used to prioritize reconstruction error and cosine similarity in Step 2 (default: l1=1, l2=0) |
threshold_low and threshold_high |
quantiles for gene filtering (keeps genes between the two values) (default: threshold_low = 0.3, threshold_high = 0.99) |
gene_list |
input gene list to be used for deconvolution (instead of CDState internal gene filtering) (default: None) |
Jupyter notebooks with tutorials on how to run CDState, select number of components and characterize output sources can be found in notebooks.
For method overview please see our Wiki.
Bulkified data used for CDState benchmarking can be found in data:
bulkified_cancercontains five datasets: breast cancer Wu et al., ovarian cancer Vázquez-García et al., glioblastoma Neftel et al., lung cancer Kim et al., squamous cell carcinoma Ji et al..bulkified_mixescontains four simulated mixes generated using lung cancer data Kim et al.
For details check README inside the two directories.
CDState malignant sources and proportions identified across The Cancer Genome Atlas bulk RNA-seq data can be found in TCGA_states. For details check README inside the directory.
If you use CDState in your work, you can cite it using
@article {Kraft2025.03.01.641017,
author = {Kraft, Agnieszka and Yates, Josephine and Barkmann, Florian and Boeva, Valentina},
title = {CDState: an unsupervised approach to predict malignant cell heterogeneity in tumor bulk RNA-sequencing data},
elocation-id = {2025.03.01.641017},
year = {2025},
doi = {10.1101/2025.03.01.641017},
publisher = {Cold Spring Harbor Laboratory},
URL = {https://www.biorxiv.org/content/early/2025/04/09/2025.03.01.641017},
eprint = {https://www.biorxiv.org/content/early/2025/04/09/2025.03.01.641017.full.pdf},
journal = {bioRxiv}
}CDState is licensed under the MIT License. See the LICENSE file for details.
