CDState

An unsupervised approach to predict malignant cell heterogeneity in tumor bulk RNA-sequencing data

CDState is an unsupervised deconvolution method for tumor bulk RNA-sequencing data, aimed at identifying malignant cell states and their proportions.

Installation with Conda

Clone the repository:

git clone https://github.com/BoevaLab/CDState.git
cd CDState

Create the Conda environment:
```
conda env create -f environment.yml
```
Activate the environment:
```
conda activate cdstate
```

Verify Python version:

python --version # should be Python 3.10.x

Basic usage

Input bulk data should have genes in rows, samples in columns.

Input purity should have a column 'purity', with samples in rows.

import CDState_base as cd
import pandas as pd
import copy
import numpy as np

data = pd.read_csv("data/bulkified_mixes/mixa_bulk_sum.csv", index_col=0,sep=',',header=0)
proportions = pd.read_csv("data/bulkified_mixes/seta_bulk_sum.csv", index_col=0,sep=',',header=0)

purity = proportions.loc[:,'Malignant']
purity.rename(index="purity", inplace=True)
purity.index = data.columns

Create CDState object:

k = 3 # number of sources
cn = cd.CDState(data, num_bases=k, global_round = False)

Prepare data - filter out genes from sex chromosomes and keep only highly variable genes for deconvolution:

cn.prepare_data()

Initialize sources as random k samples from cn.data after gene filtering:

n_cols = cn.data.shape[1]
cols = np.random.choice(n_cols, size=k, replace=False)
initial_sources = cn.data[:, cols]
cn.W = copy.copy(initial_sources)
cn.W += 1e-10 # add pseudocount to avoid division by 0

Run Step 1:

cn.factorize()

Run Step 2:

cnG = cd.CDState(data, purity, num_bases=k, global_round = True, gene_list = cn.gene_list)
cnG.H = copy.copy(cn.H) # start from proportions found in Step 1
cnG.W = copy.copy(cn.W) # start from sources found in Step 1
cnG.prepare_data()
cnG.factorize()

(Optional) Recover source expression for filtered-out genes:

cnG.infer_full()

Outputs

CDState returns following outputs:

outputs from Step 1 are stored in cn, outputs from Step 2 are stored in cnG
cn.H and cnG.H: numpy array with inferred source proportions [sources x number of samples]
cn.W and cnG.W: numpy array with inferred source expression [filtered genes x sources]
cn.W and cnG.W: numpy array with inferred source expression [filtered genes x sources]
cn.full_W and cnG.full_W: pandas data frame with source expression inferred for all genes from data [genes x sources]
cn.gene_list and cnG.gene_list: list of genes used for deconvolution
cnG.mal : column indexes of malignant (malignant = cnG.W[:,cnG.mal])

Input parameters:

CDState uses the following parameters that can be adjusted by a user:

Parameter	Description
`data`	pandas data frame with bulk RNA-seq expression [genes x samples]
`purity`	pandas data frame with purity values for input samples, index should be called 'purity'
`num_bases`	Number of sources for deconvolution (default: 4)
`global_round`	if False, CDState runs Step 1, otherwise runs Step 2 (default: False)
`l1` and `l2`	Weights used to prioritize reconstruction error and cosine similarity in Step 2 (default: l1=1, l2=0)
`threshold_low` and `threshold_high`	quantiles for gene filtering (keeps genes between the two values) (default: threshold_low = 0.3, threshold_high = 0.99)
`gene_list`	input gene list to be used for deconvolution (instead of CDState internal gene filtering) (default: None)

Tutorials

Jupyter notebooks with tutorials on how to run CDState, select number of components and characterize output sources can be found in notebooks. For method overview please see our Wiki.

Data

Bulkified data used for CDState benchmarking can be found in data:

bulkified_cancer contains five datasets: breast cancer Wu et al., ovarian cancer Vázquez-García et al., glioblastoma Neftel et al., lung cancer Kim et al., squamous cell carcinoma Ji et al..
bulkified_mixes contains four simulated mixes generated using lung cancer data Kim et al.

For details check README inside the two directories.

TCGA malignant cell states

CDState malignant sources and proportions identified across The Cancer Genome Atlas bulk RNA-seq data can be found in TCGA_states. For details check README inside the directory.

Citing CDState

If you use CDState in your work, you can cite it using

@article {Kraft2025.03.01.641017,
	author = {Kraft, Agnieszka and Yates, Josephine and Barkmann, Florian and Boeva, Valentina},
	title = {CDState: an unsupervised approach to predict malignant cell heterogeneity in tumor bulk RNA-sequencing data},
	elocation-id = {2025.03.01.641017},
	year = {2025},
	doi = {10.1101/2025.03.01.641017},
	publisher = {Cold Spring Harbor Laboratory},
	URL = {https://www.biorxiv.org/content/early/2025/04/09/2025.03.01.641017},
	eprint = {https://www.biorxiv.org/content/early/2025/04/09/2025.03.01.641017.full.pdf},
	journal = {bioRxiv}
}

License

CDState is licensed under the MIT License. See the LICENSE file for details.

Name		Name	Last commit message	Last commit date
Latest commit History 58 Commits
data		data
images		images
notebooks		notebooks
CDState_base.py		CDState_base.py
CDState_run.py		CDState_run.py
LICENSE		LICENSE
README.md		README.md
environment.yml		environment.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CDState

An unsupervised approach to predict malignant cell heterogeneity in tumor bulk RNA-sequencing data

Installation with Conda

Basic usage

Outputs

Input parameters:

Tutorials

Data

TCGA malignant cell states

Citing CDState

License

About

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

CDState

An unsupervised approach to predict malignant cell heterogeneity in tumor bulk RNA-sequencing data

Installation with Conda

Basic usage

Outputs

Input parameters:

Tutorials

Data

TCGA malignant cell states

Citing CDState

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Uh oh!

Contributors

Uh oh!

Languages