signaturescoring.scoring_methods package#

Submodules#

signaturescoring.scoring_methods.adjusted_neighborhood_scoring module#

signaturescoring.scoring_methods.adjusted_neighborhood_scoring.score_genes(adata: AnnData, gene_list: List[str], ctrl_size: int = 100, gene_pool: List[str] | None = None, df_mean_var: DataFrame | None = None, remove_genes_with_invalid_control_set: bool = True, store_path_mean_var_data: str | None = None, score_name: str = 'ANS_score', copy: bool = False, return_control_genes: bool = False, return_gene_list: bool = False, use_raw: bool | None = None) → AnnData | None#

Adjusted neighborhood gene signature scoring method (ANS) scores each cell in the dataset for a passed signature (gene_list) and stores the scores in the data object. Implementation is inspired by score_genes method of Scanpy (https://scanpy.readthedocs.io/en/latest/generated/scanpy.tl.score_genes.html#scanpy.tl.score_genes)

Parameters:

adata – AnnData object containing the preprocessed (log-normalized) gene expression data.
gene_list – A list of genes for which the cells are scored for.
ctrl_size – The number of control genes selected for each gene in the gene_list.
gene_pool – The pool of genes out of which control genes can be selected.
df_mean_var – A pandas DataFrame containing the average expression (and variance) for each gene in the dataset. If df_mean_var is None, the average gene expression and variance is computed during gene signature scoring.
remove_genes_with_invalid_control_set – If true, the scoring method removes genes from the gene_list for which no optimal control set can be computed, i.e., if a gene belongs to the ctrl_size/2 genes with the largest average expression.
store_path_mean_var_data – Path to store data and visualizations created during average gene expression computation. If it is None, data and visualizations are not stored.
score_name – Column name for scores added in .obs of data.
copy – Indicates whether original or a copy of adata is modified.
return_control_genes – Indicates if method returns selected control genes.
return_gene_list – Indicates if method returns the possibly reduced gene list.
use_raw – Whether to compute gene signature score on raw data stored in .raw attribute of adata

Returns:

If copy=True, the method returns a copy of the original data with stored ANS scores in .obs, otherwise None is returned.

signaturescoring.scoring_methods.compute_signature_score module#

signaturescoring.scoring_methods.compute_signature_score.compute_signature_score(adata: AnnData, sig_genes: List[str], ctrl_genes: List[List[str]], max_block: int = 10000, max_nr_ctrl: int = 10000)#

The method computes for all Tirosh-based scoring methods the cell scores given the signature genes and the control gene sets for each signature gene.

Parameters:

adata – AnnData object containing the preprocessed (log-normalized) gene expression data.
sig_genes – A list of genes for which the cells are scored for.
ctrl_genes – A list of control gene lists. The length of the outer list must correspond to the length of the sig_genes
max_block – Maximum number of cell before changing to parallel score computation for grouped cells.
max_nr_ctrl – Maximum number of total control gene number before changing to parallel score computation for grouped cells.

Returns:

Returns vector of scores for each cell.

signaturescoring.scoring_methods.corrected_scanpy_scoring module#

signaturescoring.scoring_methods.corrected_scanpy_scoring.score_genes(adata: AnnData, gene_list: Sequence[str], ctrl_size: int = 100, gene_pool: Sequence[str] | None = None, n_bins: int = 25, score_name: str = 'corrected_scanpy_score', random_state: None | int | RandomState = 0, copy: bool = False, use_raw: bool | None = None, verbose: int = 0) → AnnData | None#

The following scoring method is very similar to Scanpy’s scoring method (https://scanpy.readthedocs.io/en/latest/generated/scanpy.tl.score_genes.html, inkl. code). Scanpy’s scoring method does not allow sampling from an expression bin more than once, even if more than one signature gene land into the same expression bin. This behaviour is corrected in the following scoring method.

Parameters:

adata – AnnData object containing the preprocessed (log-normalized) gene expression data.
gene_list – A list of genes for which the cells are scored for.
ctrl_size – The number of control genes selected for each gene in the gene_list.
gene_pool – The pool of genes out of which control genes can be selected.
n_bins – The number of average gene expression bins to use.
score_name – Column name for scores added in .obs of data.
random_state – Seed for random state
copy – Indicates whether original or a copy of adata is modified.
use_raw – Whether to compute gene signature score on raw data stored in .raw attribute of adata
verbose – If verbose is larger than 0, print statements are shown.

Returns:

If copy=True, the method returns a copy of the original data with stored ANS scores in .obs, otherwise None is returned.

signaturescoring.scoring_methods.gene_signature_scoring module#

signaturescoring.scoring_methods.gene_signature_scoring.score_signature(adata: AnnData, gene_list: List[str], method: str = 'adjusted_neighborhood_scoring', **kwarg) → AnnData | None#

Wrapper method to call one of the available gene expression signature scoring methods (ANS, Seurat, Seurat_AG, Seurat_LVG, Scanpy, Jasmine, UCell).

Parameters:

adata – AnnData object containing the gene expression data.
gene_list – A list of genes,i.e., gene expression signature, for which the cells are scored for.
method – Scoring method to use. One of [‘adjusted_neighborhood_scoring’, ‘seurat_scoring’, ‘seurat_ag_scoring’, ‘seurat_lvg_scoring’, ‘scanpy_scoring’, ‘jasmine_scoring’, ‘ucell_scoring’, ‘neighborhood_scoring’, ‘corrected_scanpy_scoring’]
**kwarg – Other keyword arguments specific for the scoring method. See below names of individual scoring methods and their available keyword arguments.

Returns:

If copy=True, the method returns a copy of the original data with stored ANS scores in .obs, otherwise None is returned.

Notes

ANS: Adujsted neighborhood signature scoring method.: (see signaturescoring.scoring_methods.adjusted_neighborhood_scoring.score_genes) 10.1126/science.aad0501, (see signaturescoring.scoring_methods.seurat_scoring.score_genes)
Seurat_AG, Seurat_LVG: Modifications of above method. First selecting all genes in an expression bin as control genes. Latter selecting the least variable genes of an expression bin as control genes.: (see signaturescoring.scoring_methods.seurat_[ag/lvg]_scoring.score_genes)
Scanpy: Scoring method implemented in Scanpy: https://scanpy.readthedocs.io/en/stable/generated/scanpy.tl.score_genes.html
Jasmine: Rank-based signature scoring method by Noureen et al. 2022: https://doi.org/10.7554/eLife.71994, (see signaturescoring.scoring_methods.jasmine_scoring.score_genes)
UCell: Rank-based signature scoring method by Andretta et al. 2021: https://doi.org/10.1016/j.csbj.2021.06.043, (see signaturescoring.scoring_methods.ucell_scoring.score_genes)

signaturescoring.scoring_methods.gmm_postprocessing module#

class signaturescoring.scoring_methods.gmm_postprocessing.GMMPostprocessor(n_components: int = 3, covariance_type: str = 'full', init_params: str = 'k-means++', n_init: int = 30)#

Bases: object

The GMMPostprocessor class is used to correct for incomparable score ranges in gene signature scoring. If fits a Gaussian Mixture Model on gene signature scores and assigns clusters to signatures.

Variables:

n_components – Defines the number of clusters we expect in the Gaussian Mixture Model. For postprocessing gene expression signatures we use n_components=#signatures or n_components=(#signatures+1).
gmm – Corresponds to the GMM used for postprocessing.

Parameters:

n_components – Number of clusters we expect in the Gaussian Mixture Model.
covariance_type – The type of covariance used in GMM. Available methods ‘full’, ‘tied’, ‘diag’, ‘spherical’.
init_params – Method to initialize parameters in GMM. Available methods ‘kmeans’, ‘k-means++’, ‘random’, ‘random_from_data’
n_init – Number of initializations done.

assign_clusters_to_signatures(adata: AnnData, score_names: List[str], gmm_proba_names: List[str], plot: bool = False, store_plot_path: str | None = None) → dict#

The methods computed the assignments of GMM clusters to gene expression signatures by computing the correlation of each cluster probabilities to the signatures’ scores.

Parameters:

adata – AnnData object containing the gene expression data.
score_names – Name of signature scores columns.
gmm_proba_names – Name of GMM cluster probability columns.
plot – Plot scatterplots of scores and probabilities for each signature and GMM cluster.
store_plot_path – Path to location where scatterplots should be stored. If None, plots are not stored.

Returns:

The assignments of each signature to a cluster from GMM postprocessing.

fit_and_predict(adata: AnnData, score_names: List[str], store_name: str | None = None, inplace: bool = True) → str | List[str] | DataFrame | None#

The method fits previously initialized GMM on signature scores.

Parameters:

adata – AnnData object containing the gene expression data.
score_names – Name of signature scores columns on which the GMM is fit.
store_name – Prefix of new columns with probabilities
inplace – If probabilities are stored in adata or in a new pandas DataFrame

Returns:

If ‘inplace=True’, the names of the new columns are returned. If ‘inplace=False’, the names of the new columns and the DataFrame containing the cluster probabilities are returned.

signaturescoring.scoring_methods.gmm_postprocessing.check_score_names(adata: AnnData, score_names: List[str])#

Asserts that names associated with score columns exist in the data.

Parameters:

adata – AnnData object containing the gene expression data.
score_names – Names of score columns

Returns:

None

Raises:

Assertion error if any value in score_names is not contained in adata –

signaturescoring.scoring_methods.jasmine_scoring module#

signaturescoring.scoring_methods.jasmine_scoring.compute_avg_ranks_sig_subset(X_data, index, columns, gene_list, X_indices=None, X_indptr=None, X_shape=None)#

Compute the average ranks for a given signature for each cell.

Parameters:

X_data – Gene expression data.
index – Index of cells.
columns – Names of genes.
gene_list – Signature genes.
X_indices – For sparse matrix reconstruction indices. If None, method assumes X_data to be a dense matrix.
X_indptr – For sparse matrix reconstruction index pointers. If None, method assumes X_data to be a dense matrix.
X_shape – For sparse matrix reconstruction shape of original matrix. If None, method assumes X_data to be a dense matrix.

Returns:

For each cell in X_data the method returns the average ranks.

signaturescoring.scoring_methods.jasmine_scoring.compute_avg_ranks_signature(adata, sparse_X, gene_list, bs, joblib_kwargs)#

Create groups of manageable sizes. For each group compute for each cell the ranks of the genes and select the ranks that belong to the signature genes

Parameters:

adata – AnnData object with gene expression data.
sparse_X – Indicates if data is sparse.
gene_list – Signature genes.
bs – The number of bins.
joblib_kwargs – Keyword argument for parallel execution with joblib.

Returns:

For each cell in adata the method returns the average ranks

signaturescoring.scoring_methods.jasmine_scoring.likelihood_calculation(adata, genes)#

Computation of enrichment value based on the Likelihood of the values returned in preparation

Parameters:

adata – AnnData object with gene expression data
genes – Signature genes

Returns:

Enrichment score based on Likelihood

signaturescoring.scoring_methods.jasmine_scoring.or_calculation(adata, genes)#

Computation of enrichment value based on the Odds Ratio of the values returned in preparation

Parameters:

adata – AnnData object with gene expression data
genes – Signature genes

Returns:

Enrichment score based on Odds Ratio

signaturescoring.scoring_methods.jasmine_scoring.preparation(adata, genes)#

Preparation for the computation of the expression value in JASMINE scoring.

Parameters:

adata – AnnData object with gene expression data.
genes – Signature genes.

Returns:

The method returns the number of signature genes expressed, signature genes not expressed, non-signature genes expressed, and non-signature genes not expressed

signaturescoring.scoring_methods.jasmine_scoring.rank_calculation(cell_data, genes)#

Compute the ranks of the gene expressions for a cell

Parameters:

cell_data – gene expression data for a given cell
genes – signature genes

Returns:

average rank of signature genes for a given cell

signaturescoring.scoring_methods.jasmine_scoring.score_genes(adata: AnnData, gene_list: List[str], score_method: str = 'likelihood', bs: int = 500, score_name: str = 'JASMINE_score', random_state: int | None = None, copy: bool = False, use_raw: bool | None = None, verbose: int = 0, joblib_kwargs: dict = {'n_jobs': 4}) → AnnData | None#

JASMINE signature scoring method is a Python implementation of the scoring method proposed by Noureen et al. 2022.

Nighat Noureen, Zhenqing Ye, Yidong Chen, Xiaojing Wang, and Siyuan Zheng. „Signature-scoring methods developed for bulk samples are not adequate for cancer single-cell RNA sequencing data“. In: Elife 11 (Feb. 2022), e71994 (cit. on pp. iii, 2, 9, 15–17).

Implementation is inspired by score_genes method of Scanpy (https://scanpy.readthedocs.io/en/latest/generated/scanpy.tl.score_genes.html#scanpy.tl.score_genes)

Parameters:

adata – AnnData object containing the gene expression data.
gene_list – A list of genes (signature) for which the cells are scored for.
score_method – The method describes, which submethod of enrichment value computation should be used (‘oddsratio’, ‘likelihood’).
bs – The number of cells in a processing batch.
score_name – Column name for scores added in .obs of data.
random_state – Seed for random state.
copy – Indicates whether original or a copy of adata is modified.
use_raw – Whether to compute gene signature score on raw data stored in .raw attribute of adata
verbose – If verbose is larger than 0, print statements are shown.
joblib_kwargs – Keyword argument for parallel execution with joblib.

Returns:

If copy=True, the method returns a copy of the original data with stored JASMINE scores in .obs, otherwise None is returned.

signaturescoring.scoring_methods.neighborhood_scoring module#

signaturescoring.scoring_methods.neighborhood_scoring.score_genes(adata: AnnData, gene_list: List[str], ctrl_size: int = 100, gene_pool: List[str] | None = None, df_mean_var: DataFrame | None = None, remove_genes_with_invalid_control_set: bool = True, store_path_mean_var_data: str | None = None, score_name: str = 'NS_score', copy: bool = False, return_control_genes: bool = False, return_gene_list: bool = False, use_raw: bool | None = None) → AnnData | None#

Neighborhood gene signature scoring method (NS) scores each cell in the dataset for a passed signature (gene_list) and stores the scores in the data object. Implementation is inspired by score_genes method of Scanpy (https://scanpy.readthedocs.io/en/latest/generated/scanpy.tl.score_genes.html#scanpy.tl.score_genes)

Parameters:

adata – AnnData object containing the preprocessed (log-normalized) gene expression data.
gene_list – A list of genes for which the cells are scored for.
ctrl_size – The number of control genes selected for each gene in the gene_list.
gene_pool – The pool of genes out of which control genes can be selected.
df_mean_var – A pandas DataFrame containing the average expression (and variance) for each gene in the dataset. If df_mean_var is None, the average gene expression and variance is computed during gene signature scoring
remove_genes_with_invalid_control_set – If true, the scoring method removes genes from the gene_list for which no optimal control set can be computed, i.e., if a gene belongs to the ctrl_size/2 genes with the largest average expression.
store_path_mean_var_data – Path to store data and visualizations created during average gene expression computation. If it is None, data and visualizations are not stored.
score_name – Column name for scores added in .obs of data.
copy – Indicates whether original or a copy of adata is modified.
return_control_genes – Indicated if method returns selected control genes.
return_gene_list – Indicates if method returns the possibly reduced gene list.
use_raw – Whether to compute gene signature score on raw data stored in .raw attribute of adata

Returns:

If copy=True, the method returns a copy of the original data with stored NS scores in .obs, otherwise None is returned.

signaturescoring.scoring_methods.seurat_ag_scoring module#

signaturescoring.scoring_methods.seurat_ag_scoring.score_genes(adata: AnnData, gene_list: List[str], n_bins: int = 25, gene_pool: List[str] | None = None, df_mean_var: DataFrame | None = None, store_path_mean_var_data: str | None = None, score_name: str = 'Seurat_AG_score', copy: bool = False, return_control_genes: bool = False, return_gene_list: bool = False, use_raw: bool | None = None) → AnnData | None#

All genes as control genes scoring method (Seurat_AG) scores each cell in the dataset for a passed signature (gene_list) and stores the scores in the data object. Implementation is inspired by score_genes method of Scanpy (https://scanpy.readthedocs.io/en/latest/generated/scanpy.tl.score_genes.html#scanpy.tl.score_genes)

Parameters:

adata – AnnData object containing the preprocessed (log-normalized) gene expression data.
gene_list – A list of genes for which the cells are scored for.
n_bins – The number of average gene expression bins to use.
gene_pool – The pool of genes out of which control genes can be selected.
df_mean_var – A pandas DataFrame containing the average expression (and variance) for each gene in the dataset. If df_mean_var is None, the average gene expression and variance is computed during gene signature scoring
store_path_mean_var_data – Path to store data and visualizations created during average gene expression computation. If it is None, data and visualizations are not stored.
score_name – Column name for scores added in .obs of data.
copy – Indicates whether original or a copy of adata is modified.
return_control_genes – Indicated if method returns selected control genes.
return_gene_list – Indicates if method returns the possibly reduced gene list.
use_raw – Whether to compute gene signature score on raw data stored in .raw attribute of adata.

Returns:

If copy=True, the method returns a copy of the original data with stored AGCGS scores in .obs, otherwise None is returned.

signaturescoring.scoring_methods.seurat_lvg_scoring module#

signaturescoring.scoring_methods.seurat_lvg_scoring.score_genes(adata: AnnData, gene_list: List[str], ctrl_size: int = 100, n_bins: int = 25, gene_pool: List[str] | None = None, lvg_computation_version: str = 'v1', lvg_computation_method: str = 'seurat', nr_norm_bins: int = 5, df_mean_var: DataFrame | None = None, store_path_mean_var_data: str | None = None, score_name: str = 'Seurat_LVG_score', copy: bool = False, return_control_genes: bool = False, return_gene_list: bool = False, use_raw: bool | None = None) → AnnData | None#

Least variable genes as control genes scoring method (Seurat_LVG) scores each cell in the dataset for a passed signature (gene_list) and stores the scores in the data object. Implementation is inspired by score_genes method of Scanpy (https://scanpy.readthedocs.io/en/latest/generated/scanpy.tl.score_genes.html#scanpy.tl.score_genes)

Parameters:

adata – AnnData object containing the preprocessed (log-normalized) gene expression data.
gene_list – A list of genes for which the cells are scored for.
ctrl_size – The number of control genes selected for each gene in the gene_list.
n_bins – The number of average gene expression bins to use.
gene_pool – The pool of genes out of which control genes can be selected.
lvg_computation_version – The version of the least variable genes selection defines if the genes with the smallest dispersion are chosen directly from an expression bin (v1) or whether the expressions are binned a second round (v2).
lvg_computation_method – Indicates which method should be used to compute the least variable genes. We can use ‘seurat’ or ‘cell_ranger’. See reference https://scanpy.readthedocs.io/en/latest/generated/scanpy.pp.highly_variable_genes.html#scanpy-pp-highly-variable-genes
nr_norm_bins – If lvg_computation_version=”v2”, we need to define the number of subbins used.
df_mean_var – A pandas DataFrame containing the average expression (and variance) for each gene in the dataset. If df_mean_var is None, the average gene expression and variance is computed during gene signature scoring.
store_path_mean_var_data – Path to store data and visualizations created during average gene expression computation. If it is None, data and visualizations are not stored.
score_name – Column name for scores added in .obs of data.
copy – Indicates whether original or a copy of adata is modified.
return_control_genes – Indicated if method returns selected control genes.
return_gene_list – Indicates if method returns the possibly reduced gene list.
use_raw – Whether to compute gene signature score on raw data stored in .raw attribute of adata

Returns:

If copy=True, the method returns a copy of the original data with stored LVCG scores in .obs, otherwise None is returned.

signaturescoring.scoring_methods.seurat_scoring module#

signaturescoring.scoring_methods.seurat_scoring.score_genes(adata: AnnData, gene_list: Sequence[str], ctrl_size: int = 100, n_bins: int = 25, gene_pool: Sequence[str] | None = None, df_mean_var: DataFrame | None = None, store_path_mean_var_data: str | None = None, score_name: str = 'Seurat_score', random_state: int | None = None, copy: bool = False, return_control_genes: bool = False, return_gene_list: bool = False, use_raw: bool | None = None) → AnnData | None#

The Python implementation of the scoring method avaiable in R package Seurat (AddModuleScore) based on Tirosh et al. 2016 (Seurat) scores each cell in the dataset for a passed signature (gene_list) and stores the scores in the data object. Implementation is inspired by score_genes method of Scanpy (https://scanpy.readthedocs.io/en/latest/generated/scanpy.tl.score_genes.html#scanpy.tl.score_genes)

Parameters:

adata – AnnData object containing the preprocessed (log-normalized) gene expression data.
gene_list – A list of genes for which the cells are scored for.
ctrl_size – The number of control genes selected for each gene in the gene_list.
n_bins – The number of average gene expression bins to use.
gene_pool – The pool of genes out of which control genes can be selected.
df_mean_var – A pandas DataFrame containing the average expression (and variance) for each gene in the dataset. If df_mean_var is None, the average gene expression and variance is computed during gene signature scoring.
store_path_mean_var_data – Path to store data and visualizations created during average gene expression computation. If it is None, data and visualizations are not stored.
score_name – Column name for scores added in .obs of data.
random_state – Seed for random state.
copy – Indicates whether original or a copy of adata is modified.
return_control_genes – Indicated if method returns selected control genes.
return_gene_list – Indicates if method returns the possibly reduced gene list.
use_raw – Whether to compute gene signature score on raw data stored in .raw attribute of adata.

Returns:

If copy=True, the method returns a copy of the original data with stored Seurat scores in .obs, otherwise None is returned.

signaturescoring.scoring_methods.ucell_scoring module#

signaturescoring.scoring_methods.ucell_scoring.compute_ranks_and_ustat(X_data, index, columns, gene_list, X_indices=None, X_indptr=None, X_shape=None, maxRank=1500)#

The following method computes for each cell in X_data the UCell score.

Parameters:

X_data – Current batch of gene expression data.
index – Index of cells.
columns – Names of genes.
gene_list – Signature genes.
X_indices – For sparse matrix reconstruction indices. If None, method assumes X_data to be a dense matrix.
X_indptr – For sparse matrix reconstruction index pointers. If None, method assumes X_data to be a dense matrix.
X_shape – For sparse matrix reconstruction shape of original matrix. If None, method assumes X_data to be a dense matrix.
maxRank – Cutoff for maximum rank allowed.

Returns:

For each cell in X_data the method returns the UCell score.

signaturescoring.scoring_methods.ucell_scoring.score_genes(adata: AnnData, gene_list: List[str], maxRank: int = 1500, bs: int = 500, score_name: str = 'UCell_score', random_state: int | None = None, copy: bool = False, use_raw: bool | None = None, verbose: int = 0, joblib_kwargs: dict = {'n_jobs': 4}) → AnnData | None#

UCell signature scoring method is a Python implementation of the scoring method proposed by Andreatta et al. 2021.

Massimo Andreatta and Santiago J Carmona. „UCell: Robust and scalable single-cell gene signature scoring“. en. In: Comput. Struct. Biotechnol. J. 19 (June 2021), pp. 3796–3798 (cit. on pp. iii, 2, 9, 15, 16).

Implementation is inspired by score_genes method of Scanpy (https://scanpy.readthedocs.io/en/latest/generated/scanpy.tl.score_genes.html#scanpy.tl.score_genes)

Parameters:

adata – AnnData object containing the gene expression data.
gene_list – A list of genes (signature) for which the cells are scored for.
maxRank – Cutoff for maximum rank allowed.
bs – The number of cells in a processing batch.
score_name – Column name for scores added in .obs of data.
random_state – Seed for random state.
copy – Indicates whether original or a copy of adata is modified.
use_raw – Whether to compute gene signature score on raw data stored in .raw attribute of adata
verbose – If verbose is larger than 0, print statements are shown.
joblib_kwargs – Keyword argument for parallel execution with joblib.

Returns:

If copy=True, the method returns a copy of the original data with stored UCell scores in .obs, otherwise None is returned.

signaturescoring.scoring_methods.ucell_scoring.u_stat(rank_value, maxRank=1500)#

The method computes the U statistic on signature gene ranks.

Parameters:

rank_value – Ranks of the signature genes.
maxRank – Cutoff for maximum rank allowed.

Returns:

The U statistic for given signature gene ranks.

Module contents#

The submodule with signature scoring methods