signaturescoring.utils package#

Submodules#

signaturescoring.utils.metrics module#

signaturescoring.utils.metrics.get_AUC_and_F1_performance(adata_obs: DataFrame, scoring_names: List[str], label_col: str = 'healthy', label_of_interest: str = 'unhealthy', old_df: DataFrame | None = None, sample_id: str | None = None, store_data_path: str | None = None, save: bool = False, log: str | None = None) DataFrame#

Compute the AUCROC and classification metrics for signature scores for given labels.

Parameters:
  • adata_obs – ‘.obs’ part of an AnnData object.

  • scoring_names – Names of signature scores in ‘.obs’ to be evaluated.

  • label_col – Name of column containing cell type labels, i.e., ground truth for performance measures.

  • label_of_interest – Name of label corresponding to the positive class, i.e., label excepted to be associated with high scores.

  • old_df – Existing DataFrame that should be extended.

  • sample_id – Optional sample id name that can be added to a column

  • store_data_path – Optional path to location in which performance metrics can be stored

  • save – Indicates whether performance metrics should be stored as ‘.csv’

  • log – Name of logger

Returns:

DataFrame with the evaluated performance metrics

signaturescoring.utils.metrics.get_test_statistics(adata: AnnData, scoring_names: List[str], test_method: str = 'kstest', alternative: str = 'greater', label_col: str = 'Group', label_whsc: str = 'Group1', old_df: DataFrame | None = None, store_data_path: str | None = None, save: bool = False, log: str | None = None) DataFrame#

This function computes for each indicated scoring method the performance of the scores. It applies a two sample test based on the passed method. It can store the data in a desired folder.

Parameters:
  • adata – AnnData object containing the preprocessed (log-normalized) gene expression data.

  • scoring_names – Column names of signature scores in ‘.obs’ attribute.

  • test_method – Selected test to conduct on scores. Available tests: kstest, mannwhitneyu, auc, or auc-dist

  • alternative – Alternative of two-sample test.

  • label_col – Label column in ‘.obs’ of ‘adata’.

  • label_whsc – Label of positive class, i.e., label associated with high scores.

  • old_df – Existing DataFrame that should be extended.

  • store_data_path – Path to location to store performance files.

  • save – Indicates whether performance measurements should be stored.

  • log – Name of logger.

Returns:

A dataframe containing the test results for all names in scoring_names. It contains the following columns [Scoring method’, ‘Test method’, ‘Statistic’, ‘pvalue’].

Raises:

ValueError

  1. If label_col is not available in adata.obs. 2. If label_whsc os not a category in label_col. 3. If label_col contains not two categories.

signaturescoring.utils.utils module#

signaturescoring.utils.utils.check_signature_genes(var_names: ~typing.List[str], gene_list: ~typing.List[str], return_type: ~typing.Any = <class 'list'>) List[str]#

The method checks the availability of signature genes in the list of available genes (var_names). Genes not present in var_names are excluded.

Parameters:
  • var_names – List of available genes in the dataset.

  • gene_list – List of genes to score for (signature).

  • return_type – Indicates whether to return a list or a set.

Returns:

Filtered gene list.

Raises:

ValueError – if return_type is not of type list or set. if gene_list gets empty.

signaturescoring.utils.utils.checks_ctrl_size(ctrl_size: int, gene_pool_size: int, gene_list_size: int)#

Applies input checks on the control set size and if valid control sets can be built with the desired size.

Parameters:
  • ctrl_size – The number of control genes selected for each gene in the gene_list.

  • gene_pool_size – The number of genes in the allowed control genes pool.

  • gene_list_size – The number of genes in of a gene expression signature.

Returns:

None

Raises:

Value Error if checks on ctrl size fail.

signaturescoring.utils.utils.commonPrefix(arr, low, high)#

A Divide and Conquer based function to find the longest common prefix. This is similar to the merge sort technique. The method was implemented by Rachit Belwariar. https://www.geeksforgeeks.org/longest-common-prefix-using-divide-and-conquer-algorithm/?ref=lbp

Parameters:
  • arr – Array of strings.

  • low – index of the lowest element in range

  • high – index of the highest element in range

Returns:

Common prefix of all strings in arr

signaturescoring.utils.utils.commonPrefixUtil(str1, str2)#

Utility Function to find the common prefix between strings str1 and str2. The method was implemented by Rachit Belwariar. https://www.geeksforgeeks.org/longest-common-prefix-using-divide-and-conquer-algorithm/?ref=lbp

Parameters:
  • str1 – First string

  • str2 – Second string

Returns:

Common prefix of the two strings

signaturescoring.utils.utils.get_bins_wrt_avg_gene_expression(gene_means: Any, n_bins: int, verbose: int = 0) Series#

Method to compute the average gene expression bins.

Parameters:
  • gene_means – Average gene expression vector.

  • n_bins – Number of desired bins.

  • verbose – Show print statements if larger than 0.

Returns:

Series containing gene to expression bin assignment.

signaturescoring.utils.utils.get_data_for_gene_pool(adata: AnnData, gene_pool: List[str], gene_list: List[str], ctrl_size: int | None = None, check_gene_list: bool = True) Tuple[AnnData, List[str]]#

The method to filter dataset for gene pool and genes in gene_list.

Parameters:
  • adata – AnnData object containing the preprocessed (log-normalized) gene expression data.

  • gene_pool – List of genes from which the control genes can be selected.

  • gene_list – List of genes (signature) scoring methods want to score for.

  • check_gene_list – Indicates whether gene list should be checked.

Returns:

Eventually, filtered adata subset and new gene_pool.

Raises:

ValueError

  1. If the variable gene_pool does not have the correct type. 2. If no valid genes were passed as reference set. 3. If there are note enough genes in gene_pool (len(gene_pool) - len(gene_list) < ctrl_size) to compute scoring control sets.

signaturescoring.utils.utils.get_gene_list_real_data(adata: AnnData, dge_method: str = 'wilcoxon', dge_key: str = 'wilcoxon', dge_pval_cutoff: float = 0.01, dge_log2fc_min: float = 0, nr_de_genes: int = 100, mode: str = 'most_diff_expressed', label_col: str = 'healthy', label_of_interest: str = 'unhealthy', random_state: int | None = 0, log: str | None = None, copy: bool = False, verbose: int = 0) List[str]#

This function returns the signature genes for a given dataset. It first gets all differentially expressed genes for a group of interest and then selects a signature based on the defined mode. mode).

Parameters:
  • adata – AnnData object containing the preprocessed (log-normalized) gene expression data.

  • dge_method – Method for DGEX in Scanpy. Available methods: ‘logreg’, ‘t-test’, ‘wilcoxon’, ‘t-test_overestim_var’

  • dge_key – Name of key that is added in ‘.uns’

  • dge_pval_cutoff – Cutoff value of adjusted p-value, i.e., max adjusted p-value

  • dge_log2fc_min – Cutoff minimum value of log fold change.

  • nr_de_genes – Number of genes in signature.

  • mode – Select most highly, least or random differentially expressed genes.

  • label_col – Name of column containing cell type labels for DGEX.

  • label_of_interest – Name of class we want to get the signature for.

  • random_state – Seed for random state.

  • log – Name of logger.

  • copy – Whether to do the DGEX computation inplace or on a copy of adata.

  • verbose – Show print statements if verbose larger than 0.

Returns:

Gene expression signature, i.e., list of genes.

Raises:

ValueError – If mode is not in [“most_diff_expressed”, “least_diff_expressed”, “random”] or label_col is not in adata.obs.

signaturescoring.utils.utils.get_least_variable_genes_per_bin_v1(adata: AnnData, cuts: Any, ctrl_size: int, method: str = 'seurat', gene_pool: List[str] | None = None) dict#

This method implements v1 of the least variable control genes selection for a given dataset. The method uses provided expression bins to select from each bin the genes with the smallest dispersion.

Parameters:
Returns:

A dictionary mapping to each expression bin (i.e., distinct values in cuts) a set of genes with the least variation.

signaturescoring.utils.utils.get_least_variable_genes_per_bin_v2(adata: AnnData, cuts: Any, ctrl_size: int, method: str = 'seurat', gene_pool: List[str] | None = None, nr_norm_bins: int = 5) dict#

This method implements v2 of the least variable control genes selection for a given dataset. This method computes for each of the expression bins the least variable genes, during which the average expression of the expression bins are binned a second time and normalized dispersion scores are computed. The method then selects for each expression bin the genes with smallest normalized dispersion.

Parameters:
  • adata – AnnData object containing the preprocessed (log-normalized) gene expression data.

  • cuts – Assignment of genes to expression bins.

  • ctrl_size – The number of control genes selected for each expression bin.

  • method – Indicates which method should be used to compute the least variable genes. We can use ‘seurat’ or ‘cell_ranger’. See reference https://scanpy.readthedocs.io/en/latest/generated/scanpy.pp.highly_variable_genes.html#scanpy-pp-highly-variable-genes

  • gene_pool – The pool of genes out of which control genes can be selected.

  • nr_norm_bins – The number of bins required for the highly variable genes computation for each expression bin.

Returns:

A dictionary mapping to each expression bin (i.e., distinct values in cuts) a set of genes with least variation.

Raises:

ValueError – If method is not in [“seurat”, “cell_ranger”]

signaturescoring.utils.utils.get_mean_and_variance_gene_expression(adata: AnnData, estim_var: bool = False, loess_span: float = 0.3, loess_degree: int = 2, show_plots: bool = False, store_path: str | None = None, store_data_prefix: str = '') DataFrame#

This function computes for the passed data the average gene expression and the variance of genes. Additionally, one can compute the estimated variance and standard deviation by regression the mean out. The estimation of the variance is computed by fitting a loess curve on the log10 mean and log10 variance.

Parameters:
Returns:

Dataframe containing the average gene expression and the variance of each gene. If estim_var=True, it additionally contains the estimated variance and standard variation of each gene and the loess r2-score (in data.attrs).

signaturescoring.utils.utils.nanmean(x: Any, axis: int, dtype=None) float#

Sparse equivalent to numpy.nanmean using the sparse nanmean implementation of Scanpy scverse/scanpy

Parameters:
  • x – Data matrix.

  • axis – Axis along which to compute mean (0: column-wise, 1: row-wise).

  • dtype – Desired type of the mean vector.

Returns:

Mean vector of desired dimension

signaturescoring.utils.utils.nextnonexistent(f: str) str#

Method to get next filename if original filename does already exist.

Parameters:

f – Filename.

Returns:

New unique filename.

Module contents#

The submodule with util function for signature scoring methods