Preprocessing#

scripts.preprocessing.preprocessing.determine_seq_technology(adata: AnnData)[source]#

Determine the sequencing technology used from the AnnData object.

Parameters:

adata (anndata.AnnData) – Input data object

Returns:

Sequencing technology identifier

Return type:

str

Raises:
  • KeyError – If technology information is missing

  • ValueError – If multiple technologies are found

scripts.preprocessing.preprocessing.get_args() Namespace[source]#

Parse command line arguments for the preprocessing pipeline.

Returns:

Parsed command line arguments including:
  • input: Path to input h5ad file

  • output: Path for output h5ad file

  • excluded_sample: List of sample names to exclude

  • min_genes: Minimum number of genes required per cell

  • min_counts: Minimum number of counts required per cell

  • max_pct_mt: Maximum percentage of mitochondrial genes allowed

Return type:

Namespace

scripts.preprocessing.preprocessing.get_counts_from_tpm(tpm: csr_matrix, technology: str) csr_matrix[source]#

Convert TPM (Transcripts Per Million) values to estimated count data.

Parameters:
  • tpm (csr_matrix) – Sparse matrix of TPM values

  • technology (str) – Sequencing technology used

Returns:

Estimated count data

Return type:

csr_matrix

Raises:

NotImplementedError – If the sequencing technology is not recognized

scripts.preprocessing.preprocessing.get_counts_per_cell(x: csr_matrix)[source]#

Calculate the total count of gene expression for each cell.

Parameters:

x (csr_matrix) – Sparse matrix containing gene expression data

Returns:

Array containing total counts for each cell

Return type:

numpy.ndarray

scripts.preprocessing.preprocessing.get_tpm_counts(input, count_type: str, technology: str) Tuple[csr_matrix, csr_matrix][source]#

Process input data to get both TPM and count matrices.

Parameters:
  • input – Input count or TPM data

  • count_type (str) – Type of count data provided

  • technology (str) – Sequencing technology used

Returns:

TPM values and count data

Return type:

Tuple[csr_matrix, csr_matrix]

scripts.preprocessing.preprocessing.main()[source]#
scripts.preprocessing.preprocessing.make_datadir(output)[source]#
scripts.preprocessing.preprocessing.normalize(counts: csr_matrix, target_counts=100000.0) csr_matrix[source]#

Normalize counts to a target sum per cell.

Parameters:
  • counts (csr_matrix) – Raw count data

  • target_counts (float) – Target sum for each cell after normalization

Returns:

Normalized count data

Return type:

csr_matrix

scripts.preprocessing.preprocessing.preprocessing(adata: AnnData, excluded_samples: List[str], min_genes: int, min_counts: int, max_pct_mt: float) AnnData[source]#

Perform complete preprocessing pipeline on single-cell RNA sequencing data.

Parameters:
  • adata (anndata.AnnData) – Input data object

  • excluded_samples (List[str]) – List of sample names to exclude

  • min_genes (int) – Minimum number of genes required per cell

  • min_counts (int) – Minimum number of counts required per cell

  • max_pct_mt (float) – Maximum percentage of mitochondrial counts allowed

Returns:

Fully preprocessed data object

Return type:

anndata.AnnData

scripts.preprocessing.preprocessing.r_names(adata: AnnData) AnnData[source]#

Convert variable and observation names to R-compatible format.

Parameters:

adata (anndata.AnnData) – Input data object

Returns:

Data object with R-compatible names

Return type:

anndata.AnnData

scripts.preprocessing.preprocessing.read_anndata(path: PathLike) AnnData[source]#
scripts.preprocessing.preprocessing.remove_excluded_samples(adata: AnnData, excluded_samples: List[str]) AnnData[source]#

Remove specified samples from the dataset.

Parameters:
  • adata (sc.AnnData) – Input data object

  • excluded_samples (List[str]) – List of sample names to exclude

Returns:

Filtered data object

Return type:

sc.AnnData

scripts.preprocessing.preprocessing.remove_high_mt_cells(adata: AnnData, max_pct_mt: float) AnnData[source]#

Remove cells with high mitochondrial gene expression.

Parameters:
  • adata (anndata.AnnData) – Input data object

  • max_pct_mt (float) – Maximum percentage of mitochondrial counts allowed

Returns:

Filtered data object

Return type:

anndata.AnnData

scripts.preprocessing.preprocessing.remove_low_count_cells(adata: AnnData, min_counts: int | None, min_genes: int | None) AnnData[source]#

Filter cells based on minimum count and gene expression thresholds.

Parameters:
  • adata (anndata.AnnData) – Input data object

  • min_counts (Optional[int]) – Minimum counts required per cell

  • min_genes (Optional[int]) – Minimum genes required per cell

Returns:

Filtered data object

Return type:

anndata.AnnData

scripts.preprocessing.preprocessing.remove_samples(adata: AnnData) AnnData[source]#

Remove samples with fewer than 50 cells in G1 phase.

Parameters:

adata (anndata.AnnData) – Input data object

Returns:

Filtered data object

Return type:

anndata.AnnData

scripts.preprocessing.preprocessing.score_cell_cycle(adata: AnnData) AnnData[source]#
scripts.preprocessing.preprocessing.set_tpm_counts(adata: AnnData, technology: str) AnnData[source]#

Set TPM and count data in AnnData object and perform log transformation.

Parameters:
  • adata (anndata.AnnData) – Input data object

  • technology (str) – Sequencing technology used

Returns:

Processed data object with TPM and counts

Return type:

anndata.AnnData

scripts.preprocessing.preprocessing.subset_malignant(adata: AnnData) AnnData[source]#

Subset data to include only malignant cells.

Parameters:

adata (anndata.AnnData) – Input data object

Returns:

Data object containing only malignant cells

Return type:

anndata.AnnData

Raises:

ValueError – If malignant cells cannot be identified

scripts.preprocessing.preprocessing.write_anndata(adata: AnnData, output_path: PathLike) None[source]#

Subsampling#

scripts.preprocessing.subsample.get_args() Namespace[source]#

Parse command line arguments for the sample subsetting tool.

Returns:

Parsed command line arguments including:
  • input (str): Path to input h5ad file

  • output (str): Path for output h5ad file

  • n_samples (str): Number of samples to keep or ‘all’

  • random_seed (int): Seed for random number generator

Return type:

Namespace

scripts.preprocessing.subsample.main() None[source]#

Main function to execute the sample subsetting workflow.

The function performs the following steps: 1. Loads an AnnData object from the specified input file 2. If n_samples is ‘all’, saves the complete dataset unchanged 3. Otherwise, randomly selects the specified number of samples 4. Saves the subsetted data to the specified output file

Raises:

AssertionError – If requested number of samples exceeds available samples or if the subsetting operation fails to produce expected results