Preprocessing

Preprocessing#

scripts.preprocessing.preprocessing.determine_seq_technology(adata: AnnData)[source]#

Determine the sequencing technology used from the AnnData object.

Parameters:

adata (anndata.AnnData) – Input data object

Returns:

Sequencing technology identifier

Return type:

str

Raises:

KeyError – If technology information is missing
ValueError – If multiple technologies are found

scripts.preprocessing.preprocessing.get_args() → Namespace[source]#

Parse command line arguments for the preprocessing pipeline.

Returns:

Parsed command line arguments including:

input: Path to input h5ad file
output: Path for output h5ad file
excluded_sample: List of sample names to exclude
min_genes: Minimum number of genes required per cell
min_counts: Minimum number of counts required per cell
max_pct_mt: Maximum percentage of mitochondrial genes allowed

Return type:

Namespace

scripts.preprocessing.preprocessing.get_counts_from_tpm(tpm: csr_matrix, technology: str) → csr_matrix[source]#

Convert TPM (Transcripts Per Million) values to estimated count data.

Parameters:

tpm (csr_matrix) – Sparse matrix of TPM values
technology (str) – Sequencing technology used

Returns:

Estimated count data

Return type:

csr_matrix

Raises:

NotImplementedError – If the sequencing technology is not recognized

scripts.preprocessing.preprocessing.get_counts_per_cell(x: csr_matrix)[source]#

Calculate the total count of gene expression for each cell.

Parameters:: x (csr_matrix) – Sparse matrix containing gene expression data
Returns:: Array containing total counts for each cell
Return type:: numpy.ndarray

scripts.preprocessing.preprocessing.get_tpm_counts(input, count_type: str, technology: str) → Tuple[csr_matrix, csr_matrix][source]#

Process input data to get both TPM and count matrices.

Parameters:

input – Input count or TPM data
count_type (str) – Type of count data provided
technology (str) – Sequencing technology used

Returns:

TPM values and count data

Return type:

Tuple[csr_matrix, csr_matrix]

scripts.preprocessing.preprocessing.main()[source]#

scripts.preprocessing.preprocessing.make_datadir(output)[source]#

scripts.preprocessing.preprocessing.normalize(counts: csr_matrix, target_counts=100000.0) → csr_matrix[source]#

Normalize counts to a target sum per cell.

Parameters:

counts (csr_matrix) – Raw count data
target_counts (float) – Target sum for each cell after normalization

Returns:

Normalized count data

Return type:

csr_matrix

scripts.preprocessing.preprocessing.preprocessing(adata: AnnData, excluded_samples: List[str], min_genes: int, min_counts: int, max_pct_mt: float) → AnnData[source]#

Perform complete preprocessing pipeline on single-cell RNA sequencing data.

Parameters:

adata (anndata.AnnData) – Input data object
excluded_samples (List[str]) – List of sample names to exclude
min_genes (int) – Minimum number of genes required per cell
min_counts (int) – Minimum number of counts required per cell
max_pct_mt (float) – Maximum percentage of mitochondrial counts allowed

Returns:

Fully preprocessed data object

Return type:

anndata.AnnData

scripts.preprocessing.preprocessing.r_names(adata: AnnData) → AnnData[source]#

Convert variable and observation names to R-compatible format.

Parameters:: adata (anndata.AnnData) – Input data object
Returns:: Data object with R-compatible names
Return type:: anndata.AnnData

scripts.preprocessing.preprocessing.read_anndata(path: PathLike) → AnnData[source]#

scripts.preprocessing.preprocessing.remove_excluded_samples(adata: AnnData, excluded_samples: List[str]) → AnnData[source]#

Remove specified samples from the dataset.

Parameters:

adata (sc.AnnData) – Input data object
excluded_samples (List[str]) – List of sample names to exclude

Returns:

Filtered data object

Return type:

sc.AnnData

scripts.preprocessing.preprocessing.remove_high_mt_cells(adata: AnnData, max_pct_mt: float) → AnnData[source]#

Remove cells with high mitochondrial gene expression.

Parameters:

adata (anndata.AnnData) – Input data object
max_pct_mt (float) – Maximum percentage of mitochondrial counts allowed

Returns:

Filtered data object

Return type:

anndata.AnnData

scripts.preprocessing.preprocessing.remove_low_count_cells(adata: AnnData, min_counts: int | None, min_genes: int | None) → AnnData[source]#

Filter cells based on minimum count and gene expression thresholds.

Parameters:

adata (anndata.AnnData) – Input data object
min_counts (Optional[int]) – Minimum counts required per cell
min_genes (Optional[int]) – Minimum genes required per cell

Returns:

Filtered data object

Return type:

anndata.AnnData

scripts.preprocessing.preprocessing.remove_samples(adata: AnnData) → AnnData[source]#

Remove samples with fewer than 50 cells in G1 phase.

Parameters:: adata (anndata.AnnData) – Input data object
Returns:: Filtered data object
Return type:: anndata.AnnData

scripts.preprocessing.preprocessing.score_cell_cycle(adata: AnnData) → AnnData[source]#

scripts.preprocessing.preprocessing.set_tpm_counts(adata: AnnData, technology: str) → AnnData[source]#

Set TPM and count data in AnnData object and perform log transformation.

Parameters:

adata (anndata.AnnData) – Input data object
technology (str) – Sequencing technology used

Returns:

Processed data object with TPM and counts

Return type:

anndata.AnnData

scripts.preprocessing.preprocessing.subset_malignant(adata: AnnData) → AnnData[source]#

Subset data to include only malignant cells.

Parameters:: adata (anndata.AnnData) – Input data object
Returns:: Data object containing only malignant cells
Return type:: anndata.AnnData
Raises:: ValueError – If malignant cells cannot be identified

scripts.preprocessing.preprocessing.write_anndata(adata: AnnData, output_path: PathLike) → None[source]#

Subsampling#

scripts.preprocessing.subsample.get_args() → Namespace[source]#

Parse command line arguments for the sample subsetting tool.

Returns:

Parsed command line arguments including:

input (str): Path to input h5ad file
output (str): Path for output h5ad file
n_samples (str): Number of samples to keep or ‘all’
random_seed (int): Seed for random number generator

Return type:

Namespace

scripts.preprocessing.subsample.main() → None[source]#

Main function to execute the sample subsetting workflow.

The function performs the following steps: 1. Loads an AnnData object from the specified input file 2. If n_samples is ‘all’, saves the complete dataset unchanged 3. Otherwise, randomly selects the specified number of samples 4. Saves the subsetted data to the specified output file

Raises:: AssertionError – If requested number of samples exceeds available samples or if the subsetting operation fails to produce expected results

Preprocessing

Contents

Preprocessing#

Subsampling#