Preprocessing#
- scripts.preprocessing.preprocessing.determine_seq_technology(adata: AnnData)[source]#
Determine the sequencing technology used from the AnnData object.
- Parameters:
adata (anndata.AnnData) – Input data object
- Returns:
Sequencing technology identifier
- Return type:
str
- Raises:
KeyError – If technology information is missing
ValueError – If multiple technologies are found
- scripts.preprocessing.preprocessing.get_args() Namespace [source]#
Parse command line arguments for the preprocessing pipeline.
- Returns:
- Parsed command line arguments including:
input: Path to input h5ad file
output: Path for output h5ad file
excluded_sample: List of sample names to exclude
min_genes: Minimum number of genes required per cell
min_counts: Minimum number of counts required per cell
max_pct_mt: Maximum percentage of mitochondrial genes allowed
- Return type:
Namespace
- scripts.preprocessing.preprocessing.get_counts_from_tpm(tpm: csr_matrix, technology: str) csr_matrix [source]#
Convert TPM (Transcripts Per Million) values to estimated count data.
- Parameters:
tpm (csr_matrix) – Sparse matrix of TPM values
technology (str) – Sequencing technology used
- Returns:
Estimated count data
- Return type:
csr_matrix
- Raises:
NotImplementedError – If the sequencing technology is not recognized
- scripts.preprocessing.preprocessing.get_counts_per_cell(x: csr_matrix)[source]#
Calculate the total count of gene expression for each cell.
- Parameters:
x (csr_matrix) – Sparse matrix containing gene expression data
- Returns:
Array containing total counts for each cell
- Return type:
numpy.ndarray
- scripts.preprocessing.preprocessing.get_tpm_counts(input, count_type: str, technology: str) Tuple[csr_matrix, csr_matrix] [source]#
Process input data to get both TPM and count matrices.
- Parameters:
input – Input count or TPM data
count_type (str) – Type of count data provided
technology (str) – Sequencing technology used
- Returns:
TPM values and count data
- Return type:
Tuple[csr_matrix, csr_matrix]
- scripts.preprocessing.preprocessing.normalize(counts: csr_matrix, target_counts=100000.0) csr_matrix [source]#
Normalize counts to a target sum per cell.
- Parameters:
counts (csr_matrix) – Raw count data
target_counts (float) – Target sum for each cell after normalization
- Returns:
Normalized count data
- Return type:
csr_matrix
- scripts.preprocessing.preprocessing.preprocessing(adata: AnnData, excluded_samples: List[str], min_genes: int, min_counts: int, max_pct_mt: float) AnnData [source]#
Perform complete preprocessing pipeline on single-cell RNA sequencing data.
- Parameters:
adata (anndata.AnnData) – Input data object
excluded_samples (List[str]) – List of sample names to exclude
min_genes (int) – Minimum number of genes required per cell
min_counts (int) – Minimum number of counts required per cell
max_pct_mt (float) – Maximum percentage of mitochondrial counts allowed
- Returns:
Fully preprocessed data object
- Return type:
anndata.AnnData
- scripts.preprocessing.preprocessing.r_names(adata: AnnData) AnnData [source]#
Convert variable and observation names to R-compatible format.
- Parameters:
adata (anndata.AnnData) – Input data object
- Returns:
Data object with R-compatible names
- Return type:
anndata.AnnData
- scripts.preprocessing.preprocessing.remove_excluded_samples(adata: AnnData, excluded_samples: List[str]) AnnData [source]#
Remove specified samples from the dataset.
- Parameters:
adata (sc.AnnData) – Input data object
excluded_samples (List[str]) – List of sample names to exclude
- Returns:
Filtered data object
- Return type:
sc.AnnData
- scripts.preprocessing.preprocessing.remove_high_mt_cells(adata: AnnData, max_pct_mt: float) AnnData [source]#
Remove cells with high mitochondrial gene expression.
- Parameters:
adata (anndata.AnnData) – Input data object
max_pct_mt (float) – Maximum percentage of mitochondrial counts allowed
- Returns:
Filtered data object
- Return type:
anndata.AnnData
- scripts.preprocessing.preprocessing.remove_low_count_cells(adata: AnnData, min_counts: int | None, min_genes: int | None) AnnData [source]#
Filter cells based on minimum count and gene expression thresholds.
- Parameters:
adata (anndata.AnnData) – Input data object
min_counts (Optional[int]) – Minimum counts required per cell
min_genes (Optional[int]) – Minimum genes required per cell
- Returns:
Filtered data object
- Return type:
anndata.AnnData
- scripts.preprocessing.preprocessing.remove_samples(adata: AnnData) AnnData [source]#
Remove samples with fewer than 50 cells in G1 phase.
- Parameters:
adata (anndata.AnnData) – Input data object
- Returns:
Filtered data object
- Return type:
anndata.AnnData
- scripts.preprocessing.preprocessing.set_tpm_counts(adata: AnnData, technology: str) AnnData [source]#
Set TPM and count data in AnnData object and perform log transformation.
- Parameters:
adata (anndata.AnnData) – Input data object
technology (str) – Sequencing technology used
- Returns:
Processed data object with TPM and counts
- Return type:
anndata.AnnData
- scripts.preprocessing.preprocessing.subset_malignant(adata: AnnData) AnnData [source]#
Subset data to include only malignant cells.
- Parameters:
adata (anndata.AnnData) – Input data object
- Returns:
Data object containing only malignant cells
- Return type:
anndata.AnnData
- Raises:
ValueError – If malignant cells cannot be identified
Subsampling#
- scripts.preprocessing.subsample.get_args() Namespace [source]#
Parse command line arguments for the sample subsetting tool.
- Returns:
- Parsed command line arguments including:
input (str): Path to input h5ad file
output (str): Path for output h5ad file
n_samples (str): Number of samples to keep or ‘all’
random_seed (int): Seed for random number generator
- Return type:
Namespace
- scripts.preprocessing.subsample.main() None [source]#
Main function to execute the sample subsetting workflow.
The function performs the following steps: 1. Loads an AnnData object from the specified input file 2. If n_samples is ‘all’, saves the complete dataset unchanged 3. Otherwise, randomly selects the specified number of samples 4. Saves the subsetted data to the specified output file
- Raises:
AssertionError – If requested number of samples exceeds available samples or if the subsetting operation fails to produce expected results