Running CanSig on your data

Running CanSig on your data#

This guide explains how to run CanSig on your own datasets. The pipeline uses a configuration-based approach where you specify your data, methods, and parameters in YAML files.

Configuration Structure#

Add your configuration file to the configs folder. It should follow this structure:

root: results/your_analysis  # Output directory

methods:
  # List the integration methods you want to use
  bbknn:
  harmony:
  genenmf:
  unintegrated:
  scvi:
  scalop:
  cca:

scenarios:
  your_scenario_name:
    data_path: path/to/your/data.h5ad  # Path to your AnnData file
    signatures:
      - signature_name1
      - signature_name2
    preprocessing:  # Optional preprocessing parameters
      min_genes: 1000  # Minimum number of genes per cell
      max_pct_mt: 20   # Maximum percentage of mitochondrial genes
      excluded_samples: ["sample1", "sample2"]  # Samples to exclude

signatures:
  signature_name1:
    annotation_path: path/to/annotations1.csv
    scoring_scenario: your_scenario_name # Dataset used for comapring gene signature scores.
    n_cluster: 4
 

Input Data#

CanSig expects input data in AnnData (.h5ad) format with specific requirements for the AnnData object structure. The adata.obs must contain a “sample” column that stores the batch ID for each cell in the dataset. The adata.var_names should contain gene symbols as the feature identifiers. The adata.uns dictionary must include a “counts_type” field specifying the sequencing platform used to generate the data. Supported values for counts_type are “10x” for 10x Genomics data, “microwell array-based platform” for Microwell data, “microwell-seq” for Microwell-seq data, “smartseq2” or “SmartSeq2” for Smart-seq2 data, and “seqwell” for SeqWell data. All of these sequencing technologies are used in the currate cancer cell atlas.

Create an entry for each cell type signature
Specify the path to annotation files
Set the number of expected clusters (n_cluster)
Link to the appropriate scenario using scoring_scenario

Running the Pipeline#

Once your configuration is set up:

Place your configuration file in the root directory

Run the pipeline using:

snakemake --configfile your_config.yml -c <number-of-cores>

Output Structure#

Results will be organized in your specified directory as follows:

results/your_analysis/
├── benchmarks/     # Performance benchmarks for each pipeline step
├── corrs/          # Correlation analysis results
├── data/           # Processed data files
├── integration/    # Integrated datasets for each method
├── logs/          # Pipeline execution logs
├── metasigs/      # Generated meta-signatures
└── scores/        # Evaluation scores and metrics

Each directory contains the intermediate and final results for that stage of the pipeline.

Advanced Configuration#

Random Sampling#

You can configure random down sampling of samples by adding these parameters to your scenario:

scenarios:
  your_scenario:
    n_samples: [5, 10, 15]  # List of sample sizes to try
    random_seeds: 5  # Number of random seeds for replication

Bulk RNA Integration#

If you want to comapre your signatures also on bulk data add:

scenarios:
  your_scenario:
    bulk_path: path/to/bulk_data.h5ad  # Path to bulk RNA-seq data