Curated Cancer Cell Atlas (CCCA) Data Download Guide

Curated Cancer Cell Atlas (CCCA) Data Download Guide#

The CanSig-benchmark project provides functionality to download and process datasets from the Curated Cancer Cell Atlas (CCCA). This guide explains how to download datasets and generate processed samples for analysis.

Overview#

The CCCA fetcher (ccafetcher.py) handles:

Downloading raw data and metadata from Dropbox URLs
Processing ZIP files containing expression matrices
Converting data to AnnData format for analysis
Merging metadata with expression data

Dataset Metadata Structure#

Datasets are defined in CSV files (e.g., 3ca.csv) with the following columns:

Tissue: Tissue type (e.g., lung, breast, brain)
Title: Study author and year
Article: DOI/URL to the original publication
Data: Dropbox URL for the expression data (ZIP format)
Meta_data: Dropbox URL for the metadata (CSV format)
Disease: Disease type or condition
Technology: Sequencing technology (10x, SmartSeq2, etc.)
N_cells: Number of cells in the dataset
N_sampels: Number of samples in the dataset

Basic Usage#

1. Download Only (Fetch Mode)#

To download raw datasets without processing:

⚠️ Warning: This will download all datasets specified in the metadata file. Please subset them to the files needed!

python ccafetcher.py --metadata 3ca.csv --download-dir data/downloads --fetch

This will:

Create the download directory if it doesn’t exist
Download all ZIP files and metadata CSV files
Store files with hashed names based on their URLs

2. Process Single Dataset#

To download and convert a specific dataset (first row in the metadata) to AnnData format:

python ccafetcher.py --metadata 3ca.csv --download-dir data/downloads --dataset-dir data/raw --sample 0

Parameters:

--metadata: Path to the metadata CSV file
--download-dir: Directory to store raw downloaded files
--dataset-dir: Directory to store processed AnnData files
--sample: Index of the dataset to process (0-based)
--fetch: Download files if not already present

Output Format#

Processed datasets are saved as .h5ad files (AnnData format) in the specified dataset directory. The naming convention is:

Single dataset per ZIP: {Author}_{Tissue}.h5ad
Multiple datasets per ZIP: {Author}_{Tissue}_{Subdirectory}.h5ad

Data Structure#

Each processed AnnData object contains:

X: Expression matrix (sparse CSR format)
obs: Cell-level metadata (observations)
var: Gene-level metadata (variables)
uns: Unstructured metadata including count type

Changes made to the 3CA.#

We corrected a couple of errors in the 3CA.

Fix the metadata of Neftel et al. by changing the technology to 10x for some samples.
Converted the .rds in Couturier et al. to .mtx.
The dataset from Couturier et al. was subsetted to cells that where identified as IDH WT in genetic_hormonal_features and in histology as GBM.
The dataset from Yuan et al. was subsetted to GBM patients.