CCLE_expression.ipynb

CCLE_expression.ipynb — Gene-expression preprocessing

Extract a cancer-focused subset of the raw CCLE 22Q2 expression matrix and give it clean gene-symbol headers for downstream joins.

File

Drive location

Notes

CCLE_expression.csv

CCLE-DepMap22Q2_geneexp/

1 ,406 cell lines × 19 ,221 genes (log₂(TPM + 1))

CancerGeneCensus_GRCh38_COSMIC_v101.csv

same folder

List of 735 cancer-associated genes

sample_info.csv

same folder

CCLE sample ↔ DepMap ID mapping (for later alignment)

Load full matrix with pandas.read_csv (mixed-dtype header fix).
Subset to COSMIC genes → 735 columns retained.
Rename columns from Ensembl IDs to HGNC gene symbols (e.g., ENSG00000141510 → TP53).
Integrity checks:
- No missing values per gene.
- 1 ,406 unique DepMap IDs confirmed.
Save cleaned table as CCLE_expression_cleaned.csv (≈ 16 MB).

File

Used by

CCLE_expression_cleaned.csv

01_data_prep.ipynb

Focusing on 735 cancer-associated genes reduces dimensionality 20-fold while preserving biologically relevant signals.
Leaving expression on a log₂(TPM + 1) scale avoids negative values and allows z-scoring later in Data Integration & Preparation.ipynb.
A lightweight 16 MB CSV is quick to load in Colab and small enough to version-control if desired.

No additional files are written; intermediate DataFrames stay in memory only.

Last updated 9 days ago