CCLE_expression.ipynb
CCLE_expression.ipynb — Gene-expression preprocessing
Objective
Extract a cancer-focused subset of the raw CCLE 22Q2 expression matrix and give it clean gene-symbol headers for downstream joins.
Inputs
CCLE_expression.csv
CCLE-DepMap22Q2_geneexp/
1 ,406 cell lines × 19 ,221 genes (log₂(TPM + 1))
CancerGeneCensus_GRCh38_COSMIC_v101.csv
same folder
List of 735 cancer-associated genes
sample_info.csv
same folder
CCLE sample ↔ DepMap ID mapping (for later alignment)
Processing steps ↴
Load full matrix with
pandas.read_csv
(mixed-dtype header fix).Subset to COSMIC genes → 735 columns retained.
Rename columns from Ensembl IDs to HGNC gene symbols (e.g.,
ENSG00000141510
→TP53
).Integrity checks:
No missing values per gene.
1 ,406 unique DepMap IDs confirmed.
Save cleaned table as
CCLE_expression_cleaned.csv
(≈ 16 MB).
Outputs consumed later
CCLE_expression_cleaned.csv
01_data_prep.ipynb
Rationale
Focusing on 735 cancer-associated genes reduces dimensionality 20-fold while preserving biologically relevant signals.
Leaving expression on a log₂(TPM + 1) scale avoids negative values and allows z-scoring later in
Data Integration & Preparation.ipynb
.A lightweight 16 MB CSV is quick to load in Colab and small enough to version-control if desired.
No additional files are written; intermediate DataFrames stay in memory only.
Last updated