CCLE_expression.ipynb

CCLE_expression.ipynb — Gene-expression preprocessing

Objective

Extract a cancer-focused subset of the raw CCLE 22Q2 expression matrix and give it clean gene-symbol headers for downstream joins.


Inputs

File
Drive location
Notes

CCLE_expression.csv

CCLE-DepMap22Q2_geneexp/

1 ,406 cell lines × 19 ,221 genes (log₂(TPM + 1))

CancerGeneCensus_GRCh38_COSMIC_v101.csv

same folder

List of 735 cancer-associated genes

sample_info.csv

same folder

CCLE sample ↔ DepMap ID mapping (for later alignment)


Processing steps ↴

  1. Load full matrix with pandas.read_csv (mixed-dtype header fix).

  2. Subset to COSMIC genes → 735 columns retained.

  3. Rename columns from Ensembl IDs to HGNC gene symbols (e.g., ENSG00000141510TP53).

  4. Integrity checks:

    • No missing values per gene.

    • 1 ,406 unique DepMap IDs confirmed.

  5. Save cleaned table as CCLE_expression_cleaned.csv (≈ 16 MB).


Outputs consumed later

File
Used by

CCLE_expression_cleaned.csv

01_data_prep.ipynb


Rationale

  • Focusing on 735 cancer-associated genes reduces dimensionality 20-fold while preserving biologically relevant signals.

  • Leaving expression on a log₂(TPM + 1) scale avoids negative values and allows z-scoring later in Data Integration & Preparation.ipynb.

  • A lightweight 16 MB CSV is quick to load in Colab and small enough to version-control if desired.

No additional files are written; intermediate DataFrames stay in memory only.

Last updated