02_gen_dataset.ipynb

02_gen_dataset.ipynb — Generating labelled datasets

Objective

Transform cleaned ln(IC₅₀) values and aligned gene-expression data into stratified Train / Validation / Test splits for both classification and regression tasks.


Inputs

File
Drive location
Notes

ic50_cleaned.csv

GDSC2_drugsens/datasets/

Long-format ln(IC₅₀) table (108,696 drug-cell-line combinations from experimental measurements).

drug_info.csv

same folder

228 compounds with validated SMILES.

GeneExp.csv

…/datasets/features/

676 cell lines × 735 cancer-associated genes.


Processing steps ↴

  1. Merge tables by DrugID and CellLineID to obtain (lnIC50, SMILES, expression).

  2. Create binary label

    • Sensitive = 1 if ln(IC₅₀) ≤ –1 (0.368 µM); else 0 — class imbalance ≈ 8.5 : 1.

  3. Pivot to wide format

    • Drugs as columns, cell lines as rows for both classification and regression matrices.

  4. Stratified split by cell line (seed = 42)

    • 60 % Train, 20 % Validation, 20 % Test; class ratio preserved.

  5. Save results under sensitivity/pivot/

    • Classification: DrugSens-Train.csv, DrugSens-Validhyper-Subsampling.csv, DrugSens-Trainhyper-Subsampling.csv, DrugSens-Test.csv

    • Regression: same file pattern in pivot/regr/


Outputs consumed later

File set
Used by

4 CSVs in pivot/clas/

Data Integration & Preparation.ipynb

4 CSVs in pivot/regr/

Optional regression experiments


Rationale

  • Long → wide conversion lets us index quickly by either drug or cell line and matches common chem-bio modelling formats.

  • Stratifying by cell line avoids leakage of drug-specific IC₅₀ distribution into validation or test sets.

  • Separate “hyper-subsampling” splits allow aggressive class balancing during hyper-parameter sweeps without touching the held-out Test set.

Files in sensitivity/stack/ and the single-sheet DrugSens*.csv tables are retained for provenance but are not loaded by subsequent notebooks.

Last updated