02_gen_dataset.ipynb
02_gen_dataset.ipynb — Generating labelled datasets
Objective
Transform cleaned ln(IC₅₀) values and aligned gene-expression data into stratified Train / Validation / Test splits for both classification and regression tasks.
Inputs
ic50_cleaned.csv
GDSC2_drugsens/datasets/
Long-format ln(IC₅₀) table (108,696 drug-cell-line combinations from experimental measurements).
drug_info.csv
same folder
228 compounds with validated SMILES.
GeneExp.csv
…/datasets/features/
676 cell lines × 735 cancer-associated genes.
Processing steps ↴
Merge tables by
DrugID
andCellLineID
to obtain(lnIC50, SMILES, expression)
.Create binary label
Sensitive = 1 if ln(IC₅₀) ≤ –1 (0.368 µM); else 0 — class imbalance ≈ 8.5 : 1.
Pivot to wide format
Drugs as columns, cell lines as rows for both classification and regression matrices.
Stratified split by cell line (seed = 42)
60 % Train, 20 % Validation, 20 % Test; class ratio preserved.
Save results under
sensitivity/pivot/
Classification:
DrugSens-Train.csv
,DrugSens-Validhyper-Subsampling.csv
,DrugSens-Trainhyper-Subsampling.csv
,DrugSens-Test.csv
Regression: same file pattern in
pivot/regr/
Outputs consumed later
4 CSVs in pivot/clas/
Data Integration & Preparation.ipynb
4 CSVs in pivot/regr/
Optional regression experiments
Rationale
Long → wide conversion lets us index quickly by either drug or cell line and matches common chem-bio modelling formats.
Stratifying by cell line avoids leakage of drug-specific IC₅₀ distribution into validation or test sets.
Separate “hyper-subsampling” splits allow aggressive class balancing during hyper-parameter sweeps without touching the held-out Test set.
Files in sensitivity/stack/
and the single-sheet DrugSens*.csv
tables are retained for provenance but are not loaded by subsequent notebooks.
Last updated