02_gen_dataset.ipynb

02_gen_dataset.ipynb — Generating labelled datasets

Objective

Transform cleaned ln(IC₅₀) values and aligned gene-expression data into stratified Train / Validation / Test splits for both classification and regression tasks.

Inputs

File

Drive location

Notes

ic50_cleaned.csv

GDSC2_drugsens/datasets/

Long-format ln(IC₅₀) table (108,696 drug-cell-line combinations from experimental measurements).

drug_info.csv

same folder

228 compounds with validated SMILES.

GeneExp.csv

…/datasets/features/

676 cell lines × 735 cancer-associated genes.

Processing steps ↴

Merge tables by DrugID and CellLineID to obtain (lnIC50, SMILES, expression).
Create binary label
- Sensitive = 1 if ln(IC₅₀) ≤ –1 (0.368 µM); else 0 — class imbalance ≈ 8.5 : 1.
Pivot to wide format
- Drugs as columns, cell lines as rows for both classification and regression matrices.
Stratified split by cell line (seed = 42)
- 60 % Train, 20 % Validation, 20 % Test; class ratio preserved.
Save results under sensitivity/pivot/
- Classification: DrugSens-Train.csv, DrugSens-Validhyper-Subsampling.csv, DrugSens-Trainhyper-Subsampling.csv, DrugSens-Test.csv
- Regression: same file pattern in pivot/regr/

Outputs consumed later

File set

Used by

4 CSVs in pivot/clas/

Data Integration & Preparation.ipynb

4 CSVs in pivot/regr/

Optional regression experiments

Rationale

Long → wide conversion lets us index quickly by either drug or cell line and matches common chem-bio modelling formats.
Stratifying by cell line avoids leakage of drug-specific IC₅₀ distribution into validation or test sets.
Separate “hyper-subsampling” splits allow aggressive class balancing during hyper-parameter sweeps without touching the held-out Test set.

Files in sensitivity/stack/ and the single-sheet DrugSens*.csv tables are retained for provenance but are not loaded by subsequent notebooks.

Previous01_data_prep.ipynb NextData Integration & Preparation.ipynb

Last updated 10 days ago