01_data_prep.ipynb
01_data_prep.ipynb — Preparing GDSC IC₅₀ and metadata
Objective
Create clean, analysis-ready tables that connect GDSC2 IC₅₀ values with COSMIC/DepMap cell-line IDs and compound SMILES strings.
Inputs
GDSC2_fitted_dose_response_27Oct23.xlsx
GDSC2_drugsens/
IC₅₀ (µM) for 228 drugs × 987 cell lines
screened_compounds_rel_8.5.csv
GDSC2_drugsens/
Drug names, IDs, SMILES
sample_info.csv
CCLE-DepMap22Q2_geneexp/
CCLE sample ↔ DepMap IDs
Processing steps ↴
Load IC₅₀ sheet and cast to
float32
.Map
COSMIC_ID → DepMap_ID
usingsample_info.csv
; drop rows without a match (987 → 676 cell lines).Filter compounds: keep only 228 drugs with valid canonical SMILES (verified via RDKit).
Long-format table: reshape the drug-cell line matrix into a three-column table
(DrugID, CellLineID, lnIC50)
.Save cleaned artefacts into
datasets/
:drug_info.csv
— 228 compounds, SMILES, GDSC IDsdrug_name.txt
— plain list of compound names for plottingcell_line_info.csv
— created but not reused lateric50_cleaned.csv
— long-format ln(IC₅₀)
Outputs that feed the next notebook
drug_info.csv
02_gen_dataset.ipynb
ic50_cleaned.csv
02_gen_dataset.ipynb
Rationale
Aligning to DepMap IDs up front guarantees all downstream joins (expression, fingerprints) key on a single identifier.
Validating SMILES now avoids extra API calls later when generating Morgan fingerprints.
Reshaping to long format makes it easy to apply a ln(IC₅₀) threshold and to pivot into classification and regression matrices.
Intermediate CSVs not read again (cell_line_info.csv
, etc.) are kept solely for provenance.
Last updated