01_data_prep.ipynb

01_data_prep.ipynb — Preparing GDSC IC₅₀ and metadata

Create clean, analysis-ready tables that connect GDSC2 IC₅₀ values with COSMIC/DepMap cell-line IDs and compound SMILES strings.

File

Drive location

Notes

GDSC2_fitted_dose_response_27Oct23.xlsx

GDSC2_drugsens/

IC₅₀ (µM) for 228 drugs × 987 cell lines

screened_compounds_rel_8.5.csv

GDSC2_drugsens/

Drug names, IDs, SMILES

sample_info.csv

CCLE-DepMap22Q2_geneexp/

CCLE sample ↔ DepMap IDs

Load IC₅₀ sheet and cast to float32.
Map COSMIC_ID → DepMap_ID using sample_info.csv; drop rows without a match (987 → 676 cell lines).
Filter compounds: keep only 228 drugs with valid canonical SMILES (verified via RDKit).
Long-format table: reshape the drug-cell line matrix into a three-column table (DrugID, CellLineID, lnIC50).
Save cleaned artefacts into datasets/:
- drug_info.csv — 228 compounds, SMILES, GDSC IDs
- drug_name.txt — plain list of compound names for plotting
- cell_line_info.csv — created but not reused later
- ic50_cleaned.csv — long-format ln(IC₅₀)

File

Consumed by

drug_info.csv

02_gen_dataset.ipynb

ic50_cleaned.csv

02_gen_dataset.ipynb

Aligning to DepMap IDs up front guarantees all downstream joins (expression, fingerprints) key on a single identifier.
Validating SMILES now avoids extra API calls later when generating Morgan fingerprints.
Reshaping to long format makes it easy to apply a ln(IC₅₀) threshold and to pivot into classification and regression matrices.

Intermediate CSVs not read again (cell_line_info.csv, etc.) are kept solely for provenance.

Last updated 10 days ago