01_data_prep.ipynb

01_data_prep.ipynb — Preparing GDSC IC₅₀ and metadata

Objective

Create clean, analysis-ready tables that connect GDSC2 IC₅₀ values with COSMIC/DepMap cell-line IDs and compound SMILES strings.


Inputs

File
Drive location
Notes

GDSC2_fitted_dose_response_27Oct23.xlsx

GDSC2_drugsens/

IC₅₀ (µM) for 228 drugs × 987 cell lines

screened_compounds_rel_8.5.csv

GDSC2_drugsens/

Drug names, IDs, SMILES

sample_info.csv

CCLE-DepMap22Q2_geneexp/

CCLE sample ↔ DepMap IDs


Processing steps ↴

  1. Load IC₅₀ sheet and cast to float32.

  2. Map COSMIC_ID → DepMap_ID using sample_info.csv; drop rows without a match (987 → 676 cell lines).

  3. Filter compounds: keep only 228 drugs with valid canonical SMILES (verified via RDKit).

  4. Long-format table: reshape the drug-cell line matrix into a three-column table (DrugID, CellLineID, lnIC50).

  5. Save cleaned artefacts into datasets/:

    • drug_info.csv — 228 compounds, SMILES, GDSC IDs

    • drug_name.txt — plain list of compound names for plotting

    • cell_line_info.csv — created but not reused later

    • ic50_cleaned.csv — long-format ln(IC₅₀)


Outputs that feed the next notebook

File
Consumed by

drug_info.csv

02_gen_dataset.ipynb

ic50_cleaned.csv

02_gen_dataset.ipynb


Rationale

  • Aligning to DepMap IDs up front guarantees all downstream joins (expression, fingerprints) key on a single identifier.

  • Validating SMILES now avoids extra API calls later when generating Morgan fingerprints.

  • Reshaping to long format makes it easy to apply a ln(IC₅₀) threshold and to pivot into classification and regression matrices.

Intermediate CSVs not read again (cell_line_info.csv, etc.) are kept solely for provenance.

Last updated