Data Integration & Preparation.ipynb

Data Integration & Preparation.ipynb — Building the multimodal feature matrix

Objective

Combine gene-expression profiles, Morgan fingerprints, and sensitivity labels into a single 2,783-feature table, then save it in both CSV and Pickle formats for fast training in Colab.


Inputs

File
Drive location
Notes

GeneExp.csv

GDSC2_drugsens/datasets/features/

676 cell lines × 735 cancer-associated genes

drug_info.csv

GDSC2_drugsens/datasets/

228 compounds with canonical SMILES

4 classification CSVs

…/sensitivity/pivot/clas/

Train / Validhyper / Trainhyper / Test label matrices

Morgan fingerprints are generated on-the-fly from the SMILES in drug_info.csv (radius = 2, nBits = 2,048). No extra input file is required.


Processing steps ↴

  1. Generate 2,048-bit fingerprints with RDKit (Chem.MolFromSmiles + AllChem.GetMorganFingerprintAsBitVect).

  2. Standardise gene features per gene (Z-score).

  3. Concatenate modalities

    • 735 Z-scored gene features

    • 2,048 binary fingerprint bits

    • Total = 2,783 features per (Drug, Cell Line) pair.

  4. Broadcast features across splits

    • Merge fingerprints and expression into each of the four label matrices (Train, Validhyper, Trainhyper, Test).

  5. Serialize datasets

    • multimodal_dataset_final.csv ≈ 1.6 GB — human-readable, gzip-compressible.

    • multimodal_dataset_final.pkl ≈ 6.8 GB — Pickle of NumPy arrays for fast Colab loading.

    • dataset_summary.txt — row/column counts and basic stats.


Outputs consumed later

File
Used by

multimodal_dataset_final.pkl (preferred)

Model Development & Training.ipynb

multimodal_dataset_final.csv

Optional inspection or CPU-only environments


Rationale

  • Z-scoring genes only keeps binary fingerprints untouched while making gene features comparable in magnitude.

  • Writing both CSV and Pickle strikes a balance between transparency (CSV) and speed (Pickle loads 8-10 × faster in Colab).

  • Keeping four split-specific tables avoids accidental data leakage; each split travels through training exactly as originally stratified.

Legacy file multimodal_features_scaled.csv (≈ 4.3 GB) is an earlier export and is no longer used downstream.

Last updated