Data Integration & Preparation.ipynb
Data Integration & Preparation.ipynb — Building the multimodal feature matrix
Objective
Combine gene-expression profiles, Morgan fingerprints, and sensitivity labels into a single 2,783-feature table, then save it in both CSV and Pickle formats for fast training in Colab.
Inputs
GeneExp.csv
GDSC2_drugsens/datasets/features/
676 cell lines × 735 cancer-associated genes
drug_info.csv
GDSC2_drugsens/datasets/
228 compounds with canonical SMILES
4 classification CSVs
…/sensitivity/pivot/clas/
Train / Validhyper / Trainhyper / Test label matrices
Morgan fingerprints are generated on-the-fly from the SMILES in
drug_info.csv
(radius = 2, nBits = 2,048). No extra input file is required.
Processing steps ↴
Generate 2,048-bit fingerprints with RDKit (
Chem.MolFromSmiles
+AllChem.GetMorganFingerprintAsBitVect
).Standardise gene features per gene (Z-score).
Concatenate modalities
735 Z-scored gene features
2,048 binary fingerprint bits
Total = 2,783 features per (Drug, Cell Line) pair.
Broadcast features across splits
Merge fingerprints and expression into each of the four label matrices (Train, Validhyper, Trainhyper, Test).
Serialize datasets
multimodal_dataset_final.csv
≈ 1.6 GB — human-readable, gzip-compressible.multimodal_dataset_final.pkl
≈ 6.8 GB — Pickle of NumPy arrays for fast Colab loading.dataset_summary.txt
— row/column counts and basic stats.
Outputs consumed later
multimodal_dataset_final.pkl
(preferred)
Model Development & Training.ipynb
multimodal_dataset_final.csv
Optional inspection or CPU-only environments
Rationale
Z-scoring genes only keeps binary fingerprints untouched while making gene features comparable in magnitude.
Writing both CSV and Pickle strikes a balance between transparency (CSV) and speed (Pickle loads 8-10 × faster in Colab).
Keeping four split-specific tables avoids accidental data leakage; each split travels through training exactly as originally stratified.
Legacy file multimodal_features_scaled.csv
(≈ 4.3 GB) is an earlier export and is no longer used downstream.
Last updated