Data Integration & Preparation.ipynb

Data Integration & Preparation.ipynb — Building the multimodal feature matrix

Objective

Combine gene-expression profiles, Morgan fingerprints, and sensitivity labels into a single 2,783-feature table, then save it in both CSV and Pickle formats for fast training in Colab.

Inputs

File

Drive location

Notes

GeneExp.csv

GDSC2_drugsens/datasets/features/

676 cell lines × 735 cancer-associated genes

drug_info.csv

GDSC2_drugsens/datasets/

228 compounds with canonical SMILES

4 classification CSVs

…/sensitivity/pivot/clas/

Train / Validhyper / Trainhyper / Test label matrices

Morgan fingerprints are generated on-the-fly from the SMILES in drug_info.csv (radius = 2, nBits = 2,048). No extra input file is required.

Processing steps ↴

Generate 2,048-bit fingerprints with RDKit (Chem.MolFromSmiles + AllChem.GetMorganFingerprintAsBitVect).
Standardise gene features per gene (Z-score).
Concatenate modalities
- 735 Z-scored gene features
- 2,048 binary fingerprint bits
- Total = 2,783 features per (Drug, Cell Line) pair.
Broadcast features across splits
- Merge fingerprints and expression into each of the four label matrices (Train, Validhyper, Trainhyper, Test).
Serialize datasets
- multimodal_dataset_final.csv ≈ 1.6 GB — human-readable, gzip-compressible.
- multimodal_dataset_final.pkl ≈ 6.8 GB — Pickle of NumPy arrays for fast Colab loading.
- dataset_summary.txt — row/column counts and basic stats.

Outputs consumed later

File

Used by

multimodal_dataset_final.pkl (preferred)

Model Development & Training.ipynb

multimodal_dataset_final.csv

Optional inspection or CPU-only environments

Rationale

Z-scoring genes only keeps binary fingerprints untouched while making gene features comparable in magnitude.
Writing both CSV and Pickle strikes a balance between transparency (CSV) and speed (Pickle loads 8-10 × faster in Colab).
Keeping four split-specific tables avoids accidental data leakage; each split travels through training exactly as originally stratified.

Legacy file multimodal_features_scaled.csv (≈ 4.3 GB) is an earlier export and is no longer used downstream.

Previous02_gen_dataset.ipynb NextModel Development & Training.ipynb

Last updated 9 days ago