Key Datasets

Key datasets

Only the files that the notebooks actually read or write are listed here, grouped by stage in the pipeline.

1 Raw source files

File

Location in Drive

First used in

Purpose

CCLE_expression.csv

CCLE-DepMap22Q2_geneexp/

CCLE_expression.ipynb

Full CCLE 22Q2 expression matrix (1,406 cell lines × 19,221 genes).

GDSC2_fitted_dose_response_27Oct23.xlsx

GDSC2_drugsens/

01_data_prep.ipynb

IC₅₀ values for 228 drugs across 987 cell lines.

screened_compounds_rel_8.5.csv

GDSC2_drugsens/

01_data_prep.ipynb

Drug metadata, including SMILES strings used for fingerprints.

2 Cleaned & aligned tables

File

Location

First used in

What changed

CCLE_expression_cleaned.csv

CCLE-DepMap22Q2_geneexp/

01_data_prep.ipynb

Reduced to 735 cancer-associated genes, renamed headers to gene symbols.

drug_info.csv

GDSC2_drugsens/datasets/

01_data_prep.ipynb

228 compounds with validated SMILES only.

cell_line_info.csv

GDSC2_drugsens/datasets/

01_data_prep.ipynb

CCLE ↔ DepMap–aligned cell-line metadata (676 lines).

ic50_cleaned.csv

GDSC2_drugsens/datasets/

01_data_prep.ipynb

Long-format ln(IC₅₀) table ready for labelling.

GeneExp.csv

GDSC2_drugsens/datasets/features/

02_gen_dataset.ipynb

Expression matrix trimmed to the 676 aligned cell lines.

3 Labelled & split datasets

Files

Location

Notebook

Role

DrugSens_clas_pivot_train/val/test.csv

…/pivot/clas/

02_gen_dataset.ipynb

Binary sensitivity matrices, stratified 60 / 20 / 20.

DrugSens_regr_pivot_train/val/test.csv

…/pivot/regr/

02_gen_dataset.ipynb

Continuous ln(IC₅₀) matrices for regression experiments.

4 Final multimodal feature set

File

Location

Notebook

Contents

multimodal_dataset_final.csv

processed_datasets/

Data Integration & Preparation.ipynb

108,696 rows × 2,783 features (735 genes + 2,048 fingerprints).

multimodal_dataset_final.pkl

processed_datasets/

same as above

Same data as a 6.8 GB Pickle for faster loading in Colab.

All datasets and the trained model checkpoint live exclusively in the shared Google Drive folder, keeping the GitHub repository lightweight.

PreviousFolder Structure NextCCLE_expression.ipynb

Last updated 6 months ago

hashtagKey datasets

hashtag1 Raw source files

hashtag2 Cleaned & aligned tables

hashtag3 Labelled & split datasets

hashtag4 Final multimodal feature set

Key datasets

1 Raw source files

2 Cleaned & aligned tables

3 Labelled & split datasets

4 Final multimodal feature set