Key Datasets
Key datasets
Only the files that the notebooks actually read or write are listed here, grouped by stage in the pipeline.
1 Raw source files
CCLE_expression.csv
CCLE-DepMap22Q2_geneexp/
CCLE_expression.ipynb
Full CCLE 22Q2 expression matrix (1,406 cell lines × 19,221 genes).
GDSC2_fitted_dose_response_27Oct23.xlsx
GDSC2_drugsens/
01_data_prep.ipynb
IC₅₀ values for 228 drugs across 987 cell lines.
screened_compounds_rel_8.5.csv
GDSC2_drugsens/
01_data_prep.ipynb
Drug metadata, including SMILES strings used for fingerprints.
2 Cleaned & aligned tables
CCLE_expression_cleaned.csv
CCLE-DepMap22Q2_geneexp/
01_data_prep.ipynb
Reduced to 735 cancer-associated genes, renamed headers to gene symbols.
drug_info.csv
GDSC2_drugsens/datasets/
01_data_prep.ipynb
228 compounds with validated SMILES only.
cell_line_info.csv
GDSC2_drugsens/datasets/
01_data_prep.ipynb
CCLE ↔ DepMap–aligned cell-line metadata (676 lines).
ic50_cleaned.csv
GDSC2_drugsens/datasets/
01_data_prep.ipynb
Long-format ln(IC₅₀) table ready for labelling.
GeneExp.csv
GDSC2_drugsens/datasets/features/
02_gen_dataset.ipynb
Expression matrix trimmed to the 676 aligned cell lines.
3 Labelled & split datasets
DrugSens_clas_pivot_train/val/test.csv
…/pivot/clas/
02_gen_dataset.ipynb
Binary sensitivity matrices, stratified 60 / 20 / 20.
DrugSens_regr_pivot_train/val/test.csv
…/pivot/regr/
02_gen_dataset.ipynb
Continuous ln(IC₅₀) matrices for regression experiments.
4 Final multimodal feature set
multimodal_dataset_final.csv
processed_datasets/
Data Integration & Preparation.ipynb
108,696 rows × 2,783 features (735 genes + 2,048 fingerprints).
multimodal_dataset_final.pkl
processed_datasets/
same as above
Same data as a 6.8 GB Pickle for faster loading in Colab.
All datasets and the trained model checkpoint live exclusively in the shared Google Drive folder, keeping the GitHub repository lightweight.
Last updated