Download
Version 2
Download the data set in the format you prefer...
- Download data as Microsoft Excel Sheet or CSV data file including
- CAS_NO: CAS Number if available, otherwise identifier
- Source: click here for details on sources
- Activity: 0 indicates negative compounds, 1 indicates positive compounds
- Steroid: 1 indicates steroids
- WDI: name in the world drug index (if listed)
- Canonical_Smiles: canonical Smile string (enantiomer not always specified using @)
- REFERENCE: reference to source paper
- Download SMILES including
- Canonical_Smiles: canonical Smile string (enantiomer not always specified using @)
- CAS_NO: CAS Number if available, otherwise identifier
- Activity: 0 indicates negative compounds, 1 indicates positive compounds
- Download SD-File including CAS_NO Number (Not all types of software process this SDF-file correctly. In this case, be invited to use the SMILES file, which has been reported to load correctly into a variety of tools.)
|
Download the cross validation information...
If you use this data set to evaluate a prediction algorithm please use the predefined cross validation splits:
- We provide splits for a 1-times 5-fold cross validation (files above)
- Each training set includes a static train set excluded from the test set, since these compounds are known to standard commercial software.
- The index used in the csv files references the order of compounds in the data files
- test.csv: consists of 5 lines, each line describes a test set using the index of the compound files; the indices are separated by comma; each test set contains about 990 compounds
- train.csv: consists of 5 lines, each line describes a training set corresponding to the respective test set in test.csv
- Do not apply any model selection to the whole data set. Please perform scaling and features selection on each training set of each trial separately to avoid the introduction of a positive bias.
- We optimized our models with respect to the AUC. For comparable results we recommend to use the same optimization criterion.
|
We'll be happy to add your findings and publications to the
list of results. Just send an email to .
If you use this data set for publications please cite our paper:
Katja Hansen, Sebastian Mika, Timon Schroeter, Andreas Sutter, Antonius ter Laak, Thomas Steger-Hartmann, Nikolaus Heinrich and Klaus-Robert Müller. Benchmark Data Set for in Silico Prediction of Ames Mutagenicity.
Journal of Chemical Information and Modelling, DOI 10.1021/ci900161g
BibTex for your convenience:
@Article{ToxBenchmark2009,
title = {Benchmark Data Set for in Silico Prediction of Ames Mutagenicity},
author = {Katja Hansen and Sebastian Mika and Timon Schroeter and Andreas Sutter and Antonius ter Laak
and Thomas {Steger-Hartmann} and Nikolaus Heinrich and {Klaus-Robert} M\"uller},
url = {http://dx.doi.org/10.1021/ci900161g},
doi = {10.1021/ci900161g},
journal = {Journal of Chemical Information and Modeling},
year = 2009,
}
Download Previous Versions
Version 1
Download the data set in the format you prefer...
- Download data as Microsoft Excel Sheet or CSV data file including
- CAS_NO: CAS Number if available, otherwise identifier
- Source: click here for details on sources
- Activity: 0 indicates negative compounds, 1 indicates positive compounds
- Steroid: 1 indicates steroids
- WDI: name in the world drug index (if listed)
- Canonical_Smiles: canonical Smile string (enantiomer not always specified using @)
- REFERENCE: reference to source paper
- Download SMILES including
- Canonical_Smiles: canonical Smile string (enantiomer not always specified using @)
- CAS_NO: CAS Number if available, otherwise identifier
- Activity: 0 indicates negative compounds, 1 indicates positive compounds
- Download SD-File including CAS_NO Number
- Precalculated descriptors using joelib2 are available in SD-File format for 7083 compounds:
- Archive (877MB) with joelib2 descriptors in 7083 SD-Files (one for each compound).
- SMILES of the seven missing compounds
|
Download the cross validation information...
If you use this data set to evaluate a prediction algorithm please use the predefined cross validation splits:
- We provide splits for a 10-times 3-fold cross validation (files: test.csv, train.csv)
- In total 30 prediction models are trained
- The index used in the csv files references the order of compounds in the data files
- test.csv: consists of 30 lines, each line describes a test set using the index of the compound files; the indices are separated by comma; each test set contains about 2364 compounds
- train.csv: consists of 30 lines, each line describes a training set corresponding to the respective test set in test.csv
- Do not apply any model selection to the whole data set. Please perform scaling and features selection on each training set of each trial separately to avoid the introduction of a positive bias.
- We optimized our models with respect to the AUC. For comparable results we recommend to use the same optimization criterion.
|
We'll be happy to add your findings and publications to the
list of results. Just send an email to .