Download

Version 2

Download the data set in the format you prefer...
  • Download data as Microsoft Excel Sheet or CSV data file including

    1. CAS_NO: CAS Number if available, otherwise identifier
    2. Source: click here for details on sources
    3. Activity: 0 indicates negative compounds, 1 indicates positive compounds
    4. Steroid: 1 indicates steroids
    5. WDI: name in the world drug index (if listed)
    6. Canonical_Smiles: canonical Smile string (enantiomer not always specified using @)
    7. REFERENCE: reference to source paper

  • Download SMILES including

    1. Canonical_Smiles: canonical Smile string (enantiomer not always specified using @)
    2. CAS_NO: CAS Number if available, otherwise identifier
    3. Activity: 0 indicates negative compounds, 1 indicates positive compounds

  • Download SD-File including CAS_NO Number (Not all types of software process this SDF-file correctly. In this case, be invited to use the SMILES file, which has been reported to load correctly into a variety of tools.)
Download the cross validation information... If you use this data set to evaluate a prediction algorithm please use the predefined cross validation splits:
  • We provide splits for a 1-times 5-fold cross validation (files above)
  • Each training set includes a static train set excluded from the test set, since these compounds are known to standard commercial software.
  • The index used in the csv files references the order of compounds in the data files
  • test.csv: consists of 5 lines, each line describes a test set using the index of the compound files; the indices are separated by comma; each test set contains about 990 compounds
  • train.csv: consists of 5 lines, each line describes a training set corresponding to the respective test set in test.csv
  • Do not apply any model selection to the whole data set. Please perform scaling and features selection on each training set of each trial separately to avoid the introduction of a positive bias.
  • We optimized our models with respect to the AUC. For comparable results we recommend to use the same optimization criterion.

We'll be happy to add your findings and publications to the list of results. Just send an email to .

Publication

If you use this data set for publications please cite our paper:

Katja Hansen, Sebastian Mika, Timon Schroeter, Andreas Sutter, Antonius ter Laak, Thomas Steger-Hartmann, Nikolaus Heinrich and Klaus-Robert Müller. Benchmark Data Set for in Silico Prediction of Ames Mutagenicity.
Journal of Chemical Information and Modelling, DOI 10.1021/ci900161g

BibTex for your convenience:
@Article{ToxBenchmark2009,
	title = {Benchmark Data Set for in Silico Prediction of Ames Mutagenicity},
	author = {Katja Hansen and Sebastian Mika and Timon Schroeter and Andreas Sutter and Antonius ter Laak 
		and Thomas {Steger-Hartmann} and Nikolaus Heinrich and {Klaus-Robert} M\"uller},
	url = {http://dx.doi.org/10.1021/ci900161g},
	doi = {10.1021/ci900161g},
	journal = {Journal of Chemical Information and Modeling},
	year         = 2009,
}

Download Previous Versions

Version 1

Download the data set in the format you prefer...
  • Download data as Microsoft Excel Sheet or CSV data file including

    1. CAS_NO: CAS Number if available, otherwise identifier
    2. Source: click here for details on sources
    3. Activity: 0 indicates negative compounds, 1 indicates positive compounds
    4. Steroid: 1 indicates steroids
    5. WDI: name in the world drug index (if listed)
    6. Canonical_Smiles: canonical Smile string (enantiomer not always specified using @)
    7. REFERENCE: reference to source paper

  • Download SMILES including

    1. Canonical_Smiles: canonical Smile string (enantiomer not always specified using @)
    2. CAS_NO: CAS Number if available, otherwise identifier
    3. Activity: 0 indicates negative compounds, 1 indicates positive compounds

  • Download SD-File including CAS_NO Number
  • Precalculated descriptors using joelib2 are available in SD-File format for 7083 compounds:

    1. Archive (877MB) with joelib2 descriptors in 7083 SD-Files (one for each compound).
    2. SMILES of the seven missing compounds

Download the cross validation information... If you use this data set to evaluate a prediction algorithm please use the predefined cross validation splits:
  • We provide splits for a 10-times 3-fold cross validation (files: test.csv, train.csv)
  • In total 30 prediction models are trained
  • The index used in the csv files references the order of compounds in the data files
  • test.csv: consists of 30 lines, each line describes a test set using the index of the compound files; the indices are separated by comma; each test set contains about 2364 compounds
  • train.csv: consists of 30 lines, each line describes a training set corresponding to the respective test set in test.csv
  • Do not apply any model selection to the whole data set. Please perform scaling and features selection on each training set of each trial separately to avoid the introduction of a positive bias.
  • We optimized our models with respect to the AUC. For comparable results we recommend to use the same optimization criterion.

We'll be happy to add your findings and publications to the list of results. Just send an email to .