GSK-3 Inhibitors: The Dataset

Here we provide the dataset of GSK-3 ATP-competitive inhibitors and non-inhibitors that was used during several our studies. We suppose that it will be useful for the other researchers in the field of chemoinformatics; the main peculiarity of this Dataset is wide activity range of compounds and significant diversity of the actives (true inhibitors). The Dataset was manually constructed and checked by Dmitry Osolodkin and Daria Tsareva. To our knowledge, there are no mistakes, but we also know that nobody's perfect and can not guarantee their total absence (please email Dmitry if you find them).

The available version of the dataset corresponds to the one used in CBDD and BMCL papers and consists of 1685 compounds published mostly in peer-reviewed journals before August 2010 (it includes also data from a single patent; full list of references can be found in Table S3 of CBDD paper Supplementary). Non-ATP-competitive inhibitors and nice half-sandwich ruthenium complexes were not included. Further versions may appear some day, but no guarantee on this is given. You can expand this dataset with newer data at your free will.

SDF file was prepared with InstantJChem 5.3. Structures were checked for compliance with the IUPAC name reported in the Experimental section of corresponding sources in the cases of ambiguity. The following fields are present:

CdId — internal number of a compound added by InstantJChem;
Mol Weight — molecular weight as calculated by InstantJChem;
Formula is also automatically added by InstantJChem;
IC50 is given in nM (sic!) according to the data reported in the article. It's a text field (as well as the other raw activity data fields) and values like '>10000' are given for inactive compounds.
Article field contains short reference to the source of data (corresponding full references are given in aforementioned Table S3);
Article_Name is a unique identifier (usually based on family name of the first author and always includes reference number inside the source paper);
cATP is the concentration of ATP used in the assay;
Ki20 was provided in several sources and usually was obtained with Cheng-Prusoff equation from experimental IC50 values using the value of K_m = 20 μM. The source of this value was not found. In several cases K_m value was not reported.
Ki is experimental inhibition constant reported in certain sources;
pIC50 — experimental data were provided in this form in certain sources;
percent is the inhibition percent (usually at 10 μM) reported in certain sources;
Active is a Boolean field with three values: TRUE (IC50 < 10 μM), FALSE (other situations when activity was determined) and NULL (activity not determined). The threshold can be arbitrarily adjusted by user.
Measured is a Boolean field with two values: TRUE (data reported as a single value, e.g., IC50 = 1 nM), and FALSE (data reported as something like IC50 > 10 μM).
pKi_calc is a numeric field containing pKi20 values calculated by us for all the compounds for which it was possible. Zero was assigned to this field in all other cases.
pIC50_calc is similar to the above, but Cheng-Prusoff equation was not employed, it's just –log of raw IC50 values in the form of numeric field (most useful for QSAR studies inside the subsets).

If you find this Dataset useful and want to utilise it in your research, please cite our CBDD paper (Osolodkin D. I., Palyulin V. A., Zefirov N. S. Structure-based virtual screening of glycogen synthase kinase 3β inhibitors: Analysis of scoring functions applied to large true actives and decoy sets. Chemical Biology & Drug Design. 2011, 78, 378-390) and this page.