In a recent PLOS Computational Biology paper Gerard van Westen, Anna Gaulton and John Overington have shown how machine learning techniques may be used to generate predictive models of allosterism.
Allosterism, i.e. biological activity of a compound binding at a site different to the natural ligand site is viewed as a desirable drug design strategy to potentially produce less side-effects and introduce more selectivity. The impressive public access resource ChEMBL was filtered to produce a high quality data set of allosteric and non-allosteric compounds, which could then be used to generate random forest models for allosterism, typically with good accuracies (over 82% specificity and selectivity on test data) using the R-statistical package.
As ChEMBL & Gerard van Westen have generously made available the raw data on compounds, classes and descriptors, these techniques can be recreated. This work has been done for the L0 (all protein binding compounds) data, producing the same results.
Further work, using the Open Source RDkit descriptors and the Knime machine learning nodes has also been perfomed, providing predictive models of the same quality. In the typical case with a 70:30 training:test set split, the combined total classification error for the test set was < 18%.
Such models may be used to filter/prioritise compound collections sets for allosteric compounds, for particular screening campaigns or for compound collection enrichment.
The Knime generated models also provide a prediction confidence value for each molecule. In a test case, a small number of compounds from a commercial screening collection were found to be present in the ChEMBL data set.
Models built using the same strategy - van Westen classification, RDKit descriptors, Knime machine learning nodes, 70:30 training:test set split - but with overlap compounds excluded, produced robust models showing a very low total prediction error of 14% for compounds with a prediction confidence >0.67