undefined undefined undefined undefined

click to enlarge
Figure 1. Chemical space plots illustrating the exploration of a library of 266 compounds, with experimental data for potency against the 5HT1a target. In these chemical space plots, the proximity of two points represents the structural similarity between the corresponding compounds. These plots were defined using a Tanimoto index, based on 2D fingerprints and a PCA algorithm. (a) The chemical space of the library colored by potency. A number of regions can be seen containing potent compounds, including a circled example. Illustrative compound structures are also shown. (b) The same library, colored by probabilistic score against the property profile shown in Figure 2. Here it can be seen that the highlighted area of potent compounds is unlikely to achieve the required ADMET properties; however, an alternative area of chemistry is more likely to achieve a good balance of potency and ADMET properties. (c) The selected top-scoring compounds, also shown in Figure 2, are highlighted. These represent the compounds that cannot be confidently distinguished from the top-scoring, indicating that several regions should be sampled in order to confidently identify the best lead series from this library. (d) The result of automatic expansion around representative members of the initial library to explore the properties of related compounds. These virtual compounds have been scored based on their ADMET properties only. Here we can see that, in the previously highlighted region with good potency but poor ADMET properties, it is possible to access compounds with the required property profile. This suggests that it may be valuable to explore further optimization of this potent chemistry.  

The lead identification (LI) phase of drug discovery plays a critical role in determining the ultimate success of a project. Ideally, the starting point for LI will be a range of diverse hits, with activity against the intended biological target. There are many ways in which these may have been derived, including high-throughput screening, screening of a target-focused library, or fragment-based drug design. The primary goal of LI is to deliver one or more lead series with good potency and structure-activity relationships that indicate scope for further compound optimization during lead optimization (LO). However, it is now widely recognized that potency against the therapeutic target is not sufficient for a high-quality lead series.

The increasing cost of pharmaceutical R&D, driven by the high failure rate of projects and development candidates,1 has driven the acknowledgement that successful compounds must achieve a balance of potency with many other properties, including physicochemical, absorption, distribution, metabolism, elimination, and toxicity (ADMET). The earlier that high-quality chemistries can be identified, the greater the positive impact on productivity, due to faster progress through LO and a higher chance of ultimate success. It is important to consider a broad range of properties in LI to ensure that series nominated for LO provide good starting points in areas of chemistry in which the necessary compound properties are accessible.

This presents a number of challenges; in particular, the large quantity of data that must be considered to triage the initial hits in order to focus on a small number of series. In many cases, a large number of compounds will have been screened for potency and these data can be supplemented with predictions for many properties using in silico models. Furthermore, with high-throughput ADMET screening now widely accessible, experimental data may be available for a significant proportion of these compounds. In addition to the volume and complexity of data, these values may have significant uncertainty, due to statistical error in prediction and the experimental variability of high-throughput methods. Therefore, while trying to identify compounds with a good balance of properties, users need to ensure that they do not miss valuable opportunities due to predictive or experimental error.

The authors will discuss a number of approaches that help guide the exploration of a wide range of possibilities and quickly focus on those chemistries that are most likely to deliver a high-quality lead series. First, the “chemical space” of a lead identification project can be visualized to identify interesting areas of chemistry for further investigation. Furthermore, multi-parameter optimization (MPO) can quickly identify compounds in this chemical space that are likely to achieve the profile of properties required for a project. Finally, computational methods can be used to automatically expand chemistry around hits to identify additional opportunities for optimization to improve compound properties.

Chemical Space
The space of possible “drug-like” compounds is vast; it has been estimated to exceed 1060 potential compounds.2 This is an impossibly large space to explore, but in practical terms the chemical spaces of interest in a drug discovery project will range from many (tens of) thousands of compounds in early screening, through tens or hundreds or compounds in lead optimization, to a small handful of preclinical development candidates.


click to enlarge
Figure 2. Plot of the results of probabilistic scoring of the 266-compound 5HT1a library described in the text. The compounds are ordered from left to right along the x-axis in order of their score and the score for each compound is plotted on the y-axis. The overall uncertainty in each score (1 standard deviation), due to the uncertainty in the underlying data, is shown by error bars around the corresponding point. From this it can be seen that the compounds highlighted in blue cannot be confidently distinguished from the top-scoring compound because their error bars overlap. The scores have been calculated against the inset scoring profile, showing the property criteria and importance of each criterion to the overall project objective.  

In early drug discovery, when researchers are dealing with a larger number of structurally diverse compounds, it is very useful to visualize patterns of properties across the diversity of chemistry being explored. This can highlight “hot spots” of similar compounds with good potency or properties that would suggest interesting chemistries to investigate in more detail.

The basis for any chemical space plot is a definition of similarity between compounds. There are many possibilities, including similarity in 2-dimensional (2D) structure defined by fingerprints that capture the pattern of atoms or functional groups present in the compound, 3-dimensional (3D) similarity in terms of shape or molecular field, and similarity in calculated or experimental compound properties.3 There are also many algorithms for converting the resulting high-dimensional similarity space into a 2D or 3D view that can be easily visualized and manipulated, including principal component analysis (PCA),4 multi-dimensional scaling,5 and Kohonen maps6.

The most appropriate approach to use depends on the question that is being asked. In LI, where a project may be dealing with large numbers of structurally diverse compounds in a screening library, one goal is to find series of structurally similar compounds that are of particular interest. For this objective, a chemical space defined by 2D chemical structure provides a useful approach, as illustrated in figure 1a. This chemical space shows the distribution of potency for a diverse library of 266 compounds with experimental data against the 5HT1a target. From this, it can be seen that a number of chemistries have good potency. However, in this scenario, a project team would ideally like to further focus its resources on potent chemistries that are also likely to have appropriate ADMET properties.

Identifying high-quality compounds
As discussed above, an efficacious and safe drug must have a balance of many properties and, from the earliest stages of drug discovery, the goal is to identify the compounds that are most likely to achieve the required property profile. MPO methods help to guide the design and selection of compounds that simultaneously achieve these multiple, often conflicting requirements.7

Probabilistic scoring is an MPO method that allows a project team to define the profile of properties they require in an ideal compound.8 Furthermore, the importance of each individual property criterion can be defined, to reflect the compromises that would be acceptable if an ideal compound cannot be found. All of the property data that is available for each compound, whether predicted or experimental, can then be easily assessed against the project-defined property profile, to identify those compounds with the highest chance of downstream success. Furthermore, the uncertainty in the underlying data can be explicitly taken into account, to determine when compounds can be distinguished with confidence. This ensures that chemistries are not inappropriately rejected based on uncertain data, reducing the risk of missing potentially valuable opportunities.

Figure 2 shows an example of a profile, combining experimentally determined potency with predicted ADMET properties that are appropriate for an orally dosed compound against a target in the central-nervous system. Figure 2 also shows the results of scoring the compounds in the 5HT1a library against this profile.

These scores can be plotted in the chemical space of this library, as shown in Figure 1b. This shows a clear “hot spot” of similar compounds, corresponding to a chemical series with the highest chance of success. At the same time, some of the chemistries with high potency have a low chance of overall success, suggesting that these should be given a lower priority. The compounds that cannot be confidently distinguished from the highest scoring, as shown in figure 2, are also highlighted in figure 1c. This indicates that a number of diverse areas of chemistry appear to be promising. A good strategy would be to sample these chemistries and experimentally determine their properties. Experimental data will have lower uncertainty than predicted values, therefore the error bars on the scores will become smaller, increasing the confidence with which a high quality lead series may be chosen.

Automating the exploration of chemical space
If high-potency hits do not have appropriate ADMET properties, an area of chemical space containing multiple potent compounds offers the potential to find similar compounds. It may also be possible to access good ADMET properties within this space and so these may yet represent further opportunities for optimization.

It would be intractable to manually explore a large number of compounds related to each hit. However, computational approaches can automatically “expand” around the compounds in the existing library to generate new compound ideas and explore new chemistry. These virtual compounds can be automatically prioritized for more detailed considerations—using predictive models and MPO methods—to identify those most likely to be of interest.9

It is important that automatically generated compound structures make sense from a medicinal chemistry perspective. This can be achieved by applying medicinal chemistry transformation rules to an initial compound, representing typical compound modifications explored by medicinal chemists in the optimization of compounds.9 Applying a library of generally applicable transformations, representing tractable steps in chemistry space, can quickly explore a large number of related compounds that are likely to be synthetically accessible.

As an illustrative example, a library of 206 validated transformations was applied iteratively to representative compounds from the 5HT1a library to create two generations of new compounds, expanding on the chemistry in the initial library.10 The resulting chemical space is shown in figure 1d and the area of chemical space highlighted has a strong chance of achieving the required balance of ADMET properties. These compounds are similar to a series of potent hits that are predicted to have poor properties, but this result suggests that optimization of these compounds could yield a high-quality lead series.

Computational approaches can help to guide LI. This article illustrated three approaches to visualize the distribution of compound properties across the diversity of interesting chemistry; quickly focus on chemistries with the highest chance of downstream success; and rigorously explore new compound ideas related to the initial hits and identify new opportunities.


In order to get the most value from these approaches, they must be easily and intuitively accessible to all members of a project team, not only computational experts. Collaboration between medicinal and computational chemists, ADMET scientists, and biologists can be facilitated by the interactive exploration of these areas, helping the project team to get the most value from all of the available compound data in order to quickly identify high quality leads.

1. Paul S, Mytelka D, Dunwiddie D, Persinger C, Munos B, Lindborg S, Schacht A. How to improve R&D productivity: the pharmaceutical industry's grand challenge. Nat Rev Drug Discov. 2010;9:203-14.
2. Kirkpatrick P, Ellis C. Chemical Space. Nature. 2004;432(7019 (insight)):823-865.
3. Nikolova N, Jaworska J. Approaches to Measure Chemical Similarity – a Review. QSAR & Comb Sci. 2004;22(9-10):1006-1026.
4. Jolliffe IT. Principal Component Analysis, Second Edition. New York: Springer; 2002.
5. Schiffman SS, Reynolds ML, Young FW. Introduction to Multidimensional Scaling: Theory, Methods, and Applications. Bingley: Emerald Group; 1981.
6. Ultsch A, Siemon HP. Kohonen's SeIf Organizing Feature Maps for Exploratory Data Analysis. In: Windrow B, Angeniol B, editors. Proceedings of the International Neural Network Conference (INNC-90), Paris, France, July 9–13, 1990; 1990; Dortrecht. p. 305-308.
7. Segall MD. Multi-Parameter Optimization: Identifying high quality compounds with a balance of properties. Curr Pharm Des. 2012;18(9):1292-1310.
8. Segall MD, Champness E, Obrezanova O, Leeding C. Beyond Profiling: Using ADMET models to guide decisions. Chemistry and Biodiversity. 2009; 2144-2151.
9. Stewart K, Shiroda M, James C. Drug Guru: a computer software program for drug design using medicinal chemistry rules. Bioorg Med Chem. 2006;14:7011-22.
10. Segall MD, Champness EJ, Leeding C, Lilien R, Mettu R, Stevens B. Applying medicinal chemistry transformations to guide the search for high quality leads and candidates. J Chem Inf Model. 2011;51(11):2967–2976.