Computer-Aided Search and Discovery of Small Molecules
Mon, 07/11/2011 - 7:09am
Christian A. Lang, PhD, Chief Technology Officer; Acelot, Inc., Santa Barbara, Calif.

Chemists and pharmacologists have a multitude of software tools for searching and mining chemical compound repositories. These tools promise to reduce drug pipeline costs and time-to-market by shifting much of the labor intensive in-vitro and in-vivo testing to the computer.

Acelot-1.jpg

click to enlarge

Figure 1. Transformation of a chemical compound (top) into pre-defined fragments (middle) and bits in a larger bitstring (bottom). Multiple fragments can set the same bit, leading to additional information loss. (All Images courtesy of Acelot)  

Current in-silico tools transform chemical compounds into an abstract representation such as a  fingerprint (figure 1). A given chemical compound is split into fragments and each fragment is encoded as a bit in a bitstring. Searching compares the query bitstring with each database bitstring.

While performing, searching, and mining in such an abstract space is very efficient, it has several disadvantages. Much information about the global chemical structure is lost;  structure-based searching results in lower accuracy. Working in the abstract space is not “natural” to chemists who would rather think in terms of atoms and bonds.  Querying and mining are not very flexible because the abstractions are typically tailored to specific query operations.

Acelot-2.jpg

click to enlarge

Figure 2. Matching between a query molecule (left) and a molecule in the database (right). The coloring indicates which parts of the molecules have been matched. Slight variations in the matched substructures are tolerated as long as the overall structure is preserved.  

These disadvantages can be avoided by performing all searching and mining operations directly in the chemical space. Instead of transforming compounds into abstractions, Acelot, Inc.’s suite of drug discovery tools, stores them as a data structure consisting of labeled atoms and the bonds between them. Searching on this structure is then performed by finding the best match between the query compound and each compound stored in the repository (figure 2).

The colors indicate which substructures are matched. In contrast to fingerprints, a matching considers the global structure of compounds and can rank globally similar structures high even though they deviate locally. Each matching results in a similarity score that is based on the number of atoms and bonds matched and the preservation of the global structure.

Acelot-3.jpg

click to enlarge

Figure 3. ROC100 scores for various MDDR activity classes.  

In addition, since the matching happens in chemical space, the exact substructures that contribute to the matching can be highlighted to simplify interpretation of the result. The speed of structure matching can be improved significantly with the use of an index structure such as Closure Tree1 that can prune out parts of the search space that can not contain any matchable compounds.

Acelot-4.jpg

click to enlarge

Figure 4. Sample results returned by fingerprints and SimFinder for one of the activity classes in Figure 3 (HMG). While there is some overlap, SimFinder returns new results due to its different approach.  

 

Figure 3 shows the quality of this chemical space-based searching (integrated in Acelot’s SimFinder product) in comparison to fingerprints. The graph compares the ROC100 values for both approaches on a variety of activity classes from the MDL Drug Data Report (MDDR), a database of 100,000 biologically relevant compounds and derivatives managed by Accelrys and Thomson Reuters.  ROC100 measures how many active compounds are returned in the top 100 ranked results. As can be seen, SimFinder outperforms fingerprints for all activity classes. In addition,   the Venn diagram in Figure 4 shows SimFinder can provide the chemist with new results not previously found by fingerprints.

 

Acelot-6.jpg

click to enlarge

Figure 6. Sample MDDR molecules classified as BBB permeable by ActiPred. The color coding is obtained by matching positive training fragments from the BBB model with the molecule in question.  

Besides approximate string matching for similar structures or substructures, working in the chemical space also enables interesting data mining applications. The GraphSig algorithm,2discovers significant substructures in a test database of compounds. A substructure is deemed significant if it is statistically over-or under-represented as compared to a distribution of background compounds. The algorithm achieves this by computing the distribution of all fragments in the background and test datasets, then determining the low p-value ones, and finally extracting frequent subgraphs from a neighborhood of the low p-value fragments. The upper row shows substructures that are statistically over-represented among BBB permeables. The lower row shows substructures that are rare among BBB permeables. Both results can be valuable for better activity understanding and lead development.

 

Acelot-5.jpg

click to enlarge

Figure 5. Sample significant substructures returned by mining the MDDR BBB permeable subset with SigFinder. The top row shows over-represented fragments, the bottom row shows under-represented ones. This p-value based approach can pick up very subtle deviations as seen in the two right examples on the top.  

Existing algorithms can be enhanced by working in the chemical space. For example, current ADME/Tox prediction algorithms return mostly labels with scores. By building ADME/Tox models directly on molecular fragments, the contribution of each fragment to the overall activity can be determined. This knowledge can then be used to color-code the corresponding parts of a molecule, as done by Acelot’s ActiPred product. Figure 6 shows some molecules classified as BBB permeable by ActiPred together with the red-colored substructures that contribute most to permeability. In conclusion, working directly in the space of chemical compounds has many advantages. It can provide higher quality and complementary results, enable new structure-based discovery algorithms, and enhance algorithms by providing additional insights. Advances in algorithm design can make these new approaches tractable even on normal desktop computers.

References
1.  Huahai He and Ambuj K. Singh, Closure-Tree: An Index Structure for Graph Queries, In: Proceedings of the International Conference on Data Engineering; 2006:38.
2.  Sayan Ranu, Ambuj K. Singh, “GraphSig: A Scalable Approach to Mining Significant Subgraphs in Large Graph Databases”, In: Proceedings of the International Conference on Data Engineering; 2009:844-855.

Share this Story