Articles
Chemists and pharmacologists have a multitude of software tools for searching and mining chemical compound repositories. These tools promise to reduce drug pipeline costs and time-to-market by shifting much of the labor intensive in-vitro and in-vivo testing to the computer.
Current in-silico tools transform chemical compounds into an abstract representation such as a fingerprint (figure 1). A given chemical compound is split into fragments and each fragment is encoded as a bit in a bitstring. Searching compares the query bitstring with each database bitstring.
While performing, searching, and mining in such an abstract space is very efficient, it has several disadvantages. Much information about the global chemical structure is lost; structure-based searching results in lower accuracy. Working in the abstract space is not “natural” to chemists who would rather think in terms of atoms and bonds. Querying and mining are not very flexible because the abstractions are typically tailored to specific query operations.
|
These disadvantages can be avoided by performing all searching and mining operations directly in the chemical space. Instead of transforming compounds into abstractions, Acelot, Inc.’s suite of drug discovery tools, stores them as a data structure consisting of labeled atoms and the bonds between them. Searching on this structure is then performed by finding the best match between the query compound and each compound stored in the repository (figure 2).
The colors indicate which substructures are matched. In contrast to fingerprints, a matching considers the global structure of compounds and can rank globally similar structures high even though they deviate locally. Each matching results in a similarity score that is based on the number of atoms and bonds matched and the preservation of the global structure.
|
In addition, since the matching happens in chemical space, the exact substructures that contribute to the matching can be highlighted to simplify interpretation of the result. The speed of structure matching can be improved significantly with the use of an index structure such as Closure Tree1 that can prune out parts of the search space that can not contain any matchable compounds.
|
Figure 3 shows the quality of this chemical space-based searching (integrated in Acelot’s SimFinder product) in comparison to fingerprints. The graph compares the ROC100 values for both approaches on a variety of activity classes from the MDL Drug Data Report (MDDR), a database of 100,000 biologically relevant compounds and derivatives managed by Accelrys and Thomson Reuters. ROC100 measures how many active compounds are returned in the top 100 ranked results. As can be seen, SimFinder outperforms fingerprints for all activity classes. In addition, the Venn diagram in Figure 4 shows SimFinder can provide the chemist with new results not previously found by fingerprints.
|
Besides approximate string matching for similar structures or substructures, working in the chemical space also enables interesting data mining applications. The GraphSig algorithm,2discovers significant substructures in a test database of compounds. A substructure is deemed significant if it is statistically over-or under-represented as compared to a distribution of background compounds. The algorithm achieves this by computing the distribution of all fragments in the background and test datasets, then determining the low p-value ones, and finally extracting frequent subgraphs from a neighborhood of the low p-value fragments. The upper row shows substructures that are statistically over-represented among BBB permeables. The lower row shows substructures that are rare among BBB permeables. Both results can be valuable for better activity understanding and lead development.
|
Existing algorithms can be enhanced by working in the chemical space. For example, current ADME/Tox prediction algorithms return mostly labels with scores. By building ADME/Tox models directly on molecular fragments, the contribution of each fragment to the overall activity can be determined. This knowledge can then be used to color-code the corresponding parts of a molecule, as done by Acelot’s ActiPred product. Figure 6 shows some molecules classified as BBB permeable by ActiPred together with the red-colored substructures that contribute most to permeability. In conclusion, working directly in the space of chemical compounds has many advantages. It can provide higher quality and complementary results, enable new structure-based discovery algorithms, and enhance algorithms by providing additional insights. Advances in algorithm design can make these new approaches tractable even on normal desktop computers.
References
1. Huahai He and Ambuj K. Singh, Closure-Tree: An Index Structure for Graph Queries, In: Proceedings of the International Conference on Data Engineering; 2006:38.
2. Sayan Ranu, Ambuj K. Singh, “GraphSig: A Scalable Approach to Mining Significant Subgraphs in Large Graph Databases”, In: Proceedings of the International Conference on Data Engineering; 2009:844-855.

