Articles
Mike May
Mining large databases effectively drives better decision making to help yield drugs more quickly.
To some people outside drug development, data mining sounds like a gold rush for pharmaceutical companies—a technique that digs out rare but powerful therapeutics
|
from mountains of possibilities. Data mining, however, cannot stand alone. "Data mining applies to so many areas that it's hard to imagine data mining, even used broadly, is likely to be the critical factor for moving a whole drug program ahead or even making a drug come out sooner," says John Hill, PhD, executive director of exploratory development informatics at Bristol-Myers Squibb (BMS), Princeton, N.J. Instead, Hill sees data mining as supporting a wide range of steps in drug development. "Data mining gives you a better understanding of a problem," he says.
Simply put, data mining analyzes large databases for patterns. "At a very general level," says Hill, "data mining is used across all of drug development." As an example, he points out the use of data mining for tracking the safety of a drug in development. "If there's an unexpected toxicology finding, you might do some data mining around that finding to see if there are historical examples of similar reactions with similar compounds." Ongoing advances in data mining can also explore many other drug-development issues: combining data from the many stages involved in creating a drug, searching for compound names in patents, and so on.
Sometimes, data mining raises very serious issues. "If you have a compound in testing," says Hill, "some result from data mining might lead to early termination." That could happen if, for example, some tests revealed undesirable properties of a compound. That finding could impact the development of future drugs. "Data mining around those results might give you a heads-up that the problem could be a drug-class issue, instead of just an issue for the specific compound being studied."
In fact, pharmaceutical companies always look for faster ways to detect toxicity. Scott Kuzdzal, PhD, strategic collaborations scientist at PerkinElmer Inc., Boston, says scientists could use his company's expression kits to find a marker for a toxic event. These kits, however, work at high throughput, which generates a great deal of data. To mine it, PerkinElmer worked with Nonlinear Dynamics Inc., Durham, N.C., to develop software that mines the results. In late stages of development, for instance, these expression kits could be used to compare responders versus non-responders in early human studies.
A variety of data-mining techniques could provide early indications of human toxicity. In the December 2003 issue of the Journal of Computer-Aided Molecular Design, scientists from Accelrys Inc., San Diego, discussed the danger of drug-induced hepatotoxicity, a particular danger as the liver tends to be especially sensitive to drugs. "Understandably, liver toxicity is one of the most important dose-limiting considerations in the drug development cycle, yet there remains a serious shortage of methods to predict hepatotoxicity from chemical structure," the article states. The researchers developed an in silico model which they said can predict the incidence of dose-dependent human hepatotoxicity with greater than 80% accuracy. Even more important, data mining can be combined with this model to predict danger from specific compounds.
Seeking new structure
Scientists might also mine more than numbers, which brings up a growing distinction: structured versus unstructured mining. Traditionally, scientists think of data mining as searching through arrays of numbers, or structured data. Searching analyst presentations, on the other hand, involves looking through articles or websites for specific
click to enlarge The Semantic Web connects pieces of data in triplets: a subject, predicate and object (top). Then, pertinent triplets can be easily linked (bottom). (Source: Oracle) |
concepts, which is unstructured data. "Currently, the tools for structured data are much more robust than those for unstructured data," says BMS's Hill.
Nonetheless, mining of unstructured data continues to grow. For example, many scientists at IBM study ways to mine unstructured data to improve products like the Business Insights Workbench. One of IBM's tools, for instance, searches text for mentions of specific compounds. Searching for the word "valium" in a US patent, for example, sounds easy enough, but the reality is that valium has 150 different names, says Jeff Kreulen, PhD, senior manager of services-oriented technologies at IBM's Almaden Research Center, San Jose, Calif. "Lots of chemicals have so many variants that keyword searching does not work." Instead of looking for keywords, IBM developed software that converts the word to a chemical structure and then compares it with the search compound's structure.
This technique could become quite useful during drug development. A company might want to mine all of the intellectual property (IP) on a compound because, as Kreulen notes, "pharma is very IP driven." So the IBM software could look for every mention of a compound or its variants in all US patents or in all articles on Medline.
Data mining is also evolving in other ways. Colin Hill, MS, CEO of Gene Network Sciences (GNS), Ithaca, N.Y., says, "A new level of data mining, though, is more process driven. For example, we reverse engineer models from data to determine drug efficacy." Scientists at GNS build causal models from mined data. Sergej Aksenov, PhD, director of modeling at GNS says, "Our models are multivariate probability models that can look at what happens if I perturb a gene or a protein."
This kind of technology provides many advantages for drug development. For example, a company might already have several approved drugs for a disease, but wants to know how the drugs might work together. The GNS scientists could model the disease and then apply the drugs to the model. "You can probe the system to see if a drug has any adverse effects in known areas, such as for the cardiovascular system," Aksenov says. (See "Probing the Pump," below.)
| Probing the Pump Gene Network Sciences, Ithaca, N.Y., is focused on creating a platform for rapid reconstruction of networks from data and the rapid simulation of the resulting models, says Colin Hill, MS, the company's CEO. He and his colleagues see this approach as an extension of experimentation. "This now offers a way of discovering new biology at computational speed," says Hill. As an example, they recently developed a model of the electrical properties of human-heart tissue. This model can reveal mechanisms of toxicity for drugs in development. "A number of drugs can induce heart attacks when they block certain ion channels," says Hill. Their model simulates the electrical processes in individual heart cells and then combines them to generate the electrical wave that makes a heart beat. The system can then be perturbed by adding a test compound. If that compound blocks ion channels, it could also disturb the heart's overall wave of conduction, thereby causing a heart attack. As Hill and his colleagues add more data to the model, it should grow increasingly powerful. "You can imagine a computer model of your heart and how powerful that would be if driven by enough data." Such heart simulations could also be used in personalized medicine. The model could be modified with a patient's particular genes to see how specific drugs impact the electrocardiogram. "It is amazing how efficiently and quickly the simulations are starting to generate results," says Hill. |
Hill of GNS describes the reverse-engineering process as taking a Swiss watch and smashing it. Researchers then look at the pieces, what they are and where they landed, and try to reconstruct the watch. In biological scenarios, there can be added twists: The scientists do not know what the "watch" looked like before being smashed, and the parts can look different from one moment to the next. The findings from models and simulations can then be fed into future experiments. In this way, scientists can integrate the experimental and computational processes.
Currently, pharmaceutical or biotechnology companies come to GNS for modeling and simulations. "For example," says Hill, "Johnson & Johnson had a drug in preclinical cancer studies and came to us to study how the drug works." A GNS team models the company's data and returns the results from a simulation, a process that can be interactive, he says. "For instance, we designed new sets of experiments with Johnson & Johnson to further define models and find the mechanism of action for the drug."
Having such information proves especially valuable before moving to clinical trials. "Targeted medicine for individuals, especially in cancer, is something the industry will come to accept. Data mining plays a key role in this," says Hill.
Academic scientists such as Andrew Kusiak, PhD, a professor of mechanical and industrial engineering at the University of Iowa, also explore data mining for developing cancer drugs. In an online article which appeared this April in Computers in Biology and Medicine, Kusiak reported on data mining for cancer genes. "Analysis of gene expression data leads to cancer identification and classification, which will facilitate proper treatment selection and drug development," the article notes. Using a variety of techniques including genetic algorithms and data mining, these scientists classified the genes most significant to various cancers, including ovarian, prostate and lung. "Mapping of genotype information to the phenotype parameters will ultimately reduce the cost and complexity of cancer detection and classification," they concluded.
Stitching silos together
For any kind of data mining, getting the data together creates a fundamental obstacle. Every department, whether it is animal testing, chemoinformatics, formulations, and so on, generate grains of data and shovels them into various silos, the industry name for data warehouses. But, for example, if a device is made to look only for the green grains, finding them all would require a device for going through every silo one by one. Scientists would rather mine all the data simultaneously.
David Lowis, DPhil, senior director of product management at Tripos Inc., St. Louis, Mo., says his company has recently worked with Wyeth to help them consolidate data sources. "We are solving issues of scientists being unable to access data that they need," he says. For example, Lowis says there is usually a data disconnect between discovery, development and clinical trials. "Each are in different silos of information, and the real value of data mining will be when those silos break down across the entire process of bringing a new drug to market. . . . The real key of applying data mining is making the data accessible and then allowing researchers to access it in a way that will help them, instead of dumping it all on their desktop and expecting them to make sense of it."
Other companies also see great potential in combining data for drug development. "At the moment, companies in pharma have many kinds of data, and assessing a compound's safety profile is very difficult because of so many silos," says Susie Stephens, PhD, principal product manager for life sciences at Oracle Corp., Redwood, Calif. "Our Semantic Web enables data to be stored in triples, which is a similar approach to relational databases. However, the Semantic Web allows data to be modeled in a flexible graph representation. . . . . It can also more easily combine data from two heterogeneous sources."
As testing goes on with a compound, comparing various forms of data grows even more important. "It could be important, for example, to relate a response in animal testing to properties of the compound," says Ton van Daelen, PhD, director of Pipeline Pilot at SciTegic Inc., San Diego, a wholly owned subsidiary of Accelrys. Whether the response is positive or negative, developers want to know. "Maybe some structural features make a molecule toxic in specific cases," says van Daelen. SciTegic's Pipeline Pilot allows researchers to develop statistical models to make such predictions.
BMS's Hill expects that data mining will grow increasingly useful to the pharmaceutical industry. He believes the biggest advances are in mining unstructured data, using data mining in concert with modeling and simulation, and linking information from disparate areas. "With improved tools, drug developers will build contextually richer models and information sets that will allow them to make better decisions," he says.
Likewise, some advances in data mining down the road might look a bit backward. Today, many scientists discuss taking advances from the bench to the bedside. In the future, Stephens of Oracle hopes for equally useful advances from the bedside to the bench. Data mining should prove invaluable in taking lessons from the clinic and using them to make improved next-generation drugs.
About the Author
Mike May, PhD, is a publishing consultant for science and technology based in Minnesota.
This article was published in Drug Discovery & Development magazine: Vol. 9, No. 7, July, 2006, pp. 44-49.


