Managing Data in the Cloud Age
As next-generation technology ratchets the price of sequencing lower and lower, users from academic labs to Big Pharma are finding themselves drowning in data. What used to be gigabytes worth of information has become terabytes or petabytes. At the same time, the cost crunch brought on by the global recession has made researchers leery of unnecessary capital spending. The result is more and more users moving their data management to the cloud or outsourcing it entirely.
Whereas large pharma companies may have the funds and infrastructure to maintain dedicated servers for storage and analysis of sequencing data, small companies—especially those that don’t sequence continuously—are leading the migration to the cloud, and service providers are springing up to meet the demand. Cost, security, and convenience top the list of concerns for researchers looking for a place to unload reams of data. However, once that transition is made, features like collaborative data sharing, access to third-party analysis apps, and patient privacy become more important.
Expression Analysis, a Quintiles Co., Durham, N.C., provides genomic services to the pharma and biotech industry, as well as academic, government, and foundation laboratories doing research in molecular biology and genetics. It provides cloud computing services through a partnership with Golden Helix Inc., Bozeman, Mont. Its clients require computation-intensive services for generating the initial RNA or DNA sequence and also for cleaning up, aligning, and analyzing the sequence.
According to Expression Analysis, a typical sequencing project for 100 RNA samples would generate 300 to 400 GB worth of compressed data, or 700 GB to 1 TB worth of data in total; and that’s just for one experiment. For multiple experiments, the amount of data can add up to astronomical quantities quite quickly.
Some applications, such as analyzing cancer samples, are even more data intensive, because of the depth of coverage and the need to sample multiple cells in the tumor.
“The cloud offers a full environment in order to do analysis on a large number of samples simultaneously,” said Wendell Jones, PhD, vice president of statistics and bioinformatics for Expression Analysis.
That computing power becomes a commodity for the customer, replacing expensive, on-site, server infrastructure. The data is instead accessed through a browser, and there is no need to upload or download huge files. “You can leave them on the cloud and access in a streaming fashion via the cloud,” Jones said.
For small companies, the cloud-based service offers additional advantages beyond saving on hardware and real estate. Startup companies may not have the structure in place to operate Linux-based genome software applications. A cloud-based storage and analysis service allows those companies to use their own local Windows or Macintosh desktop operating systems.
There are some advantages to maintaining a physical server. “You have the option of having lower redundancy ... and faster data access times. You can choose to take your old data and unplug it. You don’t have to pay for power,” explained Jonathan Bingham, product manager for informatics and software for Menlo Park, Calif.-based Pacific Biosciences, a provider of genomics services through its SMRT platform technology and hosted cloud-based storage and analysis service.
On the other hand, that means taking responsibility for managing the hardware, Bingham added, such as replacing failed drives. That burden of ownership and maintenance is not right for every company.
Jones explained that cloud computing is ideal for research groups that have “bursty” computing needs, meaning that generating and analyzing sequence data is an intermittent need.
“The cloud in some sense is cheap, in the sense that it’s cheaper to rent a vacation home than buy it and only use it two or three weeks a year. If you’re constantly at your vacation home, it’s just better to buy it.”
Cost is a major concern at Illumina (San Diego, Calif.) as well. A giant in the sequencing industry, Illumina controls 70% of the market share for sequencing. Illumina can sequence an entire human genome in a day, and it offers its cloud-computing solution, BaseSpace, through Amazon Web Services (Seattle), the world’s largest cloud hosting service. Recently, Amazon announced a service providing reliable data storage starting at $0.01 per gigabyte per month.
Although that is a very economical rate for data storage by any standard, for long-term storage of hundreds or thousands of complete genomes, many experts agree it is better to store the data in the original tissue. In other words, if the raw data is needed again in the future, it is cheaper to regenerate the sequence from an archived sample.
Illumina offers an even better deal to its customers. “We’ve picked the ultimate pricing strategy which is free,” said Alex Dickinson, senior vice president of cloud genomics for the company. “Customers get a free terabyte of data storage, enough for 10 years of typical usage of MiSeq. We do the secondary processing, alignment, and variant calling. We also do that for free,” Dickinson said.
MiSeq is Illumina’s “personal sequencer,” a next-generation sequencing system suitable for applications such as multiplexed PCR amplicon sequencing, targeted resequencing, small RNA sequencing, and so forth.
Illumina’s choice to offer free service is based on concerns of researchers, who may be comparing the company’s offerings to use of infrastructure in their facility. Although in an absolute sense, that infrastructure is never “free,” because of the cost of housing it in the facility, its use often doesn’t come out of an individual laboratory budget. “If you try to charge for basic service, they try to compare that to free,” Dickinson said.
Instead of charging customers directly, Illumina instead channels revenue through third-party service providers, who will be offering genomic analysis apps within the sequencing environment. The application interface (API) for BaseSpace will be open to partner companies to offer applications that will be available in an app store. An initial block of 14 companies are already signed up to offer those apps.
Although Amazon cloud services provide an ideal solution for research, the rapidly emerging market for clinical sequencing comes with tougher regulatory requirements, chief among them compliance with the Health Insurance Portability and Accountability Act of 1996 (HIPAA).
Amazon cloud services are not currently HIPAA compliant, and according to Richard Resnick, CEO of GenomeQuest Inc. (Westborough, Mass.), it is very unlikely to become compliant any time soon. Resnick said that the cloud is comprised of three components: application, platform, and hardware. Achieving HIPAA compliance requires control of all three of those components. A service that is designed around coordination of many third-party providers such as Amazon would have a hard time ever validating full compliance for the entirety of its applications, platform, and hardware.
“What we’re doing is thinking about how to connect different parts of the health care ecosystem through next-generation sequencing and cloud-based genomics,” said Resnick.
GenomeQuest offers a secure HIPAA-compliant cloud designed for large scale analysis of whole genomes and gene panel samples from clinical laboratories.
Resnick said that unlike research labs, clinical laboratories can’t tolerate problems like noise and false positives in their data. “You can’t do that because there’s a real patient at the end of the day.”
So in addition to security and data privacy standards, cloud services for clinical sequencing applications have a higher bar to achieve for quality.
“There are still many uncertainties around the regulatory requirements for using cloud and hosted IT services in genomic medicine trials, so it was important for us to work with a company that really understands the healthcare IT space,” said Spyro Mousses, PhD, director of the Center for BioIntelligence at The Translational Genomics Research Institute (TGen) in Phoenix, Ariz.
In November 2011, TGen partnered with Dell to support the world’s first personalized medicine trial for pediatric cancer, and to leverage cloud computing resources donated by Dell. The Dell Giving commitment includes multi-year grant funding to support the clinical trial, as well as major hardware, software, and services contributions.
Focusing initially on neuroblastoma, the trials will leverage high-performance computing to dramatically accelerate the processing of sequencing information from patient tumors to predicting the optimal treatment for each patient. As would be required of any trial under U.S. Food and Drug Administration (FDA) regulations, the cloud solution will be compatible with both FDA and HIPAA compliance requirements.
The KIDS Cloud, as TGen terms it, “will provide a hybrid-cloud platform for securely storing and exchanging genomic data and clinical information across multiple collaborating organizations,” according to Mousses.
TGen is also participating in several other large personalized medicine trials and hopes that the kind of cloud-enabled computational infrastructure can serve as a national model for collaborative personalized medicine. “It takes a village to cure a kid with cancer,” Mousses said.
With the advent of next-generation sequencing technology, the emphasis has shifted from bringing the cost of sequencing down to addressing the cost of analysis. “The bottleneck now is being able to effectively analyze the data,” said Marc Olsen, president and COO of DNANexus (Mountain View, Calif.), a provider of cloud-based data management and analysis. Those challenges include not only the cost of storage and management of quantities of data that could fill thousands and thousands of PCs, but questions of how to transfer data, and how to share and collaborate while still maintaining security and privacy. The industry is currently seeking answers to those emerging problems, and in some cases already moving towards some degree of standardization.
About the Author
Catherine Shaffer is a freelance science writer specializing in biotechnology and related disciplines with a background in laboratory research in the pharmaceutical industry.