1000 Genomes Project data available on Amazon Cloud

The world's largest set of data on human genetic variation — produced by the international 100,000 Genomes Project — is now publicly available on the Amazon Web Services (AWS) cloud, the according to the National Institutes of Health (NIH).

The public-private collaboration demonstrates the kind of solutions that may emerge from the Big Data R&D Initiative announced last week by the White House Office of Science and Technology Policy.

"The explosion of biomedical data has already significantly advanced our understanding of health and disease. Now we want to find new and better ways to make the most of these data to speed discovery, innovation and improvements in the nation's health and economy," said NIH director Francis S. Collins, M.D., Ph.D. Collins was among agency leaders speaking in support of the initiative at the launch event.

The Big Data initiative will initially engage at least six federal science agencies — including the NIH, the National Science Foundation, and the Department of Defense and the Department of Energy — committing more than $200 million to a collaborative effort to develop core technologies and other resources needed by researchers to manage and analyze enormous data sets.

Among the NIH components participating in the Big Data initiative are the National Human Genome Research Institute (NHGRI) and the NIH National Center for Biotechnology Information (NCBI) — a division of the National Library of Medicine. NHGRI played a lead role in organizing and funding the international 1000 Genomes Project. NCBI, along with the European Bioinformatics Institute of Hinxton, England, began making 1000 Genomes Project data freely available to researchers in 2008.

Since the project's launch in 2008, the data set has grown enormously: At 200 terabytes — the equivalent of 16 million file cabinets filled with text, or more than 30,000 standard DVDs — the current 1000 Genomes Project records are a prime example of big data that has become so massive that few researchers have the computing power to use them.

To help solve that problem, AWS has just posted the 1000 Genomes Project data for free as a public data set, providing a centralized repository on the Amazon Simple Storage Service. The data can be seamlessly accessed through services such as Amazon Elastic Compute Cloud and Amazon Elastic MapReduce, which provide organizations with the highly scalable resources needed to power big data and high performance computing applications often needed in research. Researchers pay only for the additional AWS resources they need to further process or analyze the data.

The public-private collaboration to store the data in the AWS cloud allows any researcher to access and analyze the data at a fraction of the cost it would take for their institution to acquire the needed internet bandwidth, data storage and analytical computing capacity.

"Improving access to data from this important project will accelerate the ability of researchers to understand human genetic variation and its contribution to health and disease," said NHGRI director Eric D. Green, M.D., Ph.D. NHGRI is a major funder of the 1000 Genomes Project, along with Wellcome Trust of London and BGI-Shenzhen of China.

Cloud access also enables users to analyze the data much more quickly, as it eliminates download time and because users can run their analyses over many servers at once. "Putting the data in the cloud provides a tremendous opportunity for researchers around the world who want to study large-scale human genetic variation but lack the computer capability to do so," said Richard Durbin, Ph.D., co-director of the 1000 Genomes Project and joint head of human genetics at the Wellcome Trust Sanger Institute in Hinxton, England.

Paul Flicek, D.Sci., co-leader of the 1000 Genomes Project Data Coordination Center (DCC), added that the new venue “fulfills a central goal of the 1000 Genomes Project to make the data as widely available as possible to accelerate medical discoveries.”