NIH invests $32 million to increase utility of biomedical research data

The NIH has invested to develop new strategies to analyze and leverage increasingly complex biomedical data sets, often referred to as Big Data. These NIH multi-institute awards constitute an initial investment of nearly $32 million by NIH’s Big Data to Knowledge (BD2K) initiative, which is projected to have a total investment of nearly $656 million through 2020, pending available funds.

With the advent of transformative technologies for biomedical research, such as DNA sequencing and imaging, biomedical data generation is exceeding researchers’ ability to capitalize on the data. The BD2K awards will support the development of new approaches, software, tools and training programs to improve access to these data and the ability to make new discoveries using them. Investigators hope to explore novel analytics to mine large amounts of data, while protecting privacy, for eventual application to improving human health. Examples include an improved ability to predict who is at increased risk for breast cancer, heart attack and other diseases and conditions, and better ways to treat and prevent them.

“Data creation in today’s research is exponentially more rapid than anything we anticipated even a decade ago,” said Francis S. Collins, M.D., Ph.D., NIH director. “Mammoth data sets are emerging at an accelerated pace in today’s biomedical research and these funds will help us overcome the obstacles to maximizing their utility. The potential of these data, when used effectively, is quite astounding.”

The funding will establish 12 centers that each will tackle specific data science challenges. The awards also will provide support for a consortium to cultivate a scientific community-based approach on the development of a data discovery index and for data science training and workforce development.

Studies generating large amounts of data continue to proliferate, from imaging projects to epidemiological studies examining thousands of participants to large disease-oriented efforts such as the Cancer Genome Atlas, which examines the genomic underpinnings of more than 30 types of cancer, and the ENCODE Project, which seeks to identify all functional elements in the human genome. Such efforts have generated billions of data points and provide opportunities for the original researchers and other investigators to use these results in their own work to advance our knowledge of biology and biomedicine.

“The future of biomedical research is about assimilating data across biological scales from molecules to populations,” said Philip E. Bourne, Ph.D., NIH associate director for data science. “As such, the health of each one of us is a big data problem. Ensuring that we are getting the most out of the research data that we fund is a high priority for NIH.”

Bourne, in calling for the establishment of a “digital ecosystem” for biomedical research, said the new BD2K programs are at the forefront of NIH’s efforts to increase the efficiency and cost effectiveness of scientific discovery.

Challenges in making the best use of such biomedical information are many. They include problems of locating data and the appropriate software tools to access and analyze them, lack of data standards for many types of data and the low adoption of data standards across the research community.

There also is a need for new policies to facilitate data sharing while protecting privacy. A lack of standards and an unwillingness to make data available to colleagues hampers efforts to make data fully useful to the broad research community. Large data sets are relatively expensive to generate and the return on investment increases when they are shared and used widely.

Many scientists also do not have the opportunity or facility to use big data. While teams at large research universities and academic medical centers may have bioinformatics and data infrastructure, individual scientists in the biomedical research community may not. Regardless of their research facilities, many scientists have not been trained in the computational skills to access and analyze large data sets.

The BD2K initiative, launched in December 2013, is a trans-NIH program with funding from all 27 institutes and centers, as well as the NIH Common Fund. NIH’s effort is being developed in the context of a number of related projects elsewhere in the world, including those under development in the U.K. and Australia, and by the E.U.