Data-Driven Science: How AI and Open Data will Revolutionize Scientific Discovery

Sandra Smith Photo Scientists have long been perceived and portrayed in film as old people in white lab coats perched at a bench full of bubbling fluorescent liquids. The present-day reality is quite different. Scientists are increasingly data jockeys in hoodies sitting before monitors analyzing enormous amounts of data. Modern day labs are more likely composed of sterile rows of robots doing the manual handling of materials, and lab notebooks are now electronic in massive data centers holding vast quantities of information. Today, scientific input comes from data pulled from the cloud, with algorithms fueling scientific discovery the way bunsen burners once did.

Advances in technology, and especially instrumentation, enable scientists to collect and process data at an unprecedented scale. As a result, scientists are now faced with massive datasets that require sophisticated analysis techniques and computational tools to extract meaningful insights. This also presents significant challenges — how do you store, manage, and share these large datasets, as well as ensure that the data is of high quality and reliable?

The Impact of Big Data on Science

This growth in data is transforming the way scientists conduct research and enabling new discoveries across many fields, especially in the areas of genome and protein research. This has fostered the emergence of a whole new type of scientist: bioinformaticians and data scientists who work hands-on with big data by developing and applying algorithms. In fact, “data scientist” has been at the top of the list of desirable jobs on career sites for the past few years. However, while the demand for professionals adept in the ability to work with big data is at an all-time high, there is a significant lack of skilled talent.

In medicine, as in other fields, it’s not just the volume and velocity of data generation that is increasing, but also the variety of data being collected to answer research questions. For instance, flow cytometry data is fundamentally different from DNA sequence data which is again totally different from 3D models of proteins. The tools and algorithms that work for one data type are not suited for another. Flexibility in data storage and modeling is crucial for repurposing data. This is especially true for predictive science, where integration needs to occur between data and data types unrelated to the hypotheses of any of the original studies.

Turning to Machine Learning and AI

Technology can act like a powerful flashlight, illuminating hidden patterns and insights that exist in vast amounts of data — and allowing us to see and understand things that were previously too dark to see. That’s why, despite the recent rise in new AI platforms like ChatGPT generating a lot of headlines and stoking fear about potential risks, drug discovery is one setting where artificial intelligence AI and machine learning (ML) are poised to make a significant, positive impact.

For example, during the pandemic, I had the opportunity to collaborate with the team behind the EVE Online video game to create, a free minigame that enabled tens of thousands of gamers to become citizen scientists. Using data from cell samples of patients with COVID-19 and other immune system diseases, players were trained to identify different cell patterns generated using a technology known as flow cytometry. The game was incentivized with rewards and rankings to make it fun and challenging, but many players expressed the desire and satisfaction associated with participation in scientific research, especially as it related to their own experience.

To date, players have solved millions of puzzles representing hundreds of years of effort. All data from the project will be freely available for open science. Companies like Dotmatics will be able to use the data to develop ML approaches to flow cytometry data analysis, leading to exponentially faster, less expensive and more significant medical breakthroughs.

Today, both ML and AI are being used around the world in many research labs and universities to expedite discoveries. The National Cancer Institute’s (NCI) Center for Cancer Research has developed deep learning algorithms to improve cancer detection in people who have symptoms. For example, one model can function as “a virtual expert,” reviewing MRIs in hard-to-detect cancer types, guiding less-experienced radiologists and minimizing error rates. Similarly, AI is used at the University of Toronto to predict Alzheimer’s risk, by Rutgers University to predict cardiovascular disease and by hundreds of startups using advanced technology to design cheaper, safer drugs with fewer side effects.

Complexities of Big Data and Making it FAIR

Despite these advances, the complexity of the data and the heterogeneity of the tools required to analyze it can make it difficult for researchers to collaborate effectively to generate the big datasets that AI requires. Efforts such as the FAIR Guiding Principles for scientific data management and stewardship provide guidelines to improve the Findability, Accessibility, Interoperability and Reuse (FAIR) of digital assets. They are increasingly being adopted and are even being mandated by granting agencies.

While the old way of science using beakers and chemistry is still important, tomorrow’s scientists will be able to explore and understand the world around us and scale ambitious research into areas that are presently economically prohibitive. However, to truly harness the power of AI, we must invest in further improvements to the infrastructure supporting the integration, analysis and reuse of data that have already become the new frontier of scientific discovery.

Access more information on Project Discovery at https://bit.ly/3Xc4cPD.

Ryan Brinkman is vice president and research director at Dotmatics.

The opinions expressed here are those of the author and do not necessarily reflect the views of CenterWatch Weekly.