Leveraging data ecosystems to foster African research networks

September / October 2020 | Volume 19, Number 5

September / October 2020 Global Health Matters newsletter [PDF <1M]

By Susan Scutti

Africa’s data science ecosystem was examined during the Harnessing Data Science for Health Discovery and Innovation in Africa (DS-I Africa) conference to consider the open data science platform component of the program. The successful applicant will develop and maintain a data-sharing gateway and provide the organizational framework for the direction and management of the initiative’s common activities. Cloud computing, data security and interoperability were some of the issues considered.

“We've been making amazing progress in generating increasingly vast datasets that offer tremendous research capabilities,” said Dr. Benedict Paten, of the University of California, Santa Cruz. Yet, massive amounts of information mean “it's no longer possible to take the data and bring it to you, bring it to your laptop or bring it to your institution,” he said. Continuing the familiar, outmoded ways of handling data not only leads to great expense, it causes security problems and sharing difficulties. Data scientists need to “invert the model” by creating data biospheres that are modular, community-focused, open and standards-based, said Paten.

In a well-structured data science project, researchers will come to the data, which will live “in situ on a cloud,” while applications, resources and services are provided alongside the data. This model not only reduces redundancy, it also creates “layers of security and threat detection around the data so that we can know who's accessing it and make sure that appropriate responsible research is happening,” said Paten. An added benefit when scientists design open infrastructures is that data becomes “more accessible to researchers who don't have large institutional computer infrastructure,” said Paten. And, since the cloud is “very elastic,” this will also “potentially enable larger scale analyses that just weren't possible in the past.”

Paten emphasized that the components of underlying software infrastructure need to be designed using existing standards and application programming interfaces. “By doing that, we're going to make it easier and simpler for other groups to reuse what we've created and indeed interoperate with us by standing up their own services and their own systems and allowing data to be sort of analyzed across these platforms,” said Paten. “That's where we want to get to, not a siloed world, but a world in which things interoperate.” The Human Cell Atlas Program and other recent NIH projects exemplify this new paradigm, he said.

Crowdsourcing is another promising aspect of data science, said Notre Dame University’s Dr. Geoffrey Siwo. He helped organize the Malaria DREAM Challenge, a project to identify problems in malaria that could be solved using genomic data, which involved 360 participants from 31 countries working together. The primary lesson learned, he said, was “you get a huge diversity of solutions that you as an individual or your lab or your company could not have imagined” when you open a data science problem to “basically anyone in the world.”

Though he champions building out data science infrastructure in Africa, Siwo recognizes there are hurdles including unaligned datasets, which hinder artificial intelligence, and privacy and security issues or commercial interests that preclude data sharing. Creating frameworks that can analyze and encrypt data without the need for decryption is one solution, he suggested. “If we do this, then it means that a wide number of people can actually have access to these datasets without threatening the security or the privacy of those datasets.”

To better coordinate data science efforts across Africa and provide a discussion forum, the African Open Science Project was established and is managed by the Academy of Science of South Africa (ASSAF). “We did find that there were many open science, open access and open data projects on the continent,” said ASSAF’s Susan Veldsman, but most projects are decentralized and not very visible. Along with a big picture view, Veldsman and her colleagues studied process and methodology for the project that ended in 2019. “The biggest missing link in all the work we did is the fact that researchers were not necessarily included when setting up the infrastructures,” said Veldsman, who said the requirements of researchers need to be addressed. “We need strong leadership, and we need to build on the knowledge that has been gained over the last couple of years.” The future of data science depends on collaboration among countries, institutions, projects, researchers and funders, with resource sharing a must, she said.

Cooperation, particularly global-sized cooperation is second nature to Professor Russ Taylor of University of Cape Town. As director of the Inter-University Institute for Data Intensive Astronomy, he is part of a wider movement towards “large, global, mega science unique projects.” Large datasets and distributed global research teams mean “we're facing a different dynamic in terms of how to do science, both in terms of the technology and the sociology,” said Taylor. “We no longer have the old model of hierarchical, in-house solutions to problems. But we now can harness the diverse talents of the global community if we can provide them the means to work with the data and to bring their knowledge and expertise to bear on the problems.”