Synthetic data allows for safe sharing in low-resource settings
January/February 2026 | Volume 25 Number 1
Photo courtesy of Dorcas MwigereriDorcas Mwigereri, a data science researcher at Aga Khan University
The Kaloleni-Rabai Health and Demographic Surveillance System (KRHDSS) is embedded in seven rural and three peri-urban community health units centered around Mariakani township, Kenya. Set up by Aga Khan University (AKU) in 2017, KRHDSS holds information on more than 103,000 residents. The beauty of such a large dataset is it collects data over time, so it can reveal otherwise undetectable health patterns that affect a community, says Dorcas Mwigereri, a research fellow at AKU. “We can study separate diseases, comorbidities, and also look at how one disease leads to the development of another.”
Unfortunately, accessing, using, and sharing medical data is restricted by the necessary regulations to protect patient privacy, and this constrains the development and deployment of new technologies within health systems, says Mwigereri. “How do we solve this problem? That’s where synthetic data comes in—synthetic data creates a dataset with the same statistical properties as the original data yet with minimal privacy risks.”
One way to create synthetic data is by using a generative adversarial network (GAN), a type of machine learning model, that can anonymize information in a dataset with complex structures. So which GAN would work best in the Kenyan context? Mwigereri and her colleagues evaluated fidelity (how well a model reproduces the statistical patterns of the original data), utility (how well a model supports analysis and prediction), and privacy (how well a model protects confidential data) across three open-source GANs and found CTGAN performed best overall.
Good performance within a specific context is crucial when creating synthetic data, says Mwigereri. She recalls using Teladoc, an automated, AI-enabled health care service, while studying in the U.S. “The feedback was ‘we're not able to understand what you're saying, please get in touch with the facility.’ My accent is different. Clearly this model was not trained with data from my context—an African context.”
Photo courtesy of
the
UZIMA-DS research hubUZIMA data science research hub trainees attend classes
In Kenya, there are 47 tribes. Other African nations similarly include different populations. Meanwhile individual countries do not always share a unifying language. “African researchers need to collect enough data from our people to create technologies that fit our societies, so that we can then co-create solutions with researchers in the U.S., in the UK, wherever.”
As she completes her PhD, Mwigereri continues working on two additional DS-I Africa projects in Kenya. One uses AI with data collected across five facilities to identify healthcare workers prone to depression. The other relies on electronic health records (EHRs) to distinguish women in danger of developing gestational diabetes mellitus. The data is there in the EHRs, but it was collected for clinical purposes, not research, so it’s not yet accessible to researchers, says Mwigereri.
“If we sort out the issues around data access, Africa will see improvements in the healthcare sector… he who owns the data, owns the insights.”
Article:
Artificial intelligence and machine learning for early detection and diagnosis of colorectal cancer in sub-Saharan Africa
Publication:
Gut (the journal of the British Society of Gastroenterology), 2022
DS-I Africa Research Roundup
Examining alcohol use and stroke risk
Stroke is a leading cause of death and disability in many sub-Saharan African countries. Most studies that explore interactions between alcohol and stroke focus on other regions, due to a dearth of relevant data on the continent. To address this issue, researchers conducted a multicenter study in Nigeria and Ghana, comparing people who’d recently experienced a first stroke with stroke-free adults. The researchers examined different patterns of alcohol consumption, ranging from lifetime abstinence to heavy drinking. Most participants, particularly women, were lifetime abstainers; current drinkers were more often younger men and more likely to smoke. The findings indicate that moderate, binge, and heavy drinking are linked to higher odds of stroke.
Article:
Association between alcohol consumption and stroke in Nigeria and Ghana: A case-control study
Publication:
International Journal of Stroke, 2024
Genomics, AI, and the fight against drug-resistant TB
Antimicrobial resistance (AMR) threatens the effective treatment of infectious diseases, particularly in low- and middle-income countries. Tuberculosis (TB), an infectious disease that mainly affects the lungs, contributes to AMR, with drug-resistant TB complicating control efforts and requiring longer treatments. Traditional diagnosis methods often fail to detect TB resistance, so this study explored the use of machine learning to predict resistance to four first-line TB drugs. The researchers combined whole-genome sequencing data with clinical information from Ugandan patients and then evaluated 10 machine learning models. Logistic regression, gradient boosting, and XGBoost models performed best overall, often outperforming standard tools on the Ugandan dataset. However, model accuracy dropped when tested on South African data, highlighting challenges in generalizing predictions across regions and bacterial lineages.
Article:
Machine learning-based prediction of antibiotic resistance in Mycobacterium tuberculosis clinical isolates from Uganda.
Publication:
BMC Infectious Diseases, 2024
Stronger viral surveillance networks: Lessons from Nigeria’s COVID-19 genomes
Nigeria played a key role in tracking the COVID-19 pandemic. The country’s researchers analyzed 7,759 SARS-CoV-2 genomes collected between February 2020 and March 2023 and so revealed patterns of viral diversity and the spread of variants. Yet gaps in Nigeria’s access to sequencing facilities was evidence by the fact that most samples came from Southwest and North Central regions. Early in the pandemic, national institutions led sample collection and sequencing, yet over time, regional hospitals and healthcare facilities increasingly contributed; sequencing shifted from research laboratories to government institutes. The study emphasizes the need for a coordinated national strategy, including standardized protocols and a network of sequencing labs across all geopolitical zones. Sustaining and expanding genomic surveillance will strengthen Nigeria’s—and Africa’s—capacity to respond to future outbreaks, improve public health decision-making, and provide globally relevant data.
Article:
Genomic diversity and surveillance of SARS-CoV-2 in Nigeria
Publication:
BMC Genomics, 2025
“I am because we are” -- Ubuntu in the age of Big Data
Data-driven health research and precision medicine are spreading rapidly across Africa, fueled by the continent’s rich genetic diversity and growing investments in genomics. Questions about how personal health and genetic data are collected, shared, and used must be addressed, say the authors; a new ethics framework that is grounded in African philosophies is needed. They propose shifting research from a transactional model—where people simply provide data—to a participatory one that enlists people and communities as active partners. Recommendations include involving communities in setting research priorities, sharing power between data providers and users, providing public education about genetics, and giving people greater control over their data through dynamic consent. Ethically robust, culturally grounded governance is essential to build trust, prevent exploitation, and ensure that genetic research benefits all, the authors conclude.
Article:
Genomics and Health Data Governance in Africa: Democratize the Use of Big Data and Popularize Public Engagement
Publication:
Hastings Center Report, 2024
Reviewing past uses of AI in support of early childhood development research
How has machine learning been used to support early childhood development (ECD) research? To map the existing literature, the authors reviewed 27 studies that applied machine learning techniques to developmental outcomes in children ages 0–8 years. Most studies came from high-income countries, with none from sub-Saharan Africa. Machine learning approaches—mainly supervised learning and deep learning—were most often used to predict cognitive, language, and motor development, typically in children older than 2. Common data sources included images, videos, and sensor data, while socially and environmentally relevant information was used less often. Although many models showed good predictive performance, few were externally validated, explained their predictions, or integrated in real-world settings.
Article:
Application of machine learning in early childhood development research: a scoping review
Publication:
BMJ Open, 2025
Updated February 13, 2026
To view Adobe PDF files,
download current, free accessible plug-ins from Adobe's website.
Related Fogarty Programs
Related World Regions / Countries