African Next Voices: RobotsMali Releases Major Bambara Dataset, Models & Research Tools

African Next Voices: Major Bambara Dataset Release | RobotsMali AI4D Lab

African Next Voices Major Bambara Dataset, Models, Human Evaluation, and Codes

612
Hours of Speech
18
African Languages
9,000
Target Hours

RobotsMali AI4D Lab has released a major dataset of 612 hours of Bambara speech data along with transcriptions, fine-tuned models, evaluation datasets, data collection and processing codes, and a human-evaluation study as part of the African Next Voices project. The complete report on the project, Dealing with the Hard Facts of Low-Resource African NLP, is available as an arXiv preprint:

This is the largest collection of Bambara speech data ever collected and annotated. The African Next Voices project is a multi-country effort aiming to gather 9,000 hours of speech across 18 African languages. Its goal is to build high-quality datasets that empower local communities, support inclusive AI research, and provide strong foundations for ASR in underrepresented languages.

The project also included release of several models fine-tuned on a subset of the dataset. These monolingual models, based on NVIDIA's Parakeet and QuartzNet, have been deployed on devices in the Bambara learning app An Be Kalan for iOS and Android.

Also released is a compact, stratified evaluation subset for benchmarking Automatic Speech Recognition (ASR) models in Bambara, Nyana-Eval, consisting of 45 audio recordings selected to represent real-world linguistic and acoustic challenges in low-resource Bambara speech. This dataset is derived from the larger RobotsMali/Bam_ASR_Eval_500 corpus and is optimized for quick, reproducible human evaluation.

Nyana-eval was used for an evaluation of the AFVoices fine-tuned models by humans and a comparison of automated metrics to human evaluation. This is reported in the study by Madani Amadou Tall, Analyse Comparative Humaine des Modèles ASR Bambara de RobotsMali.

We also release source code for a mobile data collection app, a minimalist Flutter-based tool designed with a simple user interface to minimize training time of the data collection agents.

Try An Be Kalan

Experience our ASR models in action with our Bambara learning app

Our Mission

This work reflects RobotsMali's broader mission to advance AI and NLP research for Malian languages, with a long-term focus on improving education, access, and technology across Mali and the wider Manding linguistic region.

FrançaisfrFrançaisFrançais