RobotsMali AI4D Lab Unveil Two Open Source Models for Vocal Recognition in Bambara - Centre dedicated to training and innovation in robotics and artificial intelligence (AI)

1. Introduction

Automatic recognition of speech (ASR) is a growing field, but low-resource languages, such as Bambara, remain largely absent from technological advances. Bambara is a major vehicular language in West Africa, used daily by millions of speakers. Yet the lack of efficient and accessible ASR models limits its integration into modern technological solutions. To remedy this lack, RobotsMali AI4D Lab has launched a series of initiatives and experiences whose first results were published: two open-source ASR models optimized for real-life use :

Soloni TDT-CTC (114 M): a hybrid model with two independent decoders, a Token-and-Duration Transducer (TDT) and a Connectionist Temporal Classification (CTC) an adjusted version of the Parakeet-tdt-ctc-110m model of Nvidia.
QuartzNet-15×5 (19 M): The smallest model of ASR in Bambara to date, also an adjusted version of Nvidia's Open Source model of the same no.

Both models were trained with approximately 35 hours of semi-professionally transcribed bambara audio. A report of the experiments leading to the publication of these models was also published on the Platform Weights and Biases (https://wandb.ai/yacoudiarra-wl/bam-asr-nemo-training/reports/Draft-Technical-Report-V1–VmlldzoxMTIyOTMzOA ). The aim of this publication is to stimulate research on voice recognition for languages to low Malian resources and in this case specifically the bambara but also to propose much more efficient models than those available for the transcription of the bambara. By small sizes, these two models are more suitable for use in a context of limited resources and still offer interesting research performance.

2. Summary of Model Performance

The table and figure below give the results of the two models published on the dataset test set. bam-asr-all RobotsMali.

The metric used is Word Error Rate (Word Error Rate) Traditional for the evaluation of voice recognition models.

Model	Size (in Million parameters)	WER (TDT Decoder) (%)	WER (CTC Decoder) (%)
Soloni-114 M TDT-CTC	114	66	40.6
QuartzNet-15×5	19	–	46.5

The Soloni-114 M with its CTC decoder is distinguished by the better transcription accuracy that comes at the cost of greater complexity and less interpretation compared to the adjusted version of QuartzNet.

3. Open Source Access and Collaboration

These models and all associated resources are open-source to encourage research and innovation:

Model Soloni-114M TDT-CTC – Hugging-Face (https://huggingface.co/RobotsMali/soloni-114m-tdt-ctc)
QuartzNet-15×5 Model – Hugging-Face (https://huggingface.co/RobotsMali/stt-bm-quartznet15x5)
Code and configurations : GitHub (https://github.com/diarray-hub/bambara-asr )
Experimental report: W&B (https://wandb.ai/yacoudiarra-wl/bam-asr-nemo-training/reports/Draft-Technical-Report-V1–VmlldzoxMTIyOTMzOA )

We invite researchers and developers to test, refine and enrich these models in order to continuously improve the ASR for Bambara and other African languages.