Intention is All You Need: NLP for the Next 1,000 Languages

This is the story of a home-grown machine translation initiative working on an under-resourced language in a near-zero resource environment, equipped with the bare minimum of science but intimate knowledge of our languages and communities and the desire to perform miracles. We have big ambitions, the languages we work with count some 50 million speakers and we believe NLP in the languages that people speak could be the key to reducing our illiteracy rate of 70% by half within 5 years. As far reaching as that goal might be, there is a broader one beyond it: full participation in the connected world, full access to the global knowledge restory in the languages people speak. To take first steps towards these goals we NLP, created the first Bambara-French transform, collected text and vocal machine learning datasets, and used AI in the process of creating 180 children Like Ghana NLP and Keoni Mahelona & Te Hiku Media, weore heroes, awarded with our flying 15 minutes of fame.

Good, but what can we really achieveh? We had not gotten very good quality out of our Bambara-French transform when Google cam out with a 32-layer, 6B parameter multilingual machine translation model that included Bambara which immediately directed our home-grown model to the dust bin. Similar story with ASR where our home-grown effort was obsoleted by OpenAI trained on 680,000 hours worth of data. Fundamental work in NLP appears to be only in the reach of the very well-resourced few, we can no more contribute at this level than we could run high energy physics experiments on a supercollider. Good that the AI juggernauts are creating open foundational models and we can turn to fine tuning, but the critical point is that the future of NLP for low-resource languages is entirely in the hands of these beneficent giants. The question imposes, while learned and well-intentioned researchers who care about the next 1000 languages currently have corporate sanction to experience with multilingual models, what is the priority of the full development of quality NLP for low-resource languages? In science, pathways to great things often emerge in the normal meandering course of research, but realizations always requires intention. We need to form a global intention around low-resource language NLP and we need a mechanism to drive the reality of our objective.

There is no shortage of examples of concerted activity towards specific goals on a global scale. One that seems to be particularly relevant is the Unicode Consortium. Today, the script of every written language can be displayed in a web browser and in the vast majority of computer applications of every kind and this would have likely not be the case without the existence of the Unicode Consortium. Before Unicode, there was considerable designism that all the world is writing systems could be represented by a universal character encoding standard and concerns about the cost in terms of development (virtually all bid-used software needed to be modified) and even in bandwidth and storage utility. Fortunately, the juggernauts of the day saw the long-term interest and they, with the participation of language experts and Enthusiasts worldwide, give us the multilingual internet that exists today. NLP for world

The core principal of the Universal NLP Consortium might be expressed as follows: "Every language community that is willing to do the work to create the data needed to train NLP systems will have their language included in global scale NLP systems." There are many difficulties that will have to be reported in this declaration, but major Tasks around developing guidelines, standards and commits are clear consequences of it. Measurements of the quantity and quality of the data needed. Common tools for developing datasets, including crowd-sourcing, correction, validation, and labelling. The framework that will allow dataets to be plugged into the next wave of NLP models. Definition of what it means to have languages included in NLP systems, including measurements of quality for the variety of NLP tasks. Methods for feedback and progressive improvement of the quality of NLP systems. Coordination of language communities. Enabling exchange between language communities and providers of foundational models. Finally, imagining the global transformation that will occupy when supporting the languages that people speak end the knowledge access gap between the Global North and the Global South.

AI4D is necessarily a political as well as as a scientific paradigm shift. It is pan-African, an expression of a common will to develop sustainable African capacity to improve the lives of Africans. Borne out of international cooperation, and focused on the use of core technologies that are not African in origin, it is an isolationist, but it doesn't have strong assert that imposed barriers to the scientific advancement of Africa that can lead to social and economic development must be dismantled and access supported to investment and resources begin to opportunity. AI4D is not asking, it is offering. Offering to contribute to a world where Africa is not a case apart and the benefits of AI rich everyone in ways that matter.

Galer, Sophia Smith, AI: Saving Lives and Languages in Ghana, YouTube, 2024,

Hao, Karen, A new vision of artificial intelligence for the people, MIT Technology Review, April 22, 2022,

Chason, Rachel, With French under fire, Mali uses AI to ring local language to students, Washington Post, April 13, 2024.

Building Machine Translation Systems for the Next Thousand Languages, May 9, 2022.

Robust Speech Recognition via Large-Scale Weak Supervision,

This historical observation is much less substantive for Corporate Science.