Language Litanies: Training AI on Diverse Languages

Languages need constant usage to survive, as they survive through their everyday circulation within the word of mouth of the common people, who deploy the language to construct their everyday social reality. Even as our world is dominated by the pervasive use of English as the lingua franca of the world, the ease of communication that using English enables comes with its own set of problems, the most dominant of which is the disappearance of many of the world’s languages. This is due to a wide variety of reasons, the most important of which is the loss of value of a language over successive generations, as over time bilingual people prefer to use dominant languages in the face of institutional neglect meted out to a language.

Here, Meta’s massively multilingual speech (MMS) model holds tremendous latent potential in tapping into the powers of expanding the boundaries of text-to-speech and speech-to-text in a panoply of languages that number upward of 1,100,a drastically steep spike when compared with its previous capacity of 100 languages. In conjunction with ensuring the survivability of these languages through engaging them within common parlance and everyday discourse, MMS also propels the furtherance of these languages by updating them with new vocabulary and syntax that emerges from ordinary usage. Further, the discomfort of millions of users with English and other dominant languages is uniquely addressed, as they now have avenues to explore the internet and create content in languages they are most proficient in. Not only does this lead to a colossal surge in the number of people who are now able to access the internet, but it also ensures that the internet becomes a more robust and dynamic platform that caters to the differential needs of the population, and these benefits are amplified manifold given how this exists as an open source code, it fosters collaborative research by the world community to keep on augmenting it and plugging its failures.

The way to train MMS models is to feed them huge amounts of data in the form of audio for thousands of languages. The primary challenge here is the availability of data for such purposes. Meta then surpassed this hurdle by using texts like the Bible that have been translated into various languages and that have extant research comparing their different translations.

Google India too is undertaking research in this domain and is working in close collaboration with the Indian Institute of Science (IISc) for Project Vaani to create AI-models capable of understanding and generating text in Indian languages. Countries like India present a unique challenge to AI-models by the sheer number of languages and dialects spoken in the subcontinent that change every couple of kilometers. Tracking this diversity and incorporating it in AI algorithms to train it on these constant changes in inflections and grammar is a mammoth challenge. Google’s model has been termed the Multilingual Representations for Indian Languages (MuRIL), and right now the project aims to cover any language or dialect spoken by more than 100,000 people.

These projects hold immense importance in our globalized multilingual society, wherein inclusion and diversity must be the values that we must base our progress on. As AI models have often been criticized for reproducing language bias and worsening social inequalities, these efforts are a step in the right direction.

Sources:

https://blog.pipplet.com/openai-linguistic-diversity-bias-reduction-ai-language-testing/

https://about.fb.com/news/2023/05/ai-massively-multilingual-speech-technology/

https://www.livemint.com/companies/start-ups/google-taps-ai-to-grasp-india-s-language-diversity-11671466688191.html

Previous
Previous

ASL and Technology

Next
Next

AI as Amanuensis: Understanding Speech-to-Text Technologies