AI as Amanuensis: Understanding Speech-to-Text Technologies

For most of history, people have taken assistance from other individuals to transcribe their words into text. This person, known as an “amanuensis”, took dictation from writers to help them commit their works to the written form. Thus, people in high positions tasked with dictation to their secretaries, and writers suffering from loss of vision employed amanuensis to jot down their words for them.

Here, the 21st century witnessed a revolution in the rise of speech recognition technologies that enable users to obtain transcribed notes of their words. Speech recognition enables computer applications to translate human speech into text by analyzing voice and language and identifying individual words in the correct sequential order.

And yet, despite this promise, even as they could theoretically undertake perfect transcription, these technologies were thoroughly incompetent, precisely because human speech is not articulated perfectly at all. Even as human speech is marked with inflections, such as stops, stutterings, laughs, and hesitations, transcribing human speech to text is an incredible challenge. Combine this with the fact that it is human tendency to elide some parts of speech from articulation, and only verbalize some parts of a word, makes it even more herculean a task for AI to undertake this task. Therefore, for the longest time, speech-to-text technologies were woefully inadequate, replete with errors and missing links owing to the quirks that all human speech is inevitably laced with.

However, with the emergence of AI and predictive analysis, speech-to-text technologies have begun to fix their loopholes and become exponentially better at such tasks. Hence, the latest technologies such as one developed by the researchers at Karlsruhe Institute of Technology (KIT) have performed wonderfully in transcribing spontaneous speech with an impressively low error rate at several internationally-prescribed benchmark tests. Not only are these new technologies able to transcribe speech to text in an error-free way, but they also do so at incomprehensibly fast rates, such that it is almost instantaneous. This is because of the application of what is termed “predictive modeling” which is a process wherein AI analyzes the words used most often to gauge their meanings in different contexts and thus figure out the correct string of sequence and word choice, minimizing the possibility of error.

These abilities drastically outperform the human abilities of transcription, and as such have wonderful uses in several industries. The most important of these include facilities such as banking, telecommunication, and media and marketing. Further, this promises to be a groundbreaking intervention for disabled people, particularly those with visual impairments, who will find it significantly easier to navigate their daily lives with this technology at their disposal.

The most pressing concern, however, is the apprehension surrounding data privacy, the risk of leaking audio samples of individuals, which might be accessed by anybody anywhere for any purpose. Hence, we need comprehensive policies to figure out a way to usher in this radically promising technology while eliminating associated risks.

Sources:

  1. https://www.gnani.ai/resources/blogs/ai-speech-recognition-what-is-it-and-how-it-works/

  2. https://dsc.uci.edu/at/speech-recognition/

Previous
Previous

Language Litanies: Training AI on Diverse Languages

Next
Next

Mimicking Emotion: Exploring the Limits of AI in Replicating Human Language