What Lies Beyond the Text?: Integrating Text with Video and Sound

Artificial intelligence has expanded at a dizzying pace in recent years, so much so that it makes us wonder about the wondrous possibilities that technology can open up in different realms, changing our world in radical ways.

Multimodal AI then refers to the ability of AI to work with a variety of information in different forms of media to be able to work with a comprehensive understanding of the context of an input, and thereby generate high-quality outputs by combining these different forms. Thus, multimodal transcends the limitations of the text, as it can work with a variety of forms like images, sound, videos, and even ambient information about one’s ecosystem, like temperature, humidity, and air pressure.

Some examples of multimodal AI include computer vision technologies (such as object recognition, semantic segmentation, age recognition, and facial recognition), speech processing technologies (such as acoustic model creation, noise cancellation, trigger word detections), and so on.

The real world hardly functions via one form of information. Instead, situations in the real world are convoluted precisely because they involve different senses and contexts. This makes it necessary to create technologies wherein the computer is able to parse through all of these senses and contexts to generate meaningful interventions, as opposed to requiring coders to reduce this complex multi-modal information by translating it into the form of the text, which would inevitably involve a loss of important details and nuance.

Therefore, one can think of multimodal AI as a dynamic mix of everything that the human senses can engage with. Here are some examples of what multimodal AI can do: if you furnish it with a picture, it can generate a detailed description of that picture and understand the event it depicts with accuracy. Or for example, you can describe a scene and multimodal AI can generate a picture for you based on your input. A different example might be how multimodal AI can process information about the surroundings of a car, including traffic, noise, the distance of the nearest car, weather conditions, and so on, to improve a user’s driving experience.

Therefore multimodal AI is a game-changer that has exponentially increased the possibilities of what AI can do. The future then only looks all the more hopeful!

Sources

Previous
Previous

Fighting Free Speech: Understanding AI's Role in Handling Sensitive Content on Social Media

Next
Next

Black Mirrors: Why should Studios not Replace Writers with AI