A humanoid robot hands its operator a lush red apple after he asks for something to eat. This isn’t science fiction—it’s a scene from a demonstration by Figure AI, flexing one of the latest capabilities of artificial intelligence (AI): multimodal AI.
When the robot handed its operator the apple, its multimodal models were interpreting multiple inputs: the voice command, the apple’s visual presence, and the operator’s visual presence. Humanoid robots are the flashiest use case for this technology, but multimodal AI also powers autonomous vehicles, interactive virtual characters, AI assistants, and search tools like Google Lens.
In ecommerce, multimodal AI enables visual search, augmented reality try-ons, and advanced customer support. Here’s what you need to know to unlock its potential for your business.
What is multimodal AI?
Multimodal AI refers to machine learning models built to intake, interpret, and process multiple forms of data simultaneously. These models can receive input and create output in different modalities—rather than being limited to just one type—including text, images, audio, video, numerical data, and sensor data like GPS or accelerometers.
For instance, given a text prompt and a source image, multimodal AI can generate a video, setting the image in motion or reinterpreting it (depending on the prompt).
Three defining characteristics of multimodal AI are:
-
Heterogeneity. This term describes diverse data types. For example, an AI-generated video output is different from the text input that prompted it.
-
Connections. Multimodal models create connections between different modalities. For example, Figure AI’s humanoid linked multiple pieces of sensor data: visual data (an apple), language (“apple”), and auditory cues (a verbal request) to form a coherent understanding.
-
Interactions. This refers to how different modalities respond to one another when brought together. In a self-driving car, the visual input of a stop sign interacts with the presence of an approaching intersection in its GPS input, reinforcing the decision to stop.
Multimodal vs. unimodal vs. generative AI
Multimodal AI systems are capable of integrating multiple types of data simultaneously and creating multiple types of output. When given text prompts, audio, video, and images, a multimodal model can interpret, analyze, and respond to multiple inputs at once (provided the model is designed to handle those modalities)—much like how the human brain synthesizes what it sees, hears, and feels in real time. This versatility makes multimodal AI more accurate, agile, and intelligent than unimodal AI.
Unimodal AI is limited to understanding a single type of data, usually textual data or visuals. While still powerful—natural language processing and unimodal neural networks used in chatbots are strong examples—it lacks the holistic comprehension that multimodal systems offer.
Generative AI learns patterns from a vast dataset and uses them to generate new content. It can be either unimodal (like MusicLM, which generates music from text) or multimodal (like Chat GPT-4o). Multimodal generative AI models can generate visual content, translate between modalities, and respond creatively to real-world scenarios.
How does multimodal AI work?
Multimodal AI relies on three primary components to process multiple types of data simultaneously: the input module, fusion module, and output module.
Input module
The process starts with the input module, where raw data from various data types—textual data, customer service audio data, product images, and videos—are ingested and processed by unimodal neural networks. These networks are often built on large-scale transformer architectures, which are proficient at spotting patterns and relationships within sequences and specialized for their respective modalities. Collectively, they allow the AI system to interpret multimodal data, making it possible to interact across diverse inputs from multiple sources.
For example, a text model might predict missing words in a sentence, while a vision model can perform tasks like image inpainting to infer missing parts of an image.
Fusion module
The fusion module combines and aligns relevant data. The data is transformed into numerical representations—called embeddings—so that different modalities can communicate using the same language. This process involves translating words into tokens, images into visual embeddings, and audio into frequency-based features.
The fusion module aligns diverse data types either through early fusion, where embeddings from each modality are combined at the start, or late fusion, where they’re integrated after being processed independently. In early fusion, a model might learn what an apple looks like and how the word sounds—all paired with text, “apple.” Early fusion helps models develop a richer, more holistic understanding of the concept at hand.
Output module
After integrating the data, the output model uses the system’s underlying neural network—often a transformer decoder—to synthesize insights and produce responses. Responses vary widely but include generative content, predictions, or decisions.
To refine performance and reduce harmful outputs, models go through fine-tuning using methods like reinforcement learning with human feedback and red teaming (an adversarial testing exercise). This helps ensure the system performs well in real-world scenarios and responds with more accuracy, safety, and contextual awareness across all supported data modalities.
Examples of multimodal AI
If you’re online in any capacity, you’re likely interacting with large multimodal models more often than you realize. Examples include:
Gemini
Google’s multimodal generative AI platform, Gemini, combines vision, audio, and text to complete complex tasks. It can, for example, describe a sales funnel diagram and accompanying video out loud—interpreting multiple visual inputs at once.
GPT-4o
OpenAI’s large multimodal model can perform tasks based on textual and visual prompts. For example, GPT-4o can generate an image of an action figure in your likeness by ingesting a text prompt describing your accessories and qualities (text modality) along with an attached photo (visual modality).
Amazon StyleSnap
StyleSnap is an AI tool that uses computer vision technologies—which can analyze and interpret user-uploaded images—and NLP to suggest fashion items. The intuitive ecommerce tool makes connections between uploaded photos and its exhaustive inventory of clothing items.
PathAI
PathAI supports diagnostics using multimodal AI models to interpret medical images, electronic medical records, and clinical data. For example, its PathAssist Derm tool helps doctors quickly identify skin malignancies, addressing a pressing dermatologist shortage and improving patient outcomes.
Waymo
Waymo’s self-driving cars integrate multiple sensors like cameras and radar for real-time driving decisions. Its end-to-end multimodal model reacts to sensor data like the visual cues of lane markers, radar-detected distances, and contextual map data to safely navigate dynamic environments.
Benefits and challenges of multimodal AI
One of the most compelling advantages of multimodal AI is its ability to generate rich, context-aware content across formats. These systems can craft videos from text prompts, narrate images, and deliver insights that blend language with visuals or sound.
Multimodal AI models interpret multiple data types simultaneously, enabling more human-like interactions and more accurate outcomes. In fields like health care, education, and autonomous systems, combining data from diverse sources enhances decision making and problem solving.
However, building robust multimodal systems poses significant challenges. Aligning data from disparate modalities—like matching a voice clip with a facial expression—can be technically complex. Ensuring the model fully grasps the semantics of each data type and how they connect is equally difficult. Reasoning across diverse data sources, especially when interpreting things like intent or emotion, remains an evolving capability.
And then there’s the issue of data: Training these models requires high-quality, representative, and ethically sourced multimodal data—something not always readily available. Missing data, biased samples, or poor data quality can all weaken performance and trust in the system.
What is multimodal AI FAQ
What does multimodal in AI mean?
Multimodal AI refers to systems that simultaneously process and understand information from multiple data types, such as text, images, audio, and video.
Is ChatGPT a multimodal model?
Yes, the latest versions of ChatGPT, like GPT-4o, are multimodal AI models capable of interpreting both text and images.
What is multimodal conversational AI?
Multimodal conversational AI combines text, voice, visuals, or other inputs to enable richer, more human-like interactions with users.
What is a multimodal example?
An example of multimodal AI is a virtual assistant that analyzes spoken language and facial expressions to understand and respond to a user.