Not sure which one is the Allen wrench in your toolbox? ChatGPT can identify it for you from a picture of your tool set; in fact, it can tell you which is the right tool for the job and how to use it. OpenAI’s surprise upgrade showcases ChatGPT’s latest abilities that go beyond textual conversations. Not only can you request it for original bedtime stories, but you can have them read out to you in its own AI voice.
From image analysis to voice responses, ChatGPT’s powerful new features represent the future of AI’s multimodal models.
What does it mean for AI to go multimodal?
Multimodal AI represents simplicity at its best. While the existing AI models for images, videos, and voice are rather remarkable, the process of finding the appropriate model for each task can be time-consuming, and data transfer between models can be quite cumbersome. By addressing these issues, we are witnessing the next generation of these large models in multimodal AI.
Users can now interact with these AI agents via various media formats and effortlessly transition between image, text, and voice prompts within the same conversation.
A noteworthy example of multimodal AI is the latest ChatGPT upgrade. Instead of relying on a single AI model tailored for a specific type of input, multiple models collaborate to form a more integrated and cohesive AI system.
What can multimodal ChatGPT do?
ChatGPT comes with three specific multimodal features:
- Users can upload images to the chatbot as prompts.
- Users can use voice prompts.
- Users can choose to receive responses in one of the five available AI-generated voices.
While the image prompt feature is available on all platforms, the voice is limited to the Android and iOS apps.
Voice integration enhances ChatGPT’s conversational capabilities, and image support empowers it with functionality akin to Google Lens. One can effortlessly snap a picture and add it to the chat with a follow-up query, and ChatGPT will analyze the image in the context of the accompanying text to generate a response. It can even continue the conversation with the user around that topic.
OpenAI has highlighted that the voice conversation capabilities will enable users to engage in discussion about a wide range of topics by simply speaking out aloud. Users can choose from five voice options and express their prompts verbally, and ChatGPT will use the chosen voice to provide an appropriate response.
OpenAI delivers these features by employing speech-to-text and text-to-speech models that operate in near real-time. The process involves converting spoken input into text, feeding that text into OpenAI’s core language model, GPT-4, to generate a response, and subsequently changing that text back into the user-selected voice. The company has collaborated with several voice artists to craft voices that closely resemble human speech for this synthesis.
Another perk of ChatGPT is its simple and user-friendly interface. An image prompt is as simple as opening the app and clicking an icon to take a photo. These new, powerful upgrades will allow users to have fluid back-and-forth conversations with ChatGPT and ask it to analyze and react to any image they upload.
ChatGPT’s latest capabilities and multimodal features aren’t entirely new and have been powered by OpenAI’s proprietary vision, speech recognition, and synthesis models. In March 2023, GPT-4 introduced the ability to understand image prompts, which were implemented by some OpenAI partners, including Microsoft’s Bing Chat. However, accessing these features typically required API access, making them primarily accessible to partners and developers.
Now, these capabilities are available to a broader audience through the ChatGPT Plus subscription and Enterprise offerings over the next two weeks. The company plans to extend this access to developers and other groups of users in the near future.
Real-world applications of multimodal ChatGPT
The added convenience of the immediacy and conversational flow with ChatGPT has opened up more real-world applications than its predecessor. ChatGPT can suggest recipes simply by looking at what’s in your fridge, identify travel locations from photos, recommend appropriate travel tips or checklists, help with process optimization from a diagram of a workflow, or assist students with complex homework by reading problems aloud — the options are limitless.
The future of multimodal AI
ChatGPT’s integration of image and voice support serves as a glimpse of what the future holds. Introducing image and voice input is a logical starting point for ChatGPT’s multimodal capabilities, given that it’s a user-facing app, and these are two of the most frequently used data formats. However, there is no inherent limitation preventing an AI model from being trained to address other data formats, such as Excel spreadsheets, 3D models, or photographs with depth data.
Fan, a researcher at Nvidia, notes, “While there aren’t any good models for it right now, in principle, you can give it 3D data, or even something like digital smell data, and it can output images, videos, and even actions. I do research on game AI and robotics, and multimodal models are critical for these efforts.”
Nevertheless, building multimodal AI systems comes with substantial challenges, with one of the most significant being the extensive amount of data required to train a roster of AI models.
Ethical considerations in the multimodal ChatGPT
The new capabilities significantly enhance the functionality of ChatGPT. What’s noteworthy is OpenAI’s decision to introduce these features now rather than waiting until the release of more powerful AI models like the GPT-4.5 or GPT-5 LLM.
To mitigate the potential misuse of its voice synthesis capabilities, which could be exploited for fraudulent purposes, Open AI has limited its use to voice chat and specific approved partnerships. One such notable partnership is with Spotify, where the music platform assists podcasters in transcribing their content into other languages while preserving their own voice.
To address privacy and accuracy concerns related to image recognition, OpenAI has imposed restrictions on the bot’s ability to analyze and make direct statements about individuals in input images.
The company has, additionally, carried out various tests to uncover potential risks in sensitive domains and has put in place technical measures around image analysis. OpenAI’s commitment to transparency regarding the model’s limitations is intended to prevent misuse, given the risks associated with image analysis and synthetic media.
This cautious approach allows OpenAI to refine its protective measures while gradually expanding access to a broader user base. It proves the company’s dedication to developing AI that is both highly capable and beneficial.
OpenAI’s unveiling of voice and image functionalities brings ChatGPT closer to the forefront of AI innovation, all the while maintaining a strong focus on ethical considerations.
Learn about the interesting evolution of GPT-1 to GPT-4.