OpenAI took the wraps off of its latest AI model, GPT-4o designed to “reason across audio, vision, and text in real time.”
OpenAI held a livestreaming event Monday afternoon to unveil its latest AI model. Some had theorized the company would unveil its rumored search engine, or GPT-5. While neither of those two things happened, OpenAI’s latest innovation was no less impressive.
GPT-4o (“o” for “omni”) is a step towards much more natural human-computer interaction—it accepts as input any combination of text, audio, and image and generates any combination of text, audio, and image outputs. It can respond to audio inputs in as little as 232 milliseconds, with an average of 320 milliseconds, which is similar to human response time(opens in a new window) in a conversation. It matches GPT-4 Turbo performance on text in English and code, with significant improvement on text in non-English languages, while also being much faster and 50% cheaper in the API. GPT-4o is especially better at vision and audio understanding compared to existing models.
The three-person panel showed off ChatGPT’s new GPT-4o-powered features, including the app’s ability to use the camera to recognize object, decipher math equations written on a paper, and evaluate a person’s mood. ChatGPT showed an impressive understanding of context and was able to pick up on different emotional states.
The panelists asked the AI to tell a story, and then kept adding parameters, such as asking it to tell the story in a dramatic fashion or using a robot voice.
When looking at the math equation, ChatGPT was instructed not to divulge the answer, but to coach one of the panelists as they worked through the problem and to provide hints and feedback. The AI performed admirably, asking leading questions, offering hints, and providing positive reinforcement.
GPT-4o is an impressive step forward, with the panelists demonstrating some of the novel ways ChatGPT can be used in practical applications.