GPT-4o is OpenAI’s third major iteration of their large multimodal model, expanding on GPT-4 with Vision. This new model can talk, see, and interact with users more seamlessly through the ChatGPT interface than in previous versions.

OpenAI highlighted GPT-4o’s ability for “much more natural human-computer interaction” in their announcement. This article will cover what GPT-4o is, how it differs from previous models, its performance, and its use cases.

I. What is GPT-4o?

GPT-4o is OpenAI’s latest LLM. The ‘o’ in GPT-4o stands for “omni,” meaning “every” in Latin. This model can handle prompts that mix text, audio, images, and video. Previously, ChatGPT used separate models for different content types.

For instance, ChatGPT converted speech to text using Whisper in Voice Mode, generated a text response with GPT-4 Turbo, and then converted that response to speech with TTS.

Similarly, processing images in ChatGPT required GPT-4 Turbo and DALL-E 3.

GPT-4o is an upgrade from GPT-4?

Using a single model for all content types promises faster and higher-quality results, a simpler interface, and new use cases.

When comparing speed between GPT-4o vs. GPT-4, GPT-4o is twice as fast. It is also 50% cheaper for input tokens ($5 per million) and output tokens ($15 per million). It has five times the rate limit, handling up to 10 million tokens per minute. GPT-4o features a 128K context window with a knowledge cut-off date of October 2023.

More articles about ChatGPT:

Auto-GPT vs. Chat-GPT: Is Auto-GPT the Future of AI?

5 Most Anticipated ChatGPT Alternatives in 2024

II. What does GPT-4o do better than GPT-4?

1. Tone of voice is now considered, facilitating emotional responses

Previously, OpenAI’s system combined Whisper, GPT-4 Turbo, and TTS in a pipeline, limiting GPT-4 to spoken words. This approach ignored tone of voice, background noises, and multiple speakers’ voices. As a result, GPT-4 Turbo couldn’t express responses with different emotions or speaking styles.

With a single model that processes text and audio, GPT-4o can use rich audio information to provide higher-quality responses and a greater variety of speaking styles.

2. Lower latency enables real-time conversations

The previous three-model pipeline caused a delay (“latency”) between speaking to ChatGPT and getting a response. OpenAI reported that Voice Mode’s average latency was 2.8 seconds for GPT-3.5 and 5.4 seconds for GPT-4. In contrast, GPT-4o’s average latency is 0.32 seconds, making it nine times faster than GPT-3.5 and 17 times faster than GPT-4.

This reduced latency, close to the average human response time of 0.21 seconds, is crucial for conversational use cases with frequent back-and-forth exchanges. One viable use case with GPT-4o’s decreased latency is real-time speech translation. OpenAI showed a scenario where an English speaker and a Spanish speaker communicated by having GPT-4o translate their conversation.

GPT-4o is doing better

3. Better tokenization for non-Roman alphabets provides greater speed and value for money

In the LLM workflow, the prompt text is converted into tokens, units of text the model can understand. In English, a token is usually one word or a piece of punctuation, though some words can be split into multiple tokens. On average, three English words use about four tokens.

Using fewer tokens means fewer calculations, which speeds up text generation. Also, since OpenAI charges for its API per token input or output, fewer tokens result in lower costs for API users.

GPT-4o has an improved tokenization model that requires fewer tokens per text, especially for languages that don’t use the Roman alphabet.

For example, Indian languages have shown significant token reductions: Hindi, Marathi, Tamil, Telugu, and Gujarati use 2.9 to 4.4 times fewer tokens. Arabic has a 2x token reduction, and East Asian languages like Chinese, Japanese, Vietnamese, and Korean have token reductions between 1.4x and 1.7x.

4. Video Capabilities of GPT-4o

Important note from the API release notes regarding video use: “GPT-4o in the API supports understanding video (without audio) via vision capabilities. Specifically, videos must be converted to frames (2-4 frames per second, either sampled uniformly or via a keyframe selection algorithm) to input into the model.” Use the OpenAI cookbook for vision to understand how to use video as an input and the limitations of the release.

GPT-4o can view and understand video and audio from an uploaded video file and generate short videos.

GPT-4o was asked to comment on or respond to visual elements in the initial demo multiple times. Similar to our observations of Gemini, the demo didn’t clarify whether the model was receiving continuous video or triggering an image capture for real-time information. At one point, GPT-4o may not have triggered an image capture, resulting in it seeing a previously captured image.

5. Rollout to the free plan

Previously, users had to pay to access GPT-4 Turbo, available only on the Plus and Enterprise plans. OpenAI is now changing this by making GPT-4o available on the free plan. Plus users will still get five times as many messages as free plan users. The rollout will be gradual. Red team testers get immediate access, with more users gaining access over time.

6. Launch of the ChatGPT desktop app

OpenAI also announced the release of the ChatGPT desktop app. This update, along with the improvements in latency and multimodality, means changes in how we use ChatGPT. For example, OpenAI demonstrated an augmented coding workflow using voice and the ChatGPT desktop app. Scroll down in the use-cases section to see the demo in action.

III. What Are GPT-4o Use-Cases?

Real-time Computer Vision Use Cases

The new speed improvements and visual and audio capabilities enable real-time use cases for GPT-4o, especially in computer vision. Using a real-time view and speaking to a GPT-4o model allows for quick intelligence gathering and decision-making. This is useful for navigation, translation, guided instructions, and understanding complex visual data.

Interacting with GPT-4o at human-like speeds means less time typing and more time interacting with your environment as AI augments your needs.

Enhance Your Business with AI Integration Today!

Get in touch with’s experts for a free consultation. We’ll help you decide on next steps, explain how the development process is organized, and provide you with a free project estimate.

One-device Multimodal Use Cases

Running GPT-4o on-device for desktop, mobile, and potentially wearables like Apple VisionPro allows you to troubleshoot many tasks with one interface. Instead of typing text prompts, you can show your desktop screen. Instead of copying and pasting content into the ChatGPT window, you pass visual information while asking questions. This reduces the need to switch screens and models, creating an integrated experience.

GPT-4o’s single multimodal model reduces friction, increases speed, and streamlines device input connections, making interaction easier.

General Enterprise Applications

With more modalities in one model and improved performance, GPT-4o fits certain enterprise application pipelines that don’t need fine-tuning on custom data. Though more expensive than open-source models, its faster performance makes GPT-4o useful for building custom vision applications.

You can use GPT-4o where open-source or fine-tuned models aren’t available, and then switch to your custom models for other steps to augment GPT-4o’s knowledge or reduce costs. This allows quick prototyping of complex workflows without being blocked by model capabilities.

IV. What Does GPT-4o Mean for the Future?

There are two views on AI’s future. One is that AI should become more powerful and handle a wider range of tasks. The other is that AI should get better at specific tasks as cheaply as possible.

OpenAI aims to create artificial general intelligence (AGI) and follows the first view. GPT-4o is another step towards this goal of more powerful AI.

Future with GPT-4o

This is the first generation of a new model architecture for OpenAI. The company has a lot to learn and optimize in the coming months.

In the short term, expect new quirks and hallucinations. In the long term, expect performance improvements in speed and output quality.

The timing of GPT-4o is interesting. As tech giants reassess voice assistants like Siri, Alexa, and Google Assistant, OpenAI aims to make AI talkative again. This could lead to new use cases for generative AI. At the very least, you can now set a timer in any language.

V. GPT-4o Limitations & Risks

Regulation for generative AI is still in its early stages. The EU AI Act is the only notable legal framework so far. This means AI companies need to decide what constitutes safe AI.

OpenAI uses a preparedness framework to determine if a new model is ready for public release. The framework tests four areas of concern:

Cybersecurity: Can AI increase cybercriminal productivity and help create exploits?
BCRN: Can AI help create biological, chemical, radiological, or nuclear threats?
Persuasion: Can AI create content that persuades people to change their beliefs?
Model autonomy: Can AI perform actions with other software as an agent?

Each area is graded Low, Medium, High, or Critical. The model’s score is the highest grade across these categories.

OpenAI promises not to release a model with a critical concern, which corresponds to something that would upend human civilization. GPT-4o scores Medium concern, avoiding critical issues.

Imperfect Output

As with all generative AIs, GPT-4o doesn’t always behave as intended. Computer vision is not perfect, so interpretations of images or videos are not guaranteed to be accurate.

Similarly, speech transcriptions are rarely 100% correct, especially if the speaker has a strong accent or uses technical words.

OpenAI provided a video showing some outtakes where GPT-4o did not work as intended.

Notably, translation between two non-English languages was one of the failures. Other issues included an unsuitable tone of voice (being condescending) and speaking the wrong language.

Accelerated Risk of Audio Deepfakes

OpenAI acknowledges that “GPT-4o’s audio modalities present various novel risks.” GPT-4o can accelerate the rise of deepfake scam calls, where AI impersonates celebrities, politicians, and people’s friends and family. This problem will likely worsen before it improves, and GPT-4o can make these scam calls more convincing.

To mitigate this risk, audio output is limited to preset voices.

Technically-minded scammers could use GPT-4o to generate text output and then apply their own text-to-speech model. However, it is unclear whether this would retain GPT-4o’s latency and tone-of-voice benefits.

Conclusion

GPT-4o represents a significant leap forward in AI technology, integrating multiple modalities into a single, efficient model. This advancement promises faster, higher-quality results and opens new possibilities for real-time interactions, computer vision, and more.

For businesses looking to harness the full potential of GPT-4o and other cutting-edge AI technologies, partnering with experienced AI service providers is essential.

Don’t just stay ahead of the curve—define it. Partner with TECHVIFY to leverage GPT-4o’s unparalleled capabilities and transform your business. Our team of experts will guide you through every step, ensuring you maximize the benefits while mitigating risks.

Contact TECHVIFY today and elevate your AI strategy to the next level!

Let’s talk

A consultation with the Client Relationship Manager, who represents TECHVIFY, without any commitment from your side, will give you:

Structured and clear vision of your future application
Information about how our software development company guarantees 100% on-time and on-budget delivery
Recommendations for choosing the tech stack
Advice on further steps
Business-side recommendations
Rough project estimation on software development

TECHVIFY is right where you need. Contact us now for further consultation:

Author

TECHVIFY Team

TECHVIFY Team consists of members from many different departments at TECHVIFY Software. We strive to provide our readers with insights and the latest news about business and technology.