GPT-4 Delivering a Real Conversation Experience

Venore Worldwide

At the spring update event, OpenAI introduced the new artificial intelligence model GPT-4o, which can execute logic in real time via audio, image and text. The capabilities offered by this new model and how these capabilities will be reflected in everyday life are quite impressive. Here are the main features that the GPT-4o offers:

Memory Ability: The GPT-4o has the ability to learn from users by remembering previous conversations with them. This feature personalizes the user experience and allows it to become smarter and more responsive with every interaction.

Real-Time Translation: The model offers translation support in 50 different languages with the ability to perform instant translation. This allows users to quickly and accurately understand content in different languages.

Solving Math Problems by Talking and Private Teaching: It offers users the ability to solve these problems by explaining math problems fluently. This is a great advantage for both students and professionals engaged in mathematics.

Speaking Ability: With voice communication capability, the GPT-4o creates the feeling that you are talking to a real person. The ability to Decipher the difference between the intonations of the voice provides a more natural and fluent speaking experience.

Multimedia Analysis: By analyzing the images and text, it has the ability to establish a relationship between these data Decently. This allows the visual and textual contents to be evaluated together and a more comprehensive analysis is carried out.

These capabilities represent the GPT-4o's wide range of capabilities for interacting with users and performing various tasks. The model will be able to improve itself by continuing to learn based on its experiences. GPT-4o will be free for all ChatGPT users, but the company has not yet made a clear statement about when users will be able to use it for free. The only announcement made by CEO Sam Altman was that ”the new voice mode will be released for plus users in the coming weeks". You can see the details on this topic in the “Model availability” section at the end of the news.

So what innovations are the GPT-4o's capabilities based on? Let's take a look at the technical explanations made by OpenAI together…

The GPT-4o is at the level of intelligence of the GPT-4, but much faster than it: The GPT-4o accepts any combination of text, audio and image as input and can produce any combination of text, audio and image output. It can respond to audio inputs in as little as 232 milliseconds. This time is very close to the reaction time of the person you are facing during a dialogue. Therefore, it creates an experience very similar to a dialogue Decoupled between two people. Before the GPT-4o, ChatGPT could be dialogued using Voice Mode with an average delay of 2.8 seconds (GPT-3.5) and 5.4 seconds (GPT-4). GPT-4 could not directly observe the tone of the voice, multiple speakers, or background sounds. He also could not produce laughter, singing or expressing emotions. The reason for this was that it had a processing line consisting of three separate models. One model was converting sound to text; another was printing out the text, and a third simple model was converting this text back to sound. This process meant that GPT-4, the main source of intelligence, lost a lot of information. With the GPT-4o, a single end-to-end model is connected across text, image and audio. This means that all inputs and outputs are processed by the same neural network. “Since GPT-4o is our first model to combine all these methods, we are still only at a superficial stage of exploring what the model can do and its limits,” says OpenAI.

The new model matches the GPT-4 Turbo performance in English and texts containing coding:
GPT-4o also has a significant superiority in languages other than English. Compared to the current models, the GPT-4o is especially better at understanding images and sounds. It is also much faster in the API and 50% cheaper.

Model evaluations:
Measured by traditional benchmarks, the GPT-4o achieves GPT-4 Turbo-level performance in text, reasoning and coding intelligence, while raising the bar in multilingualism, audio and video capabilities, taking artificial intelligence technology to a new level.

Model safety and limitations:
OpenAI made long and detailed explanations about security: “GPT-4o has built-in security by design through various methods, through techniques such as filtering of training data and improving model behavior after training. We have also created new security systems to provide visors/barriers at audio outputs. We evaluated GPT-4o according to our Preparation Framework and in line with our voluntary commitments. Our assessments of cybersecurity, CBRN, persuasion and model autonomy show that the GPT-4o does not score above Medium risk in any of these categories. This evaluation involved the execution of a series of automated and human evaluations throughout the model training process. We tested both pre-security mitigation and post-security mitigation versions of the model using custom fine-tuning and prompts to better reveal the model's capabilities. GPT-4o has also undergone an extensive external red team study with more than 70 external experts in areas such as social psychology, bias, fairness and misinformation to identify the risks posed or reinforced by the newly added methods. We used this knowledge to create our security interventions to improve the security of interaction with the GPT-4o. As new risks are discovered, we will continue to reduce them.

The description continues as follows:
“We are aware that GPT-4o's voice methods present various new risks. Today, we are publicly disclosing text and image inputs and text outputs. In the coming weeks and months, we will work on the technical infrastructure, post-training usability and the security necessary for the launch of other methods. For example, audio outputs during launch will be limited to a selection of preset sounds and comply with our current security policies. We will share more details on the upcoming system board, which covers all the methods of the GPT-4o. As a result of testing and iterating the model, we observed various limitations that exist in all methods of the model, a few of which are shown in the video below. We are pleased with the feedback that will help determine the tasks in which the GPT-4 Turbo still performs better than the GPT-4o so that we can continue to develop the model.”

Model availability:
“The GPT-4o is the latest step we have taken to push the boundaries of deep learning, this time towards practical usability. We've spent a lot of effort over the last two years working on efficiency improvements at every layer of the stack. As the first fruit of this research, we were able to make a GPT-4-level model much more widely available. The GPT-4o's capabilities will be rolled out iteratively (with expanded red team access starting today).”

“The text and image capabilities of the GPT-4o begin to be made available on CHATGPT today. We offer GPT-4o at the free tier and for use by Plus users with higher message limits up to 5 times. In the coming weeks, we will be rolling out a new Audio Mode version with the GPT-4o alpha version on ChatGPT Plus. Developers can now also access GPT-4o as a text and image model in the API.