At the spring update event, OpenAI introduced the new artificial intelligence model
GPT-4o, which can execute logic in real time via audio, image and text. The
capabilities offered by this new model and how these capabilities will be reflected
in everyday life are quite impressive. Here are the main features that the GPT-4o
offers:
Memory Ability: The GPT-4o has the ability to learn from users by remembering
previous conversations with them. This feature personalizes the user experience
and allows it to become smarter and more responsive with every interaction.
Real-Time Translation: The model offers translation support in 50 different
languages with the ability to perform instant translation. This allows users to
quickly and accurately understand content in different languages.
Solving Math Problems by Talking and Private Teaching: It offers users the ability to
solve these problems by explaining math problems fluently. This is a great
advantage for both students and professionals engaged in mathematics.
Speaking Ability: With voice communication capability, the GPT-4o creates the
feeling that you are talking to a real person. The ability to Decipher the difference
between the intonations of the voice provides a more natural and fluent speaking
experience.
Multimedia Analysis: By analyzing the images and text, it has the ability to
establish a relationship between these data Decently. This allows the visual and
textual contents to be evaluated together and a more comprehensive analysis is
carried out.
These capabilities represent the GPT-4o's wide range of capabilities for interacting
with users and performing various tasks. The model will be able to improve itself by
continuing to learn based on its experiences. GPT-4o will be free for all ChatGPT
users, but the company has not yet made a clear statement about when users will
be able to use it for free. The only announcement made by CEO Sam Altman was
that ”the new voice mode will be released for plus users in the coming weeks". You
can see the details on this topic in the “Model availability” section at the end of the
news.
So what innovations are the GPT-4o's capabilities based on? Let's take a look at
the technical explanations made by OpenAI together…
The GPT-4o is at the level of intelligence of the GPT-4, but much faster than it:
The GPT-4o accepts any combination of text, audio and image as input and can
produce any combination of text, audio and image output. It can respond to audio
inputs in as little as 232 milliseconds. This time is very close to the reaction time of
the person you are facing during a dialogue. Therefore, it creates an experience
very similar to a dialogue Decoupled between two people. Before the GPT-4o,
ChatGPT could be dialogued using Voice Mode with an average delay of 2.8
seconds (GPT-3.5) and 5.4 seconds (GPT-4). GPT-4 could not directly observe the
tone of the voice, multiple speakers, or background sounds. He also could not
produce laughter, singing or expressing emotions. The reason for this was that it
had a processing line consisting of three separate models. One model was
converting sound to text; another was printing out the text, and a third simple
model was converting this text back to sound. This process meant that GPT-4, the
main source of intelligence, lost a lot of information. With the GPT-4o, a single
end-to-end model is connected across text, image and audio. This means that all
inputs and outputs are processed by the same neural network. “Since GPT-4o is
our first model to combine all these methods, we are still only at a superficial
stage of exploring what the model can do and its limits,” says OpenAI.
The new model matches the GPT-4 Turbo performance in English and texts
containing coding:
GPT-4o also has a significant superiority in languages other than English.
Compared to the current models, the GPT-4o is especially better at understanding
images and sounds. It is also much faster in the API and 50% cheaper.
Model evaluations:
Measured by traditional benchmarks, the GPT-4o achieves GPT-4 Turbo-level
performance in text, reasoning and coding intelligence, while raising the bar in
multilingualism, audio and video capabilities, taking artificial intelligence
technology to a new level.
Model safety and limitations:
OpenAI made long and detailed explanations about security: “GPT-4o has built-in
security by design through various methods, through techniques such as filtering
of training data and improving model behavior after training. We have also
created new security systems to provide visors/barriers at audio outputs. We
evaluated GPT-4o according to our Preparation Framework and in line with our
voluntary commitments. Our assessments of cybersecurity, CBRN, persuasion and
model autonomy show that the GPT-4o does not score above Medium risk in any
of these categories. This evaluation involved the execution of a series of
automated and human evaluations throughout the model training process. We
tested both pre-security mitigation and post-security mitigation versions of the
model using custom fine-tuning and prompts to better reveal the model's
capabilities. GPT-4o has also undergone an extensive external red team study with
more than 70 external experts in areas such as social psychology, bias, fairness
and misinformation to identify the risks posed or reinforced by the newly added
methods. We used this knowledge to create our security interventions to improve
the security of interaction with the GPT-4o. As new risks are discovered, we will
continue to reduce them.
The description continues as follows:
“We are aware that GPT-4o's voice methods present various new risks. Today, we
are publicly disclosing text and image inputs and text outputs. In the coming
weeks and months, we will work on the technical infrastructure, post-training
usability and the security necessary for the launch of other methods. For example,
audio outputs during launch will be limited to a selection of preset sounds and
comply with our current security policies. We will share more details on the
upcoming system board, which covers all the methods of the GPT-4o. As a result
of testing and iterating the model, we observed various limitations that exist in all
methods of the model, a few of which are shown in the video below. We are
pleased with the feedback that will help determine the tasks in which the GPT-4
Turbo still performs better than the GPT-4o so that we can continue to develop the
model.”
Model availability:
“The GPT-4o is the latest step we have taken to push the boundaries of deep
learning, this time towards practical usability. We've spent a lot of effort over the
last two years working on efficiency improvements at every layer of the stack. As
the first fruit of this research, we were able to make a GPT-4-level model much
more widely available. The GPT-4o's capabilities will be rolled out iteratively (with
expanded red team access starting today).”
“The text and image capabilities of the GPT-4o begin to be made available on
CHATGPT today. We offer GPT-4o at the free tier and for use by Plus users with
higher message limits up to 5 times. In the coming weeks, we will be rolling out a
new Audio Mode version with the GPT-4o alpha version on ChatGPT Plus.
Developers can now also access GPT-4o as a text and image model in the API.