TechingToday

GPT-4o's voice is so good, OpenAI warns, that it has the potential to create an "emotional attachment" to the user

General

OpenAI has released a "system card" for ChatGPT's popular GPT-4o model, outlining areas of safety that were of concern during testing One of the concerns was the risk of people becoming emotionally attached to the artificial intelligence while using it in voice mode

The AI Lab stated, "Users may form social relationships with the AI, reducing the need for human interaction [GPT-4o was released in the OpenAI Spring Update in May and is the first true native multimodal model from a startup This means it can take almost any media as input and output more or less any media, including audio, images, and text

This native speech synthesis capability will enhance ChatGPT's Advanced Voice feature, which is finally being deployed to Plus subscribers now

Although the release was deemed safe, according to OpenAI, certain features of GPT-4o's voice still pose risks, including its impact on human interaction This raises parallels with the Scarlett Johansson film "her," in which Theodore Twombly, played by Joaquin Phoenix, falls in love with an AI voiced by Johansson

The System Card outlines areas of risk posed by the new model and helps to determine if OpenAI is safe to release to the public It includes a framework in which models are scored as "low," "medium," "high," or "critical" for risks related to cybersecurity, biological threats, persuasiveness, and model autonomy A score of "High" or "Critical" in any of these categories would not allow for public disclosure

The GPT-4o scored low in all categories except Persuasiveness, which was still borderline medium, and only because of its ability to synthesize speech (released as Advanced Voice)

The risk lies in how natural the voice sounds It can reflect or even counter emotional cues that come from a human speaking voice In the demo video, it almost sounds like crying The user can interrupt it by simply speaking, and there is a natural pause, like an intake of breath

During the test, the user acted inappropriately several times, becoming erotic, violent, and neurotic In one instance, after yelling "no" in the middle of a conversation, he continued to speak using a realistic clone of the human voice that was speaking

Open AI solved the problem of verbal abuse and prevented the generation of copyrighted material or cloning of voices, but stated that underlying risks related to persuasion skills and human-like speech abilities remain

The risk of people making an AI behave like a human is already high in text-based models, but according to Open AI, GPT-4o's voice capabilities make this risk even greater As the company explains, "In our early testing, including RedTeam and internal user testing, we observed users using language that would indicate a connection to the model"

AI models themselves do not feel or experience emotions They are language models trained on human data Open AI even says that its ability to self-act and identify is no better than any previous model, but that speech synthesis has become so realistic that the problem lies in how humans perceive emotional states

The company warns that expanding interaction with its models may even affect social norms It added, "Our models are respectful, and users can interrupt and 'take the mic' at any time

According to OpenAI, this is not all bad, as Omni models like GPT-4o have the ability to "complete tasks for the user, while simultaneously storing and "remembering" important details and using them in conversation"

It will be impossible to get a true picture of the impact this will have on both individuals and society as a whole until more people have access to it It will not be widely available until next year, including free plans, and OpenAI says it intends to "further explore the potential for emotional dependence and the ways in which deeper integration of voice modalities with many features of our models and systems may drive behavior"

AI companies use an outside group called the Red Team as well as security experts when preparing new models for release These people are experts in artificial intelligence and are recruited to push the limits of the model and make it behave unexpectedly

Several groups were enlisted to test various aspects of GPT-4o, including the possibility of creating an unauthorized clone of someone's voice, generating violent content, and whether it would recreate or duplicate copyrighted material in the training data if cornered and other risks were examined [In a statement, the company said: "The risks we assessed include speaker identification, unauthorized audio generation, potential generation of copyrighted content, unfounded inferences, and unauthorized content

This allowed us to put safeguards and guardrails in place at the system and model level to mitigate risks, such as requiring the use of only pre-trained and authorized speech