Moshi Chat's Advanced Voice Competitor GPT-4o Responds - OpenAI Still Has Nothing to Worry About

Moshi Chat's Advanced Voice Competitor GPT-4o Responds - OpenAI Still Has Nothing to Worry About

Moshi Chat is a new native speech AI model from French startup Kyutai that promises a similar experience to GPT-4o, which can understand and interrupt tone of voice.

Unlike GPT-4o, Moshi is a smaller model that can be installed locally and run offline. If responsiveness can be improved, it may be ideal for future smart appliances.

I had several conversations with Moshi. Each lasted about 5 minutes in the current online demo, but in each case they ended up repeating the same words over and over and losing coherence.

In one conversation, Moshi began to argue with me, outright refused to tell me the story, demanded that I state the facts instead, and would not relent until I said, "Give me the facts."

This all seems to be an issue of context window size and computational resources, easily solved with time; OpenAI does not need to worry about competing with Moshi yet, but as Luma Labs, Runway, and others are squeezing Sora's quality, others are showing that they are catching up.

Moshi Chat is the brainchild of the Kyutai lab and was created from scratch 6 months ago by a team of 8 researchers. While the goal is to be open and build new models over time, this is the first openly accessible native-generated speech AI.

"This new type of technology will enable smooth, natural, and expressive communication with AI for the first time," the company said in a statement.

Its core functionality is similar to OpenAI's GPT-4o, but in a smaller model. It is also available today, whereas GPT-4o's advanced voice will not be widely available until the fall.

The team suggests that Moshi could be used in role-play scenarios or as a coach to spur you on during training. By working with the community and being open, the plan is to allow others to build on the AI and tweak it further.

This is a 7B-parameter multimodal model called Helium, trained on text and audio codecs, Moshi is natively speech-in speech-out, running on Nvidia's GPU, Apple's Metal, or CPU.

Kyutai hopes that community support will be used to enhance Moshi's knowledge base and facticity. Being a lightweight base model, these are limited, but it is hoped that extending these aspects in combination with native speech will result in a powerful assistant.

The next step is to further refine and extend the model to allow for more complex and longer form conversations with Moshi.

After trying it out and watching the demo, we found that it was incredibly fast and responsive for the first minute or so, but the longer the conversation went on, the more incoherent it became. His lack of knowledge is also apparent, and when I point out his mistakes, he gets upset and goes into a loop of "I'm sorry, I'm sorry, I'm sorry."

This is not a direct competitor to OpenAI's GPT-4o Advanced Voice. However, providing a model that works openly and locally, with the potential to work in much the same way, is an important step forward for open source AI development.

Categories