Moshi Chat is a new native speech AI model from French startup Kyutai that promises a similar experience to GPT-4o, which can understand and interrupt tone of voice
Unlike GPT-4o, Moshi is a smaller model that can be installed locally and run offline If responsiveness can be improved, it may be ideal for future smart appliances
I had several conversations with Moshi Each lasted about 5 minutes in the current online demo, but in each case they ended up repeating the same words over and over and losing coherence
In one conversation, Moshi began to argue with me, outright refused to tell me the story, demanded that I state the facts instead, and would not relent until I said, "Give me the facts"
This all seems to be an issue of context window size and computational resources, easily solved with time; OpenAI does not need to worry about competing with Moshi yet, but as Luma Labs, Runway, and others are squeezing Sora's quality, others are showing that they are catching up
Moshi Chat is the brainchild of the Kyutai lab and was created from scratch 6 months ago by a team of 8 researchers While the goal is to be open and build new models over time, this is the first openly accessible native-generated speech AI
"This new type of technology will enable smooth, natural, and expressive communication with AI for the first time," the company said in a statement
Its core functionality is similar to OpenAI's GPT-4o, but in a smaller model It is also available today, whereas GPT-4o's advanced voice will not be widely available until the fall
The team suggests that Moshi could be used in role-play scenarios or as a coach to spur you on during training By working with the community and being open, the plan is to allow others to build on the AI and tweak it further
This is a 7B-parameter multimodal model called Helium, trained on text and audio codecs, Moshi is natively speech-in speech-out, running on Nvidia's GPU, Apple's Metal, or CPU
Kyutai hopes that community support will be used to enhance Moshi's knowledge base and facticity Being a lightweight base model, these are limited, but it is hoped that extending these aspects in combination with native speech will result in a powerful assistant
The next step is to further refine and extend the model to allow for more complex and longer form conversations with Moshi
After trying it out and watching the demo, we found that it was incredibly fast and responsive for the first minute or so, but the longer the conversation went on, the more incoherent it became His lack of knowledge is also apparent, and when I point out his mistakes, he gets upset and goes into a loop of "I'm sorry, I'm sorry, I'm sorry"
This is not a direct competitor to OpenAI's GPT-4o Advanced Voice However, providing a model that works openly and locally, with the potential to work in much the same way, is an important step forward for open source AI development
Comments