TechingToday

Microsoft has developed an AI voice generator that is dangerously compelling to publish

General

[Many companies are working hard to develop models that can understand and reproduce natural speech patterns; while something like ChatGPT Voice could change storytelling forever, Microsoft claims to have reached the pinnacle of speech generation

In fact, according to the company's researchers, their VALL-E 2 speech synthesis (TTS) generator is so advanced that it would be irresponsible and dangerous to make it public According to a research paper found by our sister journal LiveScience, it takes only a few seconds for this generator to reproduce a voice that is indistinguishable from a human

With this in mind, Microsoft scientists believe that the voice generated by VALL-E 2 is of equal or better quality than the human voice when compared to voice samples from the LibriSpeech and VCTK voice libraries [VALL-E 2 is the latest advance in neural codec language modeling and a milestone in zero shot speech synthesis (TTS), achieving human-like quality for the first time

"Furthermore, VALL-E 2 consistently synthesizes high-quality speech, even in sentences that were previously difficult due to their complexity and repetitive phrasing

Although the researchers have not made their model public (more on that later), they have made several audio samples available for listening in a blog post about the project: speaker prompts taken from LibriSpeech and generated from both the VALL-E and VALL-E 2 generators You can listen to the results of completely new (and complex) sentences generated by the generators

And while the first generation model sounds stodgy, there is no denying that VALL-E 2 does an exceptional job of copying the speaker's resonances and articulations

Microsoft's VALL-E 2 TTS generator uses two specific features to achieve its impressive results: "Repetition Aware Sampling" and "Grouped Code Modeling"

The first is designed to make the output sound more fluid by addressing performance issues surrounding the repetition of small parts of words and phrases (known as tokens) that can stumble the AI

The second feature also improves efficiency, but it does so by reducing the number of individual tokens that the model processes in a single input sequence

"VALL-E 2 outperforms previous zero-shot TTS systems in speech robustness, naturalness, and speaker similarity," the researchers wrote in a blog post

"VALL-E 2 is able to produce accurate and natural speech in the original speaker's voice, rivaling human performance

Microsoft claims that there are applications for AI voice generators capable of this level of output, such as generating speech for people with aphasia or amyotrophic lateral sclerosis, but the company is currently limiting its use to research only

"There are currently no plans to incorporate VALL-E 2 into products or expand its availability to the public," the scientists write This is partly because of the potential for abuse if VALL-E were made available to the world In an ethics statement at the end of their post, the researchers write that their creation "may involve potential risks in misuse of the model, such as spoofing of voice identification or spoofing of specific speakers"

This is not limited to Microsoft; OpenAI, the developer of ChatGPT, has also placed restrictions on some speech technologies, creating a deep faking detector as a means for users to identify images created using AI VALL-E 2" ( or its successor) will remain undisclosed remains to be seen; as the AI race intensifies in the coming months and years, companies and scientists will undoubtedly feel pressure to push the envelope