Meet Voicebox, Meta's new AI model for speech generation

Written by Akash Pandey June 17, 2023 | 12:46 pm 3 min read

Voicebox is trained with 50,000+ hours of pre-recorded speech and transcripts (Photo credit: Meta)

Meta has unveiled Voicebox, a cutting-edge AI model that can perform speech generation tasks such as editing, sampling, and stylizing. Voicebox can generate high-quality sound clips and edit pre-recorded audio, like removing car horns, a dog bark, etc. while preserving the audio's style. It is a multilingual AI model, capable of producing speech in six different languages. Here are more details.

Similar to generative systems for images and text, Voicebox generates outputs in a wide range of styles. However, instead of a picture or passage of text, it produces high-quality audio clips. The AI tool can either create outputs from scratch or modify a sample given to it. It can help with speech synthesizing, audio editing, noise removal, diverse sample generation, and style conversion.

Voicebox is based on Flow Matching technique

Earlier, generative AI for speech required task-specific training using meticulously prepared training data. However, Voicebox employs a novel approach to learning that relies solely on raw audio and transcription. Meta's tool can modify any part of a given sample. It is based on a technique known as Flow Matching, which has been shown to outperform diffusion models.

The AI model beats VALL-E and YourTTS

On zero-shot text-to-speech, Voicebox can surpass the current state-of-the-art English model VALL-E in terms of intelligibility (5.9% v/s 1.9% word error rates) and audio similarity (0.580 vs. 0.681), while being up to 20 times faster. Also, Voicebox outshines YourTTS for cross-lingual style transfer, lowering the average word error rate from 10.9% to 5.2%, and improving audio similarity from 0.335 to 0.481.

It can synthesize speech across six languages

According to Meta, Voicebox is trained with 50,000+ hours of pre-recorded speech/transcripts from public-domain audiobooks in English, French, Spanish, German, Polish, and Portuguese. It can predict a speech segment when given the surrounding speech and the transcript of the segment. The tool can infill speech from context and generate segments in the middle of an audio recording without having to re-create the entire input.

Use case of Voicebox

Voicebox can use an audio sample, and replicate its style for text-to-speech generation. It can restore a section of speech interrupted by noise, or replace mispronounced words. Given a sample of a person's speech and passage of text in any of the previously-listed languages, the AI model can produce a reading of the text. It produces speech that's more like how people actually speak.

The AI tool can have several applications in future

Voicebox is a multipurpose generative AI model, which could give natural-sounding voices to future virtual assistants or non-player characters in the Metaverse. Future applications of this technology could include making audio track editing simple for creators, enabling people to speak any foreign language in their own voice, and allowing visually impaired people to hear written messages read aloud by AI in their friend's voice.

Voicebox isn't currently accessible to the general public

Despite having many intriguing applications, the Voicebox model or code isn't publicly available at the moment due to the potential risks of misuse. Meta has simply shared audio samples and a research paper outlining the methodology and results they've achieved with their latest AI model.