Google's 'Translatotron' AI can translate in speaker's voice: Here's how

By Shubham Sharma

May 16, 2019

01:35 pm

What's the story

For years, we have been using Google Translate (among other tools) as a way to convert spoken sentences in one language to another. The technology works pretty well, but now, Google is pushing it forward with what it calls project 'Translatotron'. It is an AI-based system that directly translates speech into speech while retaining the vocal characteristics of the speaker. Here's what it means.

Current tech

Conventional translation involves text

If you have used Google Translate or any other voice translation tool, you must already know that the current translation systems involve multiple layers. The words you speak are first recognized and converted into text, then that text is converted into the language you prefer, and finally, the converted text is synthesized as a vocal output, which sounds like a robot speaking.

Tech

Now, Google is bringing 'Translatotron'

In a bid to make this 'multi-layered' process simpler, Google has announced its experimental Translatotron system. The company says that the tech uses machine learning to convert words and sentences spoken in one language into another language without involving texts in between. This, naturally, makes the translation process faster and reduces the risk of errors that could occur in multi-step translation.

Working

How AI translates speech into speech directly

After working on the idea of speech-to-speech translation for years, Google brought Translatotron to life by converting spectrograms of speech in one language to spectrograms in another, using machine learning algorithms. For the uninitiated, spectrograms are detailed frequency breakdowns of audio as it varies with time; they are also dubbed as sonographs, voiceprints, or voicegrams. These spectrograms help the system with direct translation.

Key advantage

Plus, it retains the vocal characteristics of the speaker

The process, as we said, is faster than the conventional technique, but more importantly, it comes with the element of expression. Essentially, the system relies on a neural vocoder and speaker encoder, which helps the system retain the speaker's vocal characteristics in the translated speech. So, instead of expression-less robotic voice, the translation comes out with the same tone, voice of the original speaker.

Issue

However, the system isn't completely perfect

While Translatotron could be the way to define machine-based speech translation, it is not absolutely perfect. Google says the system's translations aren't as accurate as those you get from regular translation technologies. It still needs to evolve, but in case you want to hear Translatotron speaking, head over to https://bit.ly/2EdKsUb. It is not half bad.