Whatever you may think of the robotic voices foisted upon the world thanks to Google Voice Search and Siri, you’re unlikely to mistake them for human voices. For years, the state of the art in computer speech synthesis has been stuck at a fairly low level. However, new software called WaveNet, from the brainiacs at DeepMind, is setting a high watermark in the field of speech synthesis and giving AI a voice eerily similar to that of a human.
For years robotics have spoken about something called the uncanny valley – the creepy feeling one gets when observing a robot that is too mechanistic to be mistaken for a human, but not quite mechanical enough to be distinctly robotic, either.
Perhaps one reason there has been no parallel concept for robotic speech is that to date, no speech synthesizer was capable of attaining a quality that came close enough to a human as to be disturbingly similar. With DeepMind’s WaveNet, we may be witnessing the emergence of something like an uncanny waveform, a robotic voice close enough to our own as to be distinctly creepy. Or like me, you may just rejoice that finally there’s hope for an ebook reader that doesn’t sound like the re-animated corpse of a 1980’s Commodore computer.
The secret sauce behind this new standard in robotic speech, ironically enough, is artificial intelligence — albeit with a little help from some smart software engineers along the way.
We may as well get used to this state of affairs, as it looks increasingly that advancements made in things like robotics and AI will be realized with the help of artificial intelligence itself. While this virtuous feedback loop still includes human intermediaries, a trend towards self-improving AI may be in the offing — along with all the concomitant existential risks this betokens. Regardless, let’s take a closer look at WaveNet and see how artificial intelligence has enabled and is, indeed, the backbone behind DeepMind’s new speech synthesizer.
To date, most speech synthesizers were of two types — concatenative text to speech and parametric text to speech. Concatenative text to speech is the method behind the so-called “high quality” speech synthesizers used by Google Voice and Siri. It provides a more realistic sound by using large audio files of real people’s voices, chopped up and reorganized to form whatever word the computer is enunciating. The downside is that it is difficult to color the speech with changes of emotion or emphasis.
The alternative method, parametric speech, uses a rule-based system discovered by applying statistical models to speech patterns. The stilted and robotic-sounding speech synthesizers are mostly of this latter type, since they rely upon the computer to generate the audio signal rather than recordings of real human voices.
The WaveNet system can be thought of as an improvement upon concatenative text to speech, in that it still employs recordings of real human voices. But instead of chopping these up and reorganizing them in the old way, it uses an artificial neural network to generate synthetic utterances based upon the voices it was trained with. The downside is that this system is computationally intensive. Modeling raw audio typically requires 16,000 samples per second, with each sample being influenced by all the previous ones. This is well beyond the processing power of a typical smartphone, but not unthinkable for GPUs like Nvidia’s DGX-1 deep learning supercomputer.
DeepMind has some audio samples posted up on its WaveNet page if you want to hear what it sounds like. For the time being, while you’re unlikely to encounter WaveNet out in the wild, it’s not unthinkable that this system will someday power the voice on your ebook reader or a smart home console — that is, if a recursive self-improving AI hasn’t obliterated humankind first.
Now read: Artificial neural networks are changing the world — what are they?