Voice synthesis is the the next generation of audio editing. It has been called “the Photoshop of voice" and is a rapidly emerging technology which allows software to convert text into speech synthesis of a voice that is completely indistinguishable from the real thing. It allows anyone to edit recordings of what someone has said such that it sounds like the person actually has said the edit or even flat out create artificial sound-alike voice recordings of anyone in the world.
Voice synthesis software works by taking in sound clips from the person you want to copy as inputs, with the software able to convert these to make any sound or combination of sounds (i.e. words and actual sentences). Examples of these new types of software include Adobe Voco, WaveNet, and the recently launched Descript.
My concern is around the ethical issues and the next level of fakes news this type of technology will give rise to. It is extremely scary to think about the power of anyone having the ability to make it sound like world and corporate leaders have said anything you want. It also extends to being able to create songs and albums sounding like they were sung by your favorite celebrity without their consent. Even worse, it is nearly indistinguishable from real voice recordings with no way to verify if the audio is real or fake.
There many legitimate use cases for this type of software, especially for members of the media and the record industry, but I do have a lot of concern that voice synthesis technology will enable a more insidious and dangerous form of fake news and manipulation. Being able to make anyone say anything with no way to prove it is a scary prospect with a number of potentially malicious use cases.
You can see this technology in action on Barack Obama below as well as a demonstration of Descript's software.