Some Thoughts on Voice


Lately, I've been thinking about the rapidly emerging voice technologies, a subset of artificial intelligence and machine learning.  Most of the big technology companies now offer a voice assistant through a mobile phone or smart speaker such as Amazon with Alexa, Apple with Siri, Microsoft with Cortana, etc.  Forecasts predict that 50% of all searches done online will be through voice by 2020 (driven by mobile), with an estimated 13% of US households owning a smart speaker today.

My take is that voice is another interface, not a platform in and of itself or a new paradigm in computing.  Voice is inherently a low bandwidth medium making it strong for certain types of activities, specifically giving commands and getting information to basic questions.  “Alexa play music” or “Google what is the weather going to be today?”.  This is borne out by the plethora of studies being released lately examining how consumers use voice technologies.  The most common uses of voice today by far are asking for directions, asking a quick question, calling someone, checking time, or playing music.  

I expect to see the voice interface dominate when it comes to certain use cases such as home automation but I suspect it’s overhyped for a lot of other common digital use cases.  A big one in this bucket would be online shopping.  Humans are inherently visual creatures and the ability to see a product and read about it will trump being able to purchase purely through a voice dialog.  The only exception to this would be repeat purchases of the same product or if it’s a really basic item that you are agnostic to its brand (“Alexa, order me a stapler”).  Obviously, Amazon disagrees with me having stated it has over 5,000 employees currently working on Alexa and Echo technologies but I remain bearish on voice for eCommerce.

In terms of search, due to its low bandwidth I doubt voice will be very useful for discovery in general or for new customer acquisition by brands and advertisers.  It’s just not a good interface for comparison or discovery.  That of course won’t stop Google and Amazon from offering a new type of ad unit in the very near future where you can pay these companies to be a featured product in a voice search.

Noted technology pundit Scott Galloway for the last year has been predicting the death of brands due to voice but I have to disagree with him here and think he is greatly exaggerating the impact this technology will have.  Voice will simply becoming one of several interfaces users have access to as it seeps into common usage, but it certainly won’t replace touch, keyboards, etc. 

Emerging Trend: Voice Synthesis and the Next Generation of Fake News

Voice synthesis is the the next generation of audio editing.  It has been called “the Photoshop of voice" and is a rapidly emerging technology which allows software to convert text into speech synthesis of a voice that is completely indistinguishable from the real thing.  It allows anyone to edit recordings of what someone has said such that it sounds like the person actually has said the edit or even flat out create artificial sound-alike voice recordings of anyone in the world. 

Voice synthesis software works by taking in sound clips from the person you want to copy as inputs, with the software able to convert these to make any sound or combination of sounds (i.e. words and actual sentences).  Examples of these new types of software include Adobe Voco, WaveNet, and the recently launched Descript.  

My concern is around the ethical issues and the next level of fakes news this type of technology will give rise to.  It is extremely scary to think about the power of anyone having the ability to make it sound like world and corporate leaders have said anything you want.  It also extends to being able to create songs and albums sounding like they were sung by your favorite celebrity without their consent.  Even worse, it is nearly indistinguishable from real voice recordings with no way to verify if the audio is real or fake. 

There many legitimate use cases for this type of software, especially for members of the media and the record industry, but I do have a lot of concern that voice synthesis technology will enable a more insidious and dangerous form of fake news and manipulation.   Being able to make anyone say anything with no way to prove it is a scary prospect with a number of potentially malicious use cases.

You can see this technology in action on Barack Obama below as well as a demonstration of Descript's software.

Crypton Future Media



Crypton Future Media is a startup that does sound and music related software including sound synthesizers and music library manipulation.  The company licences its software to large and small companies across a number of industries including everyone from small music stores to Nintendo, Apple, government organizations, traditional media companies, etc.  In addition the company has an eCommerce business for mobile devices including ringtones.

Most noteworthy of their products is Hatsune Miku a software singing voice synthesizer that has morphed into a community constructed digital persona and cyber celebrity throwing sold out live concerts everywhere from Los Angeles to Tokyo.  To really grasp the concept see this amazing video of a packed concert as a piece of voice software attracts cheering legions of fans.  Hatsune Miku has over 100,000 unique songs, millions of social media fans, and hundreds of thousands of music videos all created by a global creative fanbase.  With the smashing success of turning a piece of software into a celebrity persona,  Crypton has created more cyber celebrities around their other voicebank products and software with similar results and fanbases being created.  Amazingly this allows Crypton to monetize not just by selling their software, but also through concert ticket sales and character licensing of their products.

The team is based in Sapporo, Japan.

Why I like Them

I like Crypton because they have a fascinating, almost science fiction like business by giving charaterization to software that has inadvertently become the first ever cyber celebrity.  The company has even gone as far as assigning a gender (female), an age (a perpetual 16), height, weight and blood type to make "her" more relatable to fans.  The idea of making a software product into a humanoid figure (a celebrity no less) with human traits boggles my mind every time I think about it.  The fact that it worked beyond their wildest expectations further blows my mind.  This humanization of a product (an intangible product like software no less) is brilliant and could very well be the next generation of marketing.  You aren't buying a product, you are buying a piece of and interaction with a celebrity.

The iPhone is the most successful consumer product in history and even it is not given a persona or celebrityhood status by its fans.  Crypton has broken innovative new ground on what could very much be a new type of product and marketing strategy - digital celebrities wholly owned by corporations but with their content created by a global fan community.  I can't wait to see what happens when this concept meets more traditional AI technologies!

Disclosure:  All information is from publicly available sources, I have not had any contact with a member of the company or its investors.

Hatsune Miku, Crypto Future Media's most successful product and the world's first digital celebrity.

Hatsune Miku, Crypto Future Media's most successful product and the world's first digital celebrity.