Apple looking to add character to text-to-speech voices

AppleInsider · October 18, 2012 6:17AM

An Apple patent application discovered on Thursday outlines an invention that uses metadata from emails, texts and other communications to determine how a synthesized voice sounds in a text-to-speech (TTS) system.

Source: USPTO

The filing, titled "Voice assignment for text-to-speech output," looks to create "speaker profiles" which can change the voice characteristics of TTS output to match parsed-out metadata like age, sex, dialect and other variables.

As noted by the application, many systems exist today to aid the visually impaired, including the system on Apple's iPhone, however most TTS engines "generate synthesized speech having voice characteristics of either a male speaker or a female speaker. Regardless of the gender of the speaker, the same voice is used for all text-to-speech conversion regardless of the source of the text being converted." Apple's invention proposes a different solution.

Instead of hearing the same voice for every message, the invention obtains metadata "directly from the communication or from a secondary source identified by the directly obtained metadata" to create the most suitable speaker profile.

According to the patent filing, "Providing a speech output that is associated with a speaker profile allows speaker recognition while providing a more enjoyable and entertaining experience for the listener."

An example is provided in which a user receives a message from "Charles Prince," who has an email address of [email protected], regarding a party for "Albert." In this case, the system could use the ".uk" address as primary metadata. Secondary metadata can be gathered if a contact card is attached to the message, or if Charles Prince's information is already in the user's address book.

Metadata samples.

The data from the text and the corresponding metadata are then fed into a TTS engine, which assigns a speaker profile to convert the text into speech.

After converting each word and phonetic transcription in the text to distinct sounds that comprise a given language, the TTS engine then divides and marks rhythmic sounds like phrases, clauses and sentences.

In some implementations, speech can be created by piecing together pre-recorded voice fragments, including sounds, entire words or even sentences, that are stored on a mobile device or in an off-site database.

In other implementations, the TTS engine can include a synthesizer that "incorporates a model of the human vocal tract or other human voice characteristics to create a synthetic speech output according to the speaker profile."

One of the most interesting iterations notes that "a speaker's voice can be recorded and analyzed to generate voice data."

From the patent filing's description:

For example, the speaker's voice can be recorded by a recording application running on the device or during a telephone call (with permission). The voice characteristics of the speaker can be obtained using known voice recognition techniques. In this implementation, a speaker profile may not be necessary as the speaker's name can be directly associated with voice data stored in voice database.

As for output, the system may pick the ".uk" email address to use as primary metadata, taking contact card information like a birthday to determine sex and age, to subsequently output a speaker profile matching an older male with a British accent. Charles Prince's physical address, phone number, or picture can also be used to determine a speaker profile. The more metadata available, the more refined the output.

Flowchart of TTS system.

It is unclear if Apple plans to deploy such a system, however the company currently has a similar, albeit less advanced, system in place with Siri. While the feature is limited to certain regions, Siri has an option to choose dialects like "English (United States)" or "English (United Kingdom)" to recognize incoming voice commands, as well as provide responses in the selected accent.

lightknight · October 18, 2012 7:53AM

"Hello, sweetheart. I'm so excited by your picture, please write me asap on sexygirl.apple.xxx" ?

clemynx · October 18, 2012 8:07AM

Very clever invention! That's a patent I'd love to see working! Adapting voice synthesis from metadata could happen in a first moment, then, after people get used to the tech, voice recording for synthesis could be used. The phone would need to be extremely secure though, I wouldn't want speech patterns for all my contacts get lost in the wild.

solipsismx · October 18, 2012 8:22AM

clemynx wrote: »

Very clever invention! That's a patent I'd love to see working! Adapting voice synthesis from metadata could happen in a first moment, then, after people get used to the tech, voice recording for synthesis could be used. The phone would need to be extremely secure though, I wouldn't want speech patterns for all my contacts get lost in the wild.

This is great and all but it seems very "sci-fi". I'd expect to see many other changes in how text-to-speech works long before this patent gets implemented.

For starters, I hate that artists in my Music are spoken incorrectly when the name is well know. This is something that the system should have a digital phonetic spelling of for all artists so that it can be as accurate as possible.

Next, I'd like for the system to allow me to record the name of people in my contacts. Not to have my recording is played back to me when Siri reads it off but so that the pattern I use can processed and used to get a playback from the system. For instance, the name Jim is being pronounced as |gim| by Siri. But even names it does get right for the masses might be unique for different dialects or other languages or cultures and it would be nice if Siri tried to know the proper one once being corrected. This is much like the first one expect it's more individual and therefore would be harder to implement.

Finally, I'd like for Apple to get with linguists to create a paragraph that details all phonemes of a language so that when you first sign up for Siri it will have you speak each sentence and will record every part of your voice which it will then process and store with your on-line profile so that it will better understand your accent, your dialect, and/or any speech aberrations you may have.

ireland · October 18, 2012 8:28AM

Speech to text?

notscott · October 18, 2012 8:34AM

It would be wonderful to have "speaker profiles" for everyone I know, so I can send "voice" messages from them to others, destroying their relationships and lives. That's going to be awesome.

clemynx · October 18, 2012 9:01AM

Quote:

Originally Posted by SolipsismX

Finally, I'd like for Apple to get with linguists to create a paragraph that details all phonemes of a language so that when you first sign up for Siri it will have you speak each sentence and will record every part of your voice which it will then process and store with your on-line profile so that it will better understand your accent, your dialect, and/or any speech aberrations you may have.

That would be a way to do it, but it's not very elegant. The user shouldn't notice when Siri is learning about his speech characteristics.

Quote:

Originally Posted by NotScott

It would be wonderful to have "speaker profiles" for everyone I know, so I can send "voice" messages from them to others, destroying their relationships and lives. That's going to be awesome.

"Darling, I'm leaving you, I think our sex life is no longer what it used to be"

Sent from granma to granpa.

MacPro · October 18, 2012 9:14AM

notscott wrote: »

It would be wonderful to have "speaker profiles" for everyone I know, so I can send "voice" messages from them to others, destroying their relationships and lives. That's going to be awesome.

It's a shame Michael Crichton left us. He seemed to always get a novel out about the latests tech advance early enough for it to seem like sci-fi. I'm sure he'd have had fun with this topic as he did with images that could be manipulated ... OMG ... at the pixel level!!! (Remember when that seemed sci-fi?).

notscott · October 18, 2012 11:02AM

"My wife's real voice isn't a sexy as the speaker profile I created for her. So... phone sex it is!"

Marvin · October 18, 2012 12:40PM

It would be cool if they turned recordings of Steve Jobs into an iPhone voice. In Maps:

"The road ahead is a well-worn path, take the next exit."
"There's an insanely great restaurant coming up in 5 miles."
"You're driving it wrong. Make a U-turn."
"You can't connect the dots looking forward but stop checking your makeup in the mirror and keep your eyes on the road ahead"
"Innovation distinguishes between a leader and a follower so innovate and overtake that bus in front of you or you're going to be late."
"Being the richest man in the cemetery doesn't matter. Going to bed at night saying you've done something wonderful is what matters so slow down, you are breaking the speed limit."
"Boom. You have reached your destination."

tallest skil · October 18, 2012 12:50PM

Originally Posted by Marvin

It would be cool if they turned recordings of Steve Jobs into an iPhone voice. In Maps:

"The road ahead is a well-worn path, take the next exit."

"There's an insanely great restaurant coming up in 5 miles."

"You're driving it wrong. Make a U-turn."

"You can't connect the dots looking forward but stop checking your makeup in the mirror and keep your eyes on the road ahead"

"Innovation distinguishes between a leader and a follower so innovate and overtake that bus in front of you or you're going to be late."

"Being the richest man in the cemetery doesn't matter. Going to bed at night saying you've done something wonderful is what matters so slow down, you are breaking the speed limit."

"Boom. You have reached your destination."

"You know… I think there's a better way to go." (for when there's traffic and it's rerouting you)

"The back of your car looks better than the front of theirs!" (at random times)

mstone · October 18, 2012 2:02PM

The thing about TTS for blind users is that a synthesized voice which has no inflections or tonal variations can be understood at very high rates, up to 400 words per minute. If a user had to listen to a different voice for each email they would need to slow down the rate to normal speaking pace which is ok if you want to listen to voice for the sake of entertainment but for pure transmission of information this new patent would not be practical.

Apple looking to add character to text-to-speech voices

Comments