TELL ME WHY - Speech Synthesis on iPhone OS X

Cleverboy · Jul 24, 2008

Here's what I'm wondering. Maybe someone can tell me.

I thinking that someone should be working on a type of product and it doesn't exist yet, so I'm worried they're not.

Speech Synthesis on the iPhone. Where is it? Here's the deal. I realize the problem. Speech synthesis can be very intensive for the CPU, and the raw data takes some amount of space, etc, etc. But, here's the thing... I'd love to have just THIN client speech on the iPhone. For instance, I'd like my iPhone to have an app that let's me browse to the web URL, then shows me a line that is essentially "start speaking here" that I can move up and down the page, then I tap the "speech" button (like a "Play" button, with full "Pause" and "Stop" functionality.) --The trick? Just have the server do the processing, and send the data back as a low-bandwidth audio stream. We all know that "speaking" can be compressed the most compared to music. --Just have the server accept the raw text, process the speech and on-the-fly, shoot it back to me as an audiostream.

I mean, you take a Mac server and Apache and some Applescripting, and I'm assuming you're really not too far off from a solution. You could even do it with an IIS server and a nice custom ISAPI filter. Both platforms have easy to use text-to-speech facilities ready to be thrown to the net.

I dig iPhone apps like Jott, that do the reverse, and send a stream of NEW audio to their server, and process it into text. We're just talking about reversing that process here.... so I'm not sure why it would be that difficult (peak usage scenarios not included).

~ CB

subcritical71 · Jul 24, 2008

I just saw a news article on this yesterday...

http://blog.wired.com/gadgets/2008/07/att-developing.html

Cleverboy · Jul 24, 2008

jjaromin said:
I just saw a news article on this yesterday...
http://blog.wired.com/gadgets/2008/07/att-developing.html

I saw that too. Seems similar to what SpeechCloud appears to be doing with "voice dialing", but extending it. Good stuff. It's still going in the opposite direction though. As it stands, Jott, SpeechCloud, MiDoMi, and even AT&T *seem* to be going in the direction of taking in audio sending the result to a server and sending back "data" or "information". I'd love for someone to take data/information, and send/stream back audio to the client.

It'll be even better if the conversation it TWO-WAY. But, right now, we seem one-sided. Audio UP -> Information DOWN. I mean, imagine a fully realized TWO-WAY system. You could be talking to a 3D animated avatar that is doing all sorts of things in the background based on voice command, getting status and reporting back to you audibly. You'd say, "Read me the morning news". And it would configure a custom stream for you, synthesizing the morning news in your chosen voice. It's REALLY close right now. Closest thing to teleconferencing... but in many cases, even better. Imagine an avatar interface to MobileMe or Google.

Right now, Google 411 let's you call them up, it processes your voice... does a Google search... and reads the results back to you interactively. Right now, the iPhone is capable of this OVER the NET, without the use of dialing... yet, Google hasn't hooked it up like that yet. Maybe the next upgrade to the free Google will process audio and read back search results. That would be a great start.

This would do a LOT for making the iPhone a more accessible device for the blind. It just gets zany from there.

~ CB

firewood · Jul 24, 2008

Apple first introduced Speech Synthesis running on MacOS 1.0 way back in 1984. MacinTalk ran on a Mac 128k using only an 8 MHz 68000 CPU, and it was easy to comprehend. The iPhone has a 50X faster clocked ARM RISC CPU and 1000 times more RAM memory.

Cleverboy · Jul 24, 2008

firewood said:
Apple first introduced Speech Synthesis running on MacOS 1.0 way back in 1984. MacinTalk ran on a Mac 128k using only an 8 MHz 68000 CPU, and it was easy to comprehend. The iPhone has a 50X faster clocked ARM RISC CPU and 1000 times more RAM memory.

This is true. The strange thing is that we observers have a tendency to think that as long as the iPhone CPU is capable of doing something, that the only hurdle has been passed. When I say something is intensive for the CPU, I don't mean to imply that the CPU is incapable of handling it. I'm referring to the battery life. I'd love to see comparisons between say... streaming audio... and voice synthesis for CPU usage. Or better, battery usage measurements between accessing the cellular radio (3G/Wi-Fi/EDGE) and CPU usage for a task comparable to speech synthesis.

One problem with comparing OLD speech technologies to MODERN DAY speech technologies, is that while us nerds and geeks may be more than happy with our retro-tech speech systems, I think the world has moved on. Give the world Agnes instead of Alex, and I think the mockery and criticism with be swift and sharp.

I remember Talking Moose. My first "real" Mac was the Quadra 840av. I even remember reading when Apple had acquired the company whose technologies moved them to the next step. I know they have the technology. Apple is also mastered its execution and all of the many considerations. I think my suggestion is a sound one though. If the "SkyFire" web browser/service can pre-process every web page from its users, and refactor them for their phones... SOMEONE can do an text-to-speech "offloading" service for the iPhone that doesn't require new technologies on the phone. I mean, just tag a tone and an underwriting message to the end of each request over 20 seconds. Done!

~ CB

kdarling · Jul 24, 2008

There's so much that needs to be done.

Someone should do an interface like the (sexy) artificial personal secretary for WinMo devices:

Sapie secretary

Apple or Google needs to do a client app like WinMo has for MS Live Search. It's very neat to be able to click a button and say "Seafood restaurants near Nyack" and get back a list to call or map.

All of this is pretty easy to program, especially with backend helpers like TellMe Studios or the Watson project shown in that video, etc. Just a matter of how you plan to fund it.

senagbe · Aug 14, 2008

We are waiting on Apple keys to start a private beta of something in this space. Drop me an email at info@imkon.com if you would like to be in the public beta.
S

Cleverboy · Oct 27, 2008

Cleverboy said:
I dig iPhone apps like Jott, that do the reverse, and send a stream of NEW audio to their server, and process it into text. We're just talking about reversing that process here.... so I'm not sure why it would be that difficult (peak usage scenarios not included).

Ha Ha! Okay, so... apparently Google has DONE IT! Google has added a SPEECH capability to YouTube, that consists of sending TEXT to its webserver, and giving a REPLY that is STREAMING AUDIO of a speech-synthesized voice READING the text! Woo-hoo! Still seems like its being "tested" though. Someone's not quite sure how they want to roll with it I guess.

http://blag.xkcd.com/2008/10/08/youtube-audio-preview/

Wow. It seems someone at YouTUBE took this comic seriously and decided to add an “Audio Preview” feature. Now you can hear your comments read aloud to you.

By examining my headers, I was able to discover the actual request being sent from my browser. This is the link here.

What does this mean? Well, considering you can embed a webkit browser into ANY iPhone application, you could create an iPhone app called "Web Companion", and "add" bonus features, that let you literally browse to a page, and then PLAY the audio stream version of the page ON-THE-FLY.

One of my BIGGEST gripes on the iPhone, is that I couldn't have news articles READ to me. I would love to be able to "tag" a collection of news articles, and then have the phone "download" the audio streams for each article, and read them to me back-to-back.

Okay... I'm so clearly going to try to program this tonight.

Next up, Microsoft will combine THIS capability, with its "Tell Me" service, and allow fully interactive conversation with their existing open standard for "voice" applications. You mark my words. 😉 Apple needs to just give everyone a nice "speech" service API for Phone apps and make all of this easier.

~ CB

stewacide · Feb 6, 2009

I normally employ the OSX speech service when reading long articles on my Mac; it'd be even better if I could - for example - have the iPhone/iPod Touch read Mobile Safari articles to me.

On a side not, I really wish they'd get the service menu working with Firefox. I have to use Safari or Camino (for Services, Speech, Start Speaking Text), neither of which I like otherwise.

Does anyone know if there's a service out there that can do something like this using 'the cloud' a la the YouTube comment reader?

Diseal3 · Feb 6, 2009

I'm unsure of how blackberry handles their voicedialing but i have to say it is one hell of a feature. It works fast and was accurate almost every time (Excluding Loud background sound.) Wish they can "Borrow" the concept in developing one built into the firmware for reliable and fast voice dialing.

mkrishnan · Feb 6, 2009

Cleverboy said:
This is true. The strange thing is that we observers have a tendency to think that as long as the iPhone CPU is capable of doing something, that the only hurdle has been passed. When I say something is intensive for the CPU, I don't mean to imply that the CPU is incapable of handling it. I'm referring to the battery life.

I would tend to disagree on the same basis. If basically passable speech synthesis was possible on the 68000 (not only on Macs but also on Amigas and, I think, on the Atari ST as well) as well as other very basic platforms, it just doesn't take much computational horsepower. Not only does it not take computational horsepower, and it certainly isn't producing a lot of audio volume, either. I would be really shocked if synthesized speech at the quality level of the 1980's Amigas and Macs would even be as much power consumption as playing MP3s -- I'd guess it would be even slightly lower.

EDIT: As for speech recognition -- I do agree -- I really like the idea of the voice dialing that doesn't have to be taught that is offered on phones from LG and others....

Cleverboy · Feb 6, 2009

mkrishnan said:
I would tend to disagree on the same basis. If basically passable speech synthesis was possible on the 68000 (not only on Macs but also on Amigas and, I think, on the Atari ST as well) as well as other very basic platforms, it just doesn't take much computational horsepower. Not only does it not take computational horsepower, and it certainly isn't producing a lot of audio volume, either. I would be really shocked if synthesized speech at the quality level of the 1980's Amigas and Macs would even be as much power consumption as playing MP3s -- I'd guess it would be even slightly lower.

I think it goes back to what you'll settle for. I'd like BETTER voice synthesis, like "Alex", than to jump in the time machine simply to save CPU cycles. --But, maybe there IS not real trade-off. I just remember the base sampling data sizes being large for the modern speech synthesis engines compared to the older ones.

mkrishnan said:
EDIT: As for speech recognition -- I do agree -- I really like the idea of the voice dialing that doesn't have to be taught that is offered on phones from LG and others....

I look forward to Google doing more integration on that end with its GMAPS api. I believe Apple's answer is more processing power at lower power consumption. Hopefully that pans out.

~ CB

ppc750fx · Feb 7, 2009

I'm unsure of how blackberry handles their voicedialing but i have to say it is one hell of a feature. It works fast and was accurate almost every time (Excluding Loud background sound.) Wish they can "Borrow" the concept in developing one built into the firmware for reliable and fast voice dialing.

"Voice dialing" is a gimmick. The only reason people claim it's useful is that the address book interface for most phones is a POS. On any phone with a sanely designed contacts listing I've never found it to be faster than manual selection. Not once.

cks · Feb 24, 2009

Flite seems to work...

Flite (an Open Source speech synthesis engine) will work on the iPhone with a bit of fiddling: http://artofsystems.blogspot.com/2009/02/speech-synthesis-on-iphone-with-flite.html

There are also some commercial solutions. At some point Apple will presumably port over the built-in OS X speech generation framework, but until then Flite seems like a good compromise.

harry454 · Feb 24, 2009

its do able

Masquerade · Feb 25, 2009

i would love to see the ipod with speech synth! as option for sure, for random music - ie: saying the song title 2 seconds before start playing, so you can skip it through the headphones.

cellocello · Feb 25, 2009

kdarling said:
There's so much that needs to be done.

Someone should do an interface like the (sexy) artificial personal secretary for WinMo devices:

Sapie secretary

All of this is pretty easy to program, especially with backend helpers like TellMe Studios or the Watson project shown in that video, etc. Just a matter of how you plan to fund it.

Yikes. Rare lapse of judgment from our dear kdarling.

You really believe phones need something like this? Just like Windows needed 'Bob'? Or how Office needed 'clippy'? Or Windows search needed that dog?

"Apple or Google needs to do a client app like WinMo has for MS Live Search. It's very neat to be able to click a button and say "Seafood restaurants near Nyack" and get back a list to call or map."

What's wrong with Google Voice search on the iPhone?

Cleverboy · Feb 26, 2009

cellocello said:
Yikes. Rare lapse of judgment from our dear kdarling. You really believe phones need something like this? Just like Windows needed 'Bob'? Or how Office needed 'clippy'? Or Windows search needed that dog? What's wrong with Google Voice search on the iPhone?

I think it is YOU who are gravely mistaken in your notion that "Bob" or "Clippy" in ANY way represents the potential of an AI agent to carry out a collection of macros, whether its on a phone or a desktop system. The biggest usage, would of course be for accessibility, but a secondary usage would indeed be for non-handicapped users to launch powerful commands without a lot of typing or fiddling.

Indeed, Google Mobile's voice search IS the model, but it can be extended throughout an operating system... not simply used in one piece of software. The voice search in Google needs to appear in Google Maps as well... specifically because you would be able to "gang up" actions that separately might take time to execute. Like: "Directions from 123 Simple Lane, Boston, MA to 345 Elegant Ave, Cambridge"... would think, and give you results. Imagine if you could tap a "voice command" button from your iPhone home screen, and say "Google search cold remedies", "Google Maps, Find pizza 02472", or "Safari Go to www.apple.com". Imagine doing so through a bluetooth headset.

The notion of an avatar would be such that the system would communicate BACK to you what it was doing, without you needing to look at the screen. Moreover, it could ask you questions to clarify a command. Microsoft avatars thus far have been useless, annoying, and counter productive... inserting themselves into processes they were never invited to.

Well done intelligent agents would be MUCH more effective and hardly a "gimmick".

~ CB

cellocello · Feb 26, 2009

Cleverboy said:
Well done intelligent agents would be MUCH more effective and hardly a "gimmick".

Well, the day a "well done intelligent agent" comes out for my phone I'll change my mind. See you in 2021!

But more realistically. Simple apps will come out and acheive the functionality you're looking for. Text to speech GPS exists. Vocie recognition exists. This stuff is real, and will improve. I agree. But to expect an AI (like Cortana in Halo, or HAL in 2001, or KITT in Knight Rider, or Tony Stark's computer in Iron man, and so on and so forth) that isn't just a silly gimmick that gets in the way more than it helps, is, well, just silly pipe dream right now. IMO.

Maybe in 10 years, but I still doubt it. Turn on Speech Control on your Mac (which has more plenty more power than an iPhone). Try using that as a 'secretary' you can talk to. See how well that works out for ya.

Cleverboy · Feb 27, 2009

cellocello said:
Well, the day a "well done intelligent agent" comes out for my phone I'll change my mind. See you in 2021!

"Well done" does not mean "complex" or "advanced", it simply means "well executed". It could be released today if someone wanted. Time has nothing to do with it. We could wait 100 years, and people still might get it wrong.

cellocello said:
I agree. But to expect an AI (like Cortana in Halo, or HAL in 2001, or KITT in Knight Rider, or Tony Stark's computer in Iron man, and so on and so forth) that isn't just a silly gimmick that gets in the way more than it helps, is, well, just silly pipe dream right now. IMO.

The reason we're waiting so long is precisely because of preconceived notions like you've given examples of. I would not want Kitt OR HAL. I just want the proper auditory cues without having the agent degenerate into "gimmick" by trying to be a "KITT" or a "HAL". A Kitt or "HAL" would be something to shoot for AFTER the laconic, minimum verbosity aspect of this technology is released to the consumer sphere.

cellocello said:
Maybe in 10 years, but I still doubt it. Turn on Speech Control on your Mac (which has more plenty more power than an iPhone). Try using that as a 'secretary' you can talk to. See how well that works out for ya.

If you follow Apple, you know Apple does things often in terms of "we saw an opportunity". I think desktop computers are precisely the WRONG venue to perfect this type of technology. Unless your handicapped, using an intelligent agent through voice commands is mostly LAZINESS. It's MUCH easier to use a handheld gadget like an Apple remote or an iPod Touch. The iPhone on the otherhand (and smart phones in general), are communication devices. The natural interface IS voice. That's why Google's voice commands are so nice to use (I use them regularly). Would I use Google Voice on the desktop computer, if implemented? --Probably NOT. Ergo, the iPhone is an excellent platform to explore this on, whereas the Mac is not. Voice commands will go further on the desktop when it proceeds further into controlling more "real world" things. Right now, it is FASTER to do almost anything on your desktop by clicking, tapping, or typing... than voice control is.

--On your phone however, the inputs are MUCH smaller, so "voice" is of much, much more assistance. Again, examples: getting directions, doing a search, calling a phone number, finding a location, etc.

If you've ever used Google 411... and liked it, you'd agree that a "hands-free" artificial directory assistance service is useful. As long as its reach doesn't exceed its grasp, this applies to many functions on the phone. I mean, think of it... combine iTunes music store with Shazam or Midomi and voice control, and you can imagine this exchange:

SCENARIO 1:
> [user clicks "command" button on home screen]
> User says: IDENTIFY MUSIC
> iPhone says: "Listening" (or plays an acknowledgement tone)
> iPhone says: "Finished"
> iPhone launches iTunes Music Store with song page up

SCENARIO 2:
> [user clicks "command" button on home screen]
> User says: SEARCH hairstyles wikipedia.org
> iPhone launches Safari with Google search keywords searched

SCENARIO 3:
> [user clicks "command" button on home screen]
> User says: LOCATE 123 Simple St., My City, My State
> iPhone launches Google Maps
> iPhone says: Did you mean, 123 Simple Ave?
> User says: Yes
> iPhone says: Found it. What would you like to do?
> User says: Directions to location from here.
> iPhone says: Done. You are now on Prospect St. You will need to take a right in 4.1 miles onto 35 Arsenal Boulevard. (displays list of directions)

This isn't "conversation". This isn't a turing test. It's basic common sense and anticipating common user responses and providing likely options... much like recent voice driven directory assistance solutions. The technology exists today. The ONLY thing I'm thinking that gets in the way... are patents and vision.

~ CB

Cleverboy · Oct 4, 2011

SCENARIO 1:
> [user clicks "command" button on home screen]
> User says: IDENTIFY MUSIC
> iPhone says: "Listening" (or plays an acknowledgement tone)
> iPhone says: "Finished"
> iPhone launches iTunes Music Store with song page up

SCENARIO 2:
> [user clicks "command" button on home screen]
> User says: SEARCH hairstyles wikipedia.org
> iPhone launches Safari with Google search keywords searched

SCENARIO 3:
> [user clicks "command" button on home screen]
> User says: LOCATE 123 Simple St., My City, My State
> iPhone launches Google Maps
> iPhone says: Did you mean, 123 Simple Ave?
> User says: Yes
> iPhone says: Found it. What would you like to do?
> User says: Directions to location from here.
> iPhone says: Done. You are now on Prospect St. You will need to take a right in 4.1 miles onto 35 Arsenal Boulevard. (displays list of directions)

While "SCENARIO 1" only applies to music playing on the device (no automatic integration with Shazam or Soundhound for external music tagging), I think with the iPhone 4S, Apple has finally accomplished the conversational capability I've wanted for so long.

http://www.engadget.com/2011/10/04/iphone-4s-hands-on/

The most impressive part was the demo of Siri, the new assistant that lets you do just about anything you can do on your phone -- but with your voice. We tried to psych it out with a bunch of random requests, including the history of Chester, Vermont (a lovely town) and the best Ramen places in San Francisco. Siri never faltered, never missed a beat. It worked as well as Scott's demo up on the stage. There's nothing better to say than that. We even sent ourselves a few text messages, which Siri transcribed to a T. Of course, the lady on the other end still sounds eerily robotic, but we're hoping for smoother responses from the alien within in a future update.

Absolutely awesome. Bravo, Apple. Bravo.

~ CB

mkrishnan · Oct 4, 2011

Cleverboy said:
Absolutely awesome. Bravo, Apple. Bravo.

Yeah, we'll all have to wait to be able to play with it, but it looks like they've gotten a very, very good start on your list. 🙂

TELL ME WHY - Speech Synthesis on iPhone OS X

macrumors 65816

macrumors member

macrumors 65816

macrumors G3

macrumors 65816

Contributor

macrumors newbie

macrumors 65816

macrumors regular

macrumors 65816

Moderator emeritus

macrumors 65816

macrumors 65816

macrumors newbie

macrumors 6502

macrumors 6502a

macrumors 68000

macrumors 65816

macrumors 68000

macrumors 65816

macrumors 65816

Moderator emeritus

Our Staff