Become a MacRumors Supporter for $50/year with no ads, ability to filter front page stories, and private forums.
I agree, at least partly.

For instance, when I heard that the 3G (128MB RAM) wasn't going to get the voice control of the 3GS (256MB RAM), my first thought was that they were reserving at least some of the extra memory for the voice control. That way, they wouldn't have to dynamically allocate the memory.

OTOH, my older WinMo phones with only 128MB RAM could recognize far more voice commands than iOS, so perhaps not that much memory was needed after all :)

(I wrote my first FFT spectrum analyzer for voice recognition in 6800 machine code around 1979 on a homebrew computer with about 4K RAM. So between that and the WinMo history, I'm not convinced that the above would be a great excuse, but it's a possible one depending on how they're using that memory.)

Regards.

By performance I mean speed and accuracy. I used to be be a WinMo user myself. In my experience the voice recognition wasn't all that great. You had to use very strict stings of words, and even then it got it wrong most of the time.

I don't have any experience with voice recognition on a 1979 computer, but considering how flakey it was on 90s computers I can only imagine it didn't work very well at all.
 
Last edited:
I agree, at least partly.

For instance, when I heard that the 3G (128MB RAM) wasn't going to get the voice control of the 3GS (256MB RAM), my first thought was that they were reserving at least some of the extra memory for the voice control. That way, they wouldn't have to dynamically allocate the memory.

OTOH, my older WinMo phones with only 128MB RAM could recognize far more voice commands than iOS, so perhaps not that much memory was needed after all :)

(I wrote my first FFT spectrum analyzer for voice recognition in 6800 machine code around 1979 on a homebrew computer with about 4K RAM. So between that and the WinMo history, I'm not convinced that the above would be a great excuse, but it's a possible one depending on how they're using that memory.)

Regards.
I'm more willing to accept the memory excuse. Speech recognition is now far more than simple spectrum analysis of discrete words and far more context is needed. With continuous speech even the word boundaries are often ambiguous and even when individual boundaries are clear then the initial signal analysis typically comes up with multiple possible candidates for each word which creates a search space that needs to be searched (and/or constrained as it is being built) using anything from simple stuff like word proximity through to syntactic and in some cases even semantic analysis. This can take huge amounts of memory for the most sophisticated systems. Also, as has already been mentioned by a few previous posters, there's speech recognition and then there's natural language understanding and the latter is a whole different ballgame. Even correctly resolving pronoun references in the assistant will be a complicated task.

It is my understanding that Apple's aspirations for this project are far beyond simple speech recognition and that they aspire to some level of basic natural language understanding. Presumeably the ultimate (and very possibly unattainable) objective should be to make the system good enough to be just like talking to one's real life executive assistant (and at at that point the awkwardness that some people here have been talking about in being seen to speak to one's phone would presumeably go away since people talk to their real life executive assistants on the phone all the time).

My concern (for Apple) is that this assistant has the potential to be another Newton handwritting recognition moment with the first release of the technology falling too far short of the mark as far as real word useability is concerned and putting users off which then stalls future development of the project. Did this technology need to stay in the lab for another 5 or 10 years before being released to the public? Until we see more than controlled demos and video mockups we can't know this and I'd be more than happy to be pleasantly surprised.

- Julian
 
Wirelessly posted (Mozilla/5.0 (iPhone; U; CPU iPhone OS 4_3_5 like Mac OS X; en-us) AppleWebKit/533.17.9 (KHTML, like Gecko) Version/5.0.2 Mobile/8L1 Safari/6533.18.5)

Thing is, it would have to be really REALLY accurate. It only has to screw up a few words to be effectively useless. given that the current voice recognition is often waaay off, I'm pretty skeptical. But then again, I'd buy the new one just for the better processor/RAM/battery/camera, so this is just icing (if it works well).

Go download dragon dictation and download the siri app. Dragon dictation is fantastically accurate, I have used it a good bit - and Siri is quite amazing as well. Combining the two would give you a very powerful tool for voice commands/control.
 
Go download dragon dictation and download the siri app. Dragon dictation is fantastically accurate, I have used it a good bit - and Siri is quite amazing as well. Combining the two would give you a very powerful tool for voice commands/control.

As usual Apple doesn't really invent innovative technology but they find a way to integrate all these great technologies into one seamless package that "just works" for the customer.
 
I'm more willing to accept the memory excuse.

I am willing, depending on how it's done. For example, is it all decoded local to the device or by sending a recording to a server farm? Local could easily take a few hundred MB to store all the Markov statistical paths.

It is my understanding that Apple's aspirations for this project are far beyond simple speech recognition and that they aspire to some level of basic natural language understanding.

Not sure what you mean? AI?

Today's possible interactions are scripted forms with data that needs to be filled in before taking the next step. Filling in can be done by voice or keyboard.

Most likely there will be a set of common action scenarios chosen after keywords are recognized. E.g. ordering takeout, or getting a reservation for a dinner or show or a flight. With enough preloaded scenes, and possibly storing our food/seating preferences for later use, it'll look like AI magic.

Time to dig out the old Apple Knowledge Navigator video for those who've never seen it. Very much voice driven.
 
I am willing, depending on how it's done. For example, is it all decoded local to the device or by sending a recording to a server farm? Local could easily take a few hundred MB to store all the Markov statistical paths.



Not sure what you mean? AI?

From what I've taken over the years, it is processor intensive, and less to do with actual memory. This is why Android phones send the data over the air and have it processed there.

If you have ever used Dragon Dicatation, you'll know that it uses a nice bit of CPU when you're heavily using it. And that's on a full blown computer.

I don't remember what event this was, but one of the guys from Google talked about it when he demonstrated it over the air.
 
If you have ever used Dragon Dicatation, you'll know that it uses a nice bit of CPU when you're heavily using it. And that's on a full blown computer.

Dragon Dictation started becoming popular and useful on older PCs with less CPU horsepower than the A5 will bring to the table. However the algorithms did eat a pile of memory and slowed the PCs responsiveness down to a crawl. More RAM and the dual-core A5 could help solve those problems.

Enough local performance could possibly improve on the added latency of a network round trip to a more powerful server farm in the cloud.
 
Dragon Dictation started becoming popular and useful on older PCs with less CPU horsepower than the A5 will bring to the table. However the algorithms did eat a pile of memory and slowed the PCs responsiveness down to a crawl. More RAM and the dual-core A5 could help solve those problems.

Enough local performance could possibly improve on the added latency of a network round trip to a more powerful server farm in the cloud.

I think as long as you have an internet connection(3G or WiFi) the Assistant will delegate to the iCloud for raw processing of speech recognition. If you lose connectivity, then does it really matter?
 
Dragon Dictation started becoming popular and useful on older PCs with less CPU horsepower than the A5 will bring to the table. However the algorithms did eat a pile of memory and slowed the PCs responsiveness down to a crawl. More RAM and the dual-core A5 could help solve those problems.

Enough local performance could possibly improve on the added latency of a network round trip to a more powerful server farm in the cloud.

*shakes head* the process is still too much to implement on the actual phone itself. Which is my point. Which is why it isn't down to having "more memory", which is what was stated up above, or implied at least.

You could have just agreed with the point I was making instead of trying to pick it apart and accomplishing nothing.
 
I am willing, depending on how it's done. For example, is it all decoded local to the device or by sending a recording to a server farm? Local could easily take a few hundred MB to store all the Markov statistical paths.

...

Not sure what you mean? AI?

Today's possible interactions are scripted forms with data that needs to be filled in before taking the next step. Filling in can be done by voice or keyboard.

Most likely there will be a set of common action scenarios chosen after keywords are recognized. E.g. ordering takeout, or getting a reservation for a dinner or show or a flight. With enough preloaded scenes, and possibly storing our food/seating preferences for later use, it'll look like AI magic.

Time to dig out the old Apple Knowledge Navigator video for those who've never seen it. Very much voice driven.

Yes, I did mean AI although these terms (AI, Natural Language Understanding) are both pretty vague. The video on the old Siri home page (http://siri.com/about/) shows some very basic capabilities that I would say use some AI/NLU techniques, e.g. using knowledge of the recent conversation to influence search paramaters such as, when asked to search for a restaurant, assuming that it is for a meal immediately after the just-discussed movie. There is also some fairly robust parsing, e.g. "Take me drunk home" (or something like that) getting taxi options. OK, it's not the NLU that most researchers are still chasing, but's it's way beyond "What is the <X>" where <X> must be one of "time", "date", "day" or "temperature" type of constrained and rigid dialogs of older first-attempt voice control stuff.

Regarding memory useage, I tentatively withdraw my previous suspicions on that. The Siri video says explicitly that the voice input is uploaded to servers to be turned into text. There's no guarantee that Apple will keep it this way but given that the action resulting from the voice request will almost certainly involve internet searches then relying on a network connection doesn't seem to be an inappropriate constraint.

That Knowledge Navigator video is interesting. I hope that dream is still alive within Apple because the assistant in that imagined scenario is most definitely displaying very substantial AI/NLU capabilities. Sadly with Apple's secrecy we don't see what's going on. Microsoft and IBM tend to say quite a lot about their research projects (e.g. IBM's Watson). I assume that a company the size of Apple has a genuine blue-sky research department but maybe I'm wrong; I certainly never hear anything about it and have always assumed that that is down to Apple's culture of secrecy.

- Julian

----------

*shakes head* the process is still too much to implement on the actual phone itself. Which is my point. Which is why it isn't down to having "more memory", which is what was stated up above, or implied at least.

As I just posted in a reply to kdarling, the original promo video from siri here (http://siri.com/about/) says explicitly that Siri "sends <the> words up to be interpreted as text" which seems pretty clear that voice files are uploaded to back end servers for processing and I'd be very surprised if Apple hasn't kept it this way.

- Julian
 
Yes, I did mean AI although these terms (AI, Natural Language Understanding) are both pretty vague. The video on the old Siri home page (http://siri.com/about/) shows some very basic capabilities that I would say use some AI/NLU techniques, e.g. using knowledge of the recent conversation to influence search paramaters such as, when asked to search for a restaurant, assuming that it is for a meal immediately after the just-discussed movie. There is also some fairly robust parsing, e.g. "Take me drunk home" (or something like that) getting taxi options. OK, it's not the NLU that most researchers are still chasing, but's it's way beyond "What is the <X>" where <X> must be one of "time", "date", "day" or "temperature" type of constrained and rigid dialogs of older first-attempt voice control stuff.

Regarding memory useage, I tentatively withdraw my previous suspicions on that. The Siri video says explicitly that the voice input is uploaded to servers to be turned into text. There's no guarantee that Apple will keep it this way but given that the action resulting from the voice request will almost certainly involve internet searches then relying on a network connection doesn't seem to be an inappropriate constraint.

That Knowledge Navigator video is interesting. I hope that dream is still alive within Apple because the assistant in that imagined scenario is most definitely displaying very substantial AI/NLU capabilities. Sadly with Apple's secrecy we don't see what's going on. Microsoft and IBM tend to say quite a lot about their research projects (e.g. IBM's Watson). I assume that a company the size of Apple has a genuine blue-sky research department but maybe I'm wrong; I certainly never hear anything about it and have always assumed that that is down to Apple's culture of secrecy.

- Julian

----------



As I just posted in a reply to kdarling, the original promo video from siri here (http://siri.com/about/) says explicitly that Siri "sends <the> words up to be interpreted as text" which seems pretty clear that voice files are uploaded to back end servers for processing and I'd be very surprised if Apple hasn't kept it this way.

- Julian

And as I posted in a reply to firewood, THAT IS WHAT I WAS SAYING.

*and again, *shakes head*

Jesus Christ.
 
Great new feature, but will be funny is riding along watching a million or so people talking to themselves :)

Apple has done what Microsoft has talked about for years, and hopefully this works better.
 
Great new feature, but will be funny is riding along watching a million or so people talking to themselves :)

Apple has done what Microsoft has talked about for years, and hopefully this works better.

Of course Android had it first ;) But Apple will probably make it better. Which will only give Google the gentle push it needs :) LOVE competition.
 
Of course Android had it first ;) But Apple will probably make it better. Which will only give Google the gentle push it needs :) LOVE competition.

Yep android does it first, apple does it right, android will flood the market with phones that try to emulate (poorly) apples implementation, and several years later android apologists will say what apple did would have happened anyway. Never ending cycle.
 
Yep android does it first, apple does it right, android will flood the market with phones that try to emulate (poorly) apples implementation, and several years later android apologists will say what apple did would have happened anyway. Never ending cycle.

Doing it right is subjective... I can do pretty much everything I want to do with android's voice recognition. It is amazingly accurate and very easy to use. But Apple will do it differently, and they will improve on what Android already does. Then Android will counter.

Android fans will feel it is better, Apple fans will disagree and theu will come to a forum like this to argue like children over what is a largely subjective feature.

Ahhh the joys of the internet.
 
I am willing, depending on how it's done. For example, is it all decoded local to the device or by sending a recording to a server farm? Local could easily take a few hundred MB to store all the Markov statistical paths.

Will the idle usage memory utilization of the feature be rather small? In other words, for those who do not use it, will they pretty much get the benefit of 1GB of RAM over 512MB for multitasking compared to the iphone 4?
 
And as I posted in a reply to firewood, THAT IS WHAT I WAS SAYING.

*and again, *shakes head*

Jesus Christ.

Blimey. All I was trying to do is support what you were saying and reinforce your point by supplying a link to a very explicit statement from the creator of the technology that shows that you are almost certainly correct. I guess I should learn not to help people unless I'm invited to do so. I didn't mean to annoy you, I was trying to help.

- Julian
 
Have they solved the problem of feeling like a complete idiot when you use these sort of interfaces?

Give it time and it will be second nature. I remember not too many years ago people were self conscious about making a phone call on their mobile in public.

I kid you not kids :)
 
For the $80 billion in R&D MSFT spent the last decade, they couldn't come up with something as good as this? I guess their PhDs are not very bright.
 
A lot of blah, blah, blah over voice recognition. It doesn't take much to get the people in AppleLand all lathered up.

World Changing... must be quite boring in the walled garden if this is all it takes to excite the herd. :rolleyes:
 
A lot of blah, blah, blah over voice recognition. It doesn't take much to get the people in AppleLand all lathered up.

World Changing... must be quite boring in the walled garden if this is all it takes to excite the herd. :rolleyes:

You sound awfully jealous that your junky Android phone won't have a virtual AI.
 
A lot of blah, blah, blah over voice recognition. It doesn't take much to get the people in AppleLand all lathered up.

World Changing... must be quite boring in the walled garden if this is all it takes to excite the herd. :rolleyes:

Yes in android land all it takes to get excited is news of a new iphone apparently
 
Register on MacRumors! This sidebar will go away, and you'll see fewer ads.