The original Siri video was still on the siri site until this week but apple have now redirected siri.com to the Apple web site. That video had someone from Siri demonstrating the app and talking through what it was doing. He clearly said in about the first 30 seconds of the video that the voice data was uploaded to the servers for processing.
From the videos that I've seen Siri is able to interpret quite carelessly spoken sentences with the speaker making no attempt to talk slowly and clearly for the computer, sometimes with a few "um"s and "err"s in there as well. Recognising genuine conversational speech is difficult to do, especially when no training is required first to optimise the recognition for a specific speaker's voice. I suspect that just the basic conversion of the speech into text needs more processing power and memory than is available on the iPhone, even the 4S, which is why they have to upload the voice data to the servers (although I assume that they do some compression on the voice data first before uploading it in order to minimise network traffic, possibly very specialist compression since they know what specific features of the input will be needed for the subsequent analysis).
- Julian