Plus, speech interaction often pales in comparison to direct manipulation. How do you crop a photo through a speech interface? "A little wider. A little more. No, too wide. Oh, bugger." It's much easier just to reach out and grab the dang thing. So if you only can have one, choose touch. (Not that we necessary can only have one, but right now we basically have neither, so we have to pick which one to develop first. The one that's more beneficial wins.)
<snip>
My point is simply that we don't have to go back very far in time to see that "this sucks because it's different" isn't a good way to judge how things will evolve in the future. We went from goose quills and ink pots to typewriters to desktop computers to laptop computers, and it's not at all easy to predict where we'll go from here. It's fine to look at a given interaction model and be skeptical, but it's kind of foolish to dismiss anything out of hand. The future has this nasty habit of surprising us.
I notice how you go out of your way to emphasize the verbal without taking into consideration the gestural in context. While I would agree that telling the computer, "put that thing there," is essentially impractical when taken in the verbal-only context, adding the gestural as in "*touch* put this, *touch2* there," instantly tells the computer the object and the destination with the verbal telling it what to do with the object.
On the other hand, as you say, trying to tell the computer verbally how to edit a photo, depending on the context, can be both simple and complex. It would be quite simple to touch-select an image and say, "Increase brightness by one f-stop; increase contrast by 20%; Adjust white balance to Tungsten; use unsharp mask; export to email." In a matter of moments you have taken an image, edited it for clarity and told the computer to prepare an email with that image attached. Using a mouse alone or mouse and keyboard would typically require dozens of clicks, moving the mouse to find the pointer and hit the proper menu items and adjustment settings. Conceivably, the image could have been imported, adjusted and emailed out using only one touch and a series of spoken commands in less than a minute vs the 5 minutes or more by mouse and keyboard alone.
People need to get away from the idea that interaction has to be one thing or another--it should be as natural as talking to your partner or your neighbor. In fact, by combining all of the above, speech, gestures, text and even motion sensing (such as what Nintendo has for the Wii and Microsoft is adopting into the xBox) then computing can be much faster and easier than anything we have today.
However, that concept is still in our future. Were it not for people like Steve Jobs and the many different science fiction writers who can truly imagine how certain technologies can be used, we'd probably still be stuck using the old, heavy, cast-iron typewriters and relying on teletype and messengers for communications.