I was out of the loop for that time. The short pipelines should so much more of a benefit.HT on the P4 showed some decent improvements once they tuned the OS schedulers to grok the difference between physical and virtual processors. It wasn't so much a throughput thing as a responsiveness thing -- if you had a CPU bound task permanently pegging the CPU then the system didn't bog down as much. I recall some benchmarks showing mp3 encoding at the same time as gaming showing a decent speedup, too...
I wasn't sure this would be as effective in a dual-core world, purely because instead of a virtual core taking up the slack on unused resources for responsiveness, you've got a real core doing it... so, if the actual throughput change is negligable and the responsiveness isn't really improved...? I guess if you now have two processes/threads that peg CPUs, but it seems unless it improves throughput, it's a diminishing returns thing...
Of course, HT was there to mask the huge latencies introduced by that massive 30+ stage pipeline, which Merom (and presumably Penryn?) isn't lumbered with...
Northwood was surprisingly capable in retrospect. Prescott was a dead end with the heat generation issues.