Intel was using 2MB/core cache with the C2D FSB architecture to compensate for the constrained memory bandwidth available with that architecture... such enormous cache really gave the architecture much better performance than it deserved giving it longevity well past it's best before date. Now with the move to on-die memory controller (and thus eliminating a serious bottleneck) they still continue to use 2MB/core cache which is really unnecessary now. To use a water reservoir analogy, it's like increasing the capacity of the inlet to the reservoir but still having the reservoir so large that it's almost irrelevant. Intel could easily move to 1.5MB/core or even 1MB/core without impacting data access and use that massive number of transistors for higher level logic (or reducing die-size and TDP perhaps enabling higher clocks).
I understand what you're getting at. But as you noticed, it did give the architecture additional lifespan (carried over from year to year). It saved them on development costs for totally new designs.
To me, this is part of the reasoning for keeping the cache level the same. But it also helps make up for the additional latency in DDR3. Either way, you keep the core running rather than waiting for data, which is critical in high performance server systems (some workstation use as well). Now ATM, the cache ratio may be a little high, but I'd think it will be utilized better in the future parts (necessary via the controller design as it's already implemented).
Keep in mind that AMD's architecture has been using on-die memory controllers since the X2 days (5+ years?) with much smaller cache sizes and they've always been regarded as having superior memory performance for the given technology (DDR/DDR2/etc.).
Yes, they have. Intel carried their old FSB for quite some time, as they plan that long term from what I understand. Just look how long Core 2 survived, and it's not totally dead. The cores themselves, are the same in Nehalem. They only dumped the FSB and related chipsets (made changes, though significant ones). And the LGA1156 parts retain DMI as well.
Intel's move to tri-channel DDR3 really offers little real-world benefit over even dual-channel DDR2 on a FSB because the cache sizes on their old processors were over the top. Yet they haven't reduced them at all with this new architecture.
There's a little software that can utilize it, but not much. Of what I have direct experience with, some simulation software can. I'd think more commonly VM's could as well, or SMP based applications. Essentially all of it in the enterprise side. For consumer software (home users) though, no. It's overkill. But as the enterprise gear is trickled down into consumer parts, it's inevitable that such parts will obtain the new tech. Now users need software to catch up. As usual.
They've built the architecture to carry them awhile again. So there's plenty of headroom to improve the performance to the limit (they don't give us that the 1st time around).

We'll see more cores per CPU and higher memory bandwidths over the parts to come, just as with architectures released before it.
I don't see a reason to doubt the band width figures from that article for the 5500. You find those figures in other publications from IBM, Dell and Sun who looked into memory optimization. At the same RAM frequency 5500 is 2,6 times faster than 5400.
It's just not matching what I seem to recall it capable of, so I'll have to do some looking to make sure I've not gotten something crossed.
- Apple stole 20% of the bandwidth of my advanced CPU by locking the multiplier at 8 instead of 10.
- Apple is systematically forcing people towards octads and higher DIMM density due to the failure to provide 6 RAM slots/socket.
Apple definitely raked users over the coals here. There wasn't really any need IMO to make the memory multiplier fixed rather than use SPD as every other system does (EFI doesn't make SPD unavailable). So either it was a ROM capacity issue, or they intentionally bricked the system. They definitely made the DIMM slot issues intentional, as the slots could have been arranged differently to allow for a couple more per CPU (6x per would have been sufficient). But it would have limited sales of the DP systems, and that would make the most sense for doing it.