http://realworldtech.com/page.cfm?ArticleID=RWT040208182719 has all the info you want and more. Nehalem makes quite significant changes to the cache structure, instruction latencies, memory system, and other parts of the chip, so it's understandably significantly faster.
As has been mentioned before, there's very little there (or in most chips) that an operating system can be "tuned" to take advantage of. Hyperthreading-aware scheduling and the few cases where SSE4.2 are relevant are the only ones I can think of. Mostly it just runs the same code faster.
The pipelines are significantly different (based on your link). Compilers could easily be tuned to take advantage of larger speculative buffers (i.e.: re-order buffer, retirement rf, etc.). I'm a former AMD (K8) CPU designer, so I don't know the ins and outs of the microarchitecture of the various Intel chips, but typically what you'd find from our stuff is that from uArch to uArch we'd change:
- the number of things that, at peak, could be done simultaneously. For example, we might have three ALU's in one chip, and three-and-a-half in the next. Or we might increase the number of registers, the depth of re-order buffers, etc.
- the number of cycles that it takes to do something. This is simplest to understand in terms of math functions - in an earlier chip it might take 6 cycles to do a 32-bit integer multiply, and in the next we might do a 64-bit integer multiply in 4 cycles.
Another place this often showed up is in memory latency, cache miss penalties, etc.
- the quality of speculation. Branch prediction algorithms would change, cache algorithms would be tweaked, etc., to try and increase the likelihood that data would be where we wanted it, and decrease the likelihood of having to pay a big penalty. Trace caches, etc. also fall into this category. Somewhat related is eliminating the cost of bad speculation. HT is one example.
- the amount of time it takes to do one cycle - i.e.: clock speed.
Nehalem seems to have changed all of these things, and thus I would expect it is quite possible to tweak an OS to take advantage of it. A recompile, alone, would probably have a 10% effect. (CPU architects always say everything has a 10% effect
