Rincewind42 said:How do you come to this conclusion? For the most part, the 970 like most other CPUs executes simply integer instructions in one cycle. Where you get a two cycle latency is if the instructions are dependent, but the compiler or software writer can usually produce code that avoids the latency (or at least masks it). Obviously, code that was optimized for non-970 PowerPCs may have code sequences that expose this issue, but code optimized specifically for the 970 should be fine.
After reading this from Ars Technica (Bad Andy)
me (I have inserted some small clarifying parenthetical remarks):
That 2-cycle {simple integer latency} comes from the "extra cycle of cross-over latency" imposed throughout the design, due to the duplicated register-file/execution unit systems.
IBM has clearly stated (somewhere, I can't find it) that the fundamental simple-integer pipeline latency is one (and by god at these frequencies it would be really, really embarrassing if the simple integer pipeline latency WEREN'T one. ) The extra cycle is the cost of communicating the pipeline result to the OTHER register-file/execution system in the pair. IBM keeps the latency fixed at this worst case (whether dependent instruction is in the same unit's issue queue or not) as a matter of issue-logic simplicity (it also avoids what otherwise would be another compiler optimization.) IBM stated publicly that they studied the cost of allowing dependent instructions in the same queue to issue on the next cycle, and decided "it wasn't worth it.")
Of course "wasn't worth it" depends on who is doing the looking and what their metric is. It is really goddam clear how badly that 2-cycle latency impacts serialized small-integer code... worst-case the 970 becomes a 1/2 IPC processor. This just slays some types of common code (simple ill-optimized compilers and interpreters particularly)... and we see this in performance scores.
With power5 IBM uses SMT to further bury this ... (two serially-dependent threads running simultaneously at least manage 1 IPC Frown ... but more fundamentally it is statistically far less probable to have two serially-dependent code sections working at the same time, and more generally the integer section is often a work-rate limiter for other poorly optimized codes (even FP!) ... so if one thread is serially-dependent there is a good chance the other thread can advantage itself of the 1.5 integer IPC available!)
And it would seem like you where right about the simple integer bit, looks like I remembered wrong.