PDA

View Full Version : Short Pipeline is better. Long pipeline is better.


acj
Aug 2, 2003, 10:12 PM
How do you know what to believe, and when?

The G4's 4, and later 7 stage pipeline was always advertised as a strong point, compared to the pentium 4, but now the G5 has a 16 stage integer pipeline and a 21 state FP pipeline, and up to 25 stages for the Velocity Engine.

Now they mention over 200 simultanious instructions compared to P4's 126, but they never mentioned this with the G4's measly 16. Obviously the P4 wasn't over 7 times faster than the G4.

It really seems that nothing really matters and if speed is all that's important to you, you need to compare computers side by side with what YOU do and see what's best for yourself.

iJon
Aug 2, 2003, 10:33 PM
ah come on, we all know this. the whole mhz myth was just supposed to be music to customers ears who wanted ac omputer for speed when apple was getting their butts kicked by intel and amd.

iJon

Powerbook G5
Aug 2, 2003, 10:49 PM
Exactly, all you need to know is that the G5 is one higher than the P4, why would you want a mere 4 when you can have a 5? :D

bousozoku
Aug 2, 2003, 11:00 PM
When the 604 and the P2 went head-to-head, the 604 was better by quite a margin. When the P3 arrived, the margin narrowed for the 604e.

The G4 is just a poorly designed economy processor with an above average SIMD unit.

A short pipeline is better when the processor doesn't guess correctly because it has to unload everything and start over. If it guesses correctly, obviously a long pipeline is going to help because all the instructions/data are available and can keep the processor going at full pace.

simX
Aug 2, 2003, 11:27 PM
Originally posted by acj
How do you know what to believe, and when?

The G4's 4, and later 7 stage pipeline was always advertised as a strong point, compared to the pentium 4, but now the G5 has a 16 stage integer pipeline and a 21 state FP pipeline, and up to 25 stages for the Velocity Engine.

Now they mention over 200 simultanious instructions compared to P4's 126, but they never mentioned this with the G4's measly 16. Obviously the P4 wasn't over 7 times faster than the G4.

It really seems that nothing really matters and if speed is all that's important to you, you need to compare computers side by side with what YOU do and see what's best for yourself.

It all really depends, and your last statement is pretty much right on target.

From my (very limited) understanding, longer pipelines allow you to do more instructions per clock cycle, but they are a drawback when "bubbles" appear in the pipeline. "Bubbles" are like when a certain instruction requires the results of another instruction, so it has to wait for that other instruction to finish... so a long pipeline means that the bubble has a bigger effect on the efficiency of that clock cycle.

But like you said, these drawbacks can be overcome by other design considerations, so it's best just to compare real-world performance in applications you use.

MrMacMan
Aug 2, 2003, 11:29 PM
Yes, this is why shorter or longer pipeline can't be calculated properly.

Powerbook G5
Aug 2, 2003, 11:50 PM
I wonder how the branch prediction on the G5 will affect it's longer pipeline, I know Steve and the IBM guy both said it predicts correctly a good 90% of the time or so, but for that 1 in 10, that error is going to be felt more on a longer pipeline, isn't it?

Catfish_Man
Aug 2, 2003, 11:50 PM
Pipeline depth IS important, but it's not the final word in performance. When balanced out by a good cache, good out of order execution (oooe), and good branch prediction (like a P4 or G5), lengthening the pipeline is an effective way of increasing performance (by raising the clock frequency). In a processor like the G4, with little to no oooe (because embedded programs tend to be hand scheduled, and oooe increases power consumption), and only mediocre branch prediction (because of the short pipeline), lengthening the pipeline would probably have been a bad idea. It would have DEFINITELY been a bad idea for its target market (high end embedded), which is notoriously latency and power conscious. The G4 and G4+ served their intended purpose quite well, although the massively delayed transition to .13 micron, and the lack of an on chip memory controller are beginning to hamper them. The fact that they made pretty decent P3 killers was (mostly) just an added bonus.
The G5, on the other hand, seems squarely targetted at Xeons and Opterons (and to a lesser extent P4/AthlonXP/Athlon64), which is just about perfect for Apple's purposes. It's designed in a fairly similar way to them, in certain respects. It has a long pipeline, with extensive oooe, and large caches. This allows it to scale much higher than the G4, and makes it better suited to running poorly optimized code (which is what most code is), The downside is that it has significantly higher power consumption and manufacturing cost than a G4 made on the same manufacturing process.

Fender2112
Aug 3, 2003, 10:48 AM
One analogy that stuck with me described the the 970 like this: The P4 has long and narrow pipeline. The G4 has a short and wide pipeline. By comparison, the G5 (970) has long and wide pipeline. I don't know what this means in terms of stages or in flight instructions. This description gave me a mental image that makes the G5 seem like the best of both designs.

acj
Aug 3, 2003, 01:40 PM
Fender:

I think that's fairly accurate. Time will tell. Most of us haven't actually used a G5.

Catfish_Man
Aug 3, 2003, 01:58 PM
Originally posted by Fender2112
One analogy that stuck with me described the the 970 like this: The P4 has long and narrow pipeline. The G4 has a short and wide pipeline. By comparison, the G5 (970) has long and wide pipeline. I don't know what this means in terms of stages or in flight instructions. This description gave me a mental image that makes the G5 seem like the best of both designs.

This is true, but a number of compromises elsewhere in the design were made to achieve this. Tracking the execution of 200+ instructions would be prohibitively difficult, so they divided them into groups of 5 and tracked 40 groups instead. This allows them to keep the complexity down to a manageable level, but adds a number of restrictions to how instructions can be dispatched and retired. Overall, I think this was a good tradeoff (2 integer, 2 floating point, 2 load store, and 4 vector, with a long pipeline, is very impressive), but it's going to be a bitch for the compiler writers.

Mav451
Aug 3, 2003, 02:06 PM
hey fender: so what kind of pipeline does the AthlonXP and Opteron have?

I'm just wondering since i went to that Ars Technica site and didn't understand a single word of what they said :(

Fender2112
Aug 3, 2003, 06:28 PM
Originally posted by Mav451
hey fender: so what kind of pipeline does the AthlonXP and Opteron have?

I'm just wondering since i went to that Ars Technica site and didn't understand a single word of what they said :(

Those are clogged pipelines. You want to barrow some of my Draino? :) Seriouly though, I don't know. This was analogy that help explain the difference between PPC and x86. That Ars Technica artical is bit above my head, but I did follow the gist of it.