G5 with L3 cache...

mgargan1 · Jul 5, 2004

why do you think IBM hasn't added more cache to their 970? if they made it with more l2 cache, or even some l3.. man that would be fast. Or eve apple could add some L3 cache to it, cause that;s what they did to the G4... i'd love to see a g5 with 2mb of L3 cache... imagine how fast your porn would download then!!

Sun Baked · Jul 5, 2004

I'll stand by and let somebody else say that the L3 cache was added to the G4 to compensate for a problem the G5 doesn't have (ie, a shared and sucky/slow FSB.)

jackieonasses · Jul 5, 2004

mgargan1 said:
why do you think IBM hasn't added more cache to their 970? if they made it with more l2 cache, or even some l3.. man that would be fast. Or eve apple could add some L3 cache to it, cause that;s what they did to the G4... i'd love to see a g5 with 2mb of L3 cache... imagine how fast your porn would download then!!

ya the g4 had like 167 mhz fsb and that is a bottleneck if i ever heard one!

JFreak · Jul 5, 2004

the bottleneck is always somewhere, but with the G5 powermac it is not the bus speed. adding L3 cache would be waste of money.

naturally, if apple has to implement a slower bus for the powerbooks to keep the heat down, the situation will not be the same. they may very well have to - for example 1.6GHz cpu and a 400MHz bus - but let's hope for the best. i'd like to be able to buy such a powerbook within a year that had a bus as fast as my current G4 powerbook's cpu (1.25ghz). man, that would rock

of course, that is overly optimistic and will not happen.

topicolo · Jul 5, 2004

mgargan1 said:
why do you think IBM hasn't added more cache to their 970? if they made it with more l2 cache, or even some l3.. man that would be fast. Or eve apple could add some L3 cache to it, cause that;s what they did to the G4... i'd love to see a g5 with 2mb of L3 cache... imagine how fast your porn would download then!!

Well, if the Athlon 64 is any indication, increasing the L2 cache from 512kb to 1MB nets only around a 3% performance boost at the same clockspeed. Not really worthwhile, considering the die size of the chip will increase noticeably (higher costs).

mgargan1 · Jul 6, 2004

yea, i guess you guys are right... maybe i'm not always right... as i like to think i am.

invaLPsion · Jul 6, 2004

mgargan1 said:
yea, i guess you guys are right... maybe i'm not always right... as i like to think i am.

Yeaaahhh.....

I don't think your porn will download any faster with more L2 or L3 cache...

mgargan1 · Jul 6, 2004

invaLPsion said:
Yeaaahhh.....

I don't think your porn will download any faster with more L2 or L3 cache...

yea, i know it wont... it was a joke...

invaLPsion · Jul 6, 2004

mgargan1 said:
yea, i know it wont... it was a joke...

A bad one.

ddtlm · Jul 6, 2004

Sun Baked:

I'll stand by and let somebody else say that the L3 cache was added to the G4 to compensate for a problem the G5 doesn't have (ie, a shared and sucky/slow FSB.)

It drives me nuts when people say this. Look at the 2.5ghz G5's scaling vs a 2.0ghz G5 and tell me that it doesn't have a problem. Apple's best PR benchmark shows a 14% performance gain on 25% clock speed and FSB speed boost. (Of course that front-page Cinbench confused the situation by running different code on machines with different video cards. Not a meaningful test!)

I'm not kidding, go look at Apple's PR tests and run the numbers. They do a decent job of making it non-obvious by making everything a %-age of P4 performance, but the numbers are right there. I think "Bibble" is their best one, where a 2.5ghz G5 is 150% the speed of a P4 and a 2.0ghz G5 is 119% the speed. Look like a 31% speedup? Nope, the baseline is 100%, so its 250% P4-speed vs 219% P4-speed for a 14% speedup.

http://www.apple.com/powermac/performance/

So yeah, that 1.25ghz FSB is not going to save the day. Latency is important too.

JFreak:

the bottleneck is always somewhere, but with the G5 powermac it is not the bus speed. adding L3 cache would be waste of money.

I'm sure you've heard of a P4EE. Those 2MB added a lot, in some tests, and the G5 has a worse problem with memory latency. That fancy shmancy ansyc (vs RAM), packet based FSB is not good for latency when compared to a P4's synced FSB and RAM, or an A64's on-die memory controllers.

Actually, I'd go so far as to say Apple really can't design a decent system controller. I have yet to see the DDR G4 chipset show a performance edge despite the 166mhz FSB, quite an embarrasment for Apple, IMHO. The G5 chipset appears to be causing the G5 to experience poor performance gains as it clocks up. If it gains 14% or less going from 2.0ghz to 2.5ghz, then at 3.0ghz its going to have less than a 25% edge over 2.0ghz. Yuk.

topicolo:

Well, if the Athlon 64 is any indication, increasing the L2 cache from 512kb to 1MB nets only around a 3% performance boost at the same clockspeed. Not really worthwhile, considering the die size of the chip will increase noticeably (higher costs).

The Athlon has far lower latency main memory compared to the G5.

mgargan1:

yea, i guess you guys are right... maybe i'm not always right... as i like to think i am.

L3 would help the G5 more than it helped the P4. It might cost a bit, but the G5 is so tiny its a real shame they didn't put more than 512k of cache on it.

keysersoze · Jul 6, 2004

JFreak said:
the bottleneck is always somewhere, but with the G5 powermac it is not the bus speed.

Just curious, but where IS the bottleneck on a G5?

ddtlm · Jul 6, 2004

keysersoze:

Just curious, but where IS the bottleneck on a G5?

Almost certainly in main memory latency as viewed from the CPU's. Kinda funny to imagine that the bottleneck is not due to there being a flow restriction... its due to the round-trip time.

Mac_Max · Jul 6, 2004

I'm sure you've heard of a P4EE. Those 2MB added a lot, in some tests, and the G5 has a worse problem with memory latency. That fancy shmancy ansyc (vs RAM), packet based FSB is not good for latency when compared to a P4's synced FSB and RAM, or an A64's on-die memory controllers.

Actually the 2MB L3 really helped with the Pipeline misses. With a ~30 stage pipeline any sort of pipeline miss will cost a large amount of lost cycles (one stage = one cycle & one cycle takes one hz). Even with the optimizations they've done where they send more then one operation through the execution units (I'm sure everyone has seen the picture that illustrates this) If you mess up on the first stage you've just lost 30 cycles of computing there. If you have two misses its 60 cycles, etc. If you have a computation that takes 120,000 cycles & you don't have a large cache to save all of this data you could loose all the work you've done at the 119,999 cycle & have to do it over again. The P4 based Celeron with a 128KB cache drags even compared to a 1.8GHz Applebread Durron because of those pipeline misses. If you can save the data that you had before the miss there isn't too much of a problem but if you don't have the cache for it you're in trouble (unless of course the CPU has stored some of the data into memory which is much slower than an L2 or L3 cache). The lesson hidden here is basically the more stages you have in your pipeline problems like pipeline misses will become more common where in a CPU like the <10 stage PPC 750FX a pipeline miss is rather rare. Of course because of its short pipeline the 750FX doesn't clock nearly as fast as the G4, G5, Athlon, or P4.

topicolo · Jul 6, 2004

ddtlm said:
topicolo:

The Athlon has far lower latency main memory compared to the G5.

That's totally right. The on-chip memory controller has got to be one of the coolest ideas I've read about in terms of microprocessors. If the G5 had something similar, its performance would be scaling much more in sync with its frequency increases (like the new athlons).

Nevertheless, the speed gains from getting more L2 cache in a G5 would be much lower than the gains from getting more L2 cache in a G4.

ddtlm · Jul 6, 2004

topicolo:

Nevertheless, the speed gains from getting more L2 cache in a G5 would be much lower than the gains from getting more L2 cache in a G4.

Why on earth does everyone believe this? Wave "1.25ghz" in front of people's faces and its clockspeed worship time. Latency is the hidden enemy of the G5, the real enemy. I know people have noticed that a G4 without L3 on a 166mhz FSB has similar clock-for-clock performance to a G5, in many things. Obviously in those cases that bandwidth a G5 has at its disposal is utterly wasted.

I say the G5 nees cache MORE than the G4 did.

Frohickey · Jul 6, 2004

You have the assembly line (execution units).
You have the assembly line workers and stockroom (pipeline and caches).
You have the trucks and orders coming in and out (system bus and instruction/data).

Where is the bottleneck? If you just increased the trucks by 25%, and you only got a 14% increase, I think you need to add more assembly lines.

I think that more assembly lines (execution units) are needed here, but doing so necessarily means a complete redesign of the chip, and you would need to recode your applications to take advantage of it.

Every processor is a tradeoff.

ddtlm · Jul 6, 2004

Mac_Max:

Actually the 2MB L3 really helped with the Pipeline misses.

Probably you meant to say "pipeline bubbles caused by cache misses", cause pipelines themselves don't have misses. I argue that the G5 fears bubbles as much as a P4 does though; the faster the chip is demanding data the more robust the memory system needs to be to feed the chip. I would argue that the G5 core is capable of consuming data faster than the P4 core.

Its true that if a P4 suffers an important cache miss that it will have a long series of idle instructions while waiting for something to do, and it'll take a while to get results out of its pipe even once it gets the data it needed. In the case of a G5 the number of idle cycles probably won't be as long, nor will the delay after data is provided, but it'll have more width, so to speak, cause the G5 has more processing units that would sit around doing nothing. The G5 needs to be feeding more units than the P4 in order to compete with it, since it clocks lower. So in either case a lot of potential be wasted by a cache miss.

Note that I say a G5 would "probably" sit idle fewer cycles, because we don't actually know that for sure. I'm claiming that the G5's memory controller has a much slower turn-around time than good P4 controllers, so in the end the G5 could suffer as many or more idle cycles.

I'll agree that the P4's pipeline causes it to need cache more than an Athlon, and if the memory controller feeding a G5 was equal to the one feeding a P4, I would expect the P4 to like cache more than the G5. However in a case where the G5 has a poor memory controller, as I believe Apple's systems do, then the G5 needs that cache to avoid the main memory as much as it can.

topicolo:

The on-chip memory controller has got to be one of the coolest ideas I've read about in terms of microprocessors. If the G5 had something similar, its performance would be scaling much more in sync with its frequency increases (like the new athlons).

Heck, I'd even buy a G5 if it had on-die memory controllers.

Looking at the AXP -> A64 transition a G5 with on-die controllers would ROCK.

Frohickey:

Where is the bottleneck? If you just increased the trucks by 25%, and you only got a 14% increase, I think you need to add more assembly lines.

To add to your analogy, I'd argue that the problem is that they keep needing "random" parts at the factory that they have to drive across town to get. It would help to be able to get those parts next door.

mklos · Jul 6, 2004

keysersoze said:
Just curious, but where IS the bottleneck on a G5?

I'd say the major bottleneck in the G5 is the hard drive! The RAM goes at over 6 GB/sec while the hard drive (SATA) will only go at 150MB/sec. There isn't any real way to fix that though as the speed of the interface of Serial ATA (SATA) is 150MB/sec if only the drive you spit out data that fast. Even a 10,000 RPM drive would still be the bottleneck in the G5.

We have all of this technology and nobody has figured out how to make an extremely fast hard drive.

As for the L3 Cache, well its not supported on the PPC970 processor for one thing. Another thing is that L3 Cache is very, very expensive and would only raise the cost of the PowerMac G5 for I'd say not very noticeable results! So to me, its not worth IBM's time and energy to try and put L3 Cache on the PPC970 or even the PPC975/980 processors.

ddtlm · Jul 6, 2004

mklos:

I'd say the major bottleneck in the G5 is the hard drive!

For a few things it is.

Another thing is that L3 Cache is very, very expensive and would only raise the cost of the PowerMac G5 for I'd say not very noticeable results! So to me, its not worth IBM's time and energy to try and put L3 Cache on the PPC970 or even the PPC975/980 processors.

What's all this talk about "try" and "expensive"? Direct your attention to Intel's P-M with 2MB of L2, IBM could do it easily. The 970fx is a small processor compared to any of its competition.

Edit:

Here's a 2MB 1.7ghz P-M for $300:

http://www.newegg.com/app/ViewProductDesc.asp?description=19-111-159&depa=1

And lets just say that Intel is making a nice margin on those.

Jo-Kun · Jul 7, 2004

keysersoze said:
Just curious, but where IS the bottleneck on a G5?

as mentioned here off course 'slow' harddisks is one... get 2 10000rpm drives get them in raid so the speed goes up too... maybe that will help...

allso one thing: osx not being native 64bit yet... helps too on slowing down... I guess... but hey don't shoot me I'm a photographer, not an engineer ;-)

J

JFreak · Jul 7, 2004

hard disk is not always even a part of the equation. take a real-time audio processing for example, imagine a live foh mixing situation where the band plays and there's a G5 powermac with a studio-grade multi-channel audio interface as a mixing console. if the gig isn't going to be recorded, the hard drive sits idle. audio goes in to the system via pci or firewire, gets processed, then goes out via pci or firewire. no hard drive bottleneck there. that audio stream goes straight through memory controller (3GBps bandwidth for 100MBps stream) which isn't a bottleneck either. the data can reach the cpu(s) fast enough.

the interesting part is the phase where the real-time data gets processed. that job is highly cpu intensive and uses a lot of memory. there is a lot of traffic between the cpu(s) and the memory, because that 100MB will not fit into the cpu(s) internal caches at once. putting a a few MB cache between cpu(s) and the memory controller will not help as the data needs to travel all the way to the main memory. and that, my friend, is the slowest part of such a system. the bus between the memory controller and the memory storage. sure, it's the same 3GBps bandwidth as before, but as the data needs to be moved back and forth multiple times per second, the bandwidth is consumed faster than you think. the cpu(s) have only about 30 chances to calculate each audio data chunk and that will be a problem given the complexity of some audio processing algorithms. (of course, this is where the altivec comes in and virtually reduces the memory bottleneck making more work simultaneously giving cpu(s) more chances to process. but still, that is the problem.)

the G5(s) are not able to do the fullest because they have to wait for the data to travel to memory and back. and while the L3 would certainly help with this, the amount of fast memory required would be a big multiple of that 2MB apple has put into the G4 powermacs. anybody know how much ibm puts into its powerX servers? they put in a whopping 36MB per cpu. oh yes. more than 100 MB total for the whole system. that would kick some ass

of course, a second is like an infinity for a cpu. nevertheless it is justified to analyze this situation as per-second basis, because the stream is continuous and can last for many hours. and all that data will have to be processed without a single hiccup, in real-time. likely in front of a crowd of a few thousand people.

Frohickey · Jul 7, 2004

JFreak said:
the G5(s) are not able to do the fullest because they have to wait for the data to travel to memory and back. and while the L3 would certainly help with this, the amount of fast memory required would be a big multiple of that 2MB apple has put into the G4 powermacs. anybody know how much ibm puts into its powerX servers? they put in a whopping 36MB per cpu. oh yes. more than 100 MB total for the whole system. that would kick some ass

Do you know how much the IBM servers cost?
Also, the IBM servers do not use the 970, they use its big brother, the Power4. And, of course, there is going to be the new kid on the block soon, the Power5.

blutfink · Jul 7, 2004

ddtlm said:
Latency is the hidden enemy of the G5, the real enemy.

[...]

I say the G5 nees cache MORE than the G4 did.

Finally someone states it. The mighty-mighty PowerPC 970 is a latency snail -- in comparison to AMD's chips.

In many cases, bandwidth isn't everything. Code optimization for memory latency is often very tedious. For Apple and Adobe it's alright to do that once and for all in their high-volume apps (or in the Linear Algebra benchmarks at Virginia Tech for that matter). But for many developers out there (especially in the field of science and research), it's hard work to get maximum performance out of their implementations on a G5.

And in some cases (truly "random" memory access/non-streaming algorithms) it's almost impossible to keep up with the Athlons/Opterons.

Timelessblur · Jul 7, 2004

I might like to point out each catch you add profinece gains go down with it. L1 is the fastest and each after it has less and less gain. as for clock speed the more high the clock speed the less there is gained from before. There is a reason AMD has chips that go at 2ghz that are out doing 3.2-3.4 ghz pentiums. Clock speed is not everything.

Also you need to remeber when you overclock a chip there is less gained out of it than designing a chip that can go that speed with out overclocking it (aka the 2.5Ghz G5 are just overclock chips). Besides when Jobs says we made this huge speed inpovements in mac I will say blah. There is prof sitting out there that speed means a lot less than they make it out to be. (points to AMDs with there 2.0-2.4 ghz chips) I feel it is sad that apple has now joined the ghz band wagan....

ddtlm · Jul 7, 2004

Timelessblur:

Also you need to remeber when you overclock a chip there is less gained out of it than designing a chip that can go that speed with out overclocking it

Any chip X running at Y ghz with Z memory system will perform the same regardless of overclocking or the lack of it.

(aka the 2.5Ghz G5 are just overclock chips).

The G5 is perfectly capable of running at 2.5ghz without it being an overclock, if its even possible for a factory-warrantied chip at factory specs to be overclocked. Overclocking is arguably what customers do, not manufacturers.

I feel it is sad that apple has now joined the ghz band wagan....

Since when have they been mhz nutsos? Never. A G5 on 90nm can barely clock higher than a 130nm A64 at this time. With an A64's memory system I'm confident the G5 would be superior clock-for-clock to the A64 at the same process tech, and anyone who has read the things I post knows I'm no Apple worshipper.

G5 with L3 cache...

macrumors 65816

macrumors G5

macrumors 6502a

macrumors 68040

macrumors 68000

macrumors 65816

macrumors 65816

macrumors 65816

macrumors 65816

macrumors 65816

macrumors 68000

macrumors 65816

macrumors 6502

macrumors 68000

macrumors 65816

macrumors 6502a

macrumors 65816

macrumors 68000

macrumors 65816

macrumors 6502a

macrumors 68040

macrumors 6502a

macrumors newbie

macrumors 65816

macrumors 65816

Our Staff