New barefeats article proves G4s sometimes faster, sometimes slower than AMD/Pentiums

nixd2001 · Oct 12, 2002

Originally posted by ddtlm
Wow I missed a lot by spending all of Friday away from this board. I am way behind in posts here, and I'm sure I'll miss a lot of things worth comment. But anyway, the code fragment:

Is a very poor benchmark. Compilers may be able to really dig into that and make the resulting executable perform the calculate radically different. In fact, I can tell you the answer outright: x1=20000, x2=20000, x3 = 400000000. It took me 2 seconds or so. Does this mean that I am a better computer than a G4 and a P4? No, it means I realized that the loop can be reduced to simple data assignments. I have a better compiler, thats it.

I'll see about adding more thoughts later.

there is a lot a compiler could do to this - by us all (well, those who have the interest in the assembler output of a compiler at least) having a look at what the respective compilers have done, we can form more of an informed opinion of what works out to the benefit of the P4 for this case. This might all be a bit geeky, but I am intersted at least.

ddtlm · Oct 12, 2002

nixd2001, others:

Please note I am editing my previos post (last one on page 7) to address the issue.

ddtlm · Oct 12, 2002

OK, lets look at this code again. I'll write some x86 assembly to do it. Not the best in the world, but we'll get an idea whats going on. Also I need to do this to help my memory.

int x1,x2,x3;
for (x1=1; x1<=20000; x1++) {
for(x2=1; x2<=20000; x2++) {
x3 = x1*x2;
}
}

Ok, lets do it the stupidest way possible in x86 NASM:

OK:

segment .data

segment .bss

segment .text
global asm_func1, asm_func2
asm_func1:
enter 4,0

mov [ebp-4], dword 1
mov ebx, 20001
outloop1
mov ecx, 1
inloop1
mov eax, ecx
mul dword [ebp-4]

inc ecx
cmp ecx, ebx
jne inloop1

inc dword [ebp-4]
cmp [ebp-4], ebx
jne outloop1

leave
ret

asm_func2:
enter 0,0

mov edx, 1
outloop2
mov ecx, 1
xor eax, eax
inloop2
add eax, edx
add eax, edx
add eax, edx
add eax, edx
add ecx, 4

cmp ecx, 20001
jne inloop2

inc edx
cmp edx, 20001
jne outloop2

leave
ret

asm_func3:
enter 0,0

mov edx, 1
outloop3
mov ecx, 1
xor eax, eax
xor ebx, edx
inloop3
add eax, edx
shl edx, 1
add ebx, edx
add eax, edx
add ebx, edx
add eax, edx
add ebx, edx
add eax, edx
add ebx, edx
shr edx, 1
add eax, edx
add ecx, 8

cmp ecx, 20001
jne inloop3

inc edx
cmp edx, 20001
jne outloop3

leave
ret

The register eax is x3 in #1 and #2, and in #3 I am storing x3 into eax and eab alternatively.

Func #1 runs in 2.467 seconds on my Xeon 700 in Linux. I don't even want to look for where others posted number way back there, does anyone know numbers off the top of their head? I'll now do a version where I optimize it.

Function #2 runs in 0.667 seconds on the same machine. Wanna see if I can do better? Note that func #2 still does calculate every single value of x3, even though none are stored.

Function #3 runs in 0.547 seconds on the same machine, which as we can see is about a 20% speedup over #2. Note I am still doing every calculation of x3. We haven't even began to consider what a compiler could do if it saw that x3 didn't need to be computed.

Now I'll do some C compiler versions to see what happens:

unsigned int C_func1( )
{
unsigned int i, j, x3;

for(i=1;i<40001;++i)
for(j=1;j<40001;++j)
x3 = i * j;

return x3;
}

gcc driver.c -o exe && time ./exe
15.976

gcc -O driver.c -o exe && time ./exe
7.094

gcc -O2 driver.c -o exe && time ./exe
4.642

gcc -O3 driver.c -o exe && time ./exe
6.966

gcc -O2 -funroll-loops driver.c -o exe && time ./exe
0.240

gcc -O2 -funroll-all-loops driver.c -o exe && time ./exe
0.232

This is using GCC 2.96. Note how much difference I can generate even within a single compiler just by changing simple compiler options?

javajedi · Oct 12, 2002

Originally posted by ddtlm
OK, lets look at this code again. I'll write some x86 assembly to do it. Not the best in the world, but we'll get an idea whats going on. Also I need to do this to help my memory.

Ok, lets do it the stupidest way possible in x86 NASM:

I'll be back. Watch this space, I will write it up to make sure it runs.

ddtlm: I didn't know if you downloaded FPTest.java, but basically the only difference there was it was done with 2x precision fp, and doing square roots. BTW: I think I mentioned this in one of my previous post, but for the Mac OS X version, I compiled it with GCC 3.1, then ran both tests on the iBook and PowerBook G4.

C for Mac OS X:

double x1,x2,x3
for (x1=1; x1<=20000; x1++) {
for(x2=1; x2<=20000; x2++) {
x3 = sqrt(x1*x2);
}
}

Java:

double x1,x2,x3
for (x1=1; x1<=20000; x1++) {
for(x2=1; x2<=20000; x2++) {
x3 = Math.sqrt(x1*x2);
}
}

One more question i have for you while you are responding: What you suggested may very well be accurate, the compiler is making some really poor decisions, however if this were the case, what about javac? The 5.9 second score of the P4 was running under the JVM. Obviously the JVM is making some smart decisions and simplifications to the above code. I find it hard to believe these same economies of scale and efficiencies are not present in the Apple JVM (but still possible).

Lastly, lets say when we throw out the java one and go with C and GCC 3.1, lets take the scenario it is in fact, not making the best of the situation as you described. I would be disappointed if this was the case, but regardless, the same compiled code ran on the iBook, ran faster for some reason then the 800MHz G4.

Lastly, I dont know if this well help clear up anything, but Im going to ask my friend who ran it on a G4 with Yellow Dog, to run the Java version. This way we can eliminate the Apple JVM and GCC.

I think it's possible that GCC3.1 and specifically Apple's JVM could be causing this problem, but highly improbable.
Food for thought..

MacCoaster · Oct 12, 2002

javajedi: Well, well... I finally figured out GNUstep and ported your Cocoa program to it--works 100%. Funny thing it's slower than the Java one, but it might be the extra crap I put in there (menus, etc.). 10 seconds compared to 7 seconds with Java. But that's still faster than 70 seconds on a G4. I'll be making a pure C port if anyone hasn't.

JustAGuy · Oct 12, 2002

Hi all, just thought that I'd compile and run the tests on my G4/450 and PIII/733 for comparison. VERY interesting results. I had to change the i value from 20,000 down to 5,000 to save time...

In any event, the results are 15s for the G4/450 and, get this, 55s for the PIII/733.

Further compounding these results was the fact that the G4 was running setiathome with OSX's lousy priority scheduling (nice 20 usually takes up no less than 15% CPU) and the PIII was devoting 100% of it's processor resources to the task.

The best part about one-off, anecdotal evidense is that it is just that

(gcc 2.95 - cygwin - on the PC, gcc 3.1 on OSX) I'll get the java version and give it a whirl...

ddtlm · Oct 12, 2002

JustAGuy:

You should try those tests with some of the compiler flags that I used in my post a few posts up, which I have been editing.

Right now I am looking at the assembly that gcc is generating. It looks like gcc gets the answer in a very strange way.

javajedi:

One more question i have for you while you are responding: What you suggested may very well be accurate, the compiler is making some really poor decisions, however if this were the case, what about javac?

I don't have an answer to that at this time, but it seems to me that we are looking at different quality of JVM's. I could see a P4 beat a G4 by a fair amount, but lets be realistic... the G4 is not so slow as the numbers here have been suggesting.

I wish I knew some PPC assembly.

I would code up some stuff for that too, and I bet the nubmer of registers would help a lot. Registers are great for loop unrolling.

Anyway, some time ago you asked how the G4 has better scalar units than the G3. Basically the FP units are similar but the G4 unit has a lower instruction latency when doing double precision (in the G3 doubles take one more cycle than singles, on the G4 they are the same). Also, the G4 has 4 integer units where the G3 has only two. This is not always useful, but in this problem if I could do PPC assembly I could easily overwelm all 4 of them.

MacCoaster · Oct 12, 2002

JustAGuy: Okay, I modified that for 5000 and compiled on my Athlon-Tbird. Runs in about one second on average.

In fact, put back the 20000 values in both and compile it using:

gcc -mcpu=7450 -O2 -pipe -fsigned-char -maltivec -mabi=altivec -mpowerpc-gfxopt -funroll-loops -o benchmarker benchmarker.c

Or hell, use this C code:

Code:

#include <stdio.h>

int main()
{
	double x1, x2, x3;
	int result, startTime, finishTime;

	startTime = time(NULL);

	for (x1 = 1; x1 <= 20000; x1++)
	{
		for (x2 = 1; x2 <= 20000; x2++)
		{
			x3 = sqrt(x1*x2);
		}
	}

	finishTime = time(NULL);

	result = finishTime - startTime;

	printf("This computer processed the double precision test in %d seconds.\n", result);
	return 0;
}

And also, ddtlm, PLEASE tell us how you compiled your asm files and such so we can duplicate the results.

UnixMac · Oct 12, 2002

You guys lost me and prolly (I like that, Prolly) about 90% of this forum....

have fun, and lets see how many pages you can get this thread to go to? I predict, 12.

ryme4reson · Oct 12, 2002

1more question

Can some1 run this from within VPC. I believe that VPC is supposed to emulate the 486, so I am interested in finding out if they process is handled different, even though its a G4. Sure it will not be fast (emulatin) but i would be interested in seeing the results.

EDIT: ddtlm, are you interested in helping me with X86 assembly? I would be willing to pay for your time. Email me at jamesk777@mac.com or IM me at ryme4reson (AOL) Thanks

ddtlm · Oct 12, 2002

MacCoaster:

Missed your request for ASM directions for a sec there.

Anyway, I use NASM. Available here:

http://sourceforge.net/projects/nasm

I do my assembly in a .asm file, and use a C program as a wrapper to make things easy. C program, including my C loops. Notice that is't ugly and I manually change it to test different things, but hey it works. You can do better Im sure.

#include <math.h>

unsigned int asm_func1( );
unsigned int asm_func2( );
unsigned int asm_func3( );

unsigned int C_func1( )
{
unsigned int i, j, x3;

for(i=1;i<40001;++i)
for(j=1;j<40001;++j)
x3 = i * j;

return x3;
}

unsigned int C_func2( )
{
double i, j, x3;

for(i=1;i<20001;++i)
for(j=1;j<20001;++j)
x3 = sqrt(i * j);

return x3;
}

int main()
{
/* unsigned int cnt; */
double cnt;

cnt = C_func2();

/* printf("%u\n",cnt); */

printf("%d\n",cnt);

return 0;
}

So anyway, now I have my driver.c and my stuff.asm. I "gcc -c driver.c" which should make a .o file. I then "nasm -f elf stuff.asm && gcc *o -o exe" which will make my .asm into a .o and then link the .o's. (BTW, I assume you are familiar with GCC already.)

This can also be done under VisualC if anyone wants to.

http://www.cs.uaf.edu/~cs301/usingnasm.htm

Docs (available all over probably):

http://www.cs.uaf.edu/~cs301/doc/nasmdoc0.html

PCUser · Oct 12, 2002

MacCoaster, wouldn't it be more accurate to use clock() instead of time()? Here's with that change:

Code:

#include <stdio.h>
#include <time.h>
#include <math.h>

int main()
{
  clock_t cStart, cFinish;

  double x1, x2, x3;

  cStart = clock();
  for (x1=1; x1<=20000; x1++) {
    for(x2=1; x2<=20000; x2++) {
      x3 = sqrt(x1*x2);
    }
  }
  cFinish = clock();

  float result = (float)(cFinish - cStart) /  (float)CLOCKS_PER_SEC;

  printf("This computer processed the double precision test in %f seconds.\n", result);

  return 0;
}

I'm not sure if gcc 2.96 supports it. I know gcc 3.2 does, since that's what I'm using.

The test took 32.31 seconds on my Athlon XP 1.53GHz in Linux with no optimization, and 8.86 seconds with "-O2 -funroll-all-loops -march=athlon-xp". (It took the same amount when I condensed it into one loop of 400,000,000.)

ddtlm · Oct 12, 2002

Sheesh, where does the OSX 10.2 developer tools CD install gcc to, or under what name? The older dev tools gave me a compiler. Grumble.

nixd2001 · Oct 12, 2002

Just to keep the numbers rolling:

Code:

                                User    Elap   User   Elap 
                        XEON    G4      G4     G4     G4
gcc			15.97   6.500   6.59   76.34  1:19.92
-O                      7.094   1.130   1.28   68.71  1:12.43
-O2                     4.642   0.770   0.84   68.68  1:11.67
-O3                     6.966   0.900   0.84   68.28  1:11.56
-O2 -funroll-loops      0.240   0.030   0.04   68.83  1:12.10
-O2 -funroll-all-loops  0.232   0.020   0.04   67.45  1:12.76
lots of flags                                  67.63  1:11.70
                        INT     INT     INT    FP     FP

gcc --version gives "gcc (GCC) 3.1 20020420 (prerelease)". This is the version that came with Jaguar in the developers stuff, so it is a bit later than the 2.9 you used. xeon numbers are for 700Mhz system, generated by ???. G4 numbers are my DP 1GHz windy tunnels (ie 167Mhz DDR, for what it's worth). Lots of flags is "-mcpu=7450 -O2 -pipe -fsigned-char -maltivec -mabi=altivec -mpowerpc-gfxopt -funroll-loops". Tests run in an emacs shell from terminal within Jaguar and with iTunes playing in the background. If any of this really impacts the performance, there's something a bit screwed with the scheduling, so I'm going to ignore it. I note that my CPU meter tended to show about 60% for both processors, whereas I'd expect one CPU to be flat out. It's also interesting how GCC is not capable of optimising the FP variant beyond the level achieved with minimal optimisations, whilst the INT version is makes a major difference when unrolling the loops.

All in all, GCC seems to be be able to optimise very simple case well when working with integers, but it's optimisations seem pretty incapable of contributing when using FP. So I'd conclude GCC on PCC with FP is naff (on this test case!), but I have no more direct comparison.

Oh yeah, the INT and FP code is the simple nested 20K loops given above.

MacCoaster · Oct 12, 2002

PCUser:

Thanks! Didn't think about clock()!

Though, that gives me 100.8 seconds (assuming 10.08 seconds) when it ran in 10 seconds. Didn't you mean to divide by ten?

MacCoaster · Oct 12, 2002

ddtlm:
Thanks. I do know gcc a bit, but I really need complete instructions...

i.e. What to do with the .asm. What to do with the .c. What to do with them both to finally bind those. The linker ld? The only time I've ever used ld was in my little OS development... it's been months since I've touched that.

nixd2001 · Oct 12, 2002

Originally posted by MacCoaster
ddtlm:
Thanks. I do know gcc a bit, but I really need complete instructions...

i.e. What to do with the .asm. What to do with the .c. What to do with them both to finally bind those. The linker ld? The only time I've ever used ld was in my little OS development... it's been months since I've touched that.

Dunno about the asm files without delving deeper.

But imagine you've copied the benchmark code to mr2.c - then try

gcc -O2 -funroll-all-loops -o mr2 mr2.c

the -O2 and -funroll-all-loops are optimisation flags. The -o mr2 says to create an output file called mr2. GCC will work out this isn't an object file and manage the linking for you. The mr2.c on the end specifies the input file.

More?

ddtlm · Oct 12, 2002

nixd2001:

Those score I posted earlier were from the integer version of the loop that I was ripping on as meaningless. The float version is not quite so meaningless because you can't just unroll the thing, because floats get different results if the ops are even done in different orders. For the benefit of people who may not know it, with floating point numbers often 4x != x + x + x.

Anyway, my P3 Xeon 700 sports this compiler:

gcc version 2.96 20000731 (Red Hat Linux 7.3 2.96-112)

Results for the exact loop posted by PCUser are:

gcc -O driver.c -o exe && time ./exe
38.858

gcc -O2 -funroll-loops driver.c -o exe && time ./exe
38.818

On a side note, I also found gcc on my Mac after relogging into the terminal so that things were added to the path. Funny that the finder's find cannot see tools like gcc. I'll get results for that posted soon.

ddtlm · Oct 12, 2002

MacCoaster:

Ok, here we go. You have a program.c so compile it into compiler.o like this:

gcc -c program.c

You may place flags such as -O before -c, or maybe even after it. But certainly before it. Anyway, you have some asm_func.asm, so compile it into asm_func.o like this:

nasm -f elf asm_func.asm

Now, you can link these two .o files like this:

gcc *o -o exe

Which makes an executable named exe (which of course you can change to be whatever you want).

Anyway, do note that the ASM funcs do the integer "benchmark" and not the float one. Also, I think because I overwrite ebx when I am not supposed to, the asm routines tend to cause program segaults after they exit.

But they still provide a valid result. I could fix that, but whatever.

ddtlm · Oct 12, 2002

The result for my OSX 10.2 DP 800 G4 on the floating test is 85.56 seconds. I used -O and -funroll-loops as flags.

So this is about 45% the speed of my P3-Xeon 700. Not very good at all, but it falls within the ream of believeability.

nixd2001 · Oct 12, 2002

Originally posted by ddtlm
The result for my OSX 10.2 DP 800 G4 on the floating test is 85.56 seconds. I used -O and -funroll-loops as flags.

So this is about 45% the speed of my P3-Xeon 700. Not very good at all, but it falls within the ream of believeability.

Other than a -O to enable/disable any optimisations at all, what effect can you achieve with the remaining optimistion flags to GCC? I'm more surprised by the lack of variation they achieve on PPC than the actual relative performance - having looked at the PPC code briefly, it looks like I'd expect it to be slow

ddtlm · Oct 12, 2002

nixd2001:

The flags don't do anything to my x86 results either. This loop is just hard to optimize. I did manual unrolling, replaced mults with adds (which we can actually do safely since the float values in the loop controlls are not factions), and even replaced one of the loop counters with an int in conjuntion with the other two above (in such a way that I needed no typecaseing)... and the resukts inproved maybe 5% on the Mac and none on the PC.

javajedi · Oct 12, 2002

ddtlm check this out, this may suprise you:

I ran the double precision test (sqtrt()) for the first time today as a c program. I compiled on the same machine as I ran the java version, with gcc version 2.95.3-5 (cygwin didn't come with 3.x).
Here are the parameters to gcc:

$ gcc -march=i686 -O3 -pipe -mpreferred-stack-boundary=2 -fforce-mem -fforce-addr -fexpensive-optimizations -funroll-loops -fomit-frame-pointer

Using this, the C program does it in 7.01 seconds. The same code, in java does it in 5.9. The javac, or the jvm seems to better be able to tear apart the loop. I think Java being "slow" is another common misconception that people have

Oh well...

Meanwhile on the PPC side of things, I compiled the fp test against:

mcpu=7450 -O2 -pipe -fsigned-char -maltivec -mabi=altivec -mpowerpc-gfxopt -funroll-loops

Ofcourse this is running in 10.2, and I'm still stuck at around 90 seconds.

Is there anything else you think we can do aside from vectorizing it? Lastly, now that we're all on the same page now on how we are compiling this, I reran the silly single percision int test, and my powerbook looses out to the 750FX. Same platform, same code and everything, but heck?

ddtlm · Oct 12, 2002

Anyway I've had my fun here for now. I think it is settled that the G4 does poorly at this particular float test. I've done everything I can think of and gone though all sorts of variations of the loop trying to increase the IPC but I could never make significant headway on either the PC or the Mac.

That said, this test is essentialy a test where we do 400000000 double precision square roots which we don't even store and nothing else. There are no memory access, only very predictable branches. I have radically changed the loop and compiler flags and essentially nothing besides the sqrt() makes any difference.

I do not regard this test as important in the overall picture. It does not illustrate anything important to anyone, unless someone sits around doing square roots all day.

I might also add that designing a meaningful benchmark is very hard. I think SPEC is about as good as it gets, and yes the G4 looses in floats there too.

ddtlm · Oct 12, 2002

javajedi:

Sheesh, I have no idea how Java is defeating C... and those scores are still bizzarre. However PCUser did get 8.86 seconds on an Athlon 1533 with the right compiler flags. Looking at that, I wonder if the compiler flags are the cause here. Since this whole thing is essentially sqrt(), I wonder if the newer x86 chips are packing some strange special sqrt() assembly instruction that makes this huge difference. Hmmm. Otherwise I wonder how an Athlon at a little more than twice my clock speed (compared to the Xeon) can post results that are more than 4 times as fast.

Anyway this is it for me, since this is the weekend. I'll look for some x86 fast sqrt function Monday at work (I am pretty sure that such a thing exists, and if so it may be used in this test).

New barefeats article proves G4s sometimes faster, sometimes slower than AMD/Pentiums

macrumors regular

macrumors 65816

macrumors 65816

macrumors member

macrumors 6502a

macrumors member

macrumors 65816

macrumors 6502a

macrumors 6502

macrumors 6502

macrumors 65816

macrumors regular

macrumors 65816

macrumors regular

macrumors 6502a

macrumors 6502a

macrumors regular

macrumors 65816

macrumors 65816

macrumors 65816

macrumors regular

macrumors 65816

macrumors member

macrumors 65816

macrumors 65816

Our Staff