OK, lets look at this code again. I'll write some x86 assembly to do it. Not the best in the world, but we'll get an idea whats going on. Also I need to do this to help my memory.
int x1,x2,x3;
for (x1=1; x1<=20000; x1++) {
for(x2=1; x2<=20000; x2++) {
x3 = x1*x2;
}
}
Ok, lets do it the stupidest way possible in x86 NASM:
OK:
segment .data
segment .bss
segment .text
global asm_func1, asm_func2
asm_func1:
enter 4,0
mov [ebp-4], dword 1
mov ebx, 20001
outloop1
mov ecx, 1
inloop1
mov eax, ecx
mul dword [ebp-4]
inc ecx
cmp ecx, ebx
jne inloop1
inc dword [ebp-4]
cmp [ebp-4], ebx
jne outloop1
leave
ret
asm_func2:
enter 0,0
mov edx, 1
outloop2
mov ecx, 1
xor eax, eax
inloop2
add eax, edx
add eax, edx
add eax, edx
add eax, edx
add ecx, 4
cmp ecx, 20001
jne inloop2
inc edx
cmp edx, 20001
jne outloop2
leave
ret
asm_func3:
enter 0,0
mov edx, 1
outloop3
mov ecx, 1
xor eax, eax
xor ebx, edx
inloop3
add eax, edx
shl edx, 1
add ebx, edx
add eax, edx
add ebx, edx
add eax, edx
add ebx, edx
add eax, edx
add ebx, edx
shr edx, 1
add eax, edx
add ecx, 8
cmp ecx, 20001
jne inloop3
inc edx
cmp edx, 20001
jne outloop3
leave
ret
The register eax is x3 in #1 and #2, and in #3 I am storing x3 into eax and eab alternatively.
Func #1 runs in
2.467 seconds on my Xeon 700 in Linux. I don't even want to look for where others posted number way back there, does anyone know numbers off the top of their head? I'll now do a version where I optimize it.
Function #2 runs in
0.667 seconds on the same machine. Wanna see if I can do better? Note that func #2 still does calculate every single value of x3, even though none are stored.
Function #3 runs in
0.547 seconds on the same machine, which as we can see is about a 20% speedup over #2. Note I am still doing every calculation of x3. We haven't even began to consider what a compiler could do if it saw that x3 didn't need to be computed.
Now I'll do some C compiler versions to see what happens:
unsigned int C_func1( )
{
unsigned int i, j, x3;
for(i=1;i<40001;++i)
for(j=1;j<40001;++j)
x3 = i * j;
return x3;
}
gcc driver.c -o exe && time ./exe
15.976
gcc -O driver.c -o exe && time ./exe
7.094
gcc -O2 driver.c -o exe && time ./exe
4.642
gcc -O3 driver.c -o exe && time ./exe
6.966
gcc -O2 -funroll-loops driver.c -o exe && time ./exe
0.240
gcc -O2 -funroll-all-loops driver.c -o exe && time ./exe
0.232
This is using GCC 2.96. Note how much difference I can generate even within a single compiler just by changing simple compiler options?