Re: 32 vs. 64 bit aps
Originally posted by Dave Marsh
I've heard several people say that 64-bit apps would run a bit slower on the 970 than comparable 32-bit apps, which I find a bit confusing. If the processor was built to run 64-bit apps, and isn't just a juried-up 32-bit design finagled to run 64-bit apps (which we know ISN'T the case, since it's based on the POWER4 design), I don't understand why this would be the case.
What I mean is, hasn't the processor been designed to handle PURE 64-bit apps, with all the appropriate circuitry to accommodate that? And, when running 32-bit apps, doesn't that just mean that some of that additional circuitry just isn't being used (or perhaps is just being filled with leading zeroes)? If that's the case, and the processor is just moving data according to its clock rate, why would an app using only part of the design run faster?
Here is an example on IBMs site:
http://www-106.ibm.com/developerworks/linux/library/l-ppc/
To initialise a register with a pointer you need 2 instructions (PowerPC 32 bits) vs 5 instructions (PowerPC 64 bits).
This is a side effect of the instructions size that remains 32 bits in 64 bits mode, thus quantity of immediate data you can store inside is still the same (you'll find more explanations on IBMs page, just scroll to its bottom).
Using more instructions does not only take more time, it also takes L1 instruction cache space.
PowerPC 32:
lis 4,msg@ha # load top 16 bits of &msg
addi 4,4,msg@l # load bottom 16 bits
PowerPC 64:
lis 4,msg@highest # load msg bits 48-63 into r4 bits 16-31
ori 4,4,msg@higher # load msg bits 32-47 into r4 bits 0-15
rldicr 4,4,32,31 # rotate r4's low word into r4's high word
oris 4,4,msg@h # load msg bits 16-31 into r4 bits 16-31
ori 4,4,msg@l # load msg bits 0-15 into r4 bits 0-15
Another point is that you need more memory to store pointers (since they are 64 bits and no more 32 bits), this means you can keep only half as many in the L1 cache for instance, this could lead to more memory access.
If you have an array of 2000 pointers you want to copy, instead of moving 8.000 bytes, you have to move 16.000 bytes.
Inside the CPU working on 32 bits or 64 bits chuncks once they are in registers has no impact, but when you store/load 64 bits pointers in main memory or in the caches, they will take more space and put more stress on the memory subsystem.