Become a MacRumors Supporter for $50/year with no ads, ability to filter front page stories, and private forums.
Rincewind42 said:
Correct me if I'm wrong, but the last time I heard about SSE and dealing with 128-bit vectors, it implemented the functionality by processing the two 64-bit vector halves in serial. So this would mean that even though SSE can handle 2 doubles in one instruction, it still works as if your only processing one of them per cycle.
That's correct. My understanding is also that when SSE instructions are executed they block execution of normal floating point instructions, so theoretically you only save on the number of instructions you use, which can be important, but then again the instructions you use have almost twice as long opcodes (normal float FADD is "DC /0" while SSE float ADDPD is "66 0F 58 /r" so you gain nothing. I don't have any numbers on how this turns out in practice, though.
 
daveL said:
I'm still looking for the IA64 Windows XP release you continue to put forward a Windows OS release ahead of Mac OS X Tiger, because that's what you've been communicating in your posts.

Info on 64-bit Windows OS right here:

http://www.microsoft.com/windowsserver2003/64bit/default.mspx

IA64 XP was only available on HP's workstations, since HP discontinued those there does not appear to be any current way to get IA64 XP except through MSDN, where you can see:

Microsoft has now introduced Windows® XP 64-Bit Edition Version 2003, with support for the Itanium 2 processor, to meet the demands of technical workstation users who require large amounts of memory and floating point performance in areas such as mechanical design and analysis, 3D animation, video editing and composition, and scientific and high-performance computing applications.

The XP 64-bit home page is now focussed on the x64 (AMD64/EM64T) version, the server pages have both IA64 and x64.


daveL said:
You also haven't answered my question as to whether you actually write code but, no matter, I wouldn't believe you, regardless of what you say.

Thirdly, you've given no response to the delta between Altivec and SSE, as mentioned above concerning 128-bit logical operations. You make assertions, people counter with facts, and you just move on to your next unsupported rant. Dennis Miller with no conscious.

bool aiden_codes = true;

I replied to a comment about using AltiVec for wide masks, and pointed out a similar capability in SSE. I didn't intend a tangential debate on VMX vs SSE, so I didn't follow that shift in direction.

Many people here don't realize that Microsoft has been shipping (and yes, selling) 64-bit versions of Windows for over 3 years. 64-bit Linux has been around even longer. They think Apple is leading in the 64-bit evolution, not playing catch-up to everyone else. It seems that some people have learned everything they know about computers from Apple's adverts, and aren't aware of what's really available.

If you think that it's only pedantic drivel, I invite you to skip over my posts. On the other hand, if you want some facts and opinions from a different viewpoint, please read.
 
gekko513 said:
That's correct. My understanding is also that when SSE instructions are executed they block execution of normal floating point instructions, so theoretically you only save on the number of instructions you use, which can be important, but then again the instructions you use have almost twice as long opcodes (normal float FADD is "DC /0" while SSE float ADDPD is "66 0F 58 /r" so you gain nothing. I don't have any numbers on how this turns out in practice, though.


Your understanding of the blocked execution is wrong - it was true for MMX, but SSE uses a separate set of registers and can execute SSE *and* (MMX *or* FP) at the same time.

Reference: http://developer.intel.com/technology/itj/q21999/articles/art_2c.htm (page 3, middle)


The longer opcode is a very minor issue. The Pentium 4 instruction decoder converts the x86 instruction stream into uops, and the decoded uops are cached in a special cache. In a loop, you'd only decode once.


The stream does contain two 64-bit uops. The SIMD FP unit can do one 64-bit float or two 32-bit float operations at once - so at the micro-architectural level it does take 2 cycles to start the pair of double float operations (or 2 cycles to start 4 single float ops).

This really isn't part of the SSE architecture, of course. A chip could have a 4-way 32-bit (and 2-way 64-bit) SIMD FP unit for SSE. The Intel engineers had to make tradeoffs as to where to spend the transistor budget - and for overall performance apparently a wider SIMD FP unit didn't make the cut.

If you really need floating performance, Itanium has 4 FP units...
 
Some clarifications on both sides here

64 Bit solaris ( Sparc ) = Released ( years ago )
64 Bit Solaris ( x86-64 ) = Released ( Now )

64 Bit Windows ( Itanium ) = Released ( Years ago )
64 Bit Windows ( x86-64 ) = Beta ( free download )

64 Bit OSX ( PPC ) = Beta ( download though ADC )

The Itanium code is not the same as the x86 code, and is no more or less available than OSX.4. Neither Windows x86-64 nor Tiger are ready to use in real world environments.

Aiden, you bring up Win2k3 for x86-64... I recently installed the latest public build. It is not ready to go.

Point is, when you go on about windows being out for years, this is as misleading as saying that Iraq has weapons of mass destruction. Yes there has been a 64-bit windows out for a long time, but it is as specialized as Solaris. It is not the same code and creature as the version of 64-bit windows people are expecting and will generally get in the popular ( mass ) market.

Max.
 
AidenShaw said:
Your understanding of the blocked execution is wrong - it was true for MMX, but SSE uses a separate set of registers and can execute SSE *and* (MMX *or* FP) at the same time.
You're right. So the P4 can theoretically execute 2 double-precision floating point operations per cycle, one SSE and one regular. Which is the same number as the G5 can (2 regular), so the P4 wins in theory when it comes to double-precision flops because it has a higher clock speed.

The only disadvantage for the P4 is that you have to mix SSE and regular float instructions to reach the full potential.
 
Try to do a better job staying on topic in the news threads, folks, and take large discussions like this to new threads on your own.

Thanks
 
gekko513 said:
You're right. So the P4 can theoretically execute 2 double-precision floating point operations per cycle, one SSE and one regular. Which is the same number as the G5 can (2 regular), so the P4 wins in theory when it comes to double-precision flops because it has a higher clock speed.

The only disadvantage for the P4 is that you have to mix SSE and regular float instructions to reach the full potential.

This is also complicated by the fact that the x86 FPU is stack based, so it is not as easy to work with as more modern FPUs. That is one of the primary motivations behind SSE*, to make floating point on x86 easier to use. So while you may in theory be able to run in the FPU & SSE at the same time, it is not necessarily the most straightforward thing to do.

Whereas with the G5 the optimization advice is to pretend that you have a single FPU with a 12 cycle latency when scheduling (even though you actually have 2 FPUs with a 6 cycle latency). I dare say that I don't even want to think about how you would optimize the use of a stack-based FPU with a register-based FPU for similar scheduling gains.
 
Rincewind42 said:
This is also complicated by the fact that the x86 FPU is stack based, so it is not as easy to work with as more modern FPUs. That is one of the primary motivations behind SSE*, to make floating point on x86 easier to use. So while you may in theory be able to run in the FPU & SSE at the same time, it is not necessarily the most straightforward thing to do.

Whereas with the G5 the optimization advice is to pretend that you have a single FPU with a 12 cycle latency when scheduling (even though you actually have 2 FPUs with a 6 cycle latency). I dare say that I don't even want to think about how you would optimize the use of a stack-based FPU with a register-based FPU for similar scheduling gains.

Easier on AMD's chips or the P3/PM since FXCH is 0 cycles. I don't remember what the latency is on the P4, but it's definitely a pain. Also... am I wrong that no x86 chip so far has had a separate SSE unit? SSE and MMX together I could see, but I definitely recall reading that SSE is done in the FP pipeline.
 
Register on MacRumors! This sidebar will go away, and you'll see fewer ads.