SSE 3, SSE 4 and AVX in one application

Discussion in 'Mac Programming' started by silvercircle, Apr 6, 2014.

  1. silvercircle macrumors member

    Joined:
    Nov 18, 2010
    #1
    How do I support SSE 3, SSE4 and AVX in one application/bundle?

    Do I check at launch what options (processor) are supported and then run a specific application from within the bundle? Are there other options to accomplish this? And how can I check which option is supported?

    If I select SSE 4.2 on my mid 10 Mac Por the program runs a lot faster then when I select SSE 3, I want to offer the best and fastest for every user.
     
  2. gnasher729 macrumors P6

    gnasher729

    Joined:
    Nov 25, 2005
    #2
    The official way to check what is supported is by calling sysctl. I haven't used code that checks for the CPU type, but as an example:

    Code:
    		// Get the number of processors, cores, and threads by calling sysctl. If a call to 
    		// sysctlbyname fails, then assume there is one processor, one core per processor, and one
    		// thread per core. 
    		size_t len;
    		unsigned int procCount;
    		unsigned int coreCount;
    		unsigned int threadCount;
    		
    		if (sysctlbyname ("hw.packages", &procCount, (len = sizeof (procCount), &len), NULL, 0) != 0)
    			procCount = 1;
    			
    		if (sysctlbyname ("hw.physicalcpu", &coreCount, (len = sizeof (coreCount), &len), NULL, 0) != 0)
    			coreCount = procCount;
    			
    		if (sysctlbyname ("hw.logicalcpu", &threadCount, (len = sizeof (threadCount), &len), NULL, 0) != 0)
    			threadCount = coreCount;
    		
    
    I'd probably put the performance critical code into a class (C++ or Objective-C) with subclasses that are compiled with different compiler options, as far as possible compiling identical code, and have some factory method returning an instance of the right class, depending on the processor that you have.
     
  3. MorphingDragon, Apr 6, 2014
    Last edited: Apr 6, 2014

    MorphingDragon macrumors 603

    MorphingDragon

    Joined:
    Mar 27, 2009
    Location:
    The World Inbetween
    #3
    If you're doing SIMD via intrinsics or assembly the way you usually do it is to have multiple code paths for the program kernels that require SIMD. Then at runtime choose the codepath you need. More advanced applications use runtime code generation. As Gnasher mentioned usually this is an an application layer class to abstract away the details.

    Code is untested, consider it c style pseudocode.
    Code:
    void Kernel_SSE3(args) {
       // SSE3 code
    }
    
    void Kernel_SSE4(args) {
       // SSE4 code
    }
    
    void Kernel_AVX(args) {
       // AVX Code
    }
    
    void Kernel_FMA(args) {
      // FMA code
    }
    
    void (*functionPtr)(arg,arg...)  g_KernelFunction = nullptr;
    int main(...) {
        int simdType = GetSIMDType(ReadProc());
        switch(simdType)
             case SSE3:
                    g_KernelFunction = Kernel_SSE3;
                    break;
    
        etc etc
    }
    
    If you're letting the compiler do SIMD code generation. A) Most compilers don't let you have that much granularity, not easily. B) Don't rely on the compiler to output SIMD code. Hand written code and your brain is much better for that kind of optimization. Even the Intel compiler is terrible at SIMD optimization because its impossible to get the necessary context at compile time.
     
  4. Dranix macrumors 6502a

    Dranix

    Joined:
    Feb 26, 2011
    Location:
    Gelnhausen, Germany
    #4
    Honestly, why care? All currently supported CPUs have at least SSE4.1, so simply compile for it.

    Or if you care you could simple use OpenCL with the CPU mode - The OpenCL compiler generates extremely nice sse-code.
     
  5. subsonix macrumors 68040

    Joined:
    Feb 2, 2008
    #5
    Depending on what it is you are doing, look into Apple's Accelerate framework, it will pick the best option depending on what hardware you are running on across all systems.
     
  6. MorphingDragon, Apr 6, 2014
    Last edited: Apr 6, 2014

    MorphingDragon macrumors 603

    MorphingDragon

    Joined:
    Mar 27, 2009
    Location:
    The World Inbetween
    #6
    Not always.

    AFAIK, OpenCL can't tell if there's array aliasing so only vector arithmetic is sped up, not loop optimization. There are some other issues like memory alignment.

    It depends on what he's trying to achieve. You shouldn't use loops in OpenCL anyway as it may run on the GPU if you just use the default device.
     

Share This Page