macOS vDSP multiplication slower than loop? (Situationally)

Qaanol · Aug 5, 2014

Ive found a situation where a loop is faster than certain vectorized vDSP functions for double precision values, at least on my late-2013 15.4″ base-model rMBP running OS X 10.9.4; with single-precision, the vDSP functions are faster though.

I am using the Box-Muller transform to generate normally-distributed values. Ill attach my full code, but here is what transpired. I wrote a straightforward, non-vectorized implementation, and it works fine. Then I wanted to see if I could speed it up with vectorized functions, so I wrote that version, and it is indeed faster (it runs in about three-fourths the time).

For the heck of it, I decided to try replacing some vectorized function calls, notably multiplication, with loops instead. I kept the vectorized versions of complicated things like square roots and natural logs, but did the multiplication in loops. Sort of a hybrid between the two implementations. And strangely enough the hybrid version actually runs faster (about 10-15% less time) than the vectorized version.

Finally, I re-implemented all three versions using single-precision floats rather than doubles. But with floats, the vectorized version runs fastest (about 3-5% less time than the hybrid version). So I am not really sure what is going on with the double-precision case. Here are some typical run-times for generating 2^24 (approx. 16 million) normal values:

Float
Looped: 0.290083
Vector: 0.161175
Hybrid: 0.166553

Double
Looped: 0.461659
Vector: 0.332181
Hybrid: 0.298434

Any explanation for why vDSP multiplication of doubles runs slower than a loop in this situation?

Here is the code for the simple, non-vectorized, looped version:

Code:

#define SCALE_D 0x1p-53
#define RAND_53 ((random() << 22) | (random() >> 9))
#define RAND_D ((double)RAND_53)
#define EXP_D (-log(SCALE_D * (double)(RAND_53 + 1L)))

double randNormD(double * const x) {
    // Box-Muller transform
    const double r = sqrt(2.0 * EXP_D);
    const double t = (2.0 * (double)M_PI * SCALE_D) * RAND_D;
    *x = r * cos(t);
    return r * sin(t);
}

void loopNormD(double * const v, const int n) {
    for (int i = 0; i < n-1; i += 2) v[i] = randNormD(&v[i+1]);
    if ((n % 2) && (n > 0)) v[n-1] = randNormD(&v[n-1]);
}

The vectorized version:

Code:

#define BATCH_SIZE 1024
#define HALF_BATCH  512

void vecNormD(double * v, const int n) {
    for (int i = n/BATCH_SIZE; i-->0; v += BATCH_SIZE) vecNormHelpD(v);
    loopNormD(v, n%BATCH_SIZE);
}

void vecNormHelpD(double * const v) {
    const int h = HALF_BATCH;
    const double s = SCALE_D;
    const double p = 2.0 * M_PI * SCALE_D;
    const double negtwo = -2.0;
    double temp[h];
    
    for (int i = 0; i < h; i++) temp[i] = RAND_D;
    vDSP_vsmsaD(temp, 1, &s, &s, temp, 1, h);
    vvlog(temp, temp, &h);
    vDSP_vsmulD(temp, 1, &negtwo, temp, 1, h);
    vvsqrt(temp, temp, &h);
    
    for (int i = 0; i < h; i++) v[i] = RAND_D;
    vDSP_vsmulD(v, 1, &p, v, 1, h);
    vvsincos(v, v, v+h, &h);
    vDSP_vmulD(v, 1, temp, 1, v, 1, h);
    vDSP_vmulD(v+h, 1, temp, 1, v+h, 1, h);
}

And the hybrid version:

Code:

void vecNormD2(double * v, const int n) {
    for (int i = n/BATCH_SIZE; i-->0; v += BATCH_SIZE) vecNormHelpD2(v);
    loopNormD(v, n%BATCH_SIZE);
}

void vecNormHelpD2(double * const v) {
    const int h = HALF_BATCH;
    const double s = SCALE_D;
    const double p = 2.0 * (double)M_PI * SCALE_D;
    double * const w = v + h;
    double temp[h];
    
    for (int i = 0; i < h; i++) temp[i] = RAND_D;
    vDSP_vsmsaD(temp, 1, &s, &s, temp, 1, h);
    vvlog(temp, temp, &h);
    for (int i = 0; i < h; i++) temp[i] *= -2.0;
    vvsqrt(temp, temp, &h);
    
    for (int i = 0; i < h; i++) v[i] = p * RAND_D;
    vvsincos(v, v, v+h, &h);
    for (int i = 0; i < h; i++) {
        v[i] *= temp[i];
        w[i] *= temp[i];
    }
}

Full code attached. I linked against the Accelerate framework and compiled in Xcode 5.1.1 as a command-line tool with the default optimizations for a release version, and ran it in Terminal.

Qaanol · Aug 7, 2014

Update: I made a few tweaks, flagged the pointers as “restrict”, rearranged some arithmetic operations, and now the hybrid versions are about 10% faster than the vector versions for both single and double precision. I still have no idea why the vDSP multiplication functions are slower than loops. Anyone have insight? A hunch? Wild speculation?

I did a number of tests replacing some or all of the vector function calls with loops, and the hybrid versions here were the fastest:

Hybrid float:

Code:

void vecNormHelpF2(float * const restrict v) {
    const int h = HALF_BATCH;
    const float p = 2.0f * (float)M_PI * SCALE_F;
    float * const restrict w = v + h;
    float temp[h];
    
    for (int i = 0; i < h; i++) temp[i] = SCALE_F * RAND_F + SCALE_F;
    vvlogf(temp, temp, &h);
    for (int i = 0; i < h; i++) temp[i] *= -2.0f;
    vvsqrtf(temp, temp, &h);
    
    for (int i = 0; i < h; i++) v[i] = p * RAND_F;
    vvsincosf(v, v, w, &h);
    for (int i = 0; i < h; i++) {
        v[i] *= temp[i];
        w[i] *= temp[i];
    }
}

Hybrid double:

Code:

void vecNormHelpD2(double * const restrict v) {
    const int h = HALF_BATCH;
    const double p = 2.0 * (double)M_PI * SCALE_D;
    double * const restrict w = v + h;
    double temp[h];
    
    for (int i = 0; i < h; i++) temp[i] = SCALE_D * RAND_D + SCALE_D;
    vvlog(temp, temp, &h);
    for (int i = 0; i < h; i++) temp[i] *= -2.0;
    vvsqrt(temp, temp, &h);
    
    for (int i = 0; i < h; i++) v[i] = p * RAND_D;
    vvsincos(v, v, w, &h);
    for (int i = 0; i < h; i++) {
        v[i] *= temp[i];
        w[i] *= temp[i];
    }
}

Vectorized float (inexplicably slower than hybrid):

Code:

void vecNormHelpF(float * const restrict v) {
    const int h = HALF_BATCH;
    const float s = SCALE_F;
    const float p = 2.0f * (float)M_PI * SCALE_F;
    const float negtwo = -2.0f;
    float temp[h];
    
    for (int i = 0; i < h; i++) temp[i] = RAND_F;
    vDSP_vsmsa(temp, 1, &s, &s, temp, 1, h);
    vvlogf(temp, temp, &h);
    vDSP_vsmul(temp, 1, &negtwo, temp, 1, h);
    vvsqrtf(temp, temp, &h);
    
    for (int i = 0; i < h; i++) v[i] = RAND_F;
    vDSP_vsmul(v, 1, &p, v, 1, h);
    vvsincosf(v, v, v+h, &h);
    vDSP_vmul(v, 1, temp, 1, v, 1, h);
    vDSP_vmul(v+h, 1, temp, 1, v+h, 1, h);
}

And vectorized double (also inexplicably slower than hybrid):

Code:

void vecNormHelpD(double * const restrict v) {
    const int h = HALF_BATCH;
    const double s = SCALE_D;
    const double p = 2.0 * M_PI * SCALE_D;
    const double negtwo = -2.0;
    double temp[h];
    
    for (int i = 0; i < h; i++) temp[i] = RAND_D;
    vDSP_vsmsaD(temp, 1, &s, &s, temp, 1, h);
    vvlog(temp, temp, &h);
    vDSP_vsmulD(temp, 1, &negtwo, temp, 1, h);
    vvsqrt(temp, temp, &h);
    
    for (int i = 0; i < h; i++) v[i] = RAND_D;
    vDSP_vsmulD(v, 1, &p, v, 1, h);
    vvsincos(v, v, v+h, &h);
    vDSP_vmulD(v, 1, temp, 1, v, 1, h);
    vDSP_vmulD(v+h, 1, temp, 1, v+h, 1, h);
}

JustMartin · Aug 8, 2014

ok - wild speculation only. I suspect that either the compiler or the processor (or both) has spotted an optimisation route. Loops are relatively easy to optimise with techniques like predictive branching and speculative execution. Why this wouldn't work for the vector routines, I'm not sure, but perhaps they're coded in a relatively complex way that means the optimiser doesn't spot the opportunity.

Qaanol · Aug 8, 2014

JustMartin said:
ok - wild speculation only. I suspect that either the compiler or the processor (or both) has spotted an optimisation route. Loops are relatively easy to optimise with techniques like predictive branching and speculative execution. Why this wouldn't work for the vector routines, I'm not sure, but perhaps they're coded in a relatively complex way that means the optimiser doesn't spot the opportunity.

Ah yes, right you are. Looking at my code here in isolation, I think I see what's going on: the hybrid version saves 3 passes through the data, and lets the compiler heavily optimize the multiplication by -2. So the advantage really is specific to the particular sequence of calculations I'm performing.

The first operation, multiplying by and adding s, would normally be faster with vDSP_vsmsa than a separate loop. But in the hybrid version it doesn't actually require its own loop because it gets folded into the existing one that generates the uniform random values, so there is one fewer pass through the data. And the compiler is able to optimize the multiply-and-add in the loop.

Then there is multiplication by -2.0, which the compiler presumably replaces with some really fast bitwise operations. There is no way the general-purpose vectorized multiplication function can match that, since it can't "see" the -2.0 until runtime.

The multiplication by p gets folded into another loop, reducing the number of passes through the data.

And then multiplying v and w by temp goes in a single loop because temp gets reused, so the compiler can optimize it. If it were vectorized there would need to be two passes.

The log, square root, and trig functions stay vectorized because they are highly optimized and the compiler couldn't match that if they were folded into loops.

It all makes sense now, and I'm reassured that the vDSP functions really are faster than loops, as indeed they proved to be when I tested them in isolation. It was just the particular circumstances of my program that hit upon a situation where loops can do multiple things quickly in a single pass.

Search

Search

macOS vDSP multiplication slower than loop? (Situationally)

Qaanol

macrumors 6502a

Attachments

Qaanol

macrumors 6502a

JustMartin

macrumors 6502a

Qaanol

macrumors 6502a

Our Staff