macOS dispatch_apply creating inconsistent results?

exscape · Oct 10, 2009

I want to apologize in advance for the extremely ugly typecasting in the isPrime for loop...

I don't feel like fixing it right now. You'll see...
Anyway. I'm trying to learn GCD, but I'm stuck. I figured I'd try to parallelize a simple program that checks whether a number is prime or not.
It calls isPrime() in a loop a certain number of times and reports the number of primes found.

Using a for loop, the number returned is always the same (6542 for the first 2^16 numbers) - as expected. Using dispatch_apply(), the number changes between approx. 2700-4700 (a few runs showed 3442, 3593, 4033, 4336, 2701, 3571). I have no clue as to why! Here's the code:

Code:

#include <stdio.h>
#include <math.h>
#include <dispatch/dispatch.h>

#define NONPRIME 0
#define PRIME 1

short isPrime(unsigned long long);
int main()
{
    __block unsigned long primes = 0;
    unsigned long long max = (1ULL << 16ULL);

/* to switch from GCD, uncomment these two lines, comment the dispatch_apply() 
 * and }); lines, and the } ending the for loop. */
    //unsigned long long num = 0;
    //for (num=0; num < max; num++) {

    dispatch_apply(max, dispatch_get_global_queue(0, 0), ^(size_t num) {
        primes += isPrime(num);
        });
    //}
    printf("%lu primes found among the first %llu numbers\n", primes, max);

return 0;
}

short isPrime(unsigned long long n)
{
    unsigned long long j;
    if (n < 2)
        return NONPRIME;
    if (n == 2)
        return PRIME;
    if (n % 2 == 0)
        return NONPRIME;

    for (j=3; j<=((unsigned long long)((double)(sqrt((double)n)))+1); j+=2) {
        if (n % j == 0)
            return NONPRIME;
    }
    return PRIME;
}

What am I doing wrong?
Thanks in advance.

Catfish_Man · Oct 10, 2009

Try changing

Code:

primes += isPrime(num);

to

Code:

if (isPrime(num)) {
    OSAtomicIncrement64((volatile int64_t *)&primes);
}

(you'll need to include <libkern/OSAtomic.h>, and change primes to be an int64_t)

What's happening is that some of your threads will read a value N for primes, increment it to N', and then write out N' to memory without checking first to see if another thread has already incremented it. The second thread's increment therefore gets lost.

<edit>
Oh, I almost forgot: Welcome to multithreading hell! It'll be good to have some company here

</edit>

<edit2>
A somewhat less scary looking alternative is to create a dispatch_queue_t to guard the shared counter. You just have each iteration do something like this:

Code:

if (isPrime(num)) {
    dispatch_async(counterQueue, ^{ primes += 1; });
}

I'm not convinced it's really any clearer than the OSAtomic solution, but it does illustrate the technique.
</edit2>

exscape · Oct 10, 2009

Ah, yes, atomicity... Thanks! Works like a charm.
Too bad that GCD doesn't magically solve problems like these, though. I'm a novice when it comes to threaded programming. I've only used them for doing background operations, never as "worker threads" (although I've never really written any software that needed such a thing).

Anyway, as expected, the performance was almost doubled on my dual-core computer.

(The numbers are when max = 2^23, by the way.)

Code:

serenity@macbookpro ~ $ for i in {1..5}; do /usr/bin/time ./a.out 2>&1 | grep real; done
       12.22 real        12.06 user         0.05 sys
       11.99 real        11.87 user         0.05 sys
       11.99 real        11.87 user         0.04 sys
       11.98 real        11.87 user         0.04 sys
       11.97 real        11.87 user         0.03 sys
serenity@macbookpro ~ $ vim isprime.c 
serenity@macbookpro ~ $ gcc -Wall isprime.c -O2
serenity@macbookpro ~ $ for i in {1..5}; do /usr/bin/time ./a.out 2>&1 | grep real; done
        6.36 real        12.17 user         0.03 sys
        6.38 real        12.15 user         0.03 sys
        6.34 real        12.15 user         0.03 sys
        6.36 real        12.15 user         0.03 sys
        6.37 real        12.14 user         0.03 sys

Edit: I couldn't quite get your second idea to work, and it was a bit slower than the atomic add, at that. I guess dispatching 8 million blocks takes a bit of CPU time.

Here's what I tried, barely reading the docs:

Code:

/*
 * NOTE: This doesn't work!
 */

    dispatch_queue_t counterQueue = dispatch_queue_create("org.exscape.isprime.atomic", NULL);

    dispatch_apply(max, dispatch_get_global_queue(0, 0), ^(size_t num) {
        if (isPrime(num)) {
            dispatch_async(counterQueue, ^{ primes += 1; });
        }
    });
    printf("%lld primes found among the first %llu numbers\n", primes, max);

It occasionally prints the correct answer (564163), and is often very close, but that's obviously not good enough.

Catfish_Man · Oct 10, 2009

I *think* I see what's going on there. Your printf() is being hit before all the dispatch_async blocks finish. You can use dispatch_group_async and dispatch_group_wait to ensure that they're all done before you print the result. Alternatively, just use dispatch_sync for the +1. The added synchronization shouldn't cost too much parallelism with most of the work still being done async, it might even be cheaper.

You can probably get rid of some of the performance hit by copying the +1 block yourself before the dispatch_apply; that way dispatch_async won't be creating a new copy every time.

exscape · Oct 10, 2009

Catfish_Man said:
I *think* I see what's going on there. Your printf() is being hit before all the dispatch_async blocks finish. You can use dispatch_group_async and dispatch_group_wait to ensure that they're all done before you print the result. Alternatively, just use dispatch_sync for the +1. The added synchronization shouldn't cost too much parallelism with most of the work still being done async, it might even be cheaper.

You can probably get rid of some of the performance hit by copying the +1 block yourself before the dispatch_apply; that way dispatch_async won't be creating a new copy every time.

Right-o again.

Here's what I ended up with; performance is about 7.5-8 seconds per run.

Code:

  __block int64_t primes = 0;
    unsigned long long max = (1ULL << 23ULL);

    dispatch_queue_t counterQueue = dispatch_queue_create("org.exscape.isprime.atomic", NULL);
    dispatch_group_t asyncGroup = dispatch_group_create();

    dispatch_apply(max, dispatch_get_global_queue(0, 0), ^(size_t num) {
        if (isPrime(num)) {
            dispatch_group_async(asyncGroup, counterQueue, ^{ primes++; });
        }
    });
    dispatch_group_wait(asyncGroup, DISPATCH_TIME_FOREVER);
    printf("%lld primes found among the first %llu numbers\n", primes, max);

I tried for a while to make the ^{ primes++ } a block saved as a variable with little success, though. I haven't exactly mastered function pointers yet so I guess that is to expect

Catfish_Man · Oct 10, 2009

How about this:

Code:

__block int64_t primes = 0;
    unsigned long long max = (1ULL << 23ULL);

    dispatch_queue_t counterQueue = dispatch_queue_create("org.exscape.isprime.atomic", NULL);
    dispatch_block_t plusOne = [^{ primes++; } copy];
    void (^work)(size_t) = [^(size_t num) {
        if (isPrime(num)) {
            dispatch_sync(counterQueue, plusOne);
        }
    } copy];
    dispatch_apply(max, dispatch_get_global_queue(0, 0), work);
    [plusOne release];
    [work release];
    dispatch_release(counterQueue);
    printf("%lld primes found among the first %llu numbers\n", primes, max);

I'd also be interested to see whether dispatch_sync or dispatch_group_async end up winning in terms of performance for this

<edit> [foo copy] and [foo release] will need to be Block_copy() and Block_release() in a .c file; too used to objc

</edit>

Catfish_Man · Oct 10, 2009

Wow. So, I tried the dispatch_sync() with +1 block vs the OSAtomicIncrement (all blocks pre-copied in both cases). They're both almost exactly the same speed. Within 0.02 seconds of each other. I'm really impressed.

exscape · Oct 11, 2009

Catfish_Man said:
Wow. So, I tried the dispatch_sync() with +1 block vs the OSAtomicIncrement (all blocks pre-copied in both cases). They're both almost exactly the same speed. Within 0.02 seconds of each other. I'm really impressed.

Nice! Would you mind posting the full source for the dispatch_sync() version?

Catfish_Man · Oct 11, 2009

(this needs to be a .m file; I compiled with gcc -O2 -framework Foundation primes.m -o primes)

Code:

#include <stdio.h>
#include <math.h>
#include <dispatch/dispatch.h>
#include <libkern/OSAtomic.h>
#import <Foundation/Foundation.h>

#define NONPRIME 0
#define PRIME 1

#define USE_DISPATCH_SYNC 1

short isPrime(unsigned long long);
int main()
{
__block int64_t primes = 0;
    unsigned long long max = (1ULL << 23ULL);
#if USE_DISPATCH_SYNC
	dispatch_queue_t counterQueue = dispatch_queue_create("org.exscape.isprime.atomic", NULL);
    dispatch_block_t plusOne = [^{ primes++; } copy];
#endif
    void (^work)(size_t) = [^(size_t num) {
        if (isPrime(num)) {
#if USE_DISPATCH_SYNC
			dispatch_sync(counterQueue, plusOne);
#else
        	OSAtomicIncrement64((volatile void *)&primes);
#endif
        }
    } copy];
    dispatch_apply(max, dispatch_get_global_queue(0, 0), work);
    [work release];
#if USE_DISPATCH_SYNC
    [plusOne release];
	dispatch_release(counterQueue);
#endif
    printf("%lld primes found among the first %llu numbers\n", primes, max);
return 0;
}

short isPrime(unsigned long long n)
{
    unsigned long long j;
    if (n < 2)
        return NONPRIME;
    if (n == 2)
        return PRIME;
    if (n % 2 == 0)
        return NONPRIME;

    for (j=3; j<=((unsigned long long)((double)(sqrt((double)n)))+1); j+=2) {
        if (n % j == 0)
            return NONPRIME;
    }
    return PRIME;
}

gnasher729 · Oct 11, 2009

Have a look what's hidden in

<libkern/OSAtomic.h>

Whenever you are programming with multiple threads (and that's what happens when you dispatch blocks to various queues), you have to be careful when multiple threads want to access the same data. In your case, where you are just adding up numbers, using the atomic operations like OSAtomicAdd32 and OSAtomicAdd64 is all you need, but you also _do_ need them, unless you make your code quite complicated.

There are three important groups of functions in that file:

1. Atomic operations. These functions do very simple things like adding one number to another number, but they are thread-safe: If one thread uses them to add X to a variable, and another thread uses them to add Y to exactly the same variable, the total effect will be that (X + Y) will be added to the variable, as it should be.

2. Compare-and-swap operations: These are need if you want to change a value in a more complicated way. You read an old value yourself, then you calculate what the new value should be, then you call the function which will store the new value, but only if the old value is still unchanged. If not, you have to try again.

3. Atomic enqueue and dequeue: These functions let you add an element to the front of a single linked list, or remove an element from the front of a single linked list, in a thread-safe manner.

If these functions are enough to handle what you need, then they are the easiest and most efficient way to achieve this. So in your case, calling AtomicIncrement every time you found a prime number is a very simple, safe and efficient way of counting the prime numbers.

exscape · Oct 11, 2009

gnasher729 said:
If these functions are enough to handle what you need, then they are the easiest and most efficient way to achieve this. So in your case, calling AtomicIncrement every time you found a prime number is a very simple, safe and efficient way of counting the prime numbers.

They have downsides, though: they're OS dependant, GCD is not (since it's open sourced). There's already implementations of libdispatch and clang running on FreeBSD.

Krevnik · Oct 12, 2009

A couple comments about optimization:
- Be careful about what you take as a cross-platform optimization. What seems to be good for one platform might not actually be good for another, even if the code doesn't change.
- Some APIs do have direct counterparts on other platforms, and can easily be turned into cross-platform macros (the atomic increment/decrement/exchange methods are actually good candidates here). Since macros resolve at compile-time into the platform-dependent code, you don't take a perf hit for this sort of abstraction.

Another thing to look at is to cut down on the number of entries you add to the queue. Any work you can avoid doing the better. Since you know the status of 1 and 2 on if they are prime or not, you can skip them, and start at 3... (with the count set to 1) and you can skip all the evens because you know they can't be prime.

This can easily be done by doing something similar to a stride in the block. Pass ((max - 3)/2) into dispatch_apply, and then reverse the math in the loop itself: ((num * 2) + 3). That might get you even more perf out of the app as GCD will be queueing half the jobs, and you don't even have to evaluate them.

gnasher729 · Oct 12, 2009

exscape said:
They have downsides, though: they're OS dependant, GCD is not (since it's open sourced). There's already implementations of libdispatch and clang running on FreeBSD.

GCD doesn't run on the operating system that most Macintosh users are using. Which is MacOS X 10.5 "Leopard". Which is of course irrelevant because you are using it (or not) in the middle of GCD code...

But being serious, the atomic add functions do exactly what you wanted and nothing else; they are guaranteed to be present and working on all Mac hardware present and future (because they are created specifically for your machine when it is booting), and you can't do thread-safe programming without some kind of atomic operations, so you can assume that something similar is available on any OS, so you might as well live with them.

exscape · Oct 12, 2009

gnasher729 said:
GCD doesn't run on the operating system that most Macintosh users are using. Which is MacOS X 10.5 "Leopard". Which is of course irrelevant because you are using it (or not) in the middle of GCD code...

But being serious, the atomic add functions do exactly what you wanted and nothing else; they are guaranteed to be present and working on all Mac hardware present and future (because they are created specifically for your machine when it is booting), and you can't do thread-safe programming without some kind of atomic operations, so you can assume that something similar is available on any OS, so you might as well live with them.

Very true, but as you said, when we have GCD already... Plus, the atomic operations will invariably be different between OSes, eventually requiring #ifdef blocks for each supported OS - very ugly. Hopefully, GCD will make become rather large outside OS X, but I guess that depends quite a bit on the developments of LLVM/Clang.

At some point or another, most Macs will run 10.6. Granted, it WILL take a while, especially with the whole PPC deal, but GCD seems to be the future for multithreaded programming on OS X; at least, that seems to be Apple's goal.

BTW, you could use the same arguments to not use Core Data or NSOperation since there are still are Panther and Tiger users, respectively. At one point or another, you have to make a choice whether to use the new features or not, and leave some users out.

(FWIW, my only released Mac app uses Obj-C 2.0 and as such requires Leopard; I first released it around April 2008, ~6 months after Leopard's release.)

Krevnik · Oct 12, 2009

exscape said:
Very true, but as you said, when we have GCD already... Plus, the atomic operations will invariably be different between OSes, eventually requiring #ifdef blocks for each supported OS - very ugly. Hopefully, GCD will make become rather large outside OS X, but I guess that depends quite a bit on the developments of LLVM/Clang.

Ah, but the ifdef blocks can be pushed into headers, where the atomic operation is mapped to a macro. Ugly, maybe, but then your code is a lot more readable, and the cross-platform layer can live on its own and be maintained separately as well.

As I've already said, don't take the perf for granted. Who knows, your GCD shortcut here might not be so performant on BSD if you choose to pursue that.

Search

Search

macOS dispatch_apply creating inconsistent results?

exscape

macrumors member

Catfish_Man

macrumors 68030

exscape

macrumors member

Catfish_Man

macrumors 68030

exscape

macrumors member

Catfish_Man

macrumors 68030

Catfish_Man

macrumors 68030

exscape

macrumors member

Catfish_Man

macrumors 68030

gnasher729

Suspended

exscape

macrumors member

Krevnik

macrumors 601

gnasher729

Suspended

exscape

macrumors member

Krevnik

macrumors 601

Our Staff