PDA

View Full Version : parallel code




farmerdoug
Dec 7, 2011, 07:50 AM
I have a new mac with two quad core processors. I was wondering how well it uses the eight cores.
If I have a for loop
for (i = 0; i < 16; i++)
{
x[i] = i*i;
}

would it handle it as well as
for (i = 0; i < 9; i +=8)
{
x[i] = i*i;
x[i+1] = (i+1)*(i+1);
x[i +2] = (i+2)*(i+2);
etc
}
thanks



robbieduncan
Dec 7, 2011, 07:55 AM
Unless you have an auto-parallelising compiler neither of your solutions will use more than one core. You have to split your code into discrete work units and run them across the cores, either managing the threads manually or using something like Grand Central Dispatch.

farmerdoug
Dec 7, 2011, 08:01 AM
I read somewhere that my latter approach would work. You're saying no huh?
The code manipulates data from a 4 million pixel focal plane and uses non-parallel routines from an external library but much of the code is just loops over the pixels. Do you think that Grand Central Dispatch would help?
thanks.

robbieduncan
Dec 7, 2011, 08:14 AM
I read somewhere that my latter approach would work. You're saying no huh?
The code manipulates data from a 4 million pixel focal plane and uses non-parallel routines from an external library but much of the code is just loops over the pixels. Do you think that Grand Central Dispatch would help?
thanks.

It might help an auto-parallelising compiler parallelise the loop content. But without such a compiler it will make no difference. Parallelising 16 multiply operations is unlikely to be worth the effort anyway. Unless you are doing hundreds of these loops within an outer loop then the cost of thread management/dispatch will probably outweigh the time saved

farmerdoug
Dec 7, 2011, 08:18 AM
for example
NCHAN is at least 28
CUBEWIDTH = 256

for(w=0; w<NCHAN; w++)
{
lambda = LAMBDA_MIN + lambdainc*w;
for(row=1; row<CUBEWIDTH-1; row++)
{
for(col=1; col<CUBEWIDTH-1; col++)
{
here = cube[w*CUBEWIDTH*CUBEWIDTH + row*CUBEWIDTH + col];
if(lookuptable[col][row][2] > 0 && here != 0.00)
{
Xref = lookuptable[col][row][0] + Token.xshift;
Yref = lookuptable[col][row][1] + Token.yshift;
height = heightmap[col*CUBEWIDTH + row];
vertscale = (1760 - 1100)/height;
tilt = tiltmap[row*CUBEWIDTH + col];
crudeXcntr = (int)lroundf(Xref + (LAMBDA_REF - lambda)/vertscale*tilt);
crudeYcntr = (int)lroundf(Yref + (LAMBDA_REF - lambda)/vertscale);
if(crudeXcntr > BORDER && crudeXcntr < FOCPLWIDTH - BORDER - 1 && crudeYcntr > BORDER && crudeYcntr < FOCPLWIDTH - BORDER - 1)
{
//if current lenslet, wavelength combination is extracted, form array of surrounding ones and check for low pixel
surround_count = 0;

if(lookuptable[col][row+1][2] > 0)
{//above
surround[surround_count] = cube[w*CUBEWIDTH*CUBEWIDTH + (row+1)*CUBEWIDTH + col];
surround_count++;
}
if(lookuptable[col][row-1][2] > 0)
{//below
surround[surround_count] = cube[w*CUBEWIDTH*CUBEWIDTH + (row-1)*CUBEWIDTH + col];
surround_count++;
}
if(lookuptable[col-1][row][2] > 0)
{//left
surround[surround_count] = cube[w*CUBEWIDTH*CUBEWIDTH + row*CUBEWIDTH + col-1];
surround_count++;
}
if(lookuptable[col+1][row][2] > 0)
{//right
surround[surround_count] = cube[w*CUBEWIDTH*CUBEWIDTH + row*CUBEWIDTH + col+1];
surround_count++;
}
if(lookuptable[col-1][row+1][2] > 0)
{//above left
surround[surround_count] = cube[w*CUBEWIDTH*CUBEWIDTH + (row+1)*CUBEWIDTH + col-1];
surround_count++;
}
if(lookuptable[col+1][row+1][2] > 0)
{//above right
surround[surround_count] = cube[w*CUBEWIDTH*CUBEWIDTH + (row+1)*CUBEWIDTH + col+1];
surround_count++;
}
if(lookuptable[col-1][row-1][2] > 0)
{//below left
surround[surround_count] = cube[w*CUBEWIDTH*CUBEWIDTH + (row-1)*CUBEWIDTH + col-1];
surround_count++;
}
if(lookuptable[col+1][row-1][2] > 0)
{//below right
surround[surround_count] = cube[w*CUBEWIDTH*CUBEWIDTH + (row-1)*CUBEWIDTH + col+1];
surround_count++;
}

if(surround_count > 2)
{
qsort(surround, surround_count, sizeof(float), compare_floats);
if(surround_count % 2 == 0)
{
medneighb = (surround[surround_count/2 - 1] + surround[surround_count/2])/2;
}
else
{
medneighb = surround[surround_count/2]/2;
}
if(here < medneighb - thresh && (here < 0 || here < lfr*medneighb))
{//replace low pixel
/* begin debug */
/*
if(w == 0)
{
printf("Lenslet %d, %d: neighb arr:", col, row);
for(i = 0; i < surround_count; i++)
{
printf(" %0.2f", surround[i]);
}
printf(";\there = %0.2f, median = %0.2f, thresh = %0.2f\n", here, medneighb, thresh);
}
*/
/* end of debug */
cube[w*CUBEWIDTH*CUBEWIDTH + row*CUBEWIDTH + col] = medneighb;
low_count++;
}
}
}
}
}
}
}

robbieduncan
Dec 7, 2011, 08:26 AM
I am not going to spend time and energy attempting to read that. If each iteration of the outer loop is independent (so running two or more at once would not cause them to impact each other) I would look into use GCD to schedule each one and then you will see some improvement from the threading. Essentially this means that each iteration of the loop must write to areas of memory that it owns: either different array indexes from any other loop iteration of loop-local variables. If you need to write to shared variables (in your code as is surround count wound not be loop-local but could be made so, your writes to cube look very like they would clash) you need to use semaphores or other locks to ensure atomicity of writes.

Note threading can be a minefield of complexity and debugging. Unless you are 100% sure the operations you parallelise are truly atomic you will have significant, non-repeatable, problems.

farmerdoug
Dec 7, 2011, 08:32 AM
I didn't expect you to read it; just get some idea of the complexity of the code.
Thanks for your help. Time to find an intern.

Mac_Max
Dec 7, 2011, 09:26 AM
The best case scenario for multitasking is something you can easily use a divide and conquer strategy. For example, video encoding is great because you can pass off sets of frames to each thread. Often though if you can't do that, you can use threads for multiple jobs. That way you don't have to wait for one process to finish before starting another. There can be little annoyances you have to watch out for when employing this, especially if you're doing some asynchronous IO.

ghellquist
Dec 7, 2011, 12:05 PM
It will not happen by itself. You need to get some help from the compiler. And the compiler will need some help from you.

I believe openmp is one direction to look. I believe it is supported in the mac compilers.
http://openmp.org/wp/

A very different option is to use the graphics processor, after all they are made for handling pixels.

Cuda is the place to start, I believe.
http://en.wikipedia.org/wiki/CUDA

// Gunnar

gnasher729
Dec 7, 2011, 03:14 PM
There are many ways to make your code run a lot faster.

1. Array access should be in sequential order in memory. If at all possible, rearrange the data in lookuptable and tiltmap so that you access lookuptable [row][col] and tiltmap [col*CUBEWIDTH + row].

2. Simplify the calculations. For example let's change this one bit by bit:

crudeXcntr = (int)lroundf(Xref + (LAMBDA_REF - lambda)/vertscale*tilt);
if(crudeXcntr > BORDER && crudeXcntr < FOCPLWIDTH - BORDER - 1 ...)


lroundf is a rather expensive function, so we change this to

float crudeXcntr = Xref + (LAMBDA_REF - lambda)/vertscale*tilt;
if(crudeXcntr > BORDER + 0.5 && crudeXcntr < FOCPLWIDTH - BORDER - 1.5 ...)


Now you calculated vertscale = (1760 - 1100)/height. Instead of dividing by vertscale, we multiply by height / (1760 - 1100), so we get

crudeXcntr = Xref + (LAMBDA_REF - lambda)/(1760 - 1100) * height*tilt;

(LAMBDA_REF-lambda)/(1760 - 1100) can be calculated outside the loop, so we got rid of one function call and two divisions, which are quite slow.

3. I would think that the slowest bit by far is calling qsort to find the median. You then check this:

if(here < medneighb - thresh && (here < 0 || here < lfr*medneighb))

The values here, thresh and lfr seem to be constant, so you should be able to calculate some lower and upper bounds so you can check lower < medneighb && medneighb < upper. Now instead of sorting the array, which is very expensive, count how many values are <= lower, and collect the ones between lower and upper. Let's say you have 7 values. If there are no values between lower and upper, then the median isn't between lower and upper. If there are four or more values < lower then the median is less than lower, and if three or fewer values are < upper, the median is > upper.

lloyddean
Dec 7, 2011, 06:13 PM
Doug, may I ask what you're doing your programming on?

Will this program be running on your computer only?

If not what other machines might it possibly be run on?

gnasher729
Dec 7, 2011, 06:31 PM
There are many ways to make your code run a lot faster.


4. Since the average is so much easier and faster to find than the median, and usually has better mathematical properties, is there any particular reason to use the median and not the average?

farmerdoug
Dec 7, 2011, 07:50 PM
Lloyd: This code will only be run on MAC's and is being programmed on a MAC.
Gnasher: Median is the correct average for focal plane noise.
Thanks for the programming tips.
ghellquist. You may be right about a graphic processor but to say we are processing images is to greatly simply the code.
Max. If I have several images, they have to be processed in sequence but one of my libraries supposedly can not be used in parallel code.

lloyddean
Dec 7, 2011, 11:33 PM
In that case I would suggest OpenCL instead of CUDA. CUDA may have the slightest speed advantage but I think OpenCL is now available on more platforms.

If coded to OpenCL the code could make use of both CPU and as many GPU's as are available.

Either way if I've got Doug figured correctly it's an "Intern" project shortly. No slight intended it's just more work than I think he wishes to expend.

farmerdoug
Dec 8, 2011, 07:10 AM
Lloyd.
You got that right. I have a fairly full plate at the moment; the code normally runs in about 45 secs but sometimes we take images every 1.5 secs. I expect a large improvement (x2 - 3) just replacing the old IMAC with a newer machine. I might try implementing some of coding suggestions that were posted. I don't suppose you want to volunteer to work on the search for exoplanets.

Anybody?

lloyddean
Dec 8, 2011, 11:57 AM
While I would love to take on a project like this a neck injury with upper torso spasms don't allow me the concentration necessary to guarantee I could finish any such project in any reasonable amount of time.

You're going to have to look elsewhere!

farmerdoug
Dec 8, 2011, 12:04 PM
Lloyd

I hope you are kidding about the neck injury but I have the feeling you aren't. Hope you recover quickly.

However, if delivery date is your only worry. Its not an issue. You really don't think this hacker is going to get it done any sooner, do you?

Doug.
we can take this offline if I knew how.

robbieduncan
Dec 8, 2011, 12:10 PM
we can take this offline if I knew how.

PM each other to exchange email addresses? If you click on lloyddean a menu should drop down with "Send a private message to lloyddean". Of course he might have that turned off for non-Moderators/Admins so it is possible you won't see that...

farmerdoug
Dec 8, 2011, 12:11 PM
Thanks

gnasher729
Dec 8, 2011, 07:34 PM
Lloyd.
You got that right. I have a fairly full plate at the moment; the code normally runs in about 45 secs but sometimes we take images every 1.5 secs. I expect a large improvement (x2 - 3) just replacing the old IMAC with a newer machine. I might try implementing some of coding suggestions that were posted. I don't suppose you want to volunteer to work on the search for exoplanets.

The code that you posted shouldn't take 45 seconds. With NCHAN = 28, CUBEWIDTH = 256, that would be about 24,000 nanoseconds per pixel. This code shouldn't take anywhere near that long. There are a few things we cannot see, like the data types, and the compare_floats function, but what I see shouldn't take more than a second.

Have you run your code under the profiler? (In XCode 4, Product->Perform Action->Profile)

farmerdoug
Dec 8, 2011, 08:47 PM
The was just a snippet. The full code is attached.

lloyddean
Dec 8, 2011, 08:54 PM
How about a sample "image" to process?

farmerdoug
Dec 8, 2011, 09:11 PM
A single image is 16Mbytes- a little over the forum limit - an in an unacceptable format. I would sent you the paper describing the code but its nearly 3 MB. I'll see if I can send a jpeg of an image.
Here's the abstract. Questions welcome. I'll see if I can send a jpeg of an image. I can send the paper in a private email if you are interested.

PUBLICATIONS OF THE ASTRONOMICAL SOCIETY OF THE PACIFIC, 123:746–763, 2011 June © 2011. The Astronomical Society of the Pacific. All rights reserved. Printed in U.S.A.
A Data-Cube Extraction Pipeline for a Coronagraphic Integral Field Spectrograph
NEIL ZIMMERMAN,1,2 DOUGLAS BRENNER,2 BEN R. OPPENHEIMER,1,2 IAN R. PARRY,3 SASHA HINKLEY,4,5 STEPHANIE HUNT,3 AND ROBIN ROBERTS2
Received 2010 July 2; accepted 2011 April 26; published 2011 May 24
ABSTRACT. Project 1640 is a high-contrast near-infrared instrument probing the vicinities of nearby stars through the unique combination of an integral field spectrograph with a Lyot coronagraph and a high-order adaptive optics system. The extraordinary data-reduction demands, similar to those that several new exoplanet imaging in- struments will face in the near future, have been met by the novel software algorithms described herein. The Project 1640 Data Cube Extraction Pipeline (PCXP) automates the translation of 3:8 × 104 closely packed, coarsely sampled spectra to a data cube. We implement a robust empirical model of the spectrograph focal-plane geometry to register the detector image at subpixel precision, and we map the cube extraction. We demonstrate our ability to accurately retrieve source spectra based on an observation of Saturn’s moon Titan.

farmerdoug
Dec 8, 2011, 09:28 PM
Try this. The center picture is part of an image. The focal plane has 4 million pixels; you are seeing 10000 of them. The location of the spectra changes with telescope position and possibly other factors. Taking this into account is one of the things that the code does.

gnasher729
Dec 9, 2011, 04:13 AM
Just an example how to use Grand Central Dispatch in a command line tool:

#include <dispatch/dispatch.h>

static void do_all_work (dispatch_group_t group, dispatch_queue_t main_queue, dispatch_queue_t bg_queue)
{
for (int i = 1; i <= 10; ++i) {
dispatch_group_async (group, main_queue, ^{
printf ("Starting task #%d\n", i);
dispatch_group_async (group, bg_queue, ^{
double sum = 0;
double limit = i * 10000;
for (double k = 1.0; k <= limit; k += 1) sum += 1.0 / k;
dispatch_group_async (group, main_queue, ^{
printf ("Task #%d: result = %.3f\n", i, sum);
});
});
});
}
}

int main (int argc, const char * argv[])
{
dispatch_queue_t bg_queue = dispatch_get_global_queue (DISPATCH_QUEUE_PRIORITY_DEFAULT, 0);
dispatch_queue_t main_queue = dispatch_get_main_queue ();
dispatch_group_t group = dispatch_group_create();

do_all_work (group, main_queue, bg_queue);

dispatch_async (bg_queue, ^{
dispatch_group_wait (group, DISPATCH_TIME_FOREVER);
dispatch_release (group);
dispatch_async(main_queue, ^{ exit (0); });
});
dispatch_main ();
return 0;
}


You'd copy the main () as it is, and change do_all_work as you need. Things that can be run in a background thread are put inside

dispatch_group_async (group, bg_queue, ^{
// Your code goes here
});


Things that have to run on the main thread are put inside


dispatch_group_async (group, main_queue, ^{
// Your code goes here
});


So as an example, calling


for (w = 0; w < NCHAN; ++w)
dispatch_group_async (group, bg_queue, ^{
call_some_function (w);
});


will put each call to call_some_function (w) on a separate thread. (Had to change this example, the code before the change ran all calls on a single background thread)

farmerdoug
Dec 9, 2011, 10:15 AM
Definitely looks above my present abilities. Many have a new student who can take it on but my offer to Lloyd extends to you.
Thanks

jared_kipe
Dec 9, 2011, 10:30 AM
For implementing a for loop style there is a more specific function:

dispatch_apply(size_t iterations, dispatch_queue_t queue, void (^block)(size_t));

This does 'iterations' dispatches and passes 0-(iterations - 1) as a variable to the block or function to keep track of which iteration it is on.

GCD will automatically optimize thread creation based on system resources.