macOS parallel code

farmerdoug · Dec 7, 2011

I have a new mac with two quad core processors. I was wondering how well it uses the eight cores.
If I have a for loop
for (i = 0; i < 16; i++)
{
x = i*i;
}

would it handle it as well as
for (i = 0; i < 9; i +=8)
{
x = i*i;
x[i+1] = (i+1)*(i+1);
x[i +2] = (i+2)*(i+2);
etc
}
thanks

robbieduncan · Dec 7, 2011

Unless you have an auto-parallelising compiler neither of your solutions will use more than one core. You have to split your code into discrete work units and run them across the cores, either managing the threads manually or using something like Grand Central Dispatch.

farmerdoug · Dec 7, 2011

I read somewhere that my latter approach would work. You're saying no huh?
The code manipulates data from a 4 million pixel focal plane and uses non-parallel routines from an external library but much of the code is just loops over the pixels. Do you think that Grand Central Dispatch would help?
thanks.

robbieduncan · Dec 7, 2011

farmerdoug said:
I read somewhere that my latter approach would work. You're saying no huh?
The code manipulates data from a 4 million pixel focal plane and uses non-parallel routines from an external library but much of the code is just loops over the pixels. Do you think that Grand Central Dispatch would help?
thanks.

It might help an auto-parallelising compiler parallelise the loop content. But without such a compiler it will make no difference. Parallelising 16 multiply operations is unlikely to be worth the effort anyway. Unless you are doing hundreds of these loops within an outer loop then the cost of thread management/dispatch will probably outweigh the time saved

farmerdoug · Dec 7, 2011

for example
NCHAN is at least 28
CUBEWIDTH = 256

Code:

for(w=0; w<NCHAN; w++)
	{
		lambda = LAMBDA_MIN + lambdainc*w;
		for(row=1; row<CUBEWIDTH-1; row++)
		{
			for(col=1; col<CUBEWIDTH-1; col++)
			{
				here = cube[w*CUBEWIDTH*CUBEWIDTH + row*CUBEWIDTH + col];
				if(lookuptable[col][row][2] > 0 && here != 0.00)
				{
					Xref = lookuptable[col][row][0] + Token.xshift;
					Yref = lookuptable[col][row][1] + Token.yshift;
					height = heightmap[col*CUBEWIDTH + row];
					vertscale = (1760 - 1100)/height;
					tilt = tiltmap[row*CUBEWIDTH + col];
					crudeXcntr = (int)lroundf(Xref + (LAMBDA_REF - lambda)/vertscale*tilt);
					crudeYcntr = (int)lroundf(Yref + (LAMBDA_REF - lambda)/vertscale);
					if(crudeXcntr > BORDER && crudeXcntr < FOCPLWIDTH - BORDER - 1 && crudeYcntr > BORDER && crudeYcntr < FOCPLWIDTH - BORDER - 1)
					{
						//if current lenslet, wavelength combination is extracted, form array of surrounding ones and check for low pixel
						surround_count = 0;

						if(lookuptable[col][row+1][2] > 0)
						{//above
							surround[surround_count] = cube[w*CUBEWIDTH*CUBEWIDTH + (row+1)*CUBEWIDTH + col];
							surround_count++;
						}
						if(lookuptable[col][row-1][2] > 0)
						{//below
							surround[surround_count] = cube[w*CUBEWIDTH*CUBEWIDTH + (row-1)*CUBEWIDTH + col];
							surround_count++;
						}
						if(lookuptable[col-1][row][2] > 0)
						{//left
							surround[surround_count] = cube[w*CUBEWIDTH*CUBEWIDTH + row*CUBEWIDTH + col-1];
							surround_count++;
						}
						if(lookuptable[col+1][row][2] > 0)
						{//right
							surround[surround_count] = cube[w*CUBEWIDTH*CUBEWIDTH + row*CUBEWIDTH + col+1];
							surround_count++;
						}
						if(lookuptable[col-1][row+1][2] > 0)
						{//above left
							surround[surround_count] = cube[w*CUBEWIDTH*CUBEWIDTH + (row+1)*CUBEWIDTH + col-1];
							surround_count++;
						}
						if(lookuptable[col+1][row+1][2] > 0)
						{//above right
							surround[surround_count] = cube[w*CUBEWIDTH*CUBEWIDTH + (row+1)*CUBEWIDTH + col+1];
							surround_count++;
						}
						if(lookuptable[col-1][row-1][2] > 0)
						{//below left
							surround[surround_count] = cube[w*CUBEWIDTH*CUBEWIDTH + (row-1)*CUBEWIDTH + col-1];
							surround_count++;
						}
						if(lookuptable[col+1][row-1][2] > 0)
						{//below right
							surround[surround_count] = cube[w*CUBEWIDTH*CUBEWIDTH + (row-1)*CUBEWIDTH + col+1];
							surround_count++;
						}
						
						if(surround_count > 2)
						{	
							qsort(surround, surround_count, sizeof(float), compare_floats);
							if(surround_count % 2 == 0)
							{
								medneighb = (surround[surround_count/2 - 1] + surround[surround_count/2])/2;
							}
							else
							{
								medneighb = surround[surround_count/2]/2;
							}
							if(here < medneighb - thresh && (here < 0 || here < lfr*medneighb))
							{//replace low pixel
								/* begin debug */
								/*
								if(w == 0)
								{
									printf("Lenslet %d, %d: neighb arr:", col, row);
									for(i = 0; i < surround_count; i++)
									{
										printf(" %0.2f", surround[i]);
									}
									printf(";\there = %0.2f, median = %0.2f, thresh = %0.2f\n", here, medneighb, thresh);
								}
								*/
								/* end of debug */
								cube[w*CUBEWIDTH*CUBEWIDTH + row*CUBEWIDTH + col] = medneighb;
								low_count++;
							}
						}
					}
				}
			}
		}
	}

robbieduncan · Dec 7, 2011

I am not going to spend time and energy attempting to read that. If each iteration of the outer loop is independent (so running two or more at once would not cause them to impact each other) I would look into use GCD to schedule each one and then you will see some improvement from the threading. Essentially this means that each iteration of the loop must write to areas of memory that it owns: either different array indexes from any other loop iteration of loop-local variables. If you need to write to shared variables (in your code as is surround count wound not be loop-local but could be made so, your writes to cube look very like they would clash) you need to use semaphores or other locks to ensure atomicity of writes.

Note threading can be a minefield of complexity and debugging. Unless you are 100% sure the operations you parallelise are truly atomic you will have significant, non-repeatable, problems.

farmerdoug · Dec 7, 2011

I didn't expect you to read it; just get some idea of the complexity of the code.
Thanks for your help. Time to find an intern.

Mac_Max · Dec 7, 2011

The best case scenario for multitasking is something you can easily use a divide and conquer strategy. For example, video encoding is great because you can pass off sets of frames to each thread. Often though if you can't do that, you can use threads for multiple jobs. That way you don't have to wait for one process to finish before starting another. There can be little annoyances you have to watch out for when employing this, especially if you're doing some asynchronous IO.

ghellquist · Dec 7, 2011

It will not happen by itself. You need to get some help from the compiler. And the compiler will need some help from you.

I believe openmp is one direction to look. I believe it is supported in the mac compilers.
http://openmp.org/wp/

A very different option is to use the graphics processor, after all they are made for handling pixels.

Cuda is the place to start, I believe.
http://en.wikipedia.org/wiki/CUDA

// Gunnar

gnasher729 · Dec 7, 2011

There are many ways to make your code run a lot faster.

1. Array access should be in sequential order in memory. If at all possible, rearrange the data in lookuptable and tiltmap so that you access lookuptable [row][col] and tiltmap [col*CUBEWIDTH + row].

2. Simplify the calculations. For example let's change this one bit by bit:

Code:

crudeXcntr = (int)lroundf(Xref + (LAMBDA_REF - lambda)/vertscale*tilt);
if(crudeXcntr > BORDER && crudeXcntr < FOCPLWIDTH - BORDER - 1 ...)

lroundf is a rather expensive function, so we change this to

Code:

float crudeXcntr = Xref + (LAMBDA_REF - lambda)/vertscale*tilt;
if(crudeXcntr > BORDER + 0.5 && crudeXcntr < FOCPLWIDTH - BORDER - 1.5 ...)

Now you calculated vertscale = (1760 - 1100)/height. Instead of dividing by vertscale, we multiply by height / (1760 - 1100), so we get

Code:

crudeXcntr = Xref + (LAMBDA_REF - lambda)/(1760 - 1100) * height*tilt;

(LAMBDA_REF-lambda)/(1760 - 1100) can be calculated outside the loop, so we got rid of one function call and two divisions, which are quite slow.

3. I would think that the slowest bit by far is calling qsort to find the median. You then check this:

Code:

if(here < medneighb - thresh && (here < 0 || here < lfr*medneighb))

The values here, thresh and lfr seem to be constant, so you should be able to calculate some lower and upper bounds so you can check lower < medneighb && medneighb < upper. Now instead of sorting the array, which is very expensive, count how many values are <= lower, and collect the ones between lower and upper. Let's say you have 7 values. If there are no values between lower and upper, then the median isn't between lower and upper. If there are four or more values < lower then the median is less than lower, and if three or fewer values are < upper, the median is > upper.

lloyddean · Dec 7, 2011

Doug, may I ask what you're doing your programming on?

Will this program be running on your computer only?

If not what other machines might it possibly be run on?

gnasher729 · Dec 7, 2011

gnasher729 said:
There are many ways to make your code run a lot faster.

4. Since the average is so much easier and faster to find than the median, and usually has better mathematical properties, is there any particular reason to use the median and not the average?

farmerdoug · Dec 7, 2011

Lloyd: This code will only be run on MAC's and is being programmed on a MAC.
Gnasher: Median is the correct average for focal plane noise.
Thanks for the programming tips.
ghellquist. You may be right about a graphic processor but to say we are processing images is to greatly simply the code.
Max. If I have several images, they have to be processed in sequence but one of my libraries supposedly can not be used in parallel code.

lloyddean · Dec 7, 2011

In that case I would suggest OpenCL instead of CUDA. CUDA may have the slightest speed advantage but I think OpenCL is now available on more platforms.

If coded to OpenCL the code could make use of both CPU and as many GPU's as are available.

Either way if I've got Doug figured correctly it's an "Intern" project shortly. No slight intended it's just more work than I think he wishes to expend.

farmerdoug · Dec 8, 2011

Lloyd.
You got that right. I have a fairly full plate at the moment; the code normally runs in about 45 secs but sometimes we take images every 1.5 secs. I expect a large improvement (x2 - 3) just replacing the old IMAC with a newer machine. I might try implementing some of coding suggestions that were posted. I don't suppose you want to volunteer to work on the search for exoplanets.

Anybody?

lloyddean · Dec 8, 2011

While I would love to take on a project like this a neck injury with upper torso spasms don't allow me the concentration necessary to guarantee I could finish any such project in any reasonable amount of time.

You're going to have to look elsewhere!

farmerdoug · Dec 8, 2011

Lloyd

I hope you are kidding about the neck injury but I have the feeling you aren't. Hope you recover quickly.

However, if delivery date is your only worry. Its not an issue. You really don't think this hacker is going to get it done any sooner, do you?

Doug.
we can take this offline if I knew how.

robbieduncan · Dec 8, 2011

farmerdoug said:
we can take this offline if I knew how.

PM each other to exchange email addresses? If you click on lloyddean a menu should drop down with "Send a private message to lloyddean". Of course he might have that turned off for non-Moderators/Admins so it is possible you won't see that...

farmerdoug · Dec 8, 2011

Thanks

gnasher729 · Dec 8, 2011

farmerdoug said:
Lloyd.
You got that right. I have a fairly full plate at the moment; the code normally runs in about 45 secs but sometimes we take images every 1.5 secs. I expect a large improvement (x2 - 3) just replacing the old IMAC with a newer machine. I might try implementing some of coding suggestions that were posted. I don't suppose you want to volunteer to work on the search for exoplanets.

The code that you posted shouldn't take 45 seconds. With NCHAN = 28, CUBEWIDTH = 256, that would be about 24,000 nanoseconds per pixel. This code shouldn't take anywhere near that long. There are a few things we cannot see, like the data types, and the compare_floats function, but what I see shouldn't take more than a second.

Have you run your code under the profiler? (In XCode 4, Product->Perform Action->Profile)

farmerdoug · Dec 8, 2011

The was just a snippet. The full code is attached.

lloyddean · Dec 8, 2011

How about a sample "image" to process?

farmerdoug · Dec 8, 2011

A single image is 16Mbytes- a little over the forum limit - an in an unacceptable format. I would sent you the paper describing the code but its nearly 3 MB. I'll see if I can send a jpeg of an image.
Here's the abstract. Questions welcome. I'll see if I can send a jpeg of an image. I can send the paper in a private email if you are interested.

PUBLICATIONS OF THE ASTRONOMICAL SOCIETY OF THE PACIFIC, 123:746763, 2011 June © 2011. The Astronomical Society of the Pacific. All rights reserved. Printed in U.S.A.
A Data-Cube Extraction Pipeline for a Coronagraphic Integral Field Spectrograph
NEIL ZIMMERMAN,1,2 DOUGLAS BRENNER,2 BEN R. OPPENHEIMER,1,2 IAN R. PARRY,3 SASHA HINKLEY,4,5 STEPHANIE HUNT,3 AND ROBIN ROBERTS2
Received 2010 July 2; accepted 2011 April 26; published 2011 May 24
ABSTRACT. Project 1640 is a high-contrast near-infrared instrument probing the vicinities of nearby stars through the unique combination of an integral field spectrograph with a Lyot coronagraph and a high-order adaptive optics system. The extraordinary data-reduction demands, similar to those that several new exoplanet imaging in- struments will face in the near future, have been met by the novel software algorithms described herein. The Project 1640 Data Cube Extraction Pipeline (PCXP) automates the translation of 3:8 × 104 closely packed, coarsely sampled spectra to a data cube. We implement a robust empirical model of the spectrograph focal-plane geometry to register the detector image at subpixel precision, and we map the cube extraction. We demonstrate our ability to accurately retrieve source spectra based on an observation of Saturns moon Titan.

farmerdoug · Dec 8, 2011

Try this. The center picture is part of an image. The focal plane has 4 million pixels; you are seeing 10000 of them. The location of the spectra changes with telescope position and possibly other factors. Taking this into account is one of the things that the code does.

gnasher729 · Dec 9, 2011

Just an example how to use Grand Central Dispatch in a command line tool:

Code:

#include <dispatch/dispatch.h>

static void do_all_work (dispatch_group_t group, dispatch_queue_t main_queue, dispatch_queue_t bg_queue)
{
	for (int i = 1; i <= 10; ++i) {
		dispatch_group_async (group, main_queue, ^{
			printf ("Starting task #%d\n", i);
			dispatch_group_async (group, bg_queue, ^{
				double sum = 0; 
				double limit = i * 10000;
				for (double k = 1.0; k <= limit; k += 1) sum += 1.0 / k;
				dispatch_group_async (group, main_queue, ^{
					printf ("Task #%d: result = %.3f\n", i, sum);
				});
			});
		});
	}
}

int main (int argc, const char * argv[])
{
	dispatch_queue_t bg_queue = dispatch_get_global_queue (DISPATCH_QUEUE_PRIORITY_DEFAULT, 0);
	dispatch_queue_t main_queue = dispatch_get_main_queue ();
	dispatch_group_t group = dispatch_group_create();

	do_all_work (group, main_queue, bg_queue);

	dispatch_async (bg_queue, ^{
		dispatch_group_wait (group, DISPATCH_TIME_FOREVER);
		dispatch_release (group);
		dispatch_async(main_queue, ^{ exit (0); });
	});
	dispatch_main ();
	return 0;
}

You'd copy the main () as it is, and change do_all_work as you need. Things that can be run in a background thread are put inside

Code:

			dispatch_group_async (group, bg_queue, ^{
				// Your code goes here
			});

Things that have to run on the main thread are put inside

Code:

			dispatch_group_async (group, main_queue, ^{
				// Your code goes here
			});

So as an example, calling

Code:

			for (w = 0; w < NCHAN; ++w)
					dispatch_group_async (group, bg_queue, ^{
						call_some_function (w);
				});

will put each call to call_some_function (w) on a separate thread. (Had to change this example, the code before the change ran all calls on a single background thread)

macOS parallel code

macrumors 6502a

Moderator emeritus

macrumors 6502a

Moderator emeritus

macrumors 6502a

Moderator emeritus

macrumors 6502a

macrumors 6502

macrumors regular

Suspended

macrumors 65816

Suspended

macrumors 6502a

macrumors 65816

macrumors 6502a

macrumors 65816

macrumors 6502a

Moderator emeritus

macrumors 6502a

Suspended

macrumors 6502a

Attachments

macrumors 65816

macrumors 6502a

macrumors 6502a

Attachments

Suspended

Our Staff