parallel code

Discussion in 'Mac Programming' started by farmerdoug, Dec 7, 2011.

  1. macrumors 6502a

    Joined:
    Sep 16, 2008
    #1
    I have a new mac with two quad core processors. I was wondering how well it uses the eight cores.
    If I have a for loop
    for (i = 0; i < 16; i++)
    {
    x = i*i;
    }

    would it handle it as well as
    for (i = 0; i < 9; i +=8)
    {
    x = i*i;
    x[i+1] = (i+1)*(i+1);
    x[i +2] = (i+2)*(i+2);
    etc
    }
    thanks
     
  2. Moderator emeritus

    robbieduncan

    Joined:
    Jul 24, 2002
    Location:
    London
    #2
    Unless you have an auto-parallelising compiler neither of your solutions will use more than one core. You have to split your code into discrete work units and run them across the cores, either managing the threads manually or using something like Grand Central Dispatch.
     
  3. thread starter macrumors 6502a

    Joined:
    Sep 16, 2008
    #3
    I read somewhere that my latter approach would work. You're saying no huh?
    The code manipulates data from a 4 million pixel focal plane and uses non-parallel routines from an external library but much of the code is just loops over the pixels. Do you think that Grand Central Dispatch would help?
    thanks.
     
  4. Moderator emeritus

    robbieduncan

    Joined:
    Jul 24, 2002
    Location:
    London
    #4
    It might help an auto-parallelising compiler parallelise the loop content. But without such a compiler it will make no difference. Parallelising 16 multiply operations is unlikely to be worth the effort anyway. Unless you are doing hundreds of these loops within an outer loop then the cost of thread management/dispatch will probably outweigh the time saved
     
  5. thread starter macrumors 6502a

    Joined:
    Sep 16, 2008
    #5
    for example
    NCHAN is at least 28
    CUBEWIDTH = 256
    Code:
    for(w=0; w<NCHAN; w++)
    	{
    		lambda = LAMBDA_MIN + lambdainc*w;
    		for(row=1; row<CUBEWIDTH-1; row++)
    		{
    			for(col=1; col<CUBEWIDTH-1; col++)
    			{
    				here = cube[w*CUBEWIDTH*CUBEWIDTH + row*CUBEWIDTH + col];
    				if(lookuptable[col][row][2] > 0 && here != 0.00)
    				{
    					Xref = lookuptable[col][row][0] + Token.xshift;
    					Yref = lookuptable[col][row][1] + Token.yshift;
    					height = heightmap[col*CUBEWIDTH + row];
    					vertscale = (1760 - 1100)/height;
    					tilt = tiltmap[row*CUBEWIDTH + col];
    					crudeXcntr = (int)lroundf(Xref + (LAMBDA_REF - lambda)/vertscale*tilt);
    					crudeYcntr = (int)lroundf(Yref + (LAMBDA_REF - lambda)/vertscale);
    					if(crudeXcntr > BORDER && crudeXcntr < FOCPLWIDTH - BORDER - 1 && crudeYcntr > BORDER && crudeYcntr < FOCPLWIDTH - BORDER - 1)
    					{
    						//if current lenslet, wavelength combination is extracted, form array of surrounding ones and check for low pixel
    						surround_count = 0;
    
    						if(lookuptable[col][row+1][2] > 0)
    						{//above
    							surround[surround_count] = cube[w*CUBEWIDTH*CUBEWIDTH + (row+1)*CUBEWIDTH + col];
    							surround_count++;
    						}
    						if(lookuptable[col][row-1][2] > 0)
    						{//below
    							surround[surround_count] = cube[w*CUBEWIDTH*CUBEWIDTH + (row-1)*CUBEWIDTH + col];
    							surround_count++;
    						}
    						if(lookuptable[col-1][row][2] > 0)
    						{//left
    							surround[surround_count] = cube[w*CUBEWIDTH*CUBEWIDTH + row*CUBEWIDTH + col-1];
    							surround_count++;
    						}
    						if(lookuptable[col+1][row][2] > 0)
    						{//right
    							surround[surround_count] = cube[w*CUBEWIDTH*CUBEWIDTH + row*CUBEWIDTH + col+1];
    							surround_count++;
    						}
    						if(lookuptable[col-1][row+1][2] > 0)
    						{//above left
    							surround[surround_count] = cube[w*CUBEWIDTH*CUBEWIDTH + (row+1)*CUBEWIDTH + col-1];
    							surround_count++;
    						}
    						if(lookuptable[col+1][row+1][2] > 0)
    						{//above right
    							surround[surround_count] = cube[w*CUBEWIDTH*CUBEWIDTH + (row+1)*CUBEWIDTH + col+1];
    							surround_count++;
    						}
    						if(lookuptable[col-1][row-1][2] > 0)
    						{//below left
    							surround[surround_count] = cube[w*CUBEWIDTH*CUBEWIDTH + (row-1)*CUBEWIDTH + col-1];
    							surround_count++;
    						}
    						if(lookuptable[col+1][row-1][2] > 0)
    						{//below right
    							surround[surround_count] = cube[w*CUBEWIDTH*CUBEWIDTH + (row-1)*CUBEWIDTH + col+1];
    							surround_count++;
    						}
    						
    						if(surround_count > 2)
    						{	
    							qsort(surround, surround_count, sizeof(float), compare_floats);
    							if(surround_count % 2 == 0)
    							{
    								medneighb = (surround[surround_count/2 - 1] + surround[surround_count/2])/2;
    							}
    							else
    							{
    								medneighb = surround[surround_count/2]/2;
    							}
    							if(here < medneighb - thresh && (here < 0 || here < lfr*medneighb))
    							{//replace low pixel
    								/* begin debug */
    								/*
    								if(w == 0)
    								{
    									printf("Lenslet %d, %d: neighb arr:", col, row);
    									for(i = 0; i < surround_count; i++)
    									{
    										printf(" %0.2f", surround[i]);
    									}
    									printf(";\there = %0.2f, median = %0.2f, thresh = %0.2f\n", here, medneighb, thresh);
    								}
    								*/
    								/* end of debug */
    								cube[w*CUBEWIDTH*CUBEWIDTH + row*CUBEWIDTH + col] = medneighb;
    								low_count++;
    							}
    						}
    					}
    				}
    			}
    		}
    	}
    
     
  6. Moderator emeritus

    robbieduncan

    Joined:
    Jul 24, 2002
    Location:
    London
    #6
    I am not going to spend time and energy attempting to read that. If each iteration of the outer loop is independent (so running two or more at once would not cause them to impact each other) I would look into use GCD to schedule each one and then you will see some improvement from the threading. Essentially this means that each iteration of the loop must write to areas of memory that it owns: either different array indexes from any other loop iteration of loop-local variables. If you need to write to shared variables (in your code as is surround count wound not be loop-local but could be made so, your writes to cube look very like they would clash) you need to use semaphores or other locks to ensure atomicity of writes.

    Note threading can be a minefield of complexity and debugging. Unless you are 100% sure the operations you parallelise are truly atomic you will have significant, non-repeatable, problems.
     
  7. thread starter macrumors 6502a

    Joined:
    Sep 16, 2008
    #7
    I didn't expect you to read it; just get some idea of the complexity of the code.
    Thanks for your help. Time to find an intern.
     
  8. macrumors 6502

    Joined:
    Mar 8, 2004
    #8
    The best case scenario for multitasking is something you can easily use a divide and conquer strategy. For example, video encoding is great because you can pass off sets of frames to each thread. Often though if you can't do that, you can use threads for multiple jobs. That way you don't have to wait for one process to finish before starting another. There can be little annoyances you have to watch out for when employing this, especially if you're doing some asynchronous IO.
     
  9. macrumors regular

    Joined:
    Aug 21, 2011
    Location:
    Stockholm Sweden
    #9
    It will not happen by itself. You need to get some help from the compiler. And the compiler will need some help from you.

    I believe openmp is one direction to look. I believe it is supported in the mac compilers.
    http://openmp.org/wp/

    A very different option is to use the graphics processor, after all they are made for handling pixels.

    Cuda is the place to start, I believe.
    http://en.wikipedia.org/wiki/CUDA

    // Gunnar
     
  10. gnasher729, Dec 7, 2011
    Last edited: Dec 7, 2011

    macrumors G5

    gnasher729

    Joined:
    Nov 25, 2005
    #10
    There are many ways to make your code run a lot faster.

    1. Array access should be in sequential order in memory. If at all possible, rearrange the data in lookuptable and tiltmap so that you access lookuptable [row][col] and tiltmap [col*CUBEWIDTH + row].

    2. Simplify the calculations. For example let's change this one bit by bit:

    Code:
    crudeXcntr = (int)lroundf(Xref + (LAMBDA_REF - lambda)/vertscale*tilt);
    if(crudeXcntr > BORDER && crudeXcntr < FOCPLWIDTH - BORDER - 1 ...)
    
    lroundf is a rather expensive function, so we change this to

    Code:
    float crudeXcntr = Xref + (LAMBDA_REF - lambda)/vertscale*tilt;
    if(crudeXcntr > BORDER + 0.5 && crudeXcntr < FOCPLWIDTH - BORDER - 1.5 ...)
    
    Now you calculated vertscale = (1760 - 1100)/height. Instead of dividing by vertscale, we multiply by height / (1760 - 1100), so we get

    Code:
    crudeXcntr = Xref + (LAMBDA_REF - lambda)/(1760 - 1100) * height*tilt;
    (LAMBDA_REF-lambda)/(1760 - 1100) can be calculated outside the loop, so we got rid of one function call and two divisions, which are quite slow.

    3. I would think that the slowest bit by far is calling qsort to find the median. You then check this:

    Code:
    if(here < medneighb - thresh && (here < 0 || here < lfr*medneighb))
    The values here, thresh and lfr seem to be constant, so you should be able to calculate some lower and upper bounds so you can check lower < medneighb && medneighb < upper. Now instead of sorting the array, which is very expensive, count how many values are <= lower, and collect the ones between lower and upper. Let's say you have 7 values. If there are no values between lower and upper, then the median isn't between lower and upper. If there are four or more values < lower then the median is less than lower, and if three or fewer values are < upper, the median is > upper.
     
  11. macrumors 6502a

    Joined:
    May 10, 2009
    Location:
    Des Moines, WA
    #11
    Doug, may I ask what you're doing your programming on?

    Will this program be running on your computer only?

    If not what other machines might it possibly be run on?
     
  12. macrumors G5

    gnasher729

    Joined:
    Nov 25, 2005
    #12
    4. Since the average is so much easier and faster to find than the median, and usually has better mathematical properties, is there any particular reason to use the median and not the average?
     
  13. thread starter macrumors 6502a

    Joined:
    Sep 16, 2008
    #13
    Lloyd: This code will only be run on MAC's and is being programmed on a MAC.
    Gnasher: Median is the correct average for focal plane noise.
    Thanks for the programming tips.
    ghellquist. You may be right about a graphic processor but to say we are processing images is to greatly simply the code.
    Max. If I have several images, they have to be processed in sequence but one of my libraries supposedly can not be used in parallel code.
     
  14. macrumors 6502a

    Joined:
    May 10, 2009
    Location:
    Des Moines, WA
    #14
    In that case I would suggest OpenCL instead of CUDA. CUDA may have the slightest speed advantage but I think OpenCL is now available on more platforms.

    If coded to OpenCL the code could make use of both CPU and as many GPU's as are available.

    Either way if I've got Doug figured correctly it's an "Intern" project shortly. No slight intended it's just more work than I think he wishes to expend.
     
  15. thread starter macrumors 6502a

    Joined:
    Sep 16, 2008
    #15
    Lloyd.
    You got that right. I have a fairly full plate at the moment; the code normally runs in about 45 secs but sometimes we take images every 1.5 secs. I expect a large improvement (x2 - 3) just replacing the old IMAC with a newer machine. I might try implementing some of coding suggestions that were posted. I don't suppose you want to volunteer to work on the search for exoplanets.

    Anybody?
     
  16. macrumors 6502a

    Joined:
    May 10, 2009
    Location:
    Des Moines, WA
    #16
    While I would love to take on a project like this a neck injury with upper torso spasms don't allow me the concentration necessary to guarantee I could finish any such project in any reasonable amount of time.

    You're going to have to look elsewhere!
     
  17. thread starter macrumors 6502a

    Joined:
    Sep 16, 2008
    #17
    Lloyd

    I hope you are kidding about the neck injury but I have the feeling you aren't. Hope you recover quickly.

    However, if delivery date is your only worry. Its not an issue. You really don't think this hacker is going to get it done any sooner, do you?

    Doug.
    we can take this offline if I knew how.
     
  18. Moderator emeritus

    robbieduncan

    Joined:
    Jul 24, 2002
    Location:
    London
    #18
    PM each other to exchange email addresses? If you click on lloyddean a menu should drop down with "Send a private message to lloyddean". Of course he might have that turned off for non-Moderators/Admins so it is possible you won't see that...
     
  19. thread starter macrumors 6502a

    Joined:
    Sep 16, 2008
  20. macrumors G5

    gnasher729

    Joined:
    Nov 25, 2005
    #20
    The code that you posted shouldn't take 45 seconds. With NCHAN = 28, CUBEWIDTH = 256, that would be about 24,000 nanoseconds per pixel. This code shouldn't take anywhere near that long. There are a few things we cannot see, like the data types, and the compare_floats function, but what I see shouldn't take more than a second.

    Have you run your code under the profiler? (In XCode 4, Product->Perform Action->Profile)
     
  21. thread starter macrumors 6502a

    Joined:
    Sep 16, 2008
    #21
    The was just a snippet. The full code is attached.
     

    Attached Files:

  22. macrumors 6502a

    Joined:
    May 10, 2009
    Location:
    Des Moines, WA
  23. thread starter macrumors 6502a

    Joined:
    Sep 16, 2008
    #23
    A single image is 16Mbytes- a little over the forum limit - an in an unacceptable format. I would sent you the paper describing the code but its nearly 3 MB. I'll see if I can send a jpeg of an image.
    Here's the abstract. Questions welcome. I'll see if I can send a jpeg of an image. I can send the paper in a private email if you are interested.

    PUBLICATIONS OF THE ASTRONOMICAL SOCIETY OF THE PACIFIC, 123:746–763, 2011 June © 2011. The Astronomical Society of the Pacific. All rights reserved. Printed in U.S.A.
    A Data-Cube Extraction Pipeline for a Coronagraphic Integral Field Spectrograph
    NEIL ZIMMERMAN,1,2 DOUGLAS BRENNER,2 BEN R. OPPENHEIMER,1,2 IAN R. PARRY,3 SASHA HINKLEY,4,5 STEPHANIE HUNT,3 AND ROBIN ROBERTS2
    Received 2010 July 2; accepted 2011 April 26; published 2011 May 24
    ABSTRACT. Project 1640 is a high-contrast near-infrared instrument probing the vicinities of nearby stars through the unique combination of an integral field spectrograph with a Lyot coronagraph and a high-order adaptive optics system. The extraordinary data-reduction demands, similar to those that several new exoplanet imaging in- struments will face in the near future, have been met by the novel software algorithms described herein. The Project 1640 Data Cube Extraction Pipeline (PCXP) automates the translation of 3:8 × 104 closely packed, coarsely sampled spectra to a data cube. We implement a robust empirical model of the spectrograph focal-plane geometry to register the detector image at subpixel precision, and we map the cube extraction. We demonstrate our ability to accurately retrieve source spectra based on an observation of Saturn’s moon Titan.
     
  24. thread starter macrumors 6502a

    Joined:
    Sep 16, 2008
    #24
    Try this. The center picture is part of an image. The focal plane has 4 million pixels; you are seeing 10000 of them. The location of the spectra changes with telescope position and possibly other factors. Taking this into account is one of the things that the code does.
     

    Attached Files:

  25. gnasher729, Dec 9, 2011
    Last edited: Dec 9, 2011

    macrumors G5

    gnasher729

    Joined:
    Nov 25, 2005
    #25
    Just an example how to use Grand Central Dispatch in a command line tool:

    Code:
    #include <dispatch/dispatch.h>
    
    static void do_all_work (dispatch_group_t group, dispatch_queue_t main_queue, dispatch_queue_t bg_queue)
    {
    	for (int i = 1; i <= 10; ++i) {
    		dispatch_group_async (group, main_queue, ^{
    			printf ("Starting task #%d\n", i);
    			dispatch_group_async (group, bg_queue, ^{
    				double sum = 0; 
    				double limit = i * 10000;
    				for (double k = 1.0; k <= limit; k += 1) sum += 1.0 / k;
    				dispatch_group_async (group, main_queue, ^{
    					printf ("Task #%d: result = %.3f\n", i, sum);
    				});
    			});
    		});
    	}
    }
    
    int main (int argc, const char * argv[])
    {
    	dispatch_queue_t bg_queue = dispatch_get_global_queue (DISPATCH_QUEUE_PRIORITY_DEFAULT, 0);
    	dispatch_queue_t main_queue = dispatch_get_main_queue ();
    	dispatch_group_t group = dispatch_group_create();
    
    	do_all_work (group, main_queue, bg_queue);
    
    	dispatch_async (bg_queue, ^{
    		dispatch_group_wait (group, DISPATCH_TIME_FOREVER);
    		dispatch_release (group);
    		dispatch_async(main_queue, ^{ exit (0); });
    	});
    	dispatch_main ();
    	return 0;
    }
    
    You'd copy the main () as it is, and change do_all_work as you need. Things that can be run in a background thread are put inside

    Code:
    			dispatch_group_async (group, bg_queue, ^{
    				// Your code goes here
    			});
    
    Things that have to run on the main thread are put inside


    Code:
    			dispatch_group_async (group, main_queue, ^{
    				// Your code goes here
    			});
    
    So as an example, calling


    Code:
    			for (w = 0; w < NCHAN; ++w)
    					dispatch_group_async (group, bg_queue, ^{
    						call_some_function (w);
    				});
    
    will put each call to call_some_function (w) on a separate thread. (Had to change this example, the code before the change ran all calls on a single background thread)
     

Share This Page