PDA

View Full Version : The Basic Idea Behind Multi-Core Coding Is ... ?




astrostu
Jul 26, 2007, 03:15 PM
I'm writing a lot of Java code for my work-research. I'm using a machine that has 4 cores and a lot of the code is looping through lists of data points, comparing them to a number, and depending upon the result, copying the value to another array or going on to the next one (grossly simplified, but you get the idea).

My question is, to make this take advantage of four processors, is the basic idea such that I create 4 threads, have one thread loop through the first 25% of the values, second thread the next 25%, and so on, and each one will automatically take up a different processor? (And then I combine the lists afterwards.)

Or is it more complicated?



iSee
Jul 26, 2007, 05:46 PM
Yes, that's the right idea. You don't really get a guarantee that each thread will run on a separate processor core, but you can't take advantage of the extra cores without threads (or processes) to run on them.

Check one thing to make sure this will really help, before you go through the trouble though (maybe you've already determined this):

If, when you run the same computations using your single threaded solution, is the CPU usage near 100%? (Activity Monitor in the Utilities subfolder of the Applications folder gives you an easy way to monitor this, if you didn't know).

If so, the multiple threads will probably help you. As you describe it, the threads can really churn away independently until the end, so they should help a lot.

savar
Jul 26, 2007, 09:12 PM
I'm no expert, but I'll throw some wood on the fire:

Multicore is at first glance no different than multi-thread or multi-process. The OS scheduler will distribute threads and processes across the multiple cores as efficiently as it knows how to.*

The most significant difference that I can think of is that each core has its own cache. So whereas two threads running on one core can share the same data in the cache, if the threads are on separate cores then each must access the memory separately.

So to amend my first statement, you want to create threads or processes which operate on logically separable chunks of data. So yeah, chunking the data into fourths will probably use the cores pretty efficiently.

What's more, if the two threads are writing to the same part of memory, then each core is going to be constantly invalidating the other cores cache.

So I think that what seems like a simple problem actually becomes fairly complex, even if you ignore the inherent problems with multithreading. Luckily there will be multicore APIs coming out soon...these should abstract out most of the complexity of writing efficiently parallelizable code.
.
*Well...actually it doesn't. The OS X scheduler is actually still very naive. But it's a good bet that it will be much improved in Leopard.

lazydog
Jul 27, 2007, 03:50 AM
Hi
I'm not an expert either but I've got a suggestion. Imagine that the first 3 threads finish processing their data much faster than the fourth for whatever reason. Your program will then have to wait until the fourth one finishes during which your 3 other cores will be sitting around doing nothing. So I think it might be better to divide your data set up into more segments than you have cores and then have a scheduler that allocates segments to threads. When a thread finishes a segment, it asks the scheduler for another segments.

I think this would work quite well if you can use JDK 5.0. Like I said I'm not an expert on this sort of thing but I understand that JDK 5.0 has CAS (compare and set) types which do not need synchronisation. This would let you implement a scheduler that allocates segments without needing to synchronise its methods.

hope this helps

b e n

Flynnstone
Jul 27, 2007, 07:42 AM
You even might want to consider breaking into 8 "chunks" ( or 16). Eight core are likely to become more common.
The best way to determine is to test. Try 1 through 16 chunks and time it.

robbieduncan
Jul 27, 2007, 08:09 AM
As alluded to above there are a lot of pitfalls to multi-threaded processing. I know: I am about to start acceptance testing of a system that includes some Java I used that has a massive number of threads (on one configuration over 120).

The biggest killer in terms of performance is cross-thread synchronisation. You want to avoid this at all costs.

For example instead of having one global array that they threads copy the "good" items into have 1 per thread. Then merge these at the end. This will be much, much faster.

Oh, and if you're using Java 1.4 (not sure about 1.5) add this to the command line: -XX:+UseConcMarkSweepGC For me on Solaris this results in a 100% or better performance gain.

astrostu
Jul 27, 2007, 03:06 PM
If, when you run the same computations using your single threaded solution, is the CPU usage near 100%?

Yes.


If so, the multiple threads will probably help you. As you describe it, the threads can really churn away independently until the end, so they should help a lot.

That's the hope.


... if the two threads are writing to the same part of memory, then each core is going to be constantly invalidating the other cores cache.

The plan is to create a dummy array that each thread will write to (one per thread) and then when they're done to combine them into the final array, avoiding the issue of overwriting.


Imagine that the first 3 threads finish processing their data much faster than the fourth for whatever reason. Your program will then have to wait until the fourth one finishes during which your 3 other cores will be sitting around doing nothing. So I think it might be better to divide your data set up into more segments than you have cores and then have a scheduler that allocates segments to threads. When a thread finishes a segment, it asks the scheduler for another segments.

I don't think that will be an issue because I'd effectively be dividing a circle in half (for two threads) or into quadrants (for four threads), and hence each should take the same amount of time.


You even might want to consider breaking into 8 "chunks" ( or 16). Eight core are likely to become more common.
The best way to determine is to test. Try 1 through 16 chunks and time it.

See the above comment. This is something I'll probably try timing and see what happens.


The biggest killer in terms of performance is cross-thread synchronisation. You want to avoid this at all costs.

For example instead of having one global array that they threads copy the "good" items into have 1 per thread. Then merge these at the end. This will be much, much faster.

Yep, that's the plan.


Oh, and if you're using Java 1.4 (not sure about 1.5) add this to the command line: -XX:+UseConcMarkSweepGC For me on Solaris this results in a 100% or better performance gain.

What does that do? I know next to nothing about compiling options.

lazydog
Jul 27, 2007, 04:15 PM
Yes.
I don't think that will be an issue because I'd effectively be dividing a circle in half (for two threads) or into quadrants (for four threads), and hence each should take the same amount of time.


Aren't you assuming that the 4 threads will be running, all the time, on separate cores? It might even be having more threads than cores might improve times. I guess some tests would answer all this.

b e n

Krevnik
Jul 27, 2007, 05:30 PM
The most significant difference that I can think of is that each core has its own cache. So whereas two threads running on one core can share the same data in the cache, if the threads are on separate cores then each must access the memory separately.

Not entirely accurate. The Intel Core line has shared caches between cores for each die. The catch is that Core 2 Duos, for example, have a single cache, while Core 2 Quads (Kentsfield, Clovertown), and Xeon Dual-Proc systems have two sets of cache, as you actually have two dies on the chip (early Core 2 Quads are 2 Core 2 Duos on a single chip).

So, while your comments should be /assumed/ for the sake of design purposes, it isn't the case in hardware. (But since you can't tell if your cache is shared or not, you should assume it isn't)

ChrisA
Jul 27, 2007, 06:31 PM
This problem has a classic solution. One name for it is "boss and workers". Workers ask the boss for something to do, he gives then some data (or just tells then what data to work on and the worker goes and gets it himself.) when the job is done the workers sends the result to the boss. THe key here is to decide how big each job needs to be. You can also adjust the number of workers. The workers can be threads or they can be processes running on a networked computer. The prime example of this is Seti At Home but a lot of animated films are rendered this way

robbieduncan
Jul 28, 2007, 02:54 AM
What does that do? I know next to nothing about compiling options.

It's a runtime (not compiler, you add it to the java not javac command) option that enables the concurrent, multi-threaded garbage collector instead of the non-concurrent, single threaded one that is the default. What I was seeing with the normal GC was that whenever it ran all my threads got paused whilst it executed. This resulted in noticeable stalls in processing. With the concurrent GC my threads do not get paused so processing continues whilst garbage collection occurs.

netwalker
Jul 28, 2007, 07:31 AM
Read & learn about Concurrency and esp. the new Concurrent Framework in Java 1.5.

A starting point:
http://java.sun.com/docs/books/tutorial/essential/concurrency/index.html

stadidas
Jul 29, 2007, 08:16 AM
*Well...actually it doesn't. The OS X scheduler is actually still very naive. But it's a good bet that it will be much improved in Leopard.

One of the new developer builds that came out recently had a newly re-written low-level schedular to take advantage of multi-core systems. Hopefully massive speed gains will ensue.