PDA

View Full Version : CUDA performance change after the OSX 10.8.2 update?




cornelius1
Sep 22, 2012, 09:34 PM
On my retina MacBook Pro with GT 650M (with CUDA driver 5.0.24), I'm seeing 3-4x slower CUDA performance after the OSX 10.8.2 update. For example, with CUDA-Z I'm getting a single-precision float performance of 95 Gflop/s on OSX 10.8.2, compared to the 321 Gflop/s result on OSX 10.8.1 listed here: http://www.barefeats.com/rogue02.html . Also, CUDA tasks in BOINC seem to be 3-4x slower than on OSX 10.8.1. Wondering if it's only me. Is any one else seeing a significant change in CUDA performance after the 10.8.2 update?

Also, what "single-precision float" result do you get on CUDA-Z (under the Performance tab), and which Nvidia GPU, CUDA driver version, and OSX version are you using?

CUDA-Z (beta version) can be found here: http://sourceforge.net/projects/cuda-z/files/cuda-z/
Mac CUDA drivers can be found here: http://www.nvidia.com/object/mac-driver-archive.html



macrons
Sep 23, 2012, 10:22 AM
I see roughly 92 Gflop/s,

I'd be interested if anybody who's still on 10.8.1 could run the same test.

andy318
Sep 24, 2012, 12:35 AM
I installed the latest version of the CUDA driver and CUDA-Z and single precision performance ranges from 295 to 320GFlops/sec while CUDA-Z is running

My rMBP is running 10.8.2

cornelius1
Sep 24, 2012, 09:17 PM
I see roughly 92 Gflop/s,

I'd be interested if anybody who's still on 10.8.1 could run the same test.
Is that on a retina MBP?

macrons
Sep 25, 2012, 01:31 PM
Here's my complete result:

CUDA-Z Report
=============
Version: 0.6.156 SVN Built Sep 21 2012 09:52:54
http://cuda-z.sf.net/
OS Version: Mac OS X 10.8.2 12C54 (base model, retina Macbook Pro)
Driver Version: 8.0.61 295.30.20f02
Driver Dll Version: 5.0
Runtime Dll Version: 4.20

Core Information
----------------
Name: GeForce GT 650M
Compute Capability: 3.0
Clock Rate: 900 MHz
Multiprocessors: 2
Warp Size: 32
Regs Per Block: 65536
Threads Per Block: 1024
Threads Dimensions: 1024 x 1024 x 64
Grid Dimensions: 2147483647 x 65535 x 65535
Watchdog Enabled: Yes
Integrated GPU: No
Concurrent Kernels: Yes
Compute Mode: Default

Memory Information
------------------
Total Global: 1023.69 MiB
Shared Per Block: 48 KiB
Pitch: 2048 MiB
Total Constant: 64 KiB
Texture Alignment: 512 B
Texture 1D Size: 65536
Texture 2D Size: 65536 x 65536
Texture 3D Size: 4096 x 4096 x 4096
GPU Overlap: Yes
Map Host Memory: Yes
Error Correction: No

Performance Information
-----------------------
Memory Copy
Host Pinned to Device: 4866.34 MiB/s
Host Pageable to Device: 4577.5 MiB/s
Device to Host Pinned: 4822.49 MiB/s
Device to Host Pageable: 4606.67 MiB/s
Device to Device: 10.0094 GiB/s
GPU Core Performance
Single-precision Float: 95.3403 Gflop/s <<<<<<
Double-precision Float: 7057.53 Mflop/s
32-bit Integer: 28.7314 Giop/s
24-bit Integer: 28.6809 Giop/s

Generated: Wed Sep 26 02:30:48 2012

cornelius1
Sep 25, 2012, 11:44 PM
Here's my complete result: ...

Thanks. Those performance numbers are almost the same as mine. (Base retina MBP model here too).

I installed the latest version of the CUDA driver and CUDA-Z and single precision performance ranges from 295 to 320GFlops/sec while CUDA-Z is running

My rMBP is running 10.8.2
That is interesting. I wonder why we get such low scores compared to yours on 10.8.2. Can you post your complete CUDA-Z result, too? I'm just curious about what could be different with your system. ("Export to text" is under the Performance tab). Thanks!

cornelius1
Sep 26, 2012, 07:28 PM
This is quite weird. CUDA performance seems to be back to normal with OS X 10.8.2 for no apparent reason. CUDA-Z now shows a single precision performance that varies between 256 and 334 Gflop/s (as opposed to the earlier average of 95 Gflop/s).

However, "Memory Copy" performance seems to be significantly lower than before. I wonder if there is some sort of dynamic resource shifting going on between "Memory Copy" and "GPU Core" operations that results in this kind of trade-off between the two groups of performance values.

The full report looks like this now:
CUDA-Z Report
=============
Version: 0.6.156 SVN Built Sep 21 2012 09:52:54
http://cuda-z.sf.net/
OS Version: Mac OS X 10.8.2 12C54 (base model, retina Macbook Pro)
Driver Version: 8.0.61 295.30.20f02
Driver Dll Version: 5.0
Runtime Dll Version: 4.20

Core Information
----------------
Name: GeForce GT 650M
Compute Capability: 3.0
Clock Rate: 900 MHz
Multiprocessors: 2
Warp Size: 32
Regs Per Block: 65536
Threads Per Block: 1024
Threads Dimensions: 1024 x 1024 x 64
Grid Dimensions: 2147483647 x 65535 x 65535
Watchdog Enabled: Yes
Integrated GPU: No
Concurrent Kernels: Yes
Compute Mode: Default

Memory Information
------------------
Total Global: 1023.69 MiB
Shared Per Block: 48 KiB
Pitch: 2048 MiB
Total Constant: 64 KiB
Texture Alignment: 512 B
Texture 1D Size: 65536
Texture 2D Size: 65536 x 65536
Texture 3D Size: 4096 x 4096 x 4096
GPU Overlap: Yes
Map Host Memory: Yes
Error Correction: No

Performance Information
-----------------------
Memory Copy
Host Pinned to Device: 2885.88 MiB/s
Host Pageable to Device: 2833.43 MiB/s
Device to Host Pinned: 2933.67 MiB/s
Device to Host Pageable: 2827.69 MiB/s
Device to Device: 8317.46 MiB/s
GPU Core Performance
Single-precision Float: 320.324 Gflop/s <<<<<<<
Double-precision Float: 23.0491 Gflop/s
32-bit Integer: 86.0745 Giop/s
24-bit Integer: 85.8875 Giop/s

Generated: Wed Sep 26 18:53:37 2012

qianyizh
Sep 26, 2012, 09:12 PM
I have the same CUDA performance issue here. I hate Apple (seriously I do). I am in the middle of developing a CUDA-based project, now the performance of my program drops down to 1/3 of the performance before 10.8.2 update.

I made a few tests.
CUDA-Z is exactly the same as macrons posted.
I have a bootcamp, so I ran CUDA-Z on Win7 bootcamp too.
Single precision float is around 100GHz.
But wired thing is, the clock rate only shows 405MHz instead of 900MHz (which is the clock rate for 650M). So I think CUDA-Z may not support bootcamp very well.

I made tests with convolutionFFT2D in CUDA SDK Toolkits.
On Win7 bootcamp, the three test results are:
142.358473 MPix/s (28.098082 ms)
158.115622 MPix/s (25.297943 ms)
193.503693 MPix/s (20.671440 ms)
On Mountain Lion 10.8.2, the three test results are:
42.398032 MPix/s (94.344002 ms)
81.539466 MPix/s (49.056000 ms)
98.313914 MPix/s (40.686001 ms)

I made a few tests with convolutionFFT2D, the results are quite stable on Win7 bootcamp. But on Mountain Lion, they are not stable, especially the first test result ranges from 20MPix/s~80MPix/s. But anyways, Win7 is at least twice as fast as Mountain Lion.

Can anyone make the same test for convolutionFFT2D?

And does anyone know if reinstall CUDA could solve the performance issue?
I am going to try reinstalling tonight, and if it does not work, I can only reinstall Mountain Lion. Apple is making my life really hard..

qianyizh
Sep 26, 2012, 09:22 PM
Some update:

Here is my test results of memory bandwidth (using bankdwidthTest in Toolkit)
You can see Mountain Lion 10.8.2 messed up with the to-device and in-device bandwidth.

On Mountain Lion 10.8.2:
[bandwidthTest] starting...

./bandwidthTest Starting...

Running on...

Device 0: GeForce GT 650M
Quick Mode

Host to Device Bandwidth, 1 Device(s), Paged memory
Transfer Size (Bytes) Bandwidth(MB/s)
33554432 1760.5

Device to Host Bandwidth, 1 Device(s), Paged memory
Transfer Size (Bytes) Bandwidth(MB/s)
33554432 2830.4

Device to Device Bandwidth, 1 Device(s)
Transfer Size (Bytes) Bandwidth(MB/s)
33554432 11348.9

On Win7 Bootcamp
[bandwidthTest.exe] starting...

bandwidthTest.exe Starting...

Running on...

Device 0: GeForce GT 650M
Quick Mode

Host to Device Bandwidth, 1 Device(s), Paged memory
Transfer Size (Bytes) Bandwidth(MB/s)
33554432 3636.6

Device to Host Bandwidth, 1 Device(s), Paged memory
Transfer Size (Bytes) Bandwidth(MB/s)
33554432 3533.4

Device to Device Bandwidth, 1 Device(s)
Transfer Size (Bytes) Bandwidth(MB/s)
33554432 19768.2

qianyizh
Sep 26, 2012, 10:57 PM
For the record.
I uninstalled CUDA, then reinstalled CUDA 4.2, still the same.
I uninstalled CUDA, then installed CUDA 5.0RC, bandwidth gets a bit better (1602/3169/13246); convolutionFFT2D gets worse (third result drops to less than 50 MPix/s).

qianyizh
Sep 27, 2012, 11:16 AM
Hi cornelius1, can you give me any hint on how your cuda gets back to work?
Like you did something, or did nothing just but let your Macbook be idle all the day?
There are definitely many people experiencing this, e.g., this post:
http://www.primegrid.com/forum_thread.php?id=4553&nowrap=true

I just want to get my cuda back to work again, :(

cornelius1
Sep 27, 2012, 08:47 PM
Hi cornelius1, can you give me any hint on how your cuda gets back to work?
Like you did something, or did nothing just but let your Macbook be idle all the day?
There are definitely many people experiencing this, e.g., this post:
http://www.primegrid.com/forum_thread.php?id=4553&nowrap=true

I just want to get my cuda back to work again, :(
Hi qianyizh. Sorry, I was the one who started the thread there as well (so I was the only one with this problem in that thread). And I wish I knew how it went back to normal. In fact, today it went back to being 3x as slow once again.
I still have CUDA driver version 5.0.24, and I have not installed or removed anything that affects the GPU. I didn't change anything in my daily usage pattern either. So, I have no idea what's causing these changes in CUDA performance.

cornelius1
Sep 28, 2012, 11:59 AM
And it's back to running fast once again. Btw, while a CUDA task is running in the background, CUDA-Z seems to give even better numbers (for both "Memory Copy" and "GPU Core Performance"):
Performance Information
-----------------------
Memory Copy
Host Pinned to Device: 6030.91 MiB/s <<<<<<<<<
Host Pageable to Device: 4129.53 MiB/s
Device to Host Pinned: 5859.16 MiB/s
Device to Host Pageable: 5175.31 MiB/s
Device to Device: 20.0776 GiB/s
GPU Core Performance
Single-precision Float: 400.925 Gflop/s <<<<<<<<<
Double-precision Float: 28.766 Gflop/s
32-bit Integer: 114.831 Giop/s
24-bit Integer: 114.584 Giop/s

Generated: Fri Sep 28 11:51:11 2012
So, OS X 10.8.2 might be doing some dynamic (frequency?) scaling (based on the GPU load?). I guess that sort of scaling but in the opposite direction (maybe due to high temperatures?) could be causing the huge drop in CUDA performance as well.

qianyizh
Sep 28, 2012, 02:06 PM
cornelius1, thanks for sharing your experience.

Some update from my side:
My cuda gets back to work just now..
Last night I did not shut down my OSX, instead, just closed the panel.
This morning when I woke the Macbook up, cuda is good again.
(CUDA-Z now has ~300 Gflops/s, and my program gets the normal speed)

My suspicion is that when you shut down 10.8.2 and then start it up (I have to do this a lot since I need to switch to win7 bootcamp frequently), it slows down cuda. But after a "sleep-wake up" cycle, it can recover.

Anyone can prove my suspicion?

cornelius1
Sep 28, 2012, 05:28 PM
My suspicion is that when you shut down 10.8.2 and then start it up (I have to do this a lot since I need to switch to win7 bootcamp frequently), it slows down cuda. ...

Anyone can prove my suspicion?
I tried shutting down and restarting, and CUDA did not slow down. So, if that happens, it's either not a consistent behavior, or it depends on some other factor (such as the GPU temperature being high).

Also, when it slowed down yesterday, I had not restarted OS X, I had just closed and reopened the lid (that is, did a sleep-resume cycle). So, restarting OS X cannot be the only trigger (if it is a trigger at all).