M2 Ultra Chip Benchmark Results Reveal Impressive Performance Gains

Morgonaut · Jul 11, 2023

Alan Wynn said:
Out of curiosity, what is your workflow? For what are you using the machine? What is the configuration of your 2019 Mac Pro? Since you did performance comparisons between it and various of the new Apple Silicon M2 systems, how about posting your results? It would be especially interesting, as it seem you are saying that it is faster than an M2 Ultra in your use. Understanding what your use is would help others understand how comparable it would be to their potential use.

Alan I get your point, but I spent lot of time, effort, money... to consolidate all my results and produced several videos about this topic mainly for my clients, in another words for what you want to know in single comment I needed several videos to say it. So in super short - 2019 MP 24-core and 16-core, 2x Vega Pro II or 6800XT, AB, 192GB RAM vs M2 Ultra 192/76 vs M2 Max 32/30 vs MS M1 Ultra vs MS M1 Max vs M1 Max MBP 14 and 16, M1 Air 13, M2 Air 15... in 8K/60/HDR Multicam video production, Fusion comps, 3D... And because I'm not allowed to post links because someone would be eager to blame me for self promoting I can't say more.

maccan · Aug 8, 2023

ZombiePhysicist said:

I think the performance comparison between Intel macs vs AS may also be related to the optimization of the underlaying frameworks. Some time ago I started to develop code using the Metal Performance Shader (MPS) framework. As a simple test, I wrote a program using the MPSMatrixMultiplication Shader to multiply two large marices of size [8192 x 8192]. This matrix multiplication requires (2 * 8192 - 1) x 8192 x 8192 = 1'099'444'518'912 i.e. a tera of floatinpoint operations. The amount of data transferred to the GPU is 2 x 8192 * 8192 * 4 bytes = 536'870'912 bytes = 0.537 GB. If this code runs in 1 second, the performance of the machine would be 1.1 TFlops. On my MacPro 2019 with AMD Vega II, I got 3.5 TFlops prior to macOS Big Sur 11.3 and XCode 12.5. After the update to Big Sur 11.3 and XCode 12.5, the performance dropped to 0.119 TFlops! On the current macOS the performance is again up to 1.452 TFlops, still a factor of 2 less prior to macOS 11.3. Note that this performance decrease was the result of just recompiling the exact same code after upgrading macOS and XCode. Nothing changed in the code.

I investigated this to the point, that the performance penalty was clearly coming from the data transfer to the GPU.
the calculation on the GPU itself, i.e. the MPSMatrixMultiplication instruction was not affected. I assume that the underlaying Metal Framework, mainly the way how data is transferred to the GPU changed dramatically. Perhaps this was due to optimizations done for Apple Silicon affecting the data transfer model on Intel machines.

So performance gains for claimed by Apple Silicon Macs may also due to underperforming (non optimized) frameworks for the Intel macs!

I reported this dramatic performance drop in the Apple Developer Forums see "Bad MPSMatrixMultiplication performance in Big Sur 11.3". The discussion ended with the following statment from the developing team:

---------------
"Thanks for the info maccan!

Some MPS engineers were able to reproduce the problem and have already made some progress in investigating a fix."
---------------

If you want to try yourself, you can compile the code below by copy it to a file called "matrixMul.swift and compile it by executing on a Terminal the following command: "swicftC -O matrixMul.swift".
Of course you have to have XCode installed!

-----------------
import Metal
import MetalPerformanceShaders
import Foundation
import CoreGraphics // for MTLCreateSystemDefaultDevice
// formatting numbers
let numberFormatter = NumberFormatter()
numberFormatter.numberStyle = .decimal // Set defaults to the formatter that are common for showing decimal numbers
numberFormatter.usesGroupingSeparator = true // Enabled separator
numberFormatter.groupingSeparator = "'" // Set the separator to "," (e.g. 1000000 = 1,000,000)
numberFormatter.groupingSize = 3
// Calculates matrix multiplication floating point operations
func getFops(matrixDim: Int) -> Int {
return (2 * matrixDim - 1) * matrixDim * matrixDim
}
// Reports the mean flops in units of Teraflops
func getTflops(nFP: Int, time: Double) -> String {
return String(format: "%.3f", 1e-12 * Double(nFP) / time)
}
// Get the device, commandQueue, commandBuffer and blitEncoder
// let device = MTLCreateSystemDefaultDevice()!
// In Ventura MTLCreateSystemDefaultDevice() seems not to work anymore
let device = MTLCopyAllDevices().first!
let commandQueue = device.makeCommandQueue()!
let commandBuffer = commandQueue.makeCommandBuffer()!
let blitEncoder = commandBuffer.makeBlitCommandEncoder()!
// Matrix dimensions
let n = 8192
let rowsA = n
let colsA = n
let rowsB = n
let colsB = n
let rowsC = n
let colsC = n
// Set data for Matrix A, B
let a = UnsafeMutablePointer<Float32>.allocate(capacity: rowsA * colsA)
let arrayA = UnsafeMutableBufferPointer(start: a, count: rowsA * colsA)
arrayA.update(repeating: Float32(1.0))
print("Values in matrix A[\(rowsA) x \(colsA)]: \(arrayA[0]) uniformly")
let b = UnsafeMutablePointer<Float32>.allocate(capacity: rowsA * colsA)
let arrayB = UnsafeMutableBufferPointer(start: b, count: rowsB * colsB)
arrayB.update(repeating: Float32(2.0))
print("Values in matrix B[\(rowsB) x \(colsB)]: \(arrayB[0]) uniformly")
// 1. Prepare managed buffers
// matrix A
let rowBytesA = colsA * MemoryLayout<Float32>.stride
let bufferA = device.makeBuffer(bytes: arrayA.baseAddress!,
length: rowsA * rowBytesA, options: [.storageModeManaged])!
// matrix B
let rowBytesB = colsB * MemoryLayout<Float32>.stride
let bufferB = device.makeBuffer(bytes: arrayB.baseAddress!,
length: rowsB * rowBytesB, options: [.storageModeManaged])!
// matrix C
let rowBytesC = colsC * MemoryLayout<Float32>.stride
let bufferC = device.makeBuffer(length: colsC * rowBytesC,
options: [.storageModeManaged])!

// 2. Prepare Matrices
let descrA = MPSMatrixDescriptor(rows: rowsA, columns: colsA, rowBytes: rowBytesA, dataType: .float32)
let descrB = MPSMatrixDescriptor(rows: rowsB, columns: colsB, rowBytes: rowBytesB, dataType: .float32)
let descrC = MPSMatrixDescriptor(rows: rowsC, columns: colsC, rowBytes: rowBytesC, dataType: .float32)
let matrixA = MPSMatrix(buffer: bufferA, descriptor: descrA)
let matrixB = MPSMatrix(buffer: bufferB, descriptor: descrB)
let matrixC = MPSMatrix(buffer: bufferC, descriptor: descrC)
// 3. Encode the MPS
//----------------------------------------------------
print("Starting calculation on \(device.name)\n...")
let startTime = CFAbsoluteTimeGetCurrent()
//----------------------------------------------------
let matMul = MPSMatrixMultiplication(device: device, resultRows: colsC, resultColumns: colsC, interiorColumns: colsB)
matMul.encode(commandBuffer: commandBuffer, leftMatrix: matrixA, rightMatrix: matrixB, resultMatrix: matrixC)
blitEncoder.synchronize(resource: bufferC)
blitEncoder.endEncoding()
// 4. Run command buffer, i.e. the GPU calculation
//-----------------------------------------------------
// print("Starting calculation on \(device.name)\n...")
// let startTime = CFAbsoluteTimeGetCurrent()
//-----------------------------------------------------
commandBuffer.commit()
commandBuffer.waitUntilCompleted()
let elapsedTime = Double(Int(1000 * (CFAbsoluteTimeGetCurrent() - startTime))) / 1000
// Read results
let nC = rowsC * colsC
let resultPointer = bufferC.contents().bindMemory(to: Float.self, capacity: nC)
let result = UnsafeBufferPointer(start: resultPointer, count: nC)
// Check consistency of resulting matrix
var ok = true
for i in 1..<nC {
if result != result[0] {
ok = false
}
}
if (ok) {
print("Values in matrix C = A * B: \(result[0]) uniformly")
let fops = getFops(matrixDim : n)
let tFlops = getTflops(nFP: fops, time: elapsedTime)
print(numberFormatter.string(for: fops) ?? "", "floating point operations performed")
print("Elapsed GPU time = \(elapsedTime) seconds -> \(tFlops) Teraflops")
}
else {
print("Error: Inconsistent calculation results")
}
----------------------

Morgonaut · Aug 8, 2023

@maccan - thanks for sharing, this is really interesting finding

ZombiePhysicist · Aug 8, 2023

maccan said:
I think the performance comparison between Intel macs vs AS may also be related to the optimization of the underlaying frameworks. Some time ago I started to develop code using the Metal Performance Shader (MPS) framework. As a simple test, I wrote a program using the MPSMatrixMultiplication Shader to multiply two large marices of size [8192 x 8192]. This matrix multiplication requires (2 * 8192 - 1) x 8192 x 8192 = 1'099'444'518'912 i.e. a tera of floatinpoint operations. The amount of data transferred to the GPU is 2 x 8192 * 8192 * 4 bytes = 536'870'912 bytes = 0.537 GB. If this code runs in 1 second, the performance of the machine would be 1.1 TFlops. On my MacPro 2019 with AMD Vega II, I got 3.5 TFlops prior to macOS Big Sur 11.3 and XCode 12.5. After the update to Big Sur 11.3 and XCode 12.5, the performance dropped to 0.119 TFlops! On the current macOS the performance is again up to 1.452 TFlops, still a factor of 2 less prior to macOS 11.3. Note that this performance decrease was the result of just recompiling the exact same code after upgrading macOS and XCode. Nothing changed in the code.

I investigated this to the point, that the performance penalty was clearly coming from the data transfer to the GPU.
the calculation on the GPU itself, i.e. the MPSMatrixMultiplication instruction was not affected. I assume that the underlaying Metal Framework, mainly the way how data is transferred to the GPU changed dramatically. Perhaps this was due to optimizations done for Apple Silicon affecting the data transfer model on Intel machines.

So performance gains for claimed by Apple Silicon Macs may also due to underperforming (non optimized) frameworks for the Intel macs!

I reported this dramatic performance drop in the Apple Developer Forums see "Bad MPSMatrixMultiplication performance in Big Sur 11.3". The discussion ended with the following statment from the developing team:

---------------
"Thanks for the info maccan!

Some MPS engineers were able to reproduce the problem and have already made some progress in investigating a fix."
---------------

If you want to try yourself, you can compile the code below by copy it to a file called "matrixMul.swift and compile it by executing on a Terminal the following command: "swicftC -O matrixMul.swift".
Of course you have to have XCode installed!

-----------------
import Metal
import MetalPerformanceShaders
import Foundation
import CoreGraphics // for MTLCreateSystemDefaultDevice
// formatting numbers
let numberFormatter = NumberFormatter()
numberFormatter.numberStyle = .decimal // Set defaults to the formatter that are common for showing decimal numbers
numberFormatter.usesGroupingSeparator = true // Enabled separator
numberFormatter.groupingSeparator = "'" // Set the separator to "," (e.g. 1000000 = 1,000,000)
numberFormatter.groupingSize = 3
// Calculates matrix multiplication floating point operations
func getFops(matrixDim: Int) -> Int {
return (2 * matrixDim - 1) * matrixDim * matrixDim
}
// Reports the mean flops in units of Teraflops
func getTflops(nFP: Int, time: Double) -> String {
return String(format: "%.3f", 1e-12 * Double(nFP) / time)
}
// Get the device, commandQueue, commandBuffer and blitEncoder
// let device = MTLCreateSystemDefaultDevice()!
// In Ventura MTLCreateSystemDefaultDevice() seems not to work anymore
let device = MTLCopyAllDevices().first!
let commandQueue = device.makeCommandQueue()!
let commandBuffer = commandQueue.makeCommandBuffer()!
let blitEncoder = commandBuffer.makeBlitCommandEncoder()!
// Matrix dimensions
let n = 8192
let rowsA = n
let colsA = n
let rowsB = n
let colsB = n
let rowsC = n
let colsC = n
// Set data for Matrix A, B
let a = UnsafeMutablePointer<Float32>.allocate(capacity: rowsA * colsA)
let arrayA = UnsafeMutableBufferPointer(start: a, count: rowsA * colsA)
arrayA.update(repeating: Float32(1.0))
print("Values in matrix A[\(rowsA) x \(colsA)]: \(arrayA[0]) uniformly")
let b = UnsafeMutablePointer<Float32>.allocate(capacity: rowsA * colsA)
let arrayB = UnsafeMutableBufferPointer(start: b, count: rowsB * colsB)
arrayB.update(repeating: Float32(2.0))
print("Values in matrix B[\(rowsB) x \(colsB)]: \(arrayB[0]) uniformly")
// 1. Prepare managed buffers
// matrix A
let rowBytesA = colsA * MemoryLayout<Float32>.stride
let bufferA = device.makeBuffer(bytes: arrayA.baseAddress!,
length: rowsA * rowBytesA, options: [.storageModeManaged])!
// matrix B
let rowBytesB = colsB * MemoryLayout<Float32>.stride
let bufferB = device.makeBuffer(bytes: arrayB.baseAddress!,
length: rowsB * rowBytesB, options: [.storageModeManaged])!
// matrix C
let rowBytesC = colsC * MemoryLayout<Float32>.stride
let bufferC = device.makeBuffer(length: colsC * rowBytesC,
options: [.storageModeManaged])!

// 2. Prepare Matrices
let descrA = MPSMatrixDescriptor(rows: rowsA, columns: colsA, rowBytes: rowBytesA, dataType: .float32)
let descrB = MPSMatrixDescriptor(rows: rowsB, columns: colsB, rowBytes: rowBytesB, dataType: .float32)
let descrC = MPSMatrixDescriptor(rows: rowsC, columns: colsC, rowBytes: rowBytesC, dataType: .float32)
let matrixA = MPSMatrix(buffer: bufferA, descriptor: descrA)
let matrixB = MPSMatrix(buffer: bufferB, descriptor: descrB)
let matrixC = MPSMatrix(buffer: bufferC, descriptor: descrC)
// 3. Encode the MPS
//----------------------------------------------------
print("Starting calculation on \(device.name)\n...")
let startTime = CFAbsoluteTimeGetCurrent()
//----------------------------------------------------
let matMul = MPSMatrixMultiplication(device: device, resultRows: colsC, resultColumns: colsC, interiorColumns: colsB)
matMul.encode(commandBuffer: commandBuffer, leftMatrix: matrixA, rightMatrix: matrixB, resultMatrix: matrixC)
blitEncoder.synchronize(resource: bufferC)
blitEncoder.endEncoding()
// 4. Run command buffer, i.e. the GPU calculation
//-----------------------------------------------------
// print("Starting calculation on \(device.name)\n...")
// let startTime = CFAbsoluteTimeGetCurrent()
//-----------------------------------------------------
commandBuffer.commit()
commandBuffer.waitUntilCompleted()
let elapsedTime = Double(Int(1000 * (CFAbsoluteTimeGetCurrent() - startTime))) / 1000
// Read results
let nC = rowsC * colsC
let resultPointer = bufferC.contents().bindMemory(to: Float.self, capacity: nC)
let result = UnsafeBufferPointer(start: resultPointer, count: nC)
// Check consistency of resulting matrix
var ok = true
for i in 1..<nC {
if result != result[0] {
ok = false
}
}
if (ok) {
print("Values in matrix C = A * B: \(result[0]) uniformly")
let fops = getFops(matrixDim : n)
let tFlops = getTflops(nFP: fops, time: elapsedTime)
print(numberFormatter.string(for: fops) ?? "", "floating point operations performed")
print("Elapsed GPU time = \(elapsedTime) seconds -> \(tFlops) Teraflops")
}
else {
print("Error: Inconsistent calculation results")
}
----------------------

That is amazing info. So it's the 2019 Mac Pro performance was halved, and the 2023 M2Ultra does 2x faster... Probably not done on purpose but falls under the category of things that make you go Hmmmm.

Romain_H · Aug 8, 2023

maccan said:
After the update to Big Sur 11.3 and XCode 12.5, the performance dropped to 0.119 TFlops! On the current macOS the performance is again up to 1.452 TFlops, still a factor of 2 less prior to macOS 11.3. Note that this performance decrease was the result of just recompiling the exact same code after upgrading macOS and XCode. Nothing changed in the code.

I investigated this to the point, that the performance penalty was clearly coming from the data transfer to the GPU.
the calculation on the GPU itself, i.e. the MPSMatrixMultiplication instruction was not affected. I assume that the underlaying Metal Framework, mainly the way how data is transferred to the GPU changed dramatically. Perhaps this was due to optimizations done for Apple Silicon affecting the data transfer model on Intel machines.

So performance gains for claimed by Apple Silicon Macs may also due to underperforming (non optimized) frameworks for the Intel macs!

I reported this dramatic performance drop in the Apple Developer Forums see "Bad MPSMatrixMultiplication performance in Big Sur 11.3". The discussion ended with the following statment from the developing team:

---------------
"Thanks for the info maccan!

Some MPS engineers were able to reproduce the problem and have already made some progress in investigating a fix."
---------------

Now that is what I call a wtf moment...

ZombiePhysicist · Aug 8, 2023

Romain_H said:
Now that is what I call a wtf moment...

Agree. This is actually highly controversial if true. If we had any decent tech press, this should a full out investigation and request Apple comment. If they refuse to give some simple explanation, you're left to think the worst of them.

I could see a scenario where they optimize something for some security measure (although my imagination is not good enough to think of a plausible security situation) for Metal protocol that causes a slow down, that the ASi hardware handles better.

But it's just so bad, optics-wise, that you think someone in the press would pursue this to get it clarified.

Then again the same lackey press never mentions that an i9 outperforms the M2Ultra in any of their coverage, or that they bootlick how the SystemPrefs in Ventura is "the right move" despite it being a UI disaster...so there's that.

Romain_H · Aug 8, 2023

Indeed. Fully agree

Morgonaut · Aug 19, 2023

ZombiePhysicist said:
...the SystemPrefs in Ventura is "the right move" despite it being a UI disaster...so there's that.

That'a not just UI disaster as you wrote, that's completely unintuitive mess. For the first time in my 30+ years in Mac OS, Mac OS X and macOS I have to use search function to be able to find things I knew in the past even blindly where they are!

maccan · Nov 22, 2023

Another sign of 3D performance on Intel Macs is on a steep degenerative path is the following:
1. Start Apple Maps.
2. Fly to a high density urban place like New City.
3. Switch to 3D view
4. Pan, Zoom, Rotate the 3D buildings

On AS Macs. All operations are super smooth and fast.
On Intel Macs it feels like no GPU support at all!
In fact GPU is always almost 0% (AMD Vega II Pro) and the camera view lags like mad.

Has nothing to do with internet connection/dpeed as some folks tend to argue because AS and Intel Macs are connected to the SAME Network for this test.

Observed this on Ventura and Sonoma. Currently I am on Sonoma 14.1.1

Can you confirm this behavior as well?

Best regards

rikki_t · Nov 23, 2023

running old trashcan Mac 2.7 GHz 12-Core Intel Xeon E5 AMD FirePro D700 6 GB, looking at Manhattan in 3D very smooth no stutter with pan zoom etc, redraw is fine, not buying argument for a minute

maccan · Nov 23, 2023

rikki_t said:
running old trashcan Mac 2.7 GHz 12-Core Intel Xeon E5 AMD FirePro D700 6 GB, looking at Manhattan in 3D very smooth no stutter with pan zoom etc, redraw is fine, not buying argument for a minute

Thanks for responding!
Which version of macos are you using?
I did NOT had this issue prior to Ventura. In Monterey and before everything was ok.

rikki_t · Nov 23, 2023

Trashcans will only go as far as Monterey, so that's where I'm stuck

maccan · Nov 23, 2023

rikki_t said:
Trashcans will only go as far as Monterey, so that's where I'm stuck

Ok! So it is pretty clear your machine is not affected. Your machine comes from an area where the operating system was optimized for its hardware. Now, with AS, we have a completely different situation.

rikki_t · Nov 23, 2023

Well Monterey has metal support for graphics yes, but if your on Apple Silicon your performance should be way better as the processor is tailor made to run it, I tried Apple and Google maps in 3D full screen on a 30" monitor @2560x1600 which is not very high by todays standards, but I have no issues

maccan · Nov 23, 2023

rikki_t said:
Well Monterey has metal support for graphics yes, but if your on Apple Silicon your performance should be way better as the processor is tailor made to run it, I tried Apple and Google maps in 3D full screen on a 30" monitor @2560x1600 which is not very high by todays standards, but I have no issues

My Intel Power Mac from 2019 with 92GB Memory + Radeon Vega II Pro (32 GB) is a very powerful machine. It runs Google Maps on my 5k screen (5120 x 2880) in full-screen mode with the most complex 3D structures you can find on any place without any lag! And this is even so through the Web-Browser (Safari) i.e. not a dedicated Application for this purpose. In contrast, Apple maps specifically designed for this purpose for Apple hardware from Apple completely fails in terms of 3D performance on Intel Macs since Ventura. I already observed dramatic loss data transfer to the GPU some time ago and reported here (see my previous post from August 8th above). My interpretation was that this is due to the optimization of the unified memory architecture for AS. And it could well be that Software which is heavily optimized for AS lacks more and more support for the Intel Macs. On AS, Apple Maps is as fast as it was for Intel Macs prior to Ventura.

rb2112 · Dec 23, 2023

chucker23n1 said:
If you feel that way, why are you here?

I don't think it is how summy5 "feels"...the PC DOES have more potential compute-wise....for certain workloads.

summy5 may be here (on this forum) to see how the impressive M series is gaining ground. As a Mac user I know I am/was. I also know that once the ultra numbers came out, my wait was over and I got a PC since way more performance per buck for my certain workload. I consider myself a "regular customer", and bought my PC (yes with water cooling to take advantage of the 13900 speed, why would you not?) cost almost HALF the studio off the shelf. It would have cost half if I had built the PC myself. To me it is just a box that cranks out numbers. I still surf forums as I am still a Mac user and who knows, the time may come where AS compute performance doesn't lag high-end PC by 2-3 years or so. Actually maybe that time is about now. Performance for cost is nowhere close. Again this is only for those workloads that fit it.

I appreciate energy efficiency. As both an engineer and a consumer. But I don't see the efficiency argument if you are trying to get work done fast. Do I care that my electricity bill will be $5 dollars more this month because I ran simulations for a week and finished a research project in the time allotted? Or that I got more data points within the same time? Or that I could debug and fix errors in 30-60 fewer minutes per day? Not one bit. What's your time worth? For me it was worth a $2,400 13900K box that kicks butt, has been running my simulations faster that AS could for the past 2 years. And if I needed more numbers spit out (I don't, at this point I am limited by other areas), let's say that AS is just as fast now....is AS twice as fast? Because if AS is not twice as fast, and I needed to, I would just get another Windows/Linux box for about the same amount as the first one.

I'm still here though. Still use a Mac mini for office crap. Still use a MacBook for writing on. Right tool for the right job.

avkills · Dec 24, 2023

maccan said:
My Intel Power Mac from 2019 with 92GB Memory + Radeon Vega II Pro (32 GB) is a very powerful machine. It runs Google Maps on my 5k screen (5120 x 2880) in full-screen mode with the most complex 3D structures you can find on any place without any lag! And this is even so through the Web-Browser (Safari) i.e. not a dedicated Application for this purpose. In contrast, Apple maps specifically designed for this purpose for Apple hardware from Apple completely fails in terms of 3D performance on Intel Macs since Ventura. I already observed dramatic loss data transfer to the GPU some time ago and reported here (see my previous post from August 8th above). My interpretation was that this is due to the optimization of the unified memory architecture for AS. And it could well be that Software which is heavily optimized for AS lacks more and more support for the Intel Macs. On AS, Apple Maps is as fast as it was for Intel Macs prior to Ventura.

I am sorry, but google earth and Apple Maps isn't a good measure of actual 3D performance. But to each their own if either of those apps are actually important to you.

Search

Search

M2 Ultra Chip Benchmark Results Reveal Impressive Performance Gains

Morgonaut

macrumors member

maccan

macrumors regular

Morgonaut

macrumors member

ZombiePhysicist

Suspended

Romain_H

macrumors 6502a

ZombiePhysicist

Suspended

Romain_H

macrumors 6502a

Morgonaut

macrumors member

maccan

macrumors regular

rikki_t

macrumors regular

maccan

macrumors regular

rikki_t

macrumors regular

maccan

macrumors regular

rikki_t

macrumors regular

maccan

macrumors regular

rb2112

macrumors member

avkills

macrumors 65816

Our Staff