I think the performance comparison between Intel macs vs AS may also be related to the optimization of the underlaying frameworks. Some time ago I started to develop code using the Metal Performance Shader (MPS) framework. As a simple test, I wrote a program using the MPSMatrixMultiplication Shader to multiply two large marices of size [8192 x 8192]. This matrix multiplication requires (2 * 8192 - 1) x 8192 x 8192 = 1'099'444'518'912 i.e. a tera of floatinpoint operations. The amount of data transferred to the GPU is 2 x 8192 * 8192 * 4 bytes = 536'870'912 bytes = 0.537 GB. If this code runs in 1 second, the performance of the machine would be 1.1 TFlops. On my MacPro 2019 with AMD Vega II, I got 3.5 TFlops prior to macOS Big Sur 11.3 and XCode 12.5. After the update to Big Sur 11.3 and XCode 12.5, the performance dropped to 0.119 TFlops! On the current macOS the performance is again up to 1.452 TFlops, still a factor of 2 less prior to macOS 11.3. Note that this performance decrease was the result of just recompiling the exact same code after upgrading macOS and XCode. Nothing changed in the code.
I investigated this to the point, that the performance penalty was clearly coming from the data transfer to the GPU.
the calculation on the GPU itself, i.e. the MPSMatrixMultiplication instruction was not affected. I assume that the underlaying Metal Framework, mainly the way how data is transferred to the GPU changed dramatically. Perhaps this was due to optimizations done for Apple Silicon affecting the data transfer model on Intel machines.
So performance gains for claimed by Apple Silicon Macs may also due to underperforming (non optimized) frameworks for the Intel macs!
I reported this dramatic performance drop in the Apple Developer Forums see "Bad MPSMatrixMultiplication performance in Big Sur 11.3". The discussion ended with the following statment from the developing team:
---------------
"Thanks for the info maccan!
Some MPS engineers were able to reproduce the problem and have already made some progress in investigating a fix."
---------------
If you want to try yourself, you can compile the code below by copy it to a file called "matrixMul.swift and compile it by executing on a Terminal the following command: "swicftC -O matrixMul.swift".
Of course you have to have XCode installed!
-----------------
import Metal
import MetalPerformanceShaders
import Foundation
import CoreGraphics // for MTLCreateSystemDefaultDevice
// formatting numbers
let numberFormatter = NumberFormatter()
numberFormatter.numberStyle = .decimal // Set defaults to the formatter that are common for showing decimal numbers
numberFormatter.usesGroupingSeparator = true // Enabled separator
numberFormatter.groupingSeparator = "'" // Set the separator to "," (e.g. 1000000 = 1,000,000)
numberFormatter.groupingSize = 3
// Calculates matrix multiplication floating point operations
func getFops(matrixDim: Int) -> Int {
return (2 * matrixDim - 1) * matrixDim * matrixDim
}
// Reports the mean flops in units of Teraflops
func getTflops(nFP: Int, time: Double) -> String {
return String(format: "%.3f", 1e-12 * Double(nFP) / time)
}
// Get the device, commandQueue, commandBuffer and blitEncoder
// let device = MTLCreateSystemDefaultDevice()!
// In Ventura MTLCreateSystemDefaultDevice() seems not to work anymore
let device = MTLCopyAllDevices().first!
let commandQueue = device.makeCommandQueue()!
let commandBuffer = commandQueue.makeCommandBuffer()!
let blitEncoder = commandBuffer.makeBlitCommandEncoder()!
// Matrix dimensions
let n = 8192
let rowsA = n
let colsA = n
let rowsB = n
let colsB = n
let rowsC = n
let colsC = n
// Set data for Matrix A, B
let a = UnsafeMutablePointer<Float32>.allocate(capacity: rowsA * colsA)
let arrayA = UnsafeMutableBufferPointer(start: a, count: rowsA * colsA)
arrayA.update(repeating: Float32(1.0))
print("Values in matrix A[\(rowsA) x \(colsA)]: \(arrayA[0]) uniformly")
let b = UnsafeMutablePointer<Float32>.allocate(capacity: rowsA * colsA)
let arrayB = UnsafeMutableBufferPointer(start: b, count: rowsB * colsB)
arrayB.update(repeating: Float32(2.0))
print("Values in matrix B[\(rowsB) x \(colsB)]: \(arrayB[0]) uniformly")
// 1. Prepare managed buffers
// matrix A
let rowBytesA = colsA * MemoryLayout<Float32>.stride
let bufferA = device.makeBuffer(bytes: arrayA.baseAddress!,
length: rowsA * rowBytesA, options: [.storageModeManaged])!
// matrix B
let rowBytesB = colsB * MemoryLayout<Float32>.stride
let bufferB = device.makeBuffer(bytes: arrayB.baseAddress!,
length: rowsB * rowBytesB, options: [.storageModeManaged])!
// matrix C
let rowBytesC = colsC * MemoryLayout<Float32>.stride
let bufferC = device.makeBuffer(length: colsC * rowBytesC,
options: [.storageModeManaged])!
// 2. Prepare Matrices
let descrA = MPSMatrixDescriptor(rows: rowsA, columns: colsA, rowBytes: rowBytesA, dataType: .float32)
let descrB = MPSMatrixDescriptor(rows: rowsB, columns: colsB, rowBytes: rowBytesB, dataType: .float32)
let descrC = MPSMatrixDescriptor(rows: rowsC, columns: colsC, rowBytes: rowBytesC, dataType: .float32)
let matrixA = MPSMatrix(buffer: bufferA, descriptor: descrA)
let matrixB = MPSMatrix(buffer: bufferB, descriptor: descrB)
let matrixC = MPSMatrix(buffer: bufferC, descriptor: descrC)
// 3. Encode the MPS
//----------------------------------------------------
print("Starting calculation on \(device.name)\n...")
let startTime = CFAbsoluteTimeGetCurrent()
//----------------------------------------------------
let matMul = MPSMatrixMultiplication(device: device, resultRows: colsC, resultColumns: colsC, interiorColumns: colsB)
matMul.encode(commandBuffer: commandBuffer, leftMatrix: matrixA, rightMatrix: matrixB, resultMatrix: matrixC)
blitEncoder.synchronize(resource: bufferC)
blitEncoder.endEncoding()
// 4. Run command buffer, i.e. the GPU calculation
//-----------------------------------------------------
// print("Starting calculation on \(device.name)\n...")
// let startTime = CFAbsoluteTimeGetCurrent()
//-----------------------------------------------------
commandBuffer.commit()
commandBuffer.waitUntilCompleted()
let elapsedTime = Double(Int(1000 * (CFAbsoluteTimeGetCurrent() - startTime))) / 1000
// Read results
let nC = rowsC * colsC
let resultPointer = bufferC.contents().bindMemory(to: Float.self, capacity: nC)
let result = UnsafeBufferPointer(start: resultPointer, count: nC)
// Check consistency of resulting matrix
var ok = true
for i in 1..<nC {
if result
!= result[0] {
ok = false
}
}
if (ok) {
print("Values in matrix C = A * B: \(result[0]) uniformly")
let fops = getFops(matrixDim : n)
let tFlops = getTflops(nFP: fops, time: elapsedTime)
print(numberFormatter.string(for: fops) ?? "", "floating point operations performed")
print("Elapsed GPU time = \(elapsedTime) seconds -> \(tFlops) Teraflops")
}
else {
print("Error: Inconsistent calculation results")
}
----------------------