Hey everybody! I'm trying to learn how to use Metal Performance Shaders to run matrix compute kernels on my GPU. I just wrote a simple 'learning' program that sets up 1x1 matrices with values from 1 to 4000, and sums them up on the GPU; I.e. the sum of all scalars between 1 and 4k. The point is not performance, and the process is slow since the matrices are created in a for loop on the CPU, the point is just learning the movement of data with Metal.
Anyways, the code works as expected with my MacBook Pro (integrated GPU with shared system and CPU memory) but gives [0.0] as a result on my iMac with a dGPU. Because of this, I expect the memory management between GDDR and DDR to be the culprit, but I can't for the life of me figure out why it's the case. As I can tell, I should be retrieving the data from the GPU correctly. Am I missing something? Here's the program (short and written in Playground - Again, performance isn't key and I know my for-loop is a slow CPU process, but it's just to get some quick data in)
The prints have been inserted for me to try and follow the data a bit more to perhaps find the issue. Only the very last print is about the result
Anyways, the code works as expected with my MacBook Pro (integrated GPU with shared system and CPU memory) but gives [0.0] as a result on my iMac with a dGPU. Because of this, I expect the memory management between GDDR and DDR to be the culprit, but I can't for the life of me figure out why it's the case. As I can tell, I should be retrieving the data from the GPU correctly. Am I missing something? Here's the program (short and written in Playground - Again, performance isn't key and I know my for-loop is a slow CPU process, but it's just to get some quick data in)
The prints have been inserted for me to try and follow the data a bit more to perhaps find the issue. Only the very last print is about the result
Code:
import Cocoa
import PlaygroundSupport
import MetalPerformanceShaders
let device = MTLCopyAllDevices()[0]
print(MTLCopyAllDevices())
let shaderKernel = MPSMatrixSum.init(device: device, count: 4000, rows: 1, columns: 1, transpose: false)
var matrixList: [MPSMatrix] = []
var GPUStorageBuffers: [MTLBuffer] = []
for i in 1...4000 {
var a = Float32(i)
var b: [Float32] = []
let descriptor = MPSMatrixDescriptor.init(rows: 1, columns: 1, rowBytes: 4, dataType: .float32)
b.append(a)
let buffer = device.makeBuffer(bytes: b, length: 4, options: .storageModeManaged)
GPUStorageBuffers.append(buffer!)
let GPUStoredMatrices = MPSMatrix.init(buffer: buffer!, descriptor: descriptor)
matrixList.append(GPUStoredMatrices)
}
let matrices: [MPSMatrix] = matrixList
print(matrices.count)
print("\n")
print(matrices[4].debugDescription)
print("\n")
var printer: [Float32] = []
let pointer2 = matrices[4].data.contents()
let typedPointer2 = pointer2.bindMemory(to: Float32.self, capacity: 1)
let buffpoint2 = UnsafeBufferPointer(start: typedPointer2, count: 1)
buffpoint2.map({value in
printer += [value]
})
print(printer)
let CMDQue = device.makeCommandQueue()
let CMDBuffer = CMDQue!.makeCommandBuffer()
var resultMatrix = MPSMatrix.init(device: device, descriptor: MPSMatrixDescriptor.init(rows: 1, columns: 1, rowBytes: 4, dataType: .float32))
shaderKernel.encode(to: CMDBuffer!, sourceMatrices: matrices, resultMatrix: resultMatrix, scale: nil, offsetVector: nil, biasVector: nil, start: 0)
print(CMDBuffer.debugDescription)
CMDBuffer!.commit()
print(CMDBuffer.debugDescription)
print(CMDQue.debugDescription)
let GPUStartTime = SYSTEM_CLOCK
CMDBuffer!.waitUntilCompleted()
var output = [Float32]()
let pointer = resultMatrix.data.contents()
let typedPointer = pointer.bindMemory(to: Float32.self, capacity: 1)
let buffpoint = UnsafeBufferPointer(start: typedPointer, count: 1)
buffpoint.map({value in
output += [value]
})
print(output)
let finish = GPUStartTime - SYSTEM_CLOCK
print("\n")
print(finish)