Swift GPU Parallel Computing

Just GPU Parallel Computing…

0x00

No more talking, here is the original code:

1
2
3
4
5
6
7
8
9
10
reduce(A, n) {
if (n == 1) {
return A[0];
}
computing begain:
L = reduce(A, n / 2);
R = reduce(A, n/2, n - (n / 2);
parallel end;
return L + R;
}

This image shows how it works:

1.png

But here is a problem,It’s not appropriated if you wanna use this algorithm directly.Because the resources to be calculated must be prepared in the CPU and submitted to the GPU together in metal, and returned to CPU after GPU finishes the calculation. So change the recursion to iteration might be a good idea.

0x01

A natural thought is to flatten the tree above and change it into an array.

eg: Declare a array :a = [1,2,3,4,5,6,7,8], then we declare a new array named b = [a,nil,nil,nil,nil,nil,nil,nil](nil means no element yet.), where the last three elements are the three nodes on the recursive tree, and here we go.

2.png

We can use b[index + a.count] = b[index * 2] + b[index * 2 + 1] to calculate the value of each member.If at least one of b[index*2],b[index*2 +1] is nil, it means that the previous content has not been calculated yet.Use while to wait in a loop, the depth is log, so there won’t wait for more than 30 addition time.

1
2
3
4
5
6
7
8
9
10
11
kernel void array_sum(device float *array,
                     device int &len,
                     uint index [[ thread_position_in_grid ]]) {
   int i = (index << 1);
   int j = i + 1;
   int k;
   while (isnil(array[i]) || isnil(array[j])) {
       k = i; i = j; j = k; // timeout
   }
   array[index + len] = array[i] + array[j];
}

But doesn’t work, It is because we put a while loop in a Metal shader and it does not terminate. It is fine to have a loop in Metal code, but that really only works with a fixed number of iterations. Basically, we need to rethink our logic so that it can be processed correctly on a GPU.

We need to change the way we iterate:

1
2
3
4
5
6
kernel void array_sum(device float *array,
                     device int &interval,
                     uint index [[ thread_position_in_grid ]]) {
   int i = index * interval << 1;
   array[i] += array[i + interval];
}

Here is the process diagram:

3.png

In this way, the code that CPU calculated will be more complicated , because not only the interval need to be adjusted, thearrayCount need to be adjusted, but also whether the arrayCount is odd.Because the metal code can only handle the case where the number is even.If number is odd, the last two elements need to be manually summed and then send to the GPU for calculation.

0x02

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
import Cocoa
let stride = MemoryLayout<Float>.stride
let device = MTLCreateSystemDefaultDevice()!
let library = device.makeDefaultLibrary()!
let function = library.makeFunction(name: "array_sum")!
let pipeline = try! device.makeComputePipelineState(function: function)
let queue = device.makeCommandQueue()!
func sumRow(_ a: [Float]) -> Float {
var a = a
let computeBuffer = device.makeBuffer(
length: stride * a.count,
options: []
)!
let pointer = computeBuffer.contents()
memcpy(pointer, &a, stride * a.count)
let intervalBuffer = device.makeBuffer(
length: MemoryLayout<Int>.stride,
options: []
)!
let interval = intervalBuffer.contents().bindMemory(
to: Int.self,
capacity: MemoryLayout<Int>.stride
)
interval.pointee = 1
var arrayCount = a.count
while arrayCount > 1 {
if (arrayCount & 1) == 1 {
let dst = pointer.load(
fromByteOffset: stride * (arrayCount - 2) * interval.pointee,
as: Float.self
)
let src = pointer.load(
fromByteOffset: stride * (arrayCount - 1) * interval.pointee,
as: Float.self
)
pointer.storeBytes(
of: dst + src,
toByteOffset: stride * (arrayCount - 2) * interval.pointee,
as: Float.self
)
}
arrayCount >>= 1
let commandBuffer = queue.makeCommandBuffer()!
let encoder = commandBuffer.makeComputeCommandEncoder()!
encoder.setComputePipelineState(pipeline)
encoder.setBuffer(computeBuffer, offset: 0, index: 0)
encoder.setBuffer(intervalBuffer, offset: 0, index: 1)
let gridSize = MTLSizeMake(arrayCount, 1, 1)
let threadSize = MTLSizeMake(min(arrayCount, 512), 1, 1)
encoder.dispatchThreads(gridSize, threadsPerThreadgroup: threadSize)
encoder.endEncoding()
commandBuffer.commit()
commandBuffer.waitUntilCompleted()
interval.pointee <<= 1
}
let sum = pointer.load(as: Float.self)
return sum
}

for calculation 0 to 1^8, use CPU we need 32 seconds, but with GPU, the time will be shortened to 0.88 second.

Title: Swift GPU Parallel Computing

Author: Tuski

Published: 12/22/2019 - 18:01:33

Updated: 12/23/2019 - 11:35:21

Link: http://www.perphet.com/2019/12/Swift-GPU-Parallel-Computing/

Protocol: Attribution-NonCommercial-NoDerivatives 4.0 International (CC BY-NC-ND 4.0) Reprinted please keep the original link and author

Thx F Sup