# Performance-Portable Autotuning of OpenCL Kernels for Convolutional Layers of Deep Neural Networks

Yaohung M. Tsai, Piotr Luszczek Jakub Kurzak, Jack Dongarra

> Nov. 14, 2016 MLHPC Workshop @ SC16





# **BLAS: The Core of Numerical Algorithms**

**BLAS: Basic Linear Algebra Subprograms** 

Developers are always trying to map the computation part of their algorithms into matrix operations to take advantage of highly optimized BLAS libraries:

- ATLAS: Automatically Tuned Linear Algebra Software
- GotoBLAS, OpenBLAS
- Intel Math Kernel Library
- AMD Core Math Library, cIBLAS
- Nvidia cuBLAS

They usually obtains 90% of theoretical peak performance with sufficient matrix size.



# What about Deep Neural Network?

- Training can easily take days or weeks.
- Nvidia cuDNN
- Intel MKL 2017, Intel MKL-DNN (Open source).
- So many frameworks.
- Fast growing / evolving area.
- New models / networks using the same or similar components / layers.

### Why don't we do auto tuning?



# **The Hardware Side**

| GPU Card<br>Name            | Peak Performance<br>(GFlop/s) | Price<br>(USD) | Performance/Price<br>(GFlops/s/USD) | Memory Size<br>(GB) | Memory Bandwidth<br>(GB/s) |
|-----------------------------|-------------------------------|----------------|-------------------------------------|---------------------|----------------------------|
| Nvidia Titan X<br>(Maxwell) | 6144                          | \$900          | 6.83                                | 12                  | 336                        |
| AMD Fury X                  | 8601                          | \$390          | 22.05                               | 4                   | 512 (HBM)                  |
| Nvidia Titan X<br>(Pascal)  | 10157                         | \$1200         | 8.46                                | 12                  | 480                        |
| Nvidia GTX1080<br>(Pascal)  | 7967                          | \$620          | 12.85                               | 8                   | 320                        |

Single precision performance.

Price is the market price in Nov. 2017.

No Nvidia Tesla P100, P4, P40 price, yet.



### **Background: Neural Network**





# Background: DNN, Training, Inference.



Alexnet for image classification.

Inference: Feed image into network and get the result likelihood vector.

Training: Back propagate the error and update the weights within layers.

Convolutional layers and fully connected layers are the compute intensive parts.

Krizhevsky, Alex, Ilya Sutskever, and Geoffrey E. Hinton. "Imagenet classification with deep convolutional neural networks." *Advances in neural information processing systems*. 2012.



### **Discrete 2D Convolution**





### **Dimensions of Convolutional Layer**





### CUDA v.s. OpenCL

| NVIDIA CUDA                             | Khronos OpenCL                      |  |  |  |
|-----------------------------------------|-------------------------------------|--|--|--|
| Term or Syntax                          | Term or Syntax                      |  |  |  |
| Term of Synax                           | Term of Syntax                      |  |  |  |
| GPU Hardware Components                 |                                     |  |  |  |
| SM, SMX                                 | CU                                  |  |  |  |
| streaming multiprocessor                | compute unit                        |  |  |  |
| scalar core                             | processing element (PE)             |  |  |  |
| host thread                             | host program                        |  |  |  |
| thread block                            | work-group                          |  |  |  |
| thread                                  | work item                           |  |  |  |
| grid                                    | NDRange                             |  |  |  |
| shared (per-block) memory               | local memory                        |  |  |  |
| local memory                            | private memory                      |  |  |  |
| texture cache                           | image                               |  |  |  |
| kernel                                  | program                             |  |  |  |
| $\mathbf{PTX}^{\dagger}$                | IL <sup>‡</sup>                     |  |  |  |
| GPU Software Constructs                 |                                     |  |  |  |
| global void K ()                        | kernel void K ()                    |  |  |  |
| void K(float *X)                        | <pre>void K(global float *X)</pre>  |  |  |  |
| <pre>float *F;</pre>                    | global float *F;                    |  |  |  |
| shared float *B;                        | <pre>local float *B;</pre>          |  |  |  |
| int tx = threadIdx.x                    | <pre>int tx = get_local_id(0)</pre> |  |  |  |
| int bx = blockIdx.x                     | <pre>int bx = get_group_id(0)</pre> |  |  |  |
| syncthreads()                           | barrier(CLK_LOCAL_MEM_FENCE)        |  |  |  |
| <sup>†</sup> DTV is Derellel Thread Eve |                                     |  |  |  |
|                                         |                                     |  |  |  |

<sup>†</sup> PTX is Parallel Thread Execution

<sup>‡</sup> IL is Intermediate Language

TENNESSEE

# **Typical Im2col approach with SGEMM**



# **Proposed Local Memory Convolution Kernel**

Compute corresponding index and read from input images directly. Form the require submatrix inside local (shared) memory to reuse data within a work-group (thread block).



(0,2,3,2) (0,2,3,3)

## **Comparing SGEMM and Convolutional Layer**

#### **cIBLAS SGEMM**

```
1 for(int k=0; k<K; k+=16) {</pre>
    __local float* plA = lA + idx*97+idy;
    \_local float* plB = lB + idx*97+idy;
    barrier(CLK_LOCAL_MEM_FENCE);
    //Load next submatrix into local memory
9
10
11
12
13
14
15
    for(int i=0; i<96; i+=16)
      plA[i] = A[i];
16
    for(int i=0; i<96; i+=16)</pre>
17
      plB[i] = B[i*ldb];
18
19
20
21
22
    barrier(CLK_LOCAL_MEM_FENCE);
23
24
25
    //Inner computation loop
26
    . . .
27
    //Move to next submatrix
28
    A += 16*lda;
29
30
    B += 16;
31 }
```

#### **Our Kernel**

```
1 for(int k=0; k<K; k+=16) {</pre>
    __local float* plA = lA + idx*97+idy;
    __local float* plB = lB + idx*97+idy;
   int x = p * stride_u + r;
    int y = q * stride_v + s;
    barrier(CLK_LOCAL_MEM_FENCE);
    //Load next submatrix into local memory
   //Check if it's in the padding region
9
   if (x < pad_h || x >= H+pad_h ||
10
        y < pad_w || y >= W+pad_w)
11
12
      for(int i=0; i<96; i+=16)</pre>
        plA[i] = 0;
13
    else
14
15
      for(int i=0; i<96; i+=16)
        plA[i]=A[index+c*N*H*W+r*N*W+s*N+i];
16
    for(int i=0; i<96; i+=16)</pre>
17
      plB[i] = B[i*ldb];
18
19
   //Update indices for next submatrix
20
  s += 16; r += s/S; s = s/S;
21
22
  c += r/R; r = r/R;
   barrier(CLK_LOCAL_MEM_FENCE);
23
24
25
    //Inner computation loop
26
    . . .
27
    //Move to next submatrix
28
   A += 16*lda;
29
30
    B += 16;
31 }
```





# **Auto-tuning Parameters and Constraints**



- Algorithm level: Constraints from the kernel to remove boundary check.
- Software level: Problem sizes are divisible by tile sizes to remove imbalance workload.
- Hardware level: Ensure there are enough registers and local memory spaces with minimum occupancy.





### **Performance Results**

The tested GPU was the AMD Fury X with peak single-precision (FP32) performance of 8602 Gflop/s and core frequency of 1050 MHz.



The histogram of performances from 2056 kernels on Alexnet L2.

The best kernel achieved 66.7% of peak performance.



# The tuning result on Alexnet L2

The darker red indicate the better performance in this heat map.

76.7% (3314) of the kernels were invalid and pruned before code generation.

3.4% (145) of the kernels failed at runtime.



### **Result of 5 work group configurations on AlexNet L2**

Each subfigure is extracted from one column of previous heat map.



 $M_{THD}$ 

**�iCl** 

## **Comparison between Convolution and SGEMM**



**∲ICL** 

# Performance on Alexnet with batch size N=128

| Alexnet                    | Forward fe                                     | eed                              | Back propagat                        | e (input)                 | Back propagat                                  | e (filter)                        |
|----------------------------|------------------------------------------------|----------------------------------|--------------------------------------|---------------------------|------------------------------------------------|-----------------------------------|
| Neural<br>layer            | Performance<br>(Gflop/s)                       | Time<br>(ms)                     | Performance<br>(Gflop/s)             | Time<br>(ms)              | Performance<br>(Gflop/s)                       | Time<br>(ms)                      |
| L1<br>L2<br>L3<br>L4<br>L5 | 4972.0<br>5511.2<br>5493.5<br>5018.7<br>4983.2 | 4.5<br>10.4<br>5.2<br>7.6<br>5.1 | 4795.5<br>4936.4<br>4878.5<br>4964.6 | 12.0<br>5.8<br>7.8<br>5.2 | 4161.5<br>2174.9<br>3444.5<br>3658.4<br>3069.7 | 4.3<br>26.4<br>8.3<br>10.5<br>8.3 |
| Combined Forward           | 5238.2                                         | 32.8                             | Combined Ba                          | ckward                    | 3558.1                                         | 84.3                              |

The last column represents updating the filters, whose size K X C X R X S usually is much smaller than either the input or the output images. Hence, there is a trade-off between occupancy and data reuse in local memory as the kernels have to pass the data between each other.

**∆iCl** 

# **The Portable Performance**

| Alexnet      | AMD Fury X               |           | Nvidia GTX1080           |           |  |
|--------------|--------------------------|-----------|--------------------------|-----------|--|
| Forward feed | Performance<br>(Gflop/s) | % of peak | Performance<br>(Gflop/s) | % of peak |  |
| L1           | 4972.0                   | 57.8%     | 5279.2                   | 66.3%     |  |
| L2           | 5511.2                   | 64.0%     | 5553.9                   | 69.7%     |  |
| L3           | 5493.5                   | 63.9%     | 5595.8                   | 70.2%     |  |
| L4           | 5018.7                   | 58.3%     | 5163.5                   | 64.8%     |  |
| L5           | 4983.2                   | 57.9%     | 4732.5                   | 59.4%     |  |

Theoretical peak performance of AMD Fury X : 8602 Gflop/s.

Nvidia GTX1080 : 7967 Gflop/s.



## **The Portable Performance**

Theoretical peak performance of AMD Fury X : 8602 Gflop/s.

Nvidia GTX1080 : 7967 Gflop/s.



**�iCL** 

# **Conclusions and Future Work**

We proposed:

- An efficient implementation of convolutional layers which does not require extra memory space, which leads to larger batch size N and faster training process.
- Auto tuning approach to achieve high performance without digging into architecture detail.
- Portable performance cross different GPU vendors.

Future directions:

- Integrate our kernel generator and autotuner directly into a deep learning framework.
- Merging other layers like ReLU into single kernel.
- Fast and specialized algorithms like Winograd for 3-by-3 filters.







Same