Faster Deep Learning: Optimal DNN Primitives

By Kimberly Cook |Email | Jul 20, 2018 | 12357 Views

This is a paper in a "Seminal Papers in ML" series by MIT Machine Intelligence Community (MIC). MIT MIC aims to educate the community at-large about machine learning and lower the barriers to entry. To learn more, please visit or email

Deep Neural Networks (DNNs) provide unparalleled accuracy and performance in an increasingly wide range of industrial applications such as image recognition, natural language processing, and other complex problems like control of autonomous vehicles. Despite the massive result improvements over older machine learning algorithms, DNNs are very demanding in terms of computation, and require training on massive datasets, taking large amounts of time. Therefore, it makes sense that many efforts have been made to speed up both the training time as well as inference time (time taken to actually make a prediction given a trained model) of DNNs. This enables us to train on more data in lesser time, as well as to have faster inference on less powerful devices like mobile phones or embedded systems. In their paper, "Optimal DNN Primitive Selection with Partitioned Boolean Quadratic Programming (PBQP)", Andrew Anderson and David Gregg take a hybrid approach to optimizing DNN computations. They focus on finding a solution to the problem of DNN primitive selection, which can be described as deciding which algorithms and libraries to use to run each layer of a DNN the problem is explained in detail below. They also reduce the problem to a known NP-hard graph problem, PBQP, and use an off-the-shelf PBQP-solver to solve it.

In evaluating the results, particular emphasis is placed on the convolution operation which is extremely computationally intensive and is used in almost all image processing DNNs such networks are called Convolutional Neural Networks (CNNs).

DNN Primitives and their Importance
DNNs consist of a directed graph of layers, and it is on the directed edges between these layers that data flows occur. Each layer processes the data and consists of standard mathematical operators like convolution, activation, pooling, or fully-connected layers. The output of one layer feeds into the next one and this corresponds to the data flows. These layers are implemented as a set of primitive routines, where each primitive is an optimized implementation of these mathematical operations. There are many possible algorithms and implementations for a layer, which is the same as saying there are many choices of primitive routines available for use, consisting of both open source implementations and proprietary libraries. For example Winograd and FFT are two algorithms that can be used to implement convolution. It is important to note that these primitives perform differently in different scenarios, depending on the specifications of the layer. The aim is to select the best primitive routine for each layer, such that the overall running time of the whole network is minimized this problem is called optimal DNN primitive selection.

What are Convolution Layers?
Although the methodology used is general, and can easily be implemented and tested for any type of layer, the evaluation focus is kept on convolution layers only. These usually dominate the run time and are also used in many important CNNs for image processing. The input to a convolution layer is generally a tensor, usually a three or four dimensional representation of a 2-D image. For example, for a three dimensional representation of a typical colored 2-D image, the first two dimensions usually encode the horizontal and vertical positions of the pixel, and the third dimension generally stores the amount of red, green, and blue color in the pixel. The convolution layer processes this tensor by sliding a series of small "filters" or "kernels" across the image to perform the mathematical convolution operation. An outline is shown in the diagram below, where the image of size C is convolved with a kernel of size C to get a output image (called "feature map") of size H and since there are M such kernels, the output is represented as a tensor of size HM. Multiple kernels are significant here as they can enable the network to "learn" different features and patterns. For example, one kernel might try to detect the presence of a cat in an image, and another kernel might learn to detect dogs. The resulting tensor with outputs from all the kernels- could then be fed into a max-pooling layer which would detect which of these M patterns was most significant, for example whether the cat kernel found a better match than the dog kernel this enables high accuracy image classification.

Primitive Selection and Data Format Agreement in Convolution Layers
A large number of different primitives for convolution exist. Direct-loop-methods, im2, kn2, Winograd, and FFT (using Fast Fourier Transform algorithms), are all families of algorithms that can be used to implement convolution. As mentioned before, the performance of each algorithm depends on the parameters and specifications of the convolution layer. For example, direct loop methods perform well on strided convolution, in which the kernel skips over some of the input pixels to give a smaller output image, and the im2 family does well in this case too. However, direct loop methods are generally slower and im2 algorithms are very memory intensive. Winograd is very fast but is unpredictable. FFT is also an option with a balance of performance benefits, but is bad for small kernels; despite having a lower asymptotic complexity, it has a larger overhead, thus causing the lower complexity to only be beneficial when the inputs are large. Other parameters, like the number of channels (i.e. the value of C in the diagram above) also affect the optimal choice of primitive.

Since in a typical CNN, the parameters for each layer vary greatly, it would make sense to have a different choice of primitive for each layer in order to minimize the total run time of the computations on the whole network. At first glance, this seems like a very easy problem to solve: before actually running the whole network, we could simply try out each primitive on each layer and note the run time in a list (let's call this list a "cost vector"). For this trial run we can just use random input values since the run time does not really depend on what the values are themselves, just their total number. This actually is in fact what the authors do, to start with.

After sampling the run times, we can just pick the best algorithm for each layer by seeing which one performed the best in the trials problem solved! The process would look like this:

Here the convolution layers are represented as nodes: conv1, conv2, and conv3. And the three primitives are A, B, and C. The primitive cost vectors (determined by the trial runs) for each layer are shown, and we can see that optimizing the whole network is just picking the minimum cost primitive. In this case it corresponds to picking algorithm B for conv1, algorithm C for conv2, and algorithm B again for conv3. The total cost for the network is just the sum of the costs of each layer.

However, this naive solution is missing one crucial aspect. It does not take into account the fact that each of the primitives operates on data in different data formats. For example, some primitives take in and produce output values in 16-bit floating point format, and others use 32-bit floating points. Some primitives also require specific permutations of the dimensions of input images CHW (Channel: Height , Width) and HWC (Height Width  Channel) are two different data formats and specific primitives may operate on only one of the formats. Another example is, if one operates on signals, the data could be represented in frequency domain representation, or in time domain representation (these are both just different ways of representing the same signal) and different primitives might support only one of these formats. This must be taken into account as in the diagram above, it may be that conv1 uses a different data format than conv2, and the data will need to be converted from the format of conv1 to the conv2 format before it can pass between them this conversion costs time. Thus we need to take into account the cost of data format conversion in our optimization as well.

The resulting optimization problem now takes into account not only the run times of the primitives, but also the data format conversion times between primitives used in consecutive layers. The authors show that this is equivalent to a known NP-hard problem, Partitioned Boolean Quadratic Programming (PBQP). In PBQP, we are given a graph, and each node has to be assigned a label, from a given set of labels representing costs. Thus each node has a "cost vector" for all labels, and this is similar to problem in the previous graph. The cost of picking a label for a node adds the corresponding cost to the total cost of our objective, which is what we want to minimize. But in addition to the costs of the nodes, there is also a cost for each edge. The cost of an edge depends on both the choice of label for the source node, as well as the choice of label for the target node. Since it depends on both the connected nodes' labels, it can be thought of as a matrix (as it is indexed by two labels). This is illustrated in the diagram below:

We can see in the diagram above, the lowest cost primitive for conv1 is primitive b with a cost of 6, and the lowest cost primitive for conv2 is primitive c with a cost of 14. But now we also have to consider the edge costs and using this assignment gives an extra edge cost of 5 as given by indexing into the edge cost matrix appropriately. So actually choosing primitive c for conv1 is better as it gives and edge cost of zero. The total cost is then simply the sum of all the node and edge costs this is now the objective that we want to minimize.

Implementation and Testing
Now that we have the correct problem to solve, we can implement and test it. For this, we first need the costs of running the primitives and also the costs for conversions between data formats. The costs of the running the primitives are pre-computed by running the primitives for each layer for random inputs of the actual size, and this should be reasonably accurate since the run time of a layer depends on the input sizes rather than the actual values.

The data format conversion costs are also pre-computed by running the conversions and sampling the times in the case where no direct conversion is possible, the cost of lowest cost conversion path is used. This means if data format A, for example cannot be converted to data format B. But A can be converted to data format C which in turn can be converted into B, then the sum of the cost of the conversions is used (this is modelled as an All-Pairs-Shortest-Paths problem and solved easily).

Once the cost vectors and matrices are ready, an off-the-shelf PBQP-solver is used to get the optimal primitive selection for the DNN to be bench-marked. The performance was tested on three popular neural networks: AlexNet, VGG Model E, and GoogleNet. The runtimes are compared to the performance when using only a single primitive throughout, as well as to Caffe (a popular deep learning framework). The benchmark for Intel's MKL-DNN library is also included, which is an optimized JIT compiler for DNN primitives. The experiments were performed on different hardware architectures and the speed results for Intel Haswell are shown below (higher is better):

As we can see that on Intel Haswell, the PBQP primitive assignment outperforms all the other methods in the case of multiple threads, even Intel's own MKL-DNN neural network compiler. In the case of a single thread running, the MKL-DNN is slightly better in performance in the AlexNet and GoogleNet networks, but the PBQP method comes very close and still outperforms the rest of the methods. For VGG the PBQP method surpasses MKL-DNN as well.

The performance results for ARM Cortex are shown below (higher is better):

Here we can also see that the PBQP assignment outperforms every other implementation of the networks on the ARM Cortex-A57 architecture.

Discussion and Conclusion
It is clear from the experimental results that the PBQP formulation is extremely effective at selecting the optimal primitives to implement a DNN. Other DNN optimization frameworks exist, and some use similar techniques to enable faster DNN computation. Perhaps the fastest example is cuDNN by NVIDIA, which also uses the implementation/primitive that is most likely to be the fastest for a given layer. This is in contrast to the PBQP solver in that it does not take into account edge costs, i.e. the time for conversion of data formats. Tensorflow XLA is another framework which is basically an ahead-of-time compiler for DNNs. It computes costs of cross-layer transitions and data format conversions and tries to merge layers to avoid these costs, which is somewhat in parallel to the PBQP approach. The results show that the PBQP method improves DNN run time and thus it is perhaps well suited to be adopted by DNN optimization frameworks for improvement. The authors presented this paper at the International Symposium of Code Generation and Optimization (CGO) 2018, and the methods employed in this paper enable us to train DNNs faster, as well as perform deep learning and inference on an increasing number of inexpensive mobile and embedded processors.

The article was originally published here

Source: HOB