Neural Network Inference Optimization/Acceleration

Subrata Goswami
9 min readSep 30, 2019

--

The following is an attempt to capture the main essences of inference optimization. Being able to do inference as quickly as possible is very important for neural network based products. The following briefly describes some techniques, technologies, and frameworks currently available in various stages of maturation. Although all descriptions here are for Tensorflow running on various Nvidia GPUs ( T4, RTX 2080, GTX 1080 Ti, etc.) , there is no reason why the general ideas would not work in other frameworks such as PyTorch, MXNET, etc.

Optimization can be done in multiple dimensions. The most important dimensions for inference are throughput, memory and energy. More often then not, they are intertwined. For example with lower memory needs, more of the network can fit inside a chip’s memory, hence reducing the need to access off chip DRAM, thus resulting in increased throughput. Less memory need also reduces energy consumption. There are a few broad common techniques that have been used with various levels of success — layer and operator fusion, weight quantization, lower precision, weak connection elimination, fork-join, compute reuse, etc.

One of the earliest works on accelerating neural networks processing is [1]. In the paper the authors describe 3 techniques that reduces memory foot print — pruning, quantization, and Huffman coding. The authors first prune the small-weight connections: all connections with weights below a threshold are removed and then retrained the network without the weak connections.

Quantization involved placing each weight into a fixed number of bins. The bins are then labeled with certain number of bits ( e.g 8 bits). Further refinement of the weight value represented by each bin is done by calculating a centroid value from the original value of all the weights that went into each bin. Huffman code is a loss less compression method where more frequent symbols are assigned smaller number of bits. So by using Huffman coding for labeling the bins, further reduction is memory foot print can be achieved.

Tensorflow creates a compute graph from the network definitions (in Python or C++, etc). During execution this graph is used for generating output from input. As the graph is pre-defined, Tensorflow can take advantage of that pre-defined graph for optimization.

For most real time applications, a GPU is very likely to be used. GPU maker NVIDIA currently support 3 main optimization technologies — Automatic Mixed Precision (AMP), XLA (Accelerated Linear Algebra), and TensorRT.

AMP [2,3] is the easiest and the most transparent way for inference (and training) acceleration. However, it requires support from the underlying GPU in the form of Tensor Cores. Tensor Cores are compute elements designed for linear algebra (e.g. AX+B). The most commonly used data type in neural networks is FP32 — 32 bit floating point. AMP converts as many FP32 nodes into FP16, almost without any loss in accuracy.

The following is a sample output from Tensorflow when AMP is enabled. At the end of the output, it displays how many nodes have been converted to FP16.

2019-09-26 15:44:02.110787: I tensorflow/core/grappler/optimizers/auto_mixed_precision.cc:1816] Running auto_mixed_precision graph optimizer
2019-09-26 15:44:02.115694: I tensorflow/core/grappler/optimizers/auto_mixed_precision.cc:1268] No whitelist ops found, nothing to do
2019-09-26 15:44:02.678007: I
I0926 15:44:03.035943 140424294598464 session_manager.py:500] Running local_init_op.
2019-09-26 15:44:03.069036: I tensorflow/core/grappler/optimizers/auto_mixed_precision.cc:1816] Running auto_mixed_precision graph optimizer
2019-09-26 15:44:03.069379: I tensorflow/core/grappler/optimizers/auto_mixed_precision.cc:1268] No whitelist ops found, nothing to do
I0926 15:44:03.091187 140424294598464 session_manager.py:502] Done running local_init_op.
2019-09-26 15:44:03.611920: I tensorflow/core/grappler/optimizers/auto_mixed_precision.cc:1268] No whitelist ops found, nothing to do
2019-09-26 15:44:03.817348: I tensorflow/core/grappler/optimizers/auto_mixed_precision.cc:1772] Converted 176/821 nodes to float16 precision using 2 cast(s) to float16 (excluding Const and Variable casts)
2019-09-26 15:44:08.844850: I tensorflow/stream_executor/cuda/ptxas_utils.cc:202]
2019-09-26 15:45:02.924840: I tensorflow/core/grappler/optimizers/auto_mixed_precision.cc:1816] Running auto_mixed_precision graph optimizer
2019-09-26 15:45:02.925272: I tensorflow/core/grappler/optimizers/auto_mixed_precision.cc:1268] No whitelist ops found, nothing to do

XLA [4, 5] is also very easy to enable on Tensorflow running on Nvidia GPUs. Tensorflow has multiple layers of graph optimization. The top most one is called Grappler. Grappler outputs High Level Optimizer ( HLO) IR (Intermediate Representation), that is the input to XLA. XLA first does a target independent optimization (e.g. operation fusion ). Then a target dependent HLO optimization is done. After this comes the target specific code generation. For CPU and GPU this is an LLVM IR. Then LLVM is invoked to generate machine code — LLVM NVPTX for Nvidia GPUs and ISAs for CPUs.

Enabling XLA seems to have bigger impact on training than in inference. Surprisingly, in inference experiments with official Tensorflow models (and custom networks), degradation on image/sec throughput was observed when XLA is enabled. One exception has been the benchmarks repository in the official Tensorflow github area [6]. The networks in this repository seems to be an alternative implementation of some of the official Tensorflow models (e.g. ResNet, SSD, etc.).

Typical output from Tensorflow when XLA is enabled looks like the following.

2019-09-26 16:31:04.660067: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1763] Adding visible gpu devices: 0
2019-09-26 16:31:04.691104: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcudart.so.10.1
2019-09-26 16:31:24.332694: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1181] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-09-26 16:31:24.332735: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1187] 0
2019-09-26 16:31:24.332741: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1200] 0: N
2019-09-26 16:31:24.352039: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1005] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2019-09-26 16:31:24.352787: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1005] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2019-09-26 16:31:24.353565: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1005] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2019-09-26 16:31:24.381550: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1326] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 14132 MB memory) -> physical GPU (device: 0, name: Tesla T4, pci bus id: 0000:00:04.0, compute capability: 7.5)
2019-09-26 16:31:24.384187: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x562286f9b350 executing computations on platform CUDA. Devices:

For training, XLA seems to increase throughput, image/sec, significantly. On a single T4 GPU, Tensorflow Official ResNet50 training on ImageNet 2012 data set without AMP and XLA takes about 3.87 hours per epoch. With just XLA enabled, that goes down to 3.48 hours. With both AMP and XLA enabled, that significantly comes down to 1.14 hours. There are 1281167 images per epoch in ImageNet 2012, batch size of 64 were used for all the experiments.

TensorRT [7,8] is an optimized inference engine from Nvidia. TensorRT provides graph structure optimizations, precision optimizations, kernel auto-tuning, and memory reuse optimizations [14]. TensorRT’s graph-based optimizations fall under two categories: vertical fusion,and horizontal fusion. Vertical fusion of graph operators involves fusing sequential operators into a single combined operator. Horizontal fusion involves fusing layers that aren’t necessarily sequential, but share input data and filter size. Layer fusion can
offer significant performance improvements because every operation requires a kernel launch, which often is slower than the actual kernel computations. By fusing layers into fewer kernels, kernel launches and their overhead are reduced. Furthermore, fusing also reduces the cost associated with reading and writing the intermediate data to memory.

TensorRT takes sub-graphs from a bigger graph and converts them into optimized nodes. These nodes show up as TRTEngineOP_x nodes in Tensorboard graph display. TensorRT incorporation is much more difficult — primarily because TensorRT has a number of hard limits and secondarily because integration with Tensorflow is not complete. TensorRT can be used either standalone or integrated with Tensorflow — called TF-TRT [8].

Standalone TensorRT is readily doable for straight forward networks (e.g. classification, SSD, etc). Essentially requires a graph in some form (e.g. meta, frozen, or saved) with inputs and outputs. For complex networks like FasterRCNN, TensorRT is more difficult to incorporate. Moreover if multiple inputs/outputs are present and not all the inputs/outputs are active for all runs, TensorRT standalone seems to output erroneous results. This is unlike Tensorflow, where it is possible to have unconnected sub-graphs in a larger graph. It is also not possible to transform multiple sub-graphs individually and connect then into a larger graph in TensorRT.

TF-TRT is a more viable solution for a complex graph, as that allows both straight Tensorflow and TensorRT nodes to reside in the same graph at the same time. However here also it looks like, only one part of the full graph can be transformed — primarily because there can only be 1 TRTEngineOP_0 node. Some of the hard limits of TensorRT are maximum batch sizes, reserving a fixed amount of GPU memory, a set of new nodes are created for each batch size, etc. TF-TRT has been observed to offer more than 2X throughput increase for Faster RCNN type of object detectors.

The following chart shows some experimental runs using AMP, TF-TensorRT, and XLA on 4 different types of networks — ResNet101, Darknet53, FasterRCNN, and Yolo v3. All 3, does a good job with ResNet and FasterRCNN. However on Darknet53 and Yolo v3, TF-TensorRT falls short of AMP and AMP+XLA.

Some of other ways to increase throughout is to reduce the input image size if the decreases in precision (mAP) is acceptable, use shallower networks along with same caveat in mAP. The Yolo V2 work quantified some of these [16]. In another experiment it was observed that on Yolo V3 networks going from image size 640x480 to 320x240 increases through put by 2.25x, not quite the 4x expected from reduction in pixel counts.

Beyond the above methods, inference optimization and acceleration is an active area of research [10, 11, 12]. In [10], the authors describe DeepReuse. The basic idea of DeepReuse is to leverage similarities among neuron vectors, such that computation results attained on one neuron vector can be effectively reused for some other neuron vectors in CNN inference. By figuring out similarities among groups of neurons, they were able to reuse computation. 2D convolution is very widely used operation, with filters ranging from 1x1 to 7x7. A neuron vector in this context can be all the pixels covered by the filter at each pixel location (e.g. laid out in a row, rather than a square) . The authors then use a hashing mechanism called Locality Sensitive Hash (LSH), to cluster the vectors into similarity groups. LSH hashes similar input items into the same “buckets” with high probability — very different from cryptographic hash where the goal is to randomize. There are also Locality Preserving Hash functions that maintains ordering relationship of inputs. In [11], authors describe a set of bench marks for DNN benchmark suite that can run on any platform that supports CUDA and OpenCL In [12], authors describe TapirXLA, a replacement for TensorFlow’s XLA compiler that embeds recursive fork-join parallelism into XLA’s low-level representation of code. ML computations and linear-algebra routines often exhibit structured parallelism, specifically, recursive fork-join parallelism, which includes loop parallelism. The idea of fork-join is that a larger task can be divided into smaller tasks whose solutions can then be combined. As long as the smaller tasks are independent, they can be executed in parallel. In [13] authors apply some of common optimization techniques such as loop unrolling, vectorization, graph partitioning, operator fusion, etc to RNNs. In [14] author systematically analyzes layer and operator fusion and compares their implementation with TensorRT’s. In [15] authors introduce a framework to systematically reason about operator fusion in a DAG to exploit eliminate
materialized intermediates, temporal locality, multiple aggregates over common sub-expressions (CSEs), and sparsity.

References:

  1. Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding, https://arxiv.org/abs/1510.00149
  2. Automated Mixed-Precision for TensorFlow Training, GTC March 2019, https://developer.download.nvidia.com/video/gputechconf/gtc/2019/presentation/s91029-automated-mixed-precision-tools-for-tensorflow-training-v2.pdf
  3. https://docs.nvidia.com/deeplearning/sdk/mixed-precision-training/index.html
  4. https://www.tensorflow.org/xla
  5. TensorFlow Graph Optimizations, https://web.stanford.edu/class/cs245/slides/TFGraphOptimizationsStanford.pdf
  6. https://github.com/tensorflow/benchmarks
  7. TensorRT Inference with TensorFlow , https://developer.download.nvidia.com/video/gputechconf/gtc/2019/presentation/s9431-tensorrt-inference-with-tensorflow.pdf
  8. How to Speed Up Deep Learning Inference Using TensorRT, https://devblogs.nvidia.com/speed-up-inference-tensorrt/
  9. ACCELERATING INFERENCE IN TENSORFLOW WITH TENSORRT (TFTRT), https://docs.nvidia.com/deeplearning/frameworks/pdf/TensorFlow-TensorRT-User-Guide.pdf
  10. Deep Reuse: Streamline CNN Inference On the Fly via
    Coarse-Grained Computation Reuse, https://people.engr.ncsu.edu/xshen5/Publications/ics19.pdf
  11. Tango: A Deep Neural Network Benchmark Suite for Various Accelerators, https://arxiv.org/abs/1901.04987
  12. TapirXLA: Embedding Fork-Join Parallelism into the XLA Compiler in TensorFlow Using Tapir, https://arxiv.org/abs/1908.11338
  13. Accelerating Recurrent Neural Networks through Compiler Techniques and Quantization, http://learningsys.org/nips18/assets/papers/30CameraReadySubmissionmodelcompiler_camera_ready_no_final_flag.pdf
  14. Exploring Novel Architectures For Serving Machine
    Learning Model, shttps://www2.eecs.berkeley.edu/Pubs/TechRpts/2018/EECS-2018-73.pdf,
  15. On Optimizing Operator Fusion Plans for Large-Scale Machine Learning in SystemML, https://arxiv.org/pdf/1801.00829.pdf
  16. Daniel Gordon YOLO 9000: Better, Faster, Stronger, https://www.youtube.com/watch?v=GBu2jofRJtk

--

--