Research Blog
Review of recent work by the broader Computer Vision commnunity
Review of recent work by the broader Computer Vision commnunity
Convolutional Neural Networks (CNNs) have been widely adopted in various fields, such as computer vision, speech processing, natural language processing, etc. However, deploying them on real-time resource-constrained devices that operate on the restricted latency and computational budget is not always practically feasible. We can leverage compression techniques, e.g. pruning and quantization, on pretrained models for efficiently deploying them on such devices. Quantizing the floating-point weights and activations of deep CNNs to low-bit integers helps in reducing the memory footprint and inference time, and is a critical technology for increasing applicability of such networks. This article mainly focusses on one such state-of-the-art quantization technique.
Quantization converts the floating-point (FP32) weights and activations of deep CNNs to low-bit integers. Among several variations, one of the popular choice for quantization is the asymmetric uniform quantizer, as the non-uniform methods are typically not suitable for efficient hardware execution. Given a tensor \(\mathbf{x}\) in FP32 with \(\textit{l}\) and \(\textit{u}\) as its lower and upper bounds, its quantized version \(\mathcal{Q}(\mathbf{x})\) can be written as,
\[\mathcal{Q}(\mathbf{x}) = \mathrm{round}( \mathrm{clip}(\mathbf{x}, \mathit{l}, \mathit{u})/\Delta ),\]where, \(\Delta\) denotes the scale factor that projects the FP32 number to a fixed-point integer. The \(\Delta\) can be computed as,
\[\Delta = (\mathit{u} - \mathit{l})/(2^b -1),\]where, \(b\) denotes the low-bit integer precision.
The computation of \(\Delta\) relies on the min-max range of the input FP32 tensor \(\mathbf{x}\). For a given deep neural network, the range of the FP32 weights can be directly computed. However, for the activations, original training samples or their subset may be required. If the original training samples (in full or subset) are available, they can be used to generate the activations and record the range. Once the range calibration of the activations is complete, corresponding \(\Delta\) can be computed as well. Finally, the recorded \(\Delta\) of the weights and activations can be then used to quantize the entire network before inference (static mode). We encourage the readers to refer this article for in detail analysis of various quantization schemes and their implementation.
The availability of the original training samples in full or subset may not be practically feasible for some tasks, e.g., medical imaging, where the user’s privacy is prioritized above all. Therefore, recently, quite a few data-free approaches have been proposed where the task is to perform range calibration of the activations without using original training samples at all. We now discuss one of the earlier and most-popular data-free quantization approach, called ZeroQ.
A naive approach to address the above challenge is to create a random data (Fig 1) drawn from a Gaussian distribution with zero mean and unit variance and feed it into the model. Once the activations are generated, the range can be recorded to compute \(\Delta\). However, this approach cannot capture the correct statistics of the activation data corresponding to the original training dataset. ZeroQ performs the data distillation technique to generate the synthetic samples (Fig 2) instead of random Gaussian samples. The generated synthetic samples preserve more fine-grained local structures cf. random Gaussian samples, thus may approximate the correct statistics of the activation data corresponding to original training dataset.
ZeroQ solves the distillation optimization problem to learn an input data distribution that best matches the statistics encoded in the Batch-Normalization (BN) layers of the model. Mathematically, it can be written as,
\[\min_{x^r} \sum_{i=0}^L \|\tilde \mu_i^r - \mu_i\|_2^2 + \|\tilde \sigma_i^r - \sigma_i\|_2^2\]where \(x^r\) is the reconstructed (distilled) input data, and \(\mu_i^r/\sigma_i^r\) are the mean/standard deviation of the distilled data distribution at layer \(i\), and \(\mu_i/\sigma_i\) are the corresponding mean/standard deviation parameters stored in the BN layer at layer \(i\). In other words, it generates an input data which, when fed into the model, outputs a statistical distribution that closely matches the original model. Consequently, the ranges recorded are close and \(\Delta\) can be computed with better approximation.
ZeroQ also supports mixed-precision quantization of the deep CNNs. For an \(L\) layer model with \(m\) possible precision options, the mixed-precision search space, denoted as \(S\), has an exponential size of \(m^L\). To reduce this exponential search space, they introduce sensitivity metric which is defined as
\[\Omega_i(k) = \frac{1}{N} \sum_{j=1}^{N_{dist}} \texttt{KL}(\mathcal{M}(\theta; x_j), \mathcal{M}(\tilde\theta_i(\textit{k-bit}); x_j)).\]where \(\Omega_i(k)\) measures how sensitive the \(i\)-th layer is when quantized to \(\textit{k}\)-bit, and \(\tilde\theta_i(\textit{k-bit})\) refers to quantized model parameters in the \(i\)-th layer with \(k\)-bit precision. The main idea is to opt a higher bit precision for those layers which are more sensitive, and a lower bit precision for layers that are less sensitive. For this, they use Pareto Frontier approach and optimzes the following problem
\[\min_{\{k_i\}_{i=1}^L} \Omega_{sum} = \sum_{i=1}^L \Omega_i(k_i)~~s.t.~\sum_{i=1}^L P_i*k_i \leq S_{target},\]where \(k_i\) is the quantization precision of the \(i\)-th layer, and \(P_i\) is the parameter size for the \(i\)-th layer, and \(S_{target}\) is the size of the target quantized model. In simple words, it chooses the bit-precision setting that corresponds to the minimum overall sensitivity of the network.
The authors show that their results are superior to existing post-training quantization schemes on classification and object detection.