# Quantization Arithmetic

## Diving deep into the model compression technique

There exist a plethora of articles on quantization, but they generally cover only the surface level theory or provide a simple overview. In this article, I’ll explain how quantization is actually implemented by deep learning frameworks.

Before getting into quantization, it’s good to understand the basic difference between two key concepts: float (floating-point) and int (fixed-point).

## Fixed-point vs Floating-point

### Fixed-point

Fixed-point basically means there are a fixed number of bits reserved for storing the integer and fractional parts of a number. For integers, we don’t have fractions. Thus, all the bits except one represent the integer. One bit represents the sign bit. So in ‘int8’, 7 bits will be used to represent the integer, and one bit will indicate whether the number is positive or negative. In an unsigned integer, there’s no sign bit—all bits represent the integer.

### Floating-point

Floating-points don’t have a fixed number of bits assigned to integer and fractional parts. Instead, it reserves a certain number of bits for the number (called the mantissa), and a certain number of bits to say where within that number the decimal place sits (called the exponent).

A number is represented by the formula(-1)**s (1+m)*(2**(e-bias)), where s is the sign bit, m is the mantissa, e is the exponent value, and bias is the bias number.

### Why it matters?

There are two main attributes when discussing a numerical format. The first is its dynamic range, which refers to the range of representable numbers. The second one is how many values can be represented within the dynamic range, which in turn determines the precision/resolution of the format (the distance between two numbers).

For all integer formats, the dynamic range is -2n-1 to 2n-1-1, where n is the number of bits. So for INT8, the range is -128 to 127, and for INT4, the range is -8 to 7. The number of representable values is 2n.

But in FP32 (floating-point), the dynamic range is (2-2-²³) * 2¹²⁷ ≈ ±3.4 * 10³⁸, and approximately 4.2 x 10⁹ values can be represented. For FP16, dynamic range is ±65504.

We can see that FP32 is much more versatile than int, in that it is able to represent a wide range of distributions accurately. This is a nice property for deep learning models, where the distributions of weights and activations are usually very different (at least in dynamic range). In addition, the dynamic range can differ between layers in the model.

Now that we have a solid understanding of these key concepts, let go back to our primary topic: Quantization.

## Quantization

Quantizing FP32 to int8 has many benefits. int8 takes 4x less storage and bandwidth than FP32. Not only that, but it is also much faster, more energy-efficient as well as more area-efficient.

Quantization is mainly about mapping floats to ints. Specifically, it’s done by mapping the min/max of the tensor (weights or activations) with the min/max the of int range (-128, 127 for int8).

There are two methods of quantization — symmetric and asymmetric. Let’s say we have to quantize tensor w.

In asymmetric quantization, we use actual min/max values of the tensor i.e. min,max= min(w), max(w). Thus, it’s asymmetric with respect to zero. Here, we have to use a center point called zero-point.

In symmetric mode, we use the absolute maximum value. i.e. min, max = -max(abs(w)), max(abs(w)). Thus, it’s symmetric with respect to zero.

### Scale factor

To map float with int, the float is multiplied with a numeric constant (the scale factor). The scale factor is calculated by dividing the range of all integers (2^n-1) with the range of floats in a given tensor ( 2*max(abs(x)) ). There are two possibilities — full range and restrictive range.

PyTorch and ONNX use full-range quantization, while TensorFlow, NVIDIA, TensorRT, and Intel DNNL use restrictive-range. Full-range is slightly more accurate in theory, but there’s really no significant difference in practice.

Now, our quantized value can be calculated using x_q = round(scale*x_f).

Let’s see how a fully-connected layer is quantized using the symmetric mode.

### Fully-connected equation

where x_f is a float activation, x_q is a quantized activation, q_x is a scale factor of activations—similarly for weights and biases. This equation will be same for both restrictive- and full-range quantization.

Solving further for y_q,

q_y/(q_x*q_z) and (q_x*q_w)/q_b can be precomputed. We don’t need to compute them at runtime. All other calculations are between the two int8 type numbers.

### Avoiding Overflows

The result of multiplying two n-bit integers is, at most, a 2n-bit number. In convolution layers, such multiplications are accumulated ck² times, where c is the number of input channels and k is the kernel width (assuming a square kernel).

Hence, to avoid overflowing, the accumulator should be 2n+M bits wide at least, where M is at least log2(c). In practice, mostly 32-bit accumulators are used.

## Calibration

Earlier, to calculate scale factor, we used the full range of tensor i.e. min(w) to max(w). The broader range we use, less precise will be mapping to int. So, we can remove outliers to get a narrower min/max range.

Narrowing the range won’t hurt because most of the weights are very close to zero. In above figure, if we hard clip min and max to -0.4 and 0.4 respectively, we will be able to represent almost all the weights with better precision. Only a few values outside [-0.4 , 0.4] won’t be represented accurately. TensorRT, MXNet and some other inference frameworks support this.

In actual implementations, more complex methods are used for calibration like minimizing ‘KL_divergence’. KL_divergence measures loss of information when approximating an encoding.

This is how TensorRT do it –

Above steps are executed before deploying the model. Based on the results gathered above, the scale factors are calculated and are fixed once the model is deployed. If we encounter any values outside the previously observed ranges at runtime, these values will be clipped.

### Scale factor Scope

Until now, we were using scale factor per layer. We can instead use scale factor per channel. This can be beneficial if the weight distributions vary greatly between channels.