Neural network quantization is a process of reducing the precision of the weights in the neural network, thus reducing the memory, computation, and energy bandwidths.
Particularly when deploying NN models on mobile or edge devices, quantization, and model compression in general, is desirable and often the only plausible way to deploy a mobile model as the memory and computational budget of these devices is very limited.
Continue reading “Neural Network Quantization Research Review”