Demystifying Batch Normalization

The single algorithm being taken for granted by (almost) every developer out there


Data pre-processing is a necessary step before any neural network can successfully ingest and analyze that data. The methods used to do this pre-processing are critical to the network’s performance.

Traditional data pre-processing methods include mean subtraction, normalization, and whitening, which have been around a long time—well before batch normalization came into the picture, which is what we’ll focus on in this post. To start, let’s define these pre-processing methods.

  1. Mean subtraction: Involves subtracting the mean across each individual feature in data. Geometrically, we can picture it as centering the cloud of data around the origin.
  2. Normalization: Refers to normalizing data dimensions so that they’re approximately of the same scale. This is achieved by dividing each dimension by its standard deviation or scaling it to the range (-1,1).
  3. Whitening: This process is a bit more complex process. First, the data is centred and normalized. Then the covariance matrix is computed, which tells about the correlation structure. In a covariance matrix, the (i,j)th element of the matrix gives the covariance between the ith and the jth dimension of the data. SVD factorization of the covariance matrix is done to extract the “eigenbasis” on which the original zero centred data is projected.

These normalization methods are generally applied at the input or the beginning node of the network. However, the seminal paper on batch normalization emphasizes why normalization is needed, not only at the beginning node but in the nodes deep inside, too.

We know that the change in the distribution of the layer’s inputs creates a problem, as the layer has to continuously adapt to the new distribution. This is generally managed by domain adaptation methods, which try to set up a common feature space for the source and the target domains.

We cannot, without domain adaptation, expect the network to handle many distributions all at once.

However, the input distribution properties that make training more efficient — such as having the same distribution between the training and test data — apply to training the sub-network as well!

In other words, it’s unreasonable to expect the sub-network to handle data from different distributions. If we could ensure that the distribution of inputs to the sub-network remains more stable as the network trains, then the optimizer would be less likely to get stuck in saturated regions, and the training would accelerate.

Batch normalization attempts to solve this problem, which they define as internal covariate shift:

Reducing Internal Covariate Shift

During training, the network parameters are continuously changing. The internal nodes of a deep network are always updating to move the cost function to a more desirable value.

To normalize the outputs at every layer, we could consider whitening activations at every training step or at some interval — either by modifying the network directly or by changing the optimization algorithm parameters to depend on the network activation values.

However, if these modifications are scattered with the optimization steps, then the gradient descent step may attempt to update the parameters in a way that requires the normalization to be updated, which reduces the effect of the gradient step.

Batch normalization addresses this issue by ensuring that, for any parameter values, the network always produces activations with the desired distribution.

Thus, batch normalization ensures that gradient descent optimization takes into account that normalization is taking place.

One might think that blindly normalizing each input layer may change what the layer represents. For instance, normalizing the inputs of a sigmoid would constrain them to the linear regime of the non-linearity.

To address this, it has to be ensured that the transformation inserted in the network can represent the identity transform.

This is ensured by a pair of parameters, which are learned along with the other parameters in the network. These parameters can recover the original activations if that were the optimal thing to do.


Let’s check out the algorithm discussed in the paper for batch normalization over a single mini-batch —

Parameters to be learned (𝞬 and β ) are the parameters that help the batch normalization algorithm to represent the identity function if needed. ϵ is a constant added for numerical stability.

Summary of Batch Normalization

To briefly summarize:

Batch normalization is a differentiable transformation that introduces normalized activations into a neural network.

This ensures that as the model is training, layers can continue learning on input distributions that exhibit less internal covariate shift, thus accelerating training.

Further parameters can be learned, which allow the batch normalization transform to be completely reversible by changing it to an identity function.


Let’s take a look at a short implementation of the batch normalization transform in PyTorch and Python:

Thus, the presence of batch normalization in most neural networks is inherent proof of its ability to converge ML models quickly. Try it out — if you haven’t already 😉

Check out my blog for faster updates and subscribe for quality content 😀

Hmrishav Bandyopadhyay is a 2nd year Undergraduate at the Electronics and Telecommunication department of Jadavpur University, India. His interests lie in Deep Learning, Computer Vision, and Image Processing. He can be reached at — [email protected] ||


Our team has been at the forefront of Artificial Intelligence and Machine Learning research for more than 15 years and we're using our collective intelligence to help others learn, understand and grow using these new technologies in ethical and sustainable ways.

Comments 0 Responses

Leave a Reply

Your email address will not be published. Required fields are marked *

wix banner square