A Research Guide to Convolutional Neural Networks

Examining the advancements of CNN architecture over the past few years

Convolutional neural networks are a class of deep neural networks that are commonly used for visual imagery analysis. One of their main applications is in image classification.

What is Image Classification?

Image classification is the process of taking an image as input and outputting the image’s class, whether it’s a car, bicycle, bird or the probability of classes that best describes the image.

When an image is input into a model, the model sees it as an array of pixel values depending on the resolution and size of the image. Suppose we have a JPG colored image with 640 width * 640 height. The representative computer input array of this image will be 640*640*3, where 3 is the RGB values.

Each of the above numbers is assigned a value between 0 and 255, which describes the pixel intensity at that particular point. With such input, the computer then outputs numbers that describe the probability of an image belonging to a given class, such as .75 for a dog, .15 for a bird, .10 for a cat, etc.

CNNs have a lot of applications in the real world such as face anonymization (deep fakes), handwriting analysis, medical image analysis, advertising, and more. They’re also being used for other computer vision tasks such as image segmentation and style transfer.

Structure of a Convolutional Neural Network (CNN)

Convolutional neural networks have four layers, as shown below:

1. Convolution

This is always the first layer of a CNN. It comprises a set of independent filters that are initialized randomly and become our parameters, which the network learns from in subsequent layers. They detect low-level features such as edges and curves.

2. ReLu

This layer is used with the convolution layer. With this layer, we’re applying an activation function to our feature maps to increase non-linearity in our network, as most images themselves are highly non-linear.

3. Pooling

The pooling layer reduces the spatial size of the input representation, which then reduces the number of parameters and computations in the network. This makes it easier to detect objects in an image no matter where they’re located. Pooling helps control overfitting so that the network is able to capture unwanted specific details in an image.

4. Fully Connected

At this layer, we add artificial neural networks, which combine our image features into more attributes, making it easier to more accurately predict the class of our image. Backpropagation then occurs at this layer, wherein the errors are minimized, and weights are adjusted to optimize the performance of our model.

A classic CNN structure would look like this:

Input -> Convolution -> ReLU -> Convolution -> ReLU -> Pooling ->ReLU -> Convolution -> ReLU -> Pooling -> Fully Connected

This is just a general overview of how a CNN works and its structure. We’ll now dive into the new developments and advancements in the field of computer vision and CNNs.

Our main focus will be to look at the most important papers that have been published over the last couple of years and discuss in-depth their relevance.

AlexNet [2012]

AlexNet was a model architecture first articulated in a paper titled “ImageNet Classification with Deep Convolutional Networks” and is regarded as one of the most influential publications in the field of computer vision.

Alex Krizhevsky, Ilya Suteskever, and Geoffrey E. Hinton developed a “Large, Deep Convolutional Neural Network” that has 60 million parameters, 650,000 neurons consisting of 5 convolutional and 3 fully connected layers, and a final 1000-way softmax layer producing a distribution of over 1000 class labels.

The network was trained on the ImageNet dataset, which contains over 15 million annotated images from 22,000 categories. They used non-saturating neurons and a very efficient GPU implementation of the convolution operation to make model training faster.

A regularization method known as “dropout” was used to reduce overfitting in the fully connected layers. ReLU was used for nonlinearity functions, as they reduced the training time and were faster than the conventional tanh functions.

The model emerged as the best in the 2012 ILSVR (ImageNet LargeScale Visual Recognition Challenge), where it achieved a top 5 test error rate of 15.4%.


This “Large Deep CNN” was a breakthrough for the popularity of CNNs in the computer vision field. With techniques such as data augmentation and dropout, this paper illustrated the importance of CNNs and their record-breaking results on a highly-challenging dataset using purely supervised learning.

ZF Net [2013]

The winner of the ILSVRC 2013 was a network called ZF Net, which was built by Matthew Zeiler and Rob Fergus. The model was trained on the same ImageNet 2012 dataset that AlexNet was trained on.

Each RGB image was preprocessed by resizing the smallest dimension to 256, cropping the center 256*256 region, subtracting the per-pixel mean (across all images), and then using 10 different sub-crops of size 224*224.

The model achieved a test error rate of 14.8 %. The model generalizes well to other datasets; when the softmax classifier is retrained, it convincingly beats the current results on Caltech-101 and Caltech-256 datasets. The architecture is essentially a fine-tuned version of the AlexNet architecture.

Here the authors are addressing two main issues:

  • Why do large convolutional network models perform so well?
  • How can these networks be improved?

A novel visualization technique that gives insight into the function of the intermediate feature layers and the operation of the classifier is introduced. The technique reveals the input stimuli that excite individual feature maps at any layer in the model. It also allows us to observe the evolution features during training and diagnose potential problems with the model.

The visualization technique uses a multi-layered deconvolutional network (deconvnet) to project the feature activations back to input pixel space. An input image is fed into the CNN and activations are computed at each level (forward pass).

Suppose we want to examine the activations of certain features in the 5th convolution layer? To do this, we can store the activations of this one feature map, but set all of the other activations in the layer to zero. We then pass the feature map as input into the deconvnet, which has the same filters as the original CNN. This input goes through a series of unpool (reverse maxpooling), rectify, and filter operations for each preceding layer until input space is reached.

To gain a deeper understanding of deconvnet, I would recommend this presentation by Zeiler. Through the visualization, we’re able to find a model architecture that outperforms AlexNet.

An ablation study is also performed to discover the performance contribution from different model layers. This is done by occluding portions of the input image, revealing which parts of the image are important for classification.


This is an 8-layer convnet model that takes 224 by 224 size images (with 3 color panes) as input. It’s convolved with 96 different 1st layer filters (red), each of size 7 by 7, using a stride of 2 for both x and y. The resulting feature maps are then passed through a rectified linear function, max pooled within 3*3 regions using stride 2 and contrast normalized across the feature maps to give 96 different 55 by 55 element feature maps.

The same operation is repeated in the preceding layers 2,3,4, and 5. The last 2 layers are fully connected, taking a feature from top convolutional layers as input in vector form (6.6.256=9216 dimensions). The final layer is a C-way softmax function, C (number of classes).

ZF Net provided a better understanding of the inner working of CNNs and illustrated more ways to improve performance. The visualization approach also has provided insight for improvements to network architectures more broadly.

VGG Net [2014]

The paper is titled “Very Deep Convolutional Networks For Large-Scale Image Recognition”. Karen Simonyan and Andrew Zisserman investigate the effects of a convolutional network’s depth on its accuracy in a large-scale image recognition setting.

They evaluate networks of increasing depth using an architecture with very small 3*3 convolution filters, with stride and pad 1 along with 2*2 max-pooling layers with stride 2, indicating a significant improvement on configurations that can be achieved by pushing the depth to 16–19 weight layers.

The depth of the configurations increases from the left (A) to the right (E) as more layers are added (bold). The convolutional layer parameters are denoted as “Conv (receptive field size)-(number of channels)”.

As the spatial size of the input volumes at each layer decrease, the depth of the volume increases due to the increased number of filters as you go deeper into the network. This reinforces the idea of shrinking spatial dimensions while also growing depth.

VGG Net is a very important paper in that it revealed convolutional neural networks have to have a deep network of layers in order for this hierarchical representation of visual data to work.

GoogLeNet (Inception) [2015]

This paper is titled “Going deeper with Convolutions”. GoogLeNet is a 22-layer CNN and was the winner of the ILSVRC 2014 with a top 5 error rate of 6.7%. The authors introduce an inception module that achieves the new state-of-the-art for classification and detection.

The hallmark of this architecture is the improved utilization of the computing resources inside the network (memory and power usage). The authors then increased the depth and width of the network while keeping the computational budget constant.


The bottom green box is our input and the top is tour output. The inception module allows one to have pooling or conv operations at each layer in parallel. This is the most basic way to explain the network architecture.

What is the relevance of the inception module architecture?

The module consists of a network in a network layer; a medium, large-size, filter convolution and pooling operation. The network in network conv is able to extract information about the very fine-grain details in the volume, while the 5*5 filter is able to cover a large receptive field of the input, and thus able to extract its information as well. A pooling operation helps reduce spatial sizes and combats overfitting. On top, you have ReLUs after each conv layer, which improves the non-linearity of the network.

GoogLeNet models introduced the idea that CNN layers didn’t always have to be stacked up sequentially. With the inception module, the authors were able to prove that creative structuring of layers can lead to improved performance and computational efficiency.

Microsoft ResNet (2015)

The paper is titled “Deep Residual Learning for Image Recognition”. Microsoft Research Asia came up with a new 152-layer network in late 2015 called ResNet. ResNet won the ILSVRC 2015 with an incredible error rate of 3.6%.

When deeper networks start converging, a degradation problem has been exposed: when network depth increases, accuracy gets saturated and then degrades rapidly. Such degradation isn’t caused by overfitting and adding more layers to suitably deep models results in a higher training error. The degradation (of training accuracy) implies that not all systems are similarly easy to optimize.

This paper addresses the degradation problem by introducing a deep residual learning framework.

Instead of hoping every few stacked layers directly fit a desired underlying mapping, we explicitly let these layers fit a residual mapping. We denote the desired underlying mapping as H(X), and we let the stacked non-linear layers fit another mapping F(X)=H(X)-X. The original mapping is then recast into F(X)+X.

We then hypothesize that it’s easier to optimize the residual mapping than to optimize the original, unreferenced mapping. The formation of F(X)+X can be realized by feedforward neural networks with “shortcut connections” (those skipping one or more layers).

The shortcut connections simply perform identity mapping and their outputs are added to the outputs of the stacked layers. Identity shortcut connections add neither extra parameters nor computational complexity. The entire network can be trained end-to-end by SGD with backpropagation and can be easily implemented using common libraries such as Caffe.

The model has been tested on ImageNet to show this degradation problem and proves that:

i. Extremely deep residual nets are easy to optimize, but stack layers exhibit higher training error when depth increases.

ii. The deep residual net can easily enjoy accuracy gains from greatly increased depth, producing results substantially better than the previous networks.

Region-Based CNNs (R-CNNs) 2013

This paper is titled “Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation”.

Object detection has had tremendous improvements over the past few years. The best performing methods are complex ensemble systems that combine low-level image features with high-level context.

The paper proposes a simple and scalable detection algorithm that improves mean Average Precision (mAP) by more than 30% relative to the previous best result on VOC 2012, which achieved a mAP of 53.3%.

The approach consists of two methods:

i. One can apply high-capacity convolutional neural networks to bottom-up region proposals in order to localize and segment objects.

ii. When labeled training data is scarce, supervised pre-training for an auxiliary task, followed by domain-specific fine-tuning, yields a significant performance boost.

Region with CNN features (R-CNN) originates from the region proposals being combined with CNNs.


R-CNN takes an input image, extracts 2000 bottom-up region proposals, computes features for each proposal using a large CNN and then classifies each region using class-specific linear SVMs.

In general, R-CNN objection detection consists of 3 modules:

i. Generation of category-dependent region proposals. The proposals define the set of candidate detections available to the detector.

ii. A large CNN that extracts a fixed-length feature vector from each region.

iii. A set of class-specific linear SVMs.

The research indicates that it’s highly effective to pre-train a network (with supervision) for image classification and then fine-tune it for the target tasks where the data is scarce.

It’s important to achieve the desired results by using a combination of classical tools from computer vision and deep learning (bottom-up region proposals and CNNs).

Fast R-CNN (2015)

This paper by Ross Girshick titled “Fast R-CNN (Fast Region-Based Convolutional Network)” builds on R-CNN to efficiently classify object proposals using deep convolutional networks. Fast R-CNN trains the VGG16 network 9 times faster than R-CNN, is 231 times faster at test-time and achieves a higher mAP of 66% (compared 62% for R-CNN) on PASCAL VOC 2012. Compared to SPPnet, Fast R-CNN trains VGG16 3 times faster, tests 10 times faster and is more accurate. Fast R-CNN is implemented in Python and C++ using Caffe. The implementation code has been open-sourced under the MIT license and can be found at this GitHub channel.

Fast R-CNN employs several innovations to improve training and testing speed while also increasing detection accuracy.

Improvements in R-CNN were made due to 3 main drawbacks in the original R-CNN architecture:

i. Training is a multistage pipeline. R-CNN fine-tunes ConvNet on object proposals using log loss and then fits SVMs to ConvNet features. Finally, the bounding box regressors are learned.

ii. Training is expensive in space and time.

For SVM and bounding box regressor training, features are extracted from each object proposal in each image and written to disk. With deep networks, the process takes up to 2.5 GPU days. The features require a hundred gigabytes of storage.

iii. Object detection is slow.

Fast R-CNN was able to solve these problems by sharing the computation of conv layers between different proposals and swapping the order of generating region proposals and running the CNN.


An input image and multiple Regions of Interest (ROIs) are input into a fully convolutional network. Each RoI is pooled into a fixed-size feature map and then mapped to the feature vector by the fully connected layers (FCs). The network has 2 output vectors per RoI: softmax probabilities and per-class bounding box regression offsets.

The architecture is trained end-to-end with a multi-task loss.

Faster R-CNN (2015)

The paper is titled “Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks.” The paper introduces a Region Proposal Network (RPN) that shares full-image convolutional features with a detection network, which results in a nearly cost-free region proposal.

What is a Region Proposal Network (RPN)?

This is a fully convolutional network that simultaneously predicts object bounds and objects score at each position. RPN is trained end to end, to generate high-quality region proposals which are used by Fast R-CNN for detection.

RPN and Fast R-CNN are merged into a single network by sharing their convolutional features. For VGG 16 model, the object detection system has a frame rate of 5fps on GPU, while achieving state of the art object detection accuracy on PASCAL VOC 2007, 2012 and MS COCO datasets with only 300 proposals per image. In the ILSVRC and COCO 2015 competition, Faster R-CNN and RPN are the foundations of the 1st winning entries.

Implementation code for Faster R-CNN has been made publicly available under this GitHub channel in python.

Faster R-CNN works to combat the complex training pipeline that both R-CNN and Fast R-CNN exhibited.


A Regional Proposal Network (RPN) is inserted after the last convolutional layer. The network looks at the last convolutional feature map and produces region proposals from that. In the next stage, the same pipeline as R-CNN is used (RoI pooling, FC, Classification, and regression heads).

The ability to determine what specific object is in an image is one thing, and its location is another hurdle that requires improvement in computer knowledge.

Faster R-CNN has become the standard for object detection applications.


By now you should be conversant with the advancements in CNN architecture over the past few years. New and improved CNN architectures are still being developed and tested to solve dynamic image classification, object detection, and other computer vision tasks.


Our team has been at the forefront of Artificial Intelligence and Machine Learning research for more than 15 years and we're using our collective intelligence to help others learn, understand and grow using these new technologies in ethical and sustainable ways.

Comments 0 Responses

Leave a Reply

Your email address will not be published. Required fields are marked *

wix banner square