Understanding the Inception Module in Googlenet

GoogLeNet is a 22-layer deep convolutional network whose architecture has been presented in the ImageNet Large-Scale Visual Recognition Challenge in 2014 (main tasks: object detection and image classification). You can read the official paper here.

The main novelty in the architecture of GoogLeNet is the introduction of a particular module called Inception.

To understand why this introduction represented such innovation, we should spend a few words about the architecture of standard Convolutional Neural Networks (CNNs) and the common trade-off the scientist has to make while building them. Since the following will be a very high-level summary of CNNs, if you are curious about this topic I recommend my former article about CNN's’ architecture.

CNNs are made of the following components:

  • Convolutional stage (+ non-affine transformation via activation functinos)
  • Pooling stage
  • Dense stage

Basically, before the Dense layers (which are placed at the end of the network), each time we add a new layer we face two main decisions:

  • Deciding whether we want to go with a Pooling or Convolutional operation;
  • Deciding the size and number of filters to be passed through the output of the previous layer.

Now the question is: what if we were able to try our different options all together in one single layer? To answer this question, Google researchers developed a new architecture of layer called, indeed, Inception.

The main idea of the Inception module is that of running multiple operations (pooling, convolution) with multiple filter sizes (3x3, 5x5…) in parallel so that we do not have to face any trade-off.

Before having a look at the official architecture of the GoogLeNet of 2014, let’s understand how this new module actually works, considering the following layer:

As you can see, the initial input (which represents the stack of feature maps output from a previous layer) is a tensor of 64 feature maps, each one with a width and height of 32x32. Then, three operations are carried out in parallel:

  • a convolutional operation with 16 filters of size 1x1. The output tensor will be of size 32x32x16 (where the last number refers to the number of resulting feature maps, equal to the number of filters convolved over the image).
  • a convolutional operation with 32 filters of size 3x3. To maintain the output with the same width and height as the original feature maps, we can set padding=same and stride = 1 (to learn more about padding and strides to deal with border effects, you can read my former article here). The output tensor will be of size 32x32x32.
  • a max-pooling operation with a filter size of 3x3 (same reasoning with padding and stride as before). The output tensor will be of size 32x32x64 (in this case, since the pooling filter is passed over each feature map of the input tensor, the output tensor will have a depth equal to the original one = 64).

As you can imagine, by adding more operations in each layer, we are also making the model more complicated in terms of the number of parameters to be found.

Luckily, the second version of the Inception Module has introduced a nice technique to reduce the dimensionality of feature maps before running that module.

The idea is that of running a 1x1 convolutional filter to the input feature maps before passing them to the parallel operations. By doing so, we are able to reduce the depth of the tensor stacking those maps. More specifically, we are reducing the number of feature maps in the input stack.

Namely, keep focusing on the first convolutional operation, let’s imagine to reduce the number of maps from 64 to 16 via the 1x1 convolution.

By doing so, we are going to convolve the 1x1x16 filter on a lower-deep tensor, which is now a 32x32x16 tensor. Hence, the number of parameters to be found significanlty decreases.

The introduction of the Inception Module was a great innovation in the Computer Vision field. To conclude this article, I will leave here the architecture of the original model. The first two pictures represent an Inception Module with and without the technique of 1x1 convolutional filter to reduce dimensionality. The final one represents the whole GoogLeNet architecture

Inception Module, version 1.0. Source: https://arxiv.org/abs/1409.4842
Inception Module, version 2.0. Source: https://arxiv.org/abs/1409.4842
Final Architecture. Source: https://arxiv.org/abs/1409.4842

I hope you enjoyed the reading!

Cloud Specialist at @Microsoft | MSc in Data Science | Machine Learning, Statistics and Running enthusiast

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store