Layers of a Convolutional Neural Network (Part 1)

As it was mentioned in this article, the most common layers of a Convolutional Neural Network (CNN) are,

  1. Convolutional Layer
  2. Non Linear Layer
  3. Pooling/Subsampling Layer
  4. Fully Connected Layer

In this article, I'm going to discuss the convolutional layer and how it works using an Example.

Convolutional Layer

The most important technique used in a CNN is 'convolution'. In a CNN, Convolution is performed by a filter that slides over the input image and merges the input values and the filter values on a feature map. Here the filter is a matrix of learnable weights.

When we feed an image data through a convolutional layer, the filters convolve over the entire image and produce the feature maps. The size of the feature maps (output of the convolutional layer) is determined by these 4 following hyper parameters (we have to decide them explicitly).

  1. Filter size
  2. Filter count (Depth)
  3. Stride
  4. Zero padding
  • Filter size : It is the 'height' and 'width' of the filters that are used in the convolution process
  • Filter count : The number of filters that are used in the convolutional layers
  • Stride : It determines how many pixel values that the filter slides over the input image at a time
  • Zero padding : It is done by padding the boarders of the input image data with zeros and It is done to capture the information in the edges of the input image data

The size of the output of the convolutional layers can be calculated by,

To have a better understanding on how CNN performs convolution operation on an image and each term explained above, Let's discuss the following example.

Below figure represents the pixel values of an image size of 9*9.

Let's define a filter size of 3*3.

Then the filter starts to convolute on the 3*3 patch (same as filter size) of the image as shown in the below figure.

The convolution operation is performed by doing element wise product of the image patch and the filter and then adding all those product values.

Here 8 will be the first value of our feature map.

After that if the stride is given as 2 (we have to define it explicitly as per our choice), the filter will move two pixel values to the right and then the filter performs convolution on the new image patch.

Likewise the filter moves two pixel to the right until it reaches the last patch of the input image as shown below.

Since the filter moves 4 positions horizontally and 4 positions vertically, the size of the output (feature map) of the convolutional layer is 4*4.

We can also calculate the size of output using the equation discussed above.

This is how our feature map looks like,

Using an example, we have discussed above how a single filter convolves with an input image. But in a convolutional layer, we can use multiple filters to convolute with the image. Here we have to define the number of filters (filter count or depth) that we are going to use. If we use 64 filters size of 3*3 (the filters can be represented as 3*3*64), the size of our feature maps will be 4*4*64 (one filter produces one feature map).

In case if we are going to use zero padding, we have to define the padding as well. If we define padding as 2, our input image will look like as follow.

Then the filter starts to convolute on the 3*3 patch (same as filter size) of the image as shown in the below figure.

Likewise the filter moves two pixel (if stride = 2) to the right until it reaches the last patch of the image. Since the filter moves 6 positions horizontally and 6 positions vertically, the size of the output (feature map) of the convolutional layer is 6*6. Also if we calculate it using the equation,

If we use 64 filters size of 3*3, the size of our feature maps will be 6*6*64.

This is how convolutional layer works and produces the feature maps. In the next part we will discuss the rest of the layers of a CNN.

Related articles