1

I was watching a video about Convolutional Neural Networks: https://www.youtube.com/watch?v=SQ67NBCLV98. What I'm confused about is the arrangement of applying the filters' channels to the input image or even to the output of a previous layer.

Question 1 - Looking at the visual explanation example of how one filter with 3 channels is applied to the input image (with 3 channels), so that each 1 filter channel is applied to its corresponding input channel: Conv 2_D. So hence the output is 3 channels. Makes sense.

However, looking at the second screenshot which shows an example of the VGG network: VGG, looking at the first layer (I've delineated with a red frame), which is 64 channels, where the input of the image contains 3 channels. How does the output shape become 64? The only way I would think this would be possible is if you apply:

  • filter channel 1 to image channel 1
  • filter channel 2 to image channel 2
  • filter channel 3 to image channel 3
  • filter channel 4 to image channel 1
  • filter channel 5 to image channel 2
  • filter channel 6 to image channel 3

.. and so on.

Or the other thing could be, that these are representing Conv layers, with 64 filters. Rather than a filter with 64 channels. And that's precisely what I'm confused about here. In all the popular Convolutional networks, when we see these big numbers - 64, 128, 256 ... etc, are these Conv layers with 64 filters, or are they individual filters with 64 channels each?

Question 2 - Referring back to the second screenshot, the layer I've delineated with blue frame (3x3x128). This Conv layer, as I understand, takes the output of 64 Max-pooled nodes and applies 128 Conv filters. But how does the output become 128. If we apply each filter to each Max-pooled output node, that's 64 x 128 = 8192 channels or nodes in output shape. Clearly that's not what's happening and so I'm definitely missing something here. So, how does 128 filters is applied to 64 output nodes in a way so that the output is still 128? What's the arrangement?

Many thanks in advance.

Hazzaldo
  • 309
  • 3
  • 9

1 Answers1

1

Ok, here's the break down:

The depth of an input to a convolutional layer is termed channels. The depth of a convolutional layer is the number of kernels (aka filters). The depth of a kernel is equal to the number of channels in the input.

See below:

Convolution

The input (of 7x7, pad of 1) has 3 channels. The convolutional layer has 2 kernels (or filters). Each filter has a depth of 3, equal to the number of channels in the input. Using the notation you used in your question:

  • Filter 1, channel 1 to input channel 1
  • Filter 1, channel 2 to input channel 2
  • Filter 1, channel 3 to input channel 3
  • Sum all three channels of filter 1, then add bias

  • Filter 2, channel 1 to input channel 1

  • Filter 2, channel 2 to input channel 2
  • Filter 2, channel 3 to input channel 3
  • Sum all three channels of filter 2, then add bias

These steps are repeated for each frame the filter slides over the input image.

To answer question 2, if the output is 128, that simply means there are 128 filters. There could be an infinite number of filters if you so choose.

EDIT:

Here's the link to the interactive graphic: http://cs231n.github.io/convolutional-networks/

Recessive
  • 1,446
  • 10
  • 21