How is the depth of the input related to the depth of the output of a convolutional layer?

Question

Let's suppose I have an image with 16 channels that goes to a convolutional layer, which has 3 trainable $7 \times 7$ filters, so the output of this layer has depth 3.

How does the convolutional layer go from 16 to 3 channels? What mathematical operation is applied?

score 3 · Accepted Answer · edited Sep 26 '21 at 22:00

The reason why you go from 16 to 3 channels is that, in a 2d convolution, filters span the entire depth of the input. Therefore, your filters would actually be $7 \times 7 \times 16$ in order to cover all channels of the input.

Detailed procedure

The output of the convolution automatically has a depth equal to the number of filters (so in your case this is $3$) because you have an $m \times k$ filter matrix, where $m$ is the number of filters and $k$ is the number of elements in the unrolled filter (in your case, $m = 3$ and $k = 7 \times 7 \times 16 = 784$, so the filter matrix is $3 \times 784$).

The input is usually unrolled according to the im2col procedure, where each tile corresponding to a single filter location is stretched into a column equal to the unrolled filter size. This is repeated for each filter location, so you end up with a very large matrix of size $k \times n$, where $k$ is the same as $k$ above in the filter matrix, and $n$ depends on your padding and stride.

Multiplying the $m \times k$ filter matrix with the $k \times n$ input matrix gives you an $m \times n$ output matrix, where $m$ is the number of filters.

We say that a filter is applied to a channel when it is superimposed on the image, starting left-most, performing the operation of multiplying the weights of the filter with the corresponding value in the image and then summing up to a single value and moving the filter to right (then down when it reaches rightmost part) across the image according to the stride of the filter.

When a filter, e.g. $f_1$, is applied to a channel say $c$, there is a single value. Now, apply them to all channels, we get 16 values and all of them are added up to a single value. $f_1$ is moved according to the stride and the same operation is repeated to get an output with a single channel (the number of rows and columns are determined by padding, stride, dilation, and kernel size of the filers).

The aforesaid process is done by all the 3 filters giving rise to 3 channels. In this way, the convolutional layer makes the input go from 16 to 3 channels.

More detailed explanations can be found here.

How is the depth of the input related to the depth of the output of a convolutional layer?

2 Answers2

Detailed procedure

Further reading

Linked