In the inception neural network, how is an image of shape $224 \times 224 \times 3$ converted into one of shape $112 \times 112 \times 64$?

Question

According to the original paper on page 4, $224 \times 224 \times 3$ image is reduced to $112 \times 112 \times 64$ using a filter $7 \times 7$ and stride $2$ after convolution.

$n \times n = 224 \times 224$
$f \times f = 7 \times 7$
stride: $s = 2$
padding: $p = 0$

The output of the convolution is $(((n+2p-f)/s)+1)$ (according to this), so we have $(n+2p-f)=(224+0-7)=217$, then we divide by the stride, i.e. $217/2=108.5$ (taking the lower value), then we add 1, i.e. $118+1=119$.

How do we get an output image of $112$ now?

Neil Slater · Answer 1 · 2018-10-09T16:20:22.783

The padding is not size zero* in the inception CNN layers. In fact it is deliberately chosen to pad so that the convolution by itself would produce an image the same size as the original. I.e. $p=(f−1)/2$, in some libraries this is called "same" padding.

So, $p=3$

The stride is not 2. It is $s=1$ for the convolution. The Inception CNN does not use strided convolutions. Instead the stride of 2 is associated with a later max-pooling layer.

Therefore, using $(((n+2p-f)/s)+1)$ with the correct values $(((224 + 6 - 7)/1)+1 = 224$

Then apply max-pooling, with stride 2. $224/2 = 112$.

* Not to be confused with "zero padding" which means pad using $0$ as the value to insert into the new area. So you can have "zero padding with $p=3$"

In the inception neural network, how is an image of shape $224 \times 224 \times 3$ converted into one of shape $112 \times 112 \times 64$?

1 Answers1