0

According to the original paper on page 4, $224 \times 224 \times 3$ image is reduced to $112 \times 112 \times 64$ using a filter $7 \times 7$ and stride $2$ after convolution.

  • $n \times n = 224 \times 224$
  • $f \times f = 7 \times 7$
  • stride: $s = 2$
  • padding: $p = 0$

The output of the convolution is $(((n+2p-f)/s)+1)$ (according to this), so we have $(n+2p-f)=(224+0-7)=217$, then we divide by the stride, i.e. $217/2=108.5$ (taking the lower value), then we add 1, i.e. $118+1=119$.

How do we get an output image of $112$ now?

nbro
  • 42,615
  • 12
  • 119
  • 217

1 Answers1

1

The padding is not size zero* in the inception CNN layers. In fact it is deliberately chosen to pad so that the convolution by itself would produce an image the same size as the original. I.e. $p=(f−1)/2$, in some libraries this is called "same" padding.

So, $p=3$

The stride is not 2. It is $s=1$ for the convolution. The Inception CNN does not use strided convolutions. Instead the stride of 2 is associated with a later max-pooling layer.

Therefore, using $(((n+2p-f)/s)+1)$ with the correct values $(((224 + 6 - 7)/1)+1 = 224$

Then apply max-pooling, with stride 2. $224/2 = 112$.


* Not to be confused with "zero padding" which means pad using $0$ as the value to insert into the new area. So you can have "zero padding with $p=3$"

Neil Slater
  • 33,739
  • 3
  • 47
  • 66