1

So I was studying MobileNetV2 architecture and came across this table from the original paper that represents its architecture:

Table 2: MobileNetV2 : Each line describes a sequence of 1 or more identical (modulo stride) layers, repeated n times. All layers in the same sequence have the same number c of output channels. The first layer of each sequence has a stride s and all others use stride 1. All spatial convolutions use 3 × 3 kernels. The expansion factor t is always applied to the input size as described in Table 1. Table Description: "Table 2: MobileNetV2 : Each line describes a sequence of 1 or more identical (modulo stride) layers, repeated n times. All layers in the same sequence have the same number c of output channels. The first layer of each sequence has a stride s and all others use stride 1. All spatial convolutions use 3 × 3 kernels. The expansion factor t is always applied to the input size as described in Table 1."

However I don't really understand why the last pointwise convolution (conv2d 1x1) is applied after the average pooling. Shouldn't the output from the avgpool go to a dense layer so that it can then perform its predictions?

What is the point of performing convolution, more precisely pointwise conv. after pooling? Also, why is k not a fixed number of kernels/filters, like the other layers?

Thanks in advance!

Rob
  • 637
  • 1
  • 9
  • 23
Blue Ross
  • 47
  • 1
  • 7

1 Answers1

1

This is largely a matter of semantics I think. After the avg pool layer, the feature map becomes $1\times 1\times 1280$ as indicated on the table, so when you apply a pointwise convolution (a $1 \times 1$ conv), it essentially serves as a fully-connected layer operating on the channel dimension since there are no spaital dimensions to work with anymore.

If you want, you can think of the last layer as a dense layer with weights shaped $1280 \times k$, where the $1280$ input channels are interpreted as input features to the dense layer, and where $k$ is the number of classes.

PeaBrane
  • 376
  • 1
  • 7