0

I understand how depth, kernel size, stride, and padding works when dealing with filters in a spatial convolution layer.

What I don't understand is "the number of filters". Does that mean you're running the same kernel with the same stride at the same channels in the same directions but just multiple copies of that process in parallel?

Would that not result in it converging to the same weights and getting the same feature maps?

nbro
  • 42,615
  • 12
  • 119
  • 217

1 Answers1

1

Does that mean you're running the same kernel ... but just multiple copies of that process in parallel?
Would that not result in it converging to the same weights and getting the same feature maps?

I found a related question here about how filters are initialized. It looks like filters are randomly initialized. This page builds a CNN from scratch, and there is this paper All you need is a good init, Mishkin 2016 that discusses exactly this issue.

Layer-sequential unit-variance (LSUV) initialization – a simple method for weight initialization for deep net learning – is proposed. The method consists of the two steps. First, pre-initialize weights of each convolution or inner-product layer with orthonormal matrices. Second, proceed from the first to the final layer, normalizing the variance of the output of each layer to be equal to one.

So it looks like filters are initialized in a number of ways. In tensorflow's Conv2D it specifies:

kernel_initializer | Initializer for the kernel weights matrix (see keras.initializers).
Defaults to 'glorot_uniform'.

So you are able to specify how you would like this to happen. I found this discussion about filter convergence, and it seems to be an open research question. To that point I did find this research paper that address exactly this issue Reducing Duplicate Filters in DNN, RoyChowdhury 2017

5.Conclusion
We have shown that duplication of filters occurs more in MLPs than CNNs, and this appears to be an outcome of Reducing duplicate filters in neural networks over-parameterization in the fully-connected MLP model. Increasing the number of filters at a layer results in more duplicates for MLP and is less marked for CNNs. A method to reduce a network based on near-duplicate filters is introduced and is shown to work well for fully-connected nets in the regime of moderate compression. For CNNs, a norm-based pruning strategy works as well or better, underlining their differences with MLPs in how parameter redundancy is manifested

So filters can converge, the problem gets more likely with more filters, but the combination of filters having different initializations and the nature of CNNs seem to mitigate this issue.

I hope this helps to clarify the issue :)

J.Kent
  • 111
  • 5