1

I am asking this question for a better understanding of the concept of channels in images.

I am aware that a convolutional layer generates feature maps from a given image. We can adjust the size of the output feature map by proper padding and regulating strides.

But I am not sure whether there exist kernels for a single convolution layer that are capable of changing an {RGBA, RGB, Grayscale, binary} image into (any) another {RGBA, RGB, Grayscale, binary} image?

For example, I have a binary image of a cat, is it capable to convert it into an RGBA image of a cat? If no, can it at least convert a binary cat image into an RGBA image?

I am asking only from a theoretical perspective.

desertnaut
  • 1,021
  • 11
  • 19
hanugm
  • 4,102
  • 3
  • 29
  • 63

1 Answers1

-1

No, because each output from a convolution layer only looks at a local region of the image. A convolution layer cannot do any global transformation, only local ones. Convolution layers must have translation invariance which means if it converts an eyeball to a tail at one position, it'll also convert the same eyeball to the same tail if it's found at a different position. If it's not overfitted, it will also convert similar eyeballs to similar tails. If you want only some eyeballs to become tails, you can't do that without introducing overfitting, or expanding the convolution size until the layer can see enough context to distinguish which eyeballs should become tails and which ones shouldn't.

If you want to change one image into a specific other image, and don't care what happens to all other images, it might be possible to create a convolution layer that does this transformation. The input image has to be different wherever the output image is different, or else the convolution layer won't be able to produce that difference in the output image. You would be teaching it to recognize the specific pixel patterns in the input image and generate the specific pixels in the output image. This would be an extreme case of overfitting and wouldn't work for any other input images.

The number of channels in the input and output image is irrelevant, except that more channels means the network has more data to learn from, obviously.