In this paper, if region $R_{2}$ moves in a sliding window manner, won't the saliency map have a smaller size than the original image?

Question

In the paper Salient Region Detection and Segmentation, I have a question pertaining to section 3 on the convolution-like operation being performed. I had already asked a few questions about the paper previously, for which I received an answer here. In the answer, the author (JVGD) mentions the following:

So for each pixel in the image you overlap on top $R_{1}$ and then $R_{2}$ on top of $R_{1}$, then you compute the distance $D$ for those 2 regions to get the saliency value of that pixel, then slide the $R_{1}$ and $R_{2}$ regions in a sliding window manner (which is basically telling you to implement it with convolution operation).

Regarding the above, I had the following question: If the region $R_{2}$ moves in a sliding window manner, won't the saliency map (mentioned in section 3.1) have a smaller size than the original image (like in convolution the output image is smaller)? If this is so, wouldn't it be impossible to add the saliency maps at different scales since they each have different sizes?

The following edit re-explains the question in more detail:

In the animation above, you can see a filter running across an image. For each instance of the filter, some calculations are happening between the pixels of the filter and the image. The result of each calculation becomes one pixel in the output image (denoted by "convolved feature"). Here, the output image is smaller than the input image because there are only 9 instances of the filter. From what I understood of the salient region operation, a similar process is being followed i.e. a filter runs across an image, some calculations happen, and the result of each calculation becomes one pixel in the output image (saliency map). Hence, won't the saliency map have a smaller size than the original image? Furthermore, when the filter size is 3 x 3, the output image size is 3 x 3. However, if the filter size was 5 x 3, the output image size would only be 1 x 3. Clearly, the output image size is different for different filter sizes. This makes the output images (saliency maps) impossible to add. There is clearly something I am missing / misunderstanding here, and clarity on the same would be much appreciated.

P.S. There is no indication of padding or any operation of that sort in the research paper, so I don’t want to assume anything because the calculations would then be wrong.

Also optionally you can take a look at the famous Itti-Koch paper for saliency prediction. Subtracting and adding maps at different scales are not uncommon in saliency prediction. — Redwanul Haque Sourave, May 15 '25 at 22:34

Archana David · Answer 1 · 2021-12-29T05:29:48.233

The paper you have provided, discusses *detecting salient regions by using a contrast determination filter

which operates at various scales to generate saliency maps containing “saliency values” per pixel.*

The Contrast detection filter used in this paper is better explained with the figure 2. Here R1 is the inner most region and R2 scale is varied. Note this is the filter and not the image itself

The below fig 3 shown in the paper section 3.1 indicates how the image appears when the filter of varying scales is applied(filtered images). When R2 has the highest scale(i.e., maximum width) the background is also shown that is the non-salient parts are also taken into consideration. As the R2 takes up lesser width, the non-salient parts becomes almost invisible which you can see in 3 to 8 images in figure3. It helps to focus on what's really important in the image, which is the man riding the horse.

If you have noted in the figure 3, though R2 width varies, the filtered image size is the same as the original image.
To answer your question, the saliency map is the same size as the original image size as only the filter was scaled not the image itself

This is also quoted in the paper section 3.1, page 4

A change in scale is affected by scaling the region R2 instead of scaling the image.
Scaling the filter instead of the image allows the generation of saliency maps of the same size and resolution as the input image.

Update Let me provide you with a much more simpler explanation ,
suppose you as a human have couple of reading glasses(of varying sizes) and you are trying to read a book.

The smallest reading glass will help you focus on smaller part of the book, and you would often have to look in different parts of the book
A medium reading glass will focus on a more wider part of the book
If you have a large reading glass, you can cover the whole book

Now in above example, reading glass acts as the filter, the book acts as your input image, the output image is what you can see through the reading glass.

No matter the size of your reading glass(filter) the dimensions of the book hasn’t changed only the focus on what is observed i.e., content or the information of book you see through your reading glass is more focused and everything else is a blur.

So the output image will be the same size as the input image with the out of focused will be like a blur

In this paper, if region $R_{2}$ moves in a sliding window manner, won't the saliency map have a smaller size than the original image?

1 Answers1