0

I am reading Retinaface paper,

RetinaFace: Single-stage Dense Face Localisation in the Wild Jiankang Deng, Jia Guo, Yuxiang Zhou, Jinke Yu, Irene Kotsia, Stefanos Zafeiriou

link: https://arxiv.org/abs/1905.00641

One of the questions we aim at answering in this paper is whether we can push forward the current best performance (90.3% [67]) on the WIDER FACE hard test set [60] by using extra supervision signal built of five facial landmarks.

enter image description here

Question: here what is mean by Extra-supervision and Selfsupervision ?

Also suggest some good resources for understanding this paper better.

mat
  • 13
  • 4
  • Without reading the paper, extra supervision probably only means that you use other supervisory signals, in addition to the existing ones, like the picture suggests. Now, the more interesting questions are - 1. what task(s) are we actually trying to solve here? 2. How are these different signals combined to train the model to solve those tasks? I'd recommend that you start by reading the answers in this post in order to get familiar with SSL, if you haven't already. – nbro Jul 05 '23 at 07:59
  • @nbro,Thanks for the suggestion on the SSL , it would be great to know how extra supervision is being used in tasks – mat Jul 06 '23 at 03:01

1 Answers1

0

Basically:

  • The extra-supervision refers to a facial landmark regression loss, $L_\text{pts}$, (section 3.1), where basically one branch of the network also predicts 5 facial landmarks (center of eyes, nose, and sides of mouth) from which you do a simple regression on each point. Intuitively the landmarks adds information about the face pose, so predicting this extra info may allow the network to attain better performance.
  • The self-supervision part is not a typical SSL loss but actually they have this additional dense regression branch (section 3.2) in which they use computer vision techniques to yield a 3D mesh of the input face image, and so you have an additional pixel-wise loss ($L_\text{pixel}$) on the 2d projections of these 3d face meshes. Anyway, you can still consider it "self-supervision" in the sense that you have no extra labels (so no additional supervision) but you extract the 3d meshes from the (unlabeled) images input to the networks.

Lastly, all the losses are balanced with coefficients $\lambda = [0.25, 0.1, 0.01]$ as showed in equation 1.

Luca Anzalone
  • 3,216
  • 4
  • 15