6

I am new to self-supervised learning and it all seems a little magical at the moment.

The only way I can get an intuitive understanding is to assume that, for real-world problems, features are still embedded at a per-object level.

For example, to detect cats in unseen images, my self-supervised network would still have to be composed exclusively of cats.

So, if I had 100 images of cats and 100 images of dogs, then I thought self-supervised approaches would learn the features of the images. For example, if an image is rotated 90 degrees, it learns what was in the image that was rotated 90 degrees. However, if I wanted to classify just cats using this representation, then I wouldn't be able to do so without separating out what makes a cat a cat and a dog a dog.

Is my assumption correct?

nbro
  • 42,615
  • 12
  • 119
  • 217

2 Answers2

4

I don't think your interpretation is correct. Take images as example.

  • Supervised Learning

    e.g. classification (maybe use CNN with a L2 loss function)

    Assume you have many images with different labels. You wish to find a function to approximate the function $y=f(x)$ given a lot of $(\hat x, \hat y)$ sample pairs.

  • Unsupervised Learning

    e.g. clustering (maybe use k-means)

    Assume you have many images, but we don't have the labels or we just want to see if there's a way to categorize them into different categories. So we cluster the images by some characteristic that isn't pre-defined.

  • Self-Supervised Learning

    e.g. super resolution (maybe use CNN with a L2 loss function)

    You have many high resolution images without labels, but, your target is to train a model to up sample a low resolution image. So you can have the high resolution images as target, and down size the image to be the input, and try to train the image pairs. So the target is not some manually tagged labels, but generated directly from the data.

J3soon
  • 236
  • 1
  • 4
2

Andrew Zisserman, who is a pioneer in the field of self-supervised learning, described self-supervised learning in a talk at ICML as:

Self-supervised Learning is a form of unsupervised learning where the data provides the supervision. In general, we withhold some part of the data and task the network with predicting it. The network is forced to learn what we really care about e.g. a semantic representation, in order to solve it.

Thus, self-supervised is a subset of unsupervised learning, where you generate the labels from the given data itself. There are a few patterns of research being done for self-supervised learning:

1. Reconstruction:
In this, researchers have set up pretext tasks as predicting the color image from gray-scale image (Image Colorization), predicting the high-resolution image from the low-resolution version (Image Super-resolution) and removing some part of the image and trying to predict it (Image Inpainting).

2. Common Sense Reasoning:
You could take patches of 3x3 images and shuffle the patches and ask the network to predict the correct order (Jigsaw puzzle).

Similarly, you could take the center patch and some random patch and train model to predict where the random patch is located in relation to the center patch (context prediction).

There is another approach where you randomly rotate image into {0, 90, 180, 270} degrees and ask the model to predict the rotation angle applied (Geometric Transformation Recognition).

3. Clustering:

You could cluster the images into K categories and treat those clusters as labels. Then, a model can be trained on those clusters and you get representations. You can again repeat clustering and model training for few epochs. Papers for these include: DeepCluster and Self-Labelling.

4. Contrastive Learning:

In this paradigm, augmentations of the image is taken and the task is to bring two augmentations of the same images near while making the distance between this image and some other random image far. Papers for these include: SimCLR and PIRL.

Amit Chaudhary
  • 232
  • 2
  • 7