What is self-supervised learning in machine learning? How is it different from supervised learning?
3 Answers
Introduction
The term self-supervised learning (SSL) has been used (sometimes differently) in different contexts and fields, such as representation learning [1], neural networks, robotics [2], natural language processing, and reinforcement learning. In all cases, the basic idea is to automatically generate some kind of supervisory signal to solve some task (typically, to learn representations of the data or to automatically label a dataset).
I will describe what SSL means more specifically in three contexts: representation learning, neural networks and robotics.
Representation learning
The term self-supervised learning has been widely used to refer to techniques that do not use human-annotated datasets to learn (visual) representations of the data (i.e. representation learning).
Example
In [1], two patches are randomly selected and cropped from an unlabelled image and the goal is to predict the relative position of the two patches. Of course, we have the relative position of the two patches once you have chosen them (i.e. we can keep track of their centers), so, in this case, this is the automatically generated supervisory signal. The idea is that, to solve this task (known as a pretext or auxiliary task in the literature [3, 4, 5, 6]), the neural network needs to learn features in the images. These learned representations can then be used to solve the so-called downstream tasks, i.e. the tasks you are interested in (e.g. object detection or semantic segmentation).
So, you first learn representations of the data (by SSL pre-training), then you can transfer these learned representations to solve a task that you actually want to solve, and you can do this by fine-tuning the neural network that contains the learned representations on a labeled (but smaller dataset), i.e. you can use SSL for transfer learning.
This example is similar to the example given in this other answer.
Neural networks
Some neural networks, for example, autoencoders (AE) [7] are sometimes called self-supervised learning tools. In fact, you can train AEs without images that have been manually labeled by a human. More concretely, consider a de-noising AE, whose goal is to reconstruct the original image when given a noisy version of it. During training, you actually have the original image, given that you have a dataset of uncorrupted images and you just corrupt these images with some noise, so you can calculate some kind of distance between the original image and the noisy one, where the original image is the supervisory signal. In this sense, AEs are self-supervised learning tools, but it's more common to say that AEs are unsupervised learning tools, so SSL has also been used to refer to unsupervised learning techniques.
Robotics
In [2], the training data is automatically but approximately labeled by finding and exploiting the relations or correlations between inputs coming from different sensor modalities (and this technique is called SSL by the authors). So, as opposed to representation learning or auto-encoders, in this case, an actual labeled dataset is produced automatically.
Example
Consider a robot that is equipped with a proximity sensor (which is a short-range sensor capable of detecting objects in front of the robot at short distances) and a camera (which is long-range sensor, but which does not provide a direct way of detecting objects). You can also assume that this robot is capable of performing odometry. An example of such a robot is Mighty Thymio.
Consider now the task of detecting objects in front of the robot at longer ranges than the range the proximity sensor allows. In general, we could train a CNN to achieve that. However, to train such CNN, in supervised learning, we would first need a labelled dataset, which contains labelled images (or videos), where the labels could e.g. be "object in the image" or "no object in the image". In supervised learning, this dataset would need to be manually labelled by a human, which clearly would require a lot of work.
To overcome this issue, we can use a self-supervised learning approach. In this example, the basic idea is to associate the output of the proximity sensors at a time step $t' > t$ with the output of the camera at time step $t$ (a smaller time step than $t'$).
More specifically, suppose that the robot is initially at coordinates $(x, y)$ (on the plane), at time step $t$. At this point, we still do not have enough info to label the output of the camera (at the same time step $t$). Suppose now that, at time $t'$, the robot is at position $(x', y')$. At time step $t'$, the output of the proximity sensor will e.g. be "object in front of the robot" or "no object in front of the robot". Without loss of generality, suppose that the output of the proximity sensor at $t' > t$ is "no object in front of the robot", then the label associated with the output of the camera (an image frame) at time $t$ will be "no object in front of the robot".
 
    
    - 42,615
- 12
- 119
- 217
Self-supervised learning is when you use some parts of the samples as labels for a task that requires a good degree of comprehension to be solved. I'll emphasize these two key points, before giving an example:
- Labels are extracted from the sample, so they can be generated automatically, with some very simple algorithm (maybe just random selection). 
- The task requires understanding. This means that, in order to predict the output, the model has to extract some good patterns from the data, generating on the process a good representation. 
A very common case for semi-supervised learning takes place in natural language processing, when you need to solve a task but have few labeled data. In such cases, you need to learn a good representation or language model, so you take sentences and give your network self-supervision tasks like these:
- Ask the network to predict the next word in a sentence (which you know because you took it away). 
- Mask a word and ask the network to predict which word goes there (which you know because you had to mask it). 
- Change the word for a random one (that probably doesn't make sense) and ask the network which word is wrong. 
As you can see, these tasks are fairly simple to formulate and the labels are part of the same sample, but they require a certain understanding of the context to be solved.
And it's always like this: alter your data in some way, generating the label in the process, and ask the model something related to that transformation. If the task requires enough understanding of the data, you'll have success.
 
    
    - 571
- 3
- 12
Self-supervised visual recognition is often applied to representation learning. Here we first learn features on unlabeled data (representation learning), and then learn the real model on features extracted from the labeled data. This especially makes sense when we have a lot of unlabeled data and few labeled data.
The features can be learned by solving so called pretext tasks. Examples of pretext tasks are to predict rotation of a jittered image, to recognize jittered instances of a same image, or to predict spatial relationship of image patches.
A nice overview and interesting results can be found in this recent paper.
 
    
    - 499
- 2
- 6