Self-supervised learning algorithms provide labels automatically. But, it is not clear what else is required for an algorithm to fall under the category "self-supervised":
Some say, self-supervised learning algorithms learn on a set of auxiliary tasks [1], also named pretext task [2, 3], instead of the task we are interested in. Further examples are word2vec or autoencoders [4] or word2vec [5]. Here it is sometimes mentioned that the goal is to "expose the inner structure of the data".
Others do not mention that, implying that some algorithms can be called to be "self-supervised learning algorithms" if they are directly learning the task we are interested in [6, 7].
Is the "auxiliary tasks" a requirement for a training setup to be called "self-supervised learning" or is it just optional?
Research articles mentioning the auxiliary / pretext task:
Revisiting Self-Supervised Visual Representation Learning, 2019, mentioned by [3]:
The self-supervised learning framework requires only unlabeled data in order to formulate a pretext learning task such as predicting context or image rotation, for which a target objective can be computed without supervision.
Unsupervised Representation Learning by Predicting Image Rotations, ICLR, 2018, mentioned by [2]:
a prominent paradigm is the so-called self-supervised learning that defines an annotation free pretext task, using only the visual information present on the images or videos, in order to provide a surrogate supervision signal for feature learning.
Unsupervised Visual Representation Learning by Context Prediction, 2016, mentioned by [2]:
This converts an apparently unsupervised problem (finding a good similarity metric between words) into a “self-supervised” one: learning a function from a given word to the words surrounding it. Here the context prediction task is just a “pretext” to force the model to learn a good word embedding, which, in turn, has been shown to be useful in a number of real tasks, such as semantic word similarity.
Scaling and Benchmarking Self-Supervised Visual Representation Learning, 2019:
In discriminative self-supervised learning, which is the main focus of this work, a model is trained on an auxiliary or ‘pretext’ task for which ground-truth is available for free. In most cases, the pretext task involves predicting some hidden portion of the data (for example, predicting color for gray-scale images).