Deep embeddings: which is common practice, Euclidean space or normalised (hypersphere)?

Question

There are many ways to create a "deep embedding"---by which I mean to project an input data point into a vector in a feature space, where this projection is learnt. To be useful, this vector should encode something semantic about the data.

In various places I've seen the embedding referred to as an unconstrained value in Euclidean space:

$$z \in \mathbb{R}^D$$

If it was truly unconstrained, there might be issues in training (e.g. divergence) so there will pretty much always be some kind of regularisation involved in learning the function.

Alternatively the embedding is been L2-normalised before any further consideration

$$z' = z / |z|_2$$

which is equivalent to saying the embedding represents data on the surface of a hypersphere

$$z' \in \textrm{SO}(D)$$

Which of these is common practice in current deep learning?

In other words, is it standard practice for people to L2-normalise their embeddings?

There are various topological or geometric ways to argue for one way or the other. I'd like to understand what is common practice now. (And, if possible, why that choice is dominant.)

Related literature: In the following examples of relevant papers, you'll see that hyperspherical embeddings seem to be common in some corners of AI work, but not all.

"Understanding contrastive representation learning through alignment and uniformity on the hypersphere" (ICML 2020). This paper "provides a comprehensive overview of the arguments for representation learning on the hypersphere" including "Spherical representations are associated with more stable training". The first sentence of this paper tells us: "A vast number of recent empirical works learn representations with a unit L2 norm constraint, effectively restricting the output space to the unit hypersphere"
"NormFace: L2 Hypersphere Embedding for Face Verification" (CVPR 2017) - I think this method was influential. It gives some geometric arguments for L2 normalising values. See also FaceNet mentioned in this related question.
"Deep Metric Learning with Spherical Embedding" (NeurIPS 2020) -- even on page 1 it's already clear that they're assuming that deep embeddings "should" lie on a hypersphere.

...and yet, see these very recent papers:

In this ICML 2024 poster the authors claim that "embeddings learning is mostly done in Euclidean space" and they consider hyperspheres instead.
"nGPT: Normalized Transformer with Representation Learning on the Hypersphere" (NVIDIA, 2024). Claiming some novelty in the way they use normed values throughout. -- They specifically point out that "the norms of embedding vectors in the original Transformer are unconstrained", thus I guess this is true for most transformers, as a result of the foundational work?

Some models like Resnet use batch normalization which maps the data to Euclidean space but constrains the data variance. — Curious, Nov 08 '24 at 10:01

score 1 · Accepted Answer · answered Nov 09 '24 at 01:33

As you rightly noticed that normalized hypersphere embeddings are gaining popularity due to their apparent norm-bounded stability in training, especially in domains like contrastive/triplet learning and other metric learning where embeddings often represent specific meaningful distances between points and hypersphere constraint more effectively handles in-group similarity and out-group dissimilarity purely by the cosine similarity metric. Without normalization, vectors may vary in length rather than direction, making similarity metric more complicated and unstable in practice potentially.

Euclidean embeddings remain standard in many foundational transformer models due to their powerful flexibility and implementation ease without similarity measure requirement such as in language modeling or other generative tasks. Though recently there's some interest to use hypersphere embedding for GPT models such as nGPT.

We propose a novel neural network architecture, the normalized Transformer (nGPT) with representation learning on the hypersphere... Experiments show that nGPT learns much faster, reducing the number of training steps required to achieve the same accuracy by a factor of 4 to 20, depending on the sequence length.

Deep embeddings: which is common practice, Euclidean space or normalised (hypersphere)?

1 Answers1