2

There are many ways to create a "deep embedding"---by which I mean to project an input data point into a vector in a feature space, where this projection is learnt. To be useful, this vector should encode something semantic about the data.

In various places I've seen the embedding referred to as an unconstrained value in Euclidean space:

$$z \in \mathbb{R}^D$$

If it was truly unconstrained, there might be issues in training (e.g. divergence) so there will pretty much always be some kind of regularisation involved in learning the function.

Alternatively the embedding is been L2-normalised before any further consideration

$$z' = z / |z|_2$$

which is equivalent to saying the embedding represents data on the surface of a hypersphere

$$z' \in \textrm{SO}(D)$$

Which of these is common practice in current deep learning?

In other words, is it standard practice for people to L2-normalise their embeddings?

There are various topological or geometric ways to argue for one way or the other. I'd like to understand what is common practice now. (And, if possible, why that choice is dominant.)

Related literature: In the following examples of relevant papers, you'll see that hyperspherical embeddings seem to be common in some corners of AI work, but not all.

...and yet, see these very recent papers:

  • In this ICML 2024 poster the authors claim that "embeddings learning is mostly done in Euclidean space" and they consider hyperspheres instead.
  • "nGPT: Normalized Transformer with Representation Learning on the Hypersphere" (NVIDIA, 2024). Claiming some novelty in the way they use normed values throughout. -- They specifically point out that "the norms of embedding vectors in the original Transformer are unconstrained", thus I guess this is true for most transformers, as a result of the foundational work?
Dan Stowell
  • 148
  • 4
  • 1
    Some models like Resnet use batch normalization which maps the data to Euclidean space but constrains the data variance. – Curious Nov 08 '24 at 10:01

1 Answers1

1

As you rightly noticed that normalized hypersphere embeddings are gaining popularity due to their apparent norm-bounded stability in training, especially in domains like contrastive/triplet learning and other metric learning where embeddings often represent specific meaningful distances between points and hypersphere constraint more effectively handles in-group similarity and out-group dissimilarity purely by the cosine similarity metric. Without normalization, vectors may vary in length rather than direction, making similarity metric more complicated and unstable in practice potentially.

Euclidean embeddings remain standard in many foundational transformer models due to their powerful flexibility and implementation ease without similarity measure requirement such as in language modeling or other generative tasks. Though recently there's some interest to use hypersphere embedding for GPT models such as nGPT.

We propose a novel neural network architecture, the normalized Transformer (nGPT) with representation learning on the hypersphere... Experiments show that nGPT learns much faster, reducing the number of training steps required to achieve the same accuracy by a factor of 4 to 20, depending on the sequence length.

cinch
  • 11,000
  • 3
  • 8
  • 17