25

What is non-Euclidean data?

Here are some sub-questions

  • Where does this type of data arise? I have come across this term in the context of geometric deep learning and graph neural networks.

  • Apparently, graphs and manifolds are non-Euclidean data. Why exactly is that the case?

  • What is the difference between non-Euclidean and Euclidean data?

  • How would a dataset of non-Euclidean data look like?

nbro
  • 42,615
  • 12
  • 119
  • 217

5 Answers5

16

I presume this question was prompted by the paper Geometric deep learning: going beyond Euclidean data (2017). If we look at its abstract:

Many scientific fields study data with an underlying structure that is a non-Euclidean space. Some examples include social networks in computational social sciences, sensor networks in communications, functional networks in brain imaging, regulatory networks in genetics, and meshed surfaces in computer graphics. In many applications, such geometric data are large and complex (in the case of social networks, on the scale of billions), and are natural targets for machine learning techniques. In particular, we would like to use deep neural networks, which have recently proven to be powerful tools for a broad range of problems from computer vision, natural language processing, and audio analysis. However, these tools have been most successful on data with an underlying Euclidean or grid-like structure, and in cases where the invariances of these structures are built into networks used to model them.

We see that the authors use the term "non-Euclidean data" to refer to data whose underlying structure is non-Euclidean.

Since Euclidean spaces are prototypically defined by $\mathbb{R}^n$ (for some dimension $n$), 'Euclidean data' is data which is sensibly modelled as being plotted in $n$-dimensional linear space, for example image files (where the $x$ and $y$ coordinates refer to the location of each pixel, and the $z$ coordinate refers to its colour/intensity).

However some data does not map neatly into $\mathbb{R}^n$, for example, a social network modelled by a graph. You can of course embed the physical shape of a graph in 3-d space, but you will lose information such as the quality of edges, or the values associated with nodes, or the directionality of edges, and there isn't an obvious sensible way of mapping these attributes to higher dimensional Euclidean space. And depending on the specific embedding, you may introduce spurious correlations (e.g. two unconnected nodes appearing closer to each other in the embedding than to nodes they are connected to).

Methods such as Graph Neural Networks seek to adapt existing Machine Learning technologies to directly process non-Euclidean structured data as input, so that this (possibly useful) information is not lost in transforming the data into a Euclidean input as required by existing techniques.

10

Non-Euclidian geometry can be generally boiled down to the phrase

the shortest path between 2 points isn't necessarily a straight line.

Or, put in a way that lends itself very much to machine learning,

things that are similar to each other are not necessarily close if one uses Euclidean distance as a metric (aka the triangle inequality doesn't hold).

You mention graphs and manifolds as being non-Euclidian, but, really, the majority of problems being worked on don't have Euclidian data. Take the below images for example:

Clearly, 2 of the images are more similar to each other than the third one is, but if we looked at the pixels alone, the Euclidean distance between the pixel values don't represent this similarity.

2 good boys and a rad hampster

If there was a function, $F(\text{image})$, that mapped images to a space of values where similar images produced values that were closer together, we could better understand the data, infer some statistics about the distributions, and make predictions on data we have yet to see. This is what classic techniques of image recognition have done and it's also what modern machine learning is doing. Taking data and mapping it to a space such that the triangle inequality holds.

Let's look at a more concrete example, some points I drew in MSPaint. On the left is some space that we are interested in where points have 2 classes (red or blue). Even though there are points that are close to each other, they may have different colors/classes. Ideally, we could have a function that converts these points to some space where we can draw a line to separate these 2 classes. In general, there would be many lines, or hyper-planes in dimensions > 3, but the goal is to transform the data so that it will be "linearly separable".

Some points I drew in MSPaint.

To conclude, non-Euclidian data is everywhere.

nbro
  • 42,615
  • 12
  • 119
  • 217
Jaden Travnik
  • 3,867
  • 1
  • 18
  • 35
1

It's hard to say because Euclidean space is defined with respect to some kind of metric, so without any clearer exposition on the nature of the data/problem, the phrase itself may or may not be clear.

A metric $d: A \times A \rightarrow \mathbb{R}$ is a function that defines distance between any two points in the space with respect to axioms that 1. two points have zero distance iff they are the same: $d(a,b) = 0 \Leftrightarrow a = b$. 2. symmetric: $d(a,b) = d(b,a)$. 3. triangle inequality: $d(a,b) + d(b,c) ≥ d(a,c)$.

A Euclidian metric is a metric that also obeys Pythagoras theorem , or at least: the distance between some point $(x,y) \in \mathbb{R}^2$ is equal to $\sqrt{x^2 + y^2}.$ You will find that all Euclidian spaces are isomorphic to $\mathbb{R}^n$, meaning that the two notions are in some sense identical.

Any graph/data whose underlying data does not "naturally come" from $\mathbb{R}^n$, or a graph that does not admit a natural embedding in $\mathbb{R}^n$ might not be Euclidian, since $\mathbb{R}^n$ is isomorphic to any euclidian space.

k.c. sayz 'k.c sayz'
  • 2,121
  • 13
  • 27
1

Where does this type of data arise?

In terms of learning on non-Euclidean data, it was probably first coined by prof. Bronstein here. Recently, prof. Bronstein published, along with other top authors in the field, a book about geometric deep learning, which in essence presents a unified mathematical framework for symmetries-based learning, and exemplifies the extension of concepts from classical ML/DL to the domain of higher-dimensional geometric domains (chapter 4).

Apparently, graphs and manifolds are non-Euclidean data. Why exactly is that the case?

According to the unified mathematical framework presented in prof. Bronstein's book, the blueprint of geometric DL consists of an underlying domain $\Omega$ and signals $\mathcal{X}(\Omega)$ and functions $\mathcal{F}(\mathcal{X}(\Omega))$.

The domain is the geometric/algebraic structure behind any instance of that data type (basically shows how its composing features are arranged). Associated with this domain is a symmetry group (all the transformations under which the domain is invariant).

The signal is the observable state/value/quantity/features in the points of the domain.

The function(s) are the blocks/layers that we want to build as functions of the signals on the domain. Here is where we can inject inductive biases.

Translation equivariance is a property of the domain of images (among other groups). Convolutions as functions of a signal on the domain $R^2$ are translation equivariant, meaning that if you move an object in an image, you would still find the same features (at different locations) when applying the same kernel over the two images. This is the (geometric prior) "inductive bias", as the function used is aware of the domain, knows its properties, and uses them.

The underlying domain of an image is $\mathbb{R}^2$ (the position of the pixels in the (x, y) plane). In this context, one can say that the signal of an is simply the colour/intensity of the pixels. As pointed out in this answer, this group admits the Euclidean distance between two points in the plane as $d(x, y) = \sqrt{\sum_{i=0}^{1} (x_i - y_i)^2}$.

In broad terms, non-Euclidean data is data whose underlying domain does not obey Euclidean distance as a metric between points in the domain. For visualization simplicity, think of $\mathbb{Z}^2$ instead, which can be seen as a "grid" of integer-valued points separated by a distance of 1. You can easily "visualize" the distance between the points of the domain.

Now let's move to graphs. What is the underlying domain of the graph? It's a set of nodes $\mathcal{V}$ and a set of edges $\mathcal{E}$. There is no such thing as distance in the Euclidean sense between the nodes of a graph. If you ask "What is the distance between point A and B?", the answer is probably a function of the connectivity of the graph (which is part of the domain), and is not the same for arbitrary points in the graph.

How would a dataset of non-Euclidean data look like?

Usually, a data point (sample) consists of a domain and a signal.

  • The domain can be for example a weighted adjacency matrix of the graph, which lists the "distances" between nodes, for those that exist.
  • Features can be either per node, per edge or both.

In a concrete example, the domain can be a road network graph with distances and connectivity between road intersections, and the signal can be per-node features indicating how many cars are in the intersection at a given point.

Of course, the domain can also be different between data points (i.e. in the PROTEINS dataset, where each protein is a graph connected (edges) aminoacids (nodes), with different structures).

A non-Euclidean dataset simply consists of multiple such data points, which may or may not have the same underlying domain (i.e. same graph structure defined by an adjacency matrix).

0

As far as I understand, the concept of non-Euclidean space doesn't bring the ordinality or hierarchy among the features, compared to that with the data formed in the Euclidean space.

The difference between both these techniques is not remarkable for discriminative tasks like classification. But, for generative modeling, the non-Euclidean techniques helps in defining the latent manifold space for the given data distribution. This can further help in traversing the manifold from the same distribution (to generate similar samples from the same or underlying manifold) even with $n$ degrees of freedom in the latent space. This is not possible with Euclidean techniques. One cannot fully traverse/generate samples from or outside the manifold without the minimal change in the Euclidean space. More precisely, it can, but it will only present it as noisy data.

nbro
  • 42,615
  • 12
  • 119
  • 217