1

Note: I am not asking what is geometric formulation of machine learning. I am searching alternative formulations that are not reduced to machine learning on manifolds

Machine learning is usually formulated as a geometric problem (e.g. [1], [2]). The data are represented as points in some manifold where coordinates are features, and the task to find another geometric object related to the data. This object can be a line going through the data in case of regression or a hyper-surface separating subsets of the data in case of classification.

I couldn't find a formulation of machine learning that is not reduced to such geometry.

To me, it looks like geometric formulation is not fundamental. The space in which data are represented is not unique. It depends on the feature set created for the data. For the same data many (potentially infinite) different feature sets can be created, so the same machine learning problem can be represented in different spaces of different dimensions.

For example, I work in chemistry and material science. There is a lot of machine learning here, mapping compounds to points in so called 'chemical space' where most predictions fail. I keep asking in seminars and conferences what is this chemical space, its structure, dimension, topology. It looks like there is no definite answer except 'it depends on the descriptor (feature set, a vector of numbers, corresponding to a compound, etc, graphs are also there)'.

There are thousands of papers about different descriptors, and even multi-volume books devoted solely to descriptors to choose from. According to some descriptors, two molecules are very similar, yet one is benign, another is a poison. According to another descriptor, they're different. That's why I say 'not fundamental' - any feature set says very little about chemical compounds they are supposed to represent.

Another issue is that geometric formulation requires the notion of distance. Different types of distances can be used which gives different, even contradicting predictions. It is often unknown whether distance and other geometric constructs can make sense at all.

Is the distance between graphene and amorphous silicon less or greater than the distance between boron nitride and amorphous carbon? Is there triangle inequality in that space? What is greater, the distance between copper oxide and molybdenum dichalcogenide plus the distance between molybdenum dichalcogenide and silicon carbide or the distance between copper oxide and silicon carbide? I think, I can construct features of these substances and distances between them that the answer can be anything you want.

One way to solve this in chemistry is to continue searching for a better feature set. But then there are materials like VO2 and TiO2 whose electron densities are almost indistinguishable, so any set of features would map them into points that are next to each other, yet one of them is a conductor, while another is an insulator.

So, my question is: Maybe instead of searching for a better feature set that maps data to points in space and designing a distance function that measures similarities between data, I could use some other machine learning paradigm that does not reduce a machine learning problem to a geometric problem?

Is mapping to geometric objects, and getting inference based on their similarity is the only possible paradigm in machine learning? Are there formulations of machine learning alternative to the geometric one?

1 Answers1

1

Firstly you claim that geometric formulation of machine learning (M)L is not fundamental and the same ML problem can be represented in different spaces of different dimensions. However, the essence of contemporary deep learning is built from two simple principles known as feature learning and local gradient-descent. Many high-dimensional learning tasks previously thought to be beyond reach, such as computer vision, playing Go, or protein folding, are in fact feasible after adapted learned features capture the appropriate notion of regularity for each task, such as semantic similarity (GPT and BERT large variant pretrained token embeddings have different dimensions and representations, but both share regularity of semantic similarity) and group symmetry (translational and rotational equivariance & invariance of CNNs).

Similarly for the geometric feature of distance, each ML task will have inductive bias for such feature if it's really relevant with domain knowledge. Therefore though features can be represented in many seemingly different ways, for most real world ML problems their predicted or generated data have many intrinsic regularities as well to ensure stability and applicability.

Besides geometric formulation, ML has many other formulations such as probabilistic (Bayesian) and PAC formulation where sample data are drawn from some unknown probability distribution and ML becomes the task of inferring or approximating distributions rather than finding geometric shapes such as Normalizing Flows or the task of updating a prior (latent) distribution to a posterior using Bayes' theorem such as VAEs.

For reinforcement learning problems, sequential decision-theoretic formulation is the standard, and in fact classification and regression ML problems can be viewed as one-shot decision problems as well, not just as geometric lines or hyper-surfaces separating or going through data which is hardly applicable for cost sensitive loss with imbalanced data, for instance. Traditional knowledge base learning problems are usually formulated in purely symbolic rules and logic such as Inductive Logic Programming (ILP) where symbolic rules are learned from samples. Finally there's information-theoretic Minimum Description Length (MDL) formulation of some ML problems where the goal is to find the model that compresses the observed data most effectively.

cinch
  • 11,000
  • 3
  • 8
  • 17