3

I'm not very knowledgeable in this field but I'm wondering if any research or information already exists for the following situation:

I have some data that may or may not look similar to each other. Each represents a node that represents a vector of size 128.

similar vectors

And they are inserted into a tree graph according to similarity.

better graph

and a new node is created with an edge connecting the most similar vertex node found in the entire tree graph.

Except I'm wasting a lot of time searching through the entire graph to insert each new node when I could narrow down my search according to previous information. Imagine a trunk node saying "Oh, I saw a node like you before, it went down this branch. And don't bother going down that other branch." I could reduce the cost of searching the entire tree structure if there was a clever way to remember if a similar node went down a certain path.

I've thought about some ways to use caching or creating a look-up table but these are very memory intensive methods and will become slower the longer the program runs on. I have some other ideas I am playing around with but I was hoping someone could point me in the right direction before I started trying out weird ideas.

Edit: added a better (more realistic) graph picture

JungleBird
  • 31
  • 2

2 Answers2

1

I am not sure if this could be a solution, but I had an idea that I thought it might be helpful for your case.

You might be able to use a balanced binary search trees like AVL trees or Red Black trees. It is a rooted tree. It can reduce your insertion time by O(lgn), where n = number of nodes.

The challenge is that you need a well ordered sorting order, that is you need a way to compare any two vectors. For example given $a \in R^{128}$ and $b \in R^{128}$, you need a way to check if $a < b$ or not.

One of the ways to handle this is to normalize each vector so that each vector has unit magnitude, $\hat{a} = \frac{a}{||a||}$, and then compare the dot product with one standard vector, or it can be x-axis. (First element 1 and all other elements are 0).

But then, this method will not work if the way you are measuring distance between two nodes are different (which I assume it is). Maybe you can take motivation from this answer and modify your distance metric in such a way such that BBST can be used.

UPDATE: I realized that binary trees can only have 2 connected nodes. It might not be useful for your use case.

0

It sounds like you may be looking for the A* search algorithm. It is a search algorithm, like DFS and BFS, but it will explore only the most promising branches based on a heuristic function you supply. The difficult part of implementing this is deciding on a low-cost, admissible heuristic.

Excited to reflect nbro's suggestion from comments.