3

I'm building a friend finder app and I need to match people based on a paragraph of text. Here is an example of what I mean:

Person A: I love walking and going to the beach, I also love reading and keeping active. I'm very allergic to dogs, so I don't have any pets and have no intention of having any. I used to swim in college as a d1 athlete. My favourite movie is Finding Nemo, I hate horror films..."

Person B: I'm a dog lover with 3 Labradors. I'm an extremely active person who loves to swim many days out of the week. I love going to the theatre and watching movies in IMAX. My job is an civil engineer, but I like to code games in my spare time..."

Match: 56%

What is the best way to calculate a score that accurately matches the similarity of interests/hobbies etc. between the two individuals?

I have looked at OpenAI embeddings API and storing them in a Pinecone database for retrieval. I have also looked at creating a 'Profile' using some kind of classification model that would give each user a score for various traits such as {active: 0.3, social: 0.9, ...} and then comparing results.

Many thanks for any help.

Dom
  • 31
  • 3

1 Answers1

2

There are several ways to do this. The most straightforward would be to encode the two paragraphs as vectors (also called text embeddings) using a pretrained language model.

The idea is that the vectors representing the paragraphs encode the "semantic content" or the underlying "meaning" of the text, and the vector encoder model was trained such that you can use a distance metric like dot-product or cosine-similarity to measure this "semantic distance" between text. The sentence-transformers library has several implementations and tutorials for this.

One caveat is that the type of similarity captured by the models may not apply to your specific use case as they may be trained on a different distribution of data (these models are often trained for information retrieval purposes). In this case you may want to fine-tune the model or choose a model that was trained similarly to your target use-case.

Another similar method is to use cross-encoders. Instead of encoding each paragraph separately then comparing them with a cheap metric, cross-encoders are models that take both paragraphs as input and output a similarity score. This may give you better accuracy (again, depending on how it was trained), but it comes at the cost of performance since you need to do an expensive forward pass for each pair of sentences.

I would also consider a more interpretable method of first extracting "properties" from each description, then comparing each of them separately. For example, it'd be reasonable to assume that most personal descriptions have hobbies, favorite movies, etc. You can try extracting spans of text corresponding to each of these properties separately, then use one of the methods described previously (or even just basic token-matching) for comparisons.

Alexander Wan
  • 1,409
  • 1
  • 12