Compare Strings composed from 2-3 words using NLP/ML(Python)

Question

I have a database of books. Each book have a list of categories that describe the genre/topics of the book (I use Python models).

Most of the time, the categories in the list are composed from 1-3 words.

Examples of a book category list:

['Children', 'Flour mills', 'Jealousy', 'Nannies', 'Child labor', 'Conduct of life'],
["Children's stories", 'Christian life'],
['Children', 'Brothers and sisters', 'Conduct of life', 'Cheerfulness', 'Christian life'],
['Fugitive slaves', 'African Americans', 'Slavery', 'Plantation life', 'Slaves', 'Christian life', 'Cruelty']

I want to create/use an algorithm to compare the books and find similarity between 2 books using NLP/machine learning models.

The categories are not well defined and tend to change. For example, there can be a category of 'story' and other called 'stories' category (since the text in the system don't saved categories and use a open text box).

So far I tried 2 algorithms:

cosine similarity with WordNet - split the category to bag of words and check if each word have synonym in the other book lists.
Check the similarity using the nlp model of the spacy library (Python) - distance algorithm.

So far I used WordNet model from the nltk package and spacy had problem with those two algorithms because when the algorithm compare a categories that contain 2 or 3 words the results wasn't accurate and each of them had specific problems.

Which algorithm and models (in Python), can I use to compare between the books that can handle strings that contain 2 or 3 words?

B.w is the first time I ask here. If you need more details about the database or what I did so far please tell me.

score 1 · Answer 1 · answered Jun 15 '22 at 12:59

1

You can use a model to create rich embeddings for example: sentence transformers and then use cosine similarity distance from sklearn with a threshold (at least 0.6) to create clusters of semantically-close document

answered Jun 15 '22 at 12:59

Saeron X

21
2

Compare Strings composed from 2-3 words using NLP/ML(Python)

1 Answers1