0

I have a database of books. Each book have a list of categories that describe the genre/topics of the book (I use Python models).

Most of the time, the categories in the list are composed from 1-3 words.

Examples of a book category list:

['Children', 'Flour mills', 'Jealousy', 'Nannies', 'Child labor', 'Conduct of life'],
["Children's stories", 'Christian life'],
['Children', 'Brothers and sisters', 'Conduct of life', 'Cheerfulness', 'Christian life'],
['Fugitive slaves', 'African Americans', 'Slavery', 'Plantation life', 'Slaves', 'Christian life', 'Cruelty']

I want to create/use an algorithm to compare the books and find similarity between 2 books using NLP/machine learning models.

The categories are not well defined and tend to change. For example, there can be a category of 'story' and other called 'stories' category (since the text in the system don't saved categories and use a open text box).

So far I tried 2 algorithms:

  • cosine similarity with WordNet - split the category to bag of words and check if each word have synonym in the other book lists.
  • Check the similarity using the nlp model of the spacy library (Python) - distance algorithm.

So far I used WordNet model from the nltk package and spacy had problem with those two algorithms because when the algorithm compare a categories that contain 2 or 3 words the results wasn't accurate and each of them had specific problems.

Which algorithm and models (in Python), can I use to compare between the books that can handle strings that contain 2 or 3 words?

B.w is the first time I ask here. If you need more details about the database or what I did so far please tell me.

nbro
  • 42,615
  • 12
  • 119
  • 217
Eitan Rosati
  • 101
  • 3

1 Answers1

1

You can use a model to create rich embeddings for example: sentence transformers and then use cosine similarity distance from sklearn with a threshold (at least 0.6) to create clusters of semantically-close document

Saeron X
  • 21
  • 2