3

This is a conceptual gap that I have concerning embeddings, and would really appreciate some help closing it.

I understand when you embed a corpus for, let's say, a question-and-answer task you can then encode your query to find documents within your corpus that are nearest to the query in the embedding space, using something like cosine distance, and retrieve an answer, with a source being the document in your corpus from which it got the answer.

This to me is a "one-dimensional" embedding task - you chunk your corpus documents into lists of sentences for embeddings. The resulting list of embeddings is then a (NxM)-dimensional matrix, where N is your number of chunks you feed in, and M is functionally the number of unique tokens in your corpus.

enter image description here

However, for my project, I am trying to embed multiple text fields about public companies to later search and compare them - like a long description (1-2 paragraphs in length), their industry (1-2 words in length), and their sector (1-2 words).

I would think this is a "3-dimensional" embedding task, because we have three separate text fields that we embed. So for TSLA, we'd have a (NxMx3) tensor with embeddings for each of the three fields? How would the semantic search in this case work?

When performing semantic search, should I just try to search over all three groups of embeddings? If so, what might that look like? Or should I concatenate the embeddings somehow (kind of 'flattening')

Thank you in advance for any help

Ian Murray
  • 31
  • 1

0 Answers0