9

BERT encodes a piece of text such that each token (usually words) in the input text map to a vector in the encoding of the text. However, this makes the length of the encoding vary as a function of the input length of the text, which makes it more cumbersome to use as input to downstream neural networks that take only fixed-size inputs.

Are there any transformer-based neural network architectures that can encode a piece of text into a fixed-size feature vector more suitable for downstream tasks?

Edit: To illustrate my question, I’m wondering whether there is some framework that allows the input to be either a sentence, a paragraph, an article, or a book, and produce an output encoding on the same, fixed-sized format for all of them.

HelloGoodbye
  • 323
  • 1
  • 12

2 Answers2

0

One way you could do it is by using SentenceTransformers.

SentenceTransformers is a Python framework for state-of-the-art sentence, text and image embeddings. The initial work is described in our paper Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks.

To install it via pip

pip install -U sentence-transformers

To generate sentence embedding

from sentence_transformers import SentenceTransformer

We are using "paraphrase-MiniLM-L6-v2" model here, You can find list of model [here][2]

model = SentenceTransformer('paraphrase-MiniLM-L6-v2')

Our sentences we like to encode

sentences = ['This framework generates embeddings for each input sentence', 'Sentences are passed as a list of string.', 'The quick brown fox jumps over the lazy dog.']

Sentences are encoded by calling model.encode()

embeddings = model.encode(sentences)

#Print the embeddings for sentence, embedding in zip(sentences, embeddings): print("Sentence:", sentence) print("Embedding:", embedding) print("")

But remember, SentenceTransformers models have an input limit as well, 512 tokens usually. If your text is more than that, then it might not be a suitable method.

0

Building transformer models, in general, for arbitrary context retrieval is difficult because the architecture and training simply aren't designed to handle this. This is even more of an issue if you want to use the model to embed documents, as you typically need to do some specialized training to get meaningful embeddings.

There has, however, been work on long-context embedding models more generally: e.g., M2-BERT with a maximum of 32k tokens. i.e., it can support documents with length <= 32k. This, however, isn't a transformer model. Comparing to other models in their paper, transformer-based embedding methods only seem to have a maximum context length of 8k.

Alexander Wan
  • 1,409
  • 1
  • 12