3

A conversation through the OpenAI API looks something like this

    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Who won the world series in 2020?"},
        {"role": "assistant", "content": "The Los Angeles Dodgers won the World Series in 2020."},
        {"role": "user", "content": "Where was it played?"}
    ]

When I call a vector database I'm going to get back related content in the order of similarity to the question.

Let's say I get back 10 chunks at around 500 characters each.

Some frameworks (i.e. llama-index) set the context in the "system" role.

So that would look something like

    messages=[
        {"role": "system", "content": "You are a helpful assistant. "Context information is below."
    "\n--------------------\n"
    "{context_str}"
    "\n--------------------\n"},
        {"role": "user", "content": "Who won the world series in 2020?"},
        {"role": "assistant", "content": "The Los Angeles Dodgers won the World Series in 2020."},
        {"role": "user", "content": "Where was it played?"}
    ]

Where context_string will be replaced by the chunks retrieved from the similarity search.

Now for the history, I can keep adding "user", "assistant" role pair say up to a maximum of 10.

What's a good strategy for making sure I don't overflow the data limit I can send to the LLM?

For example I have GPT4All 7b with a limit of 2000 tokens and another model has a limit of 32K tokens.

How do I calculate how much to use?

Ian Purton
  • 131
  • 1

0 Answers0