A conversation through the OpenAI API looks something like this
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Who won the world series in 2020?"},
{"role": "assistant", "content": "The Los Angeles Dodgers won the World Series in 2020."},
{"role": "user", "content": "Where was it played?"}
]
When I call a vector database I'm going to get back related content in the order of similarity to the question.
Let's say I get back 10 chunks at around 500 characters each.
Some frameworks (i.e. llama-index) set the context in the "system" role.
So that would look something like
messages=[
{"role": "system", "content": "You are a helpful assistant. "Context information is below."
"\n--------------------\n"
"{context_str}"
"\n--------------------\n"},
{"role": "user", "content": "Who won the world series in 2020?"},
{"role": "assistant", "content": "The Los Angeles Dodgers won the World Series in 2020."},
{"role": "user", "content": "Where was it played?"}
]
Where context_string will be replaced by the chunks retrieved from the similarity search.
Now for the history, I can keep adding "user", "assistant" role pair say up to a maximum of 10.
What's a good strategy for making sure I don't overflow the data limit I can send to the LLM?
For example I have GPT4All 7b with a limit of 2000 tokens and another model has a limit of 32K tokens.
How do I calculate how much to use?