I know of some benchmarks that LLMs do undergo, but I am no expert whatsoever. I think what I am wondering about is closest to TruthfulQA. The question came up when I heard of combining company data with LLMs by providing the internal, sensitive data in prompts à la RAG (of course you do not want to fine-tune the model every time, because the data may change daily or even hourly, but instead supply it with the data in prompts).
So the question is: How sure can one be about the correctness of the models outputs on these kinds of prompts? I do not want it to hallucinate here whatsoever; that would undermine the whole approach of using LLMs in this context. Are there benchmarks specifically made for this task (i.e., providing lots of information in context and then asking for things appearing in this context)?
Thanks a lot!