4

I am trying to find out how I can teach the content of a whole, multiple hundert pages book to an LLM so that it "knows" all details and can be queried, give summaries etc. The book is one consistent story, private and has never been published. I thought training LLM on a long book would be a common use case, but I found surprisingly little information about this.

Most use cases these days regarding own content seems to be stuff like "chat with your documents" or such. But this seems much easier due to context length and the lack of coherence between documents.
I am not an ML expert but know the basics of embeddings and fine tunings. Is either of these approaches better suited? How could a raw book be turned into a proper training data set for fine tuning a model? This could not be done manually, as the length is almost a million words. Or could it work "simply" by splitting the text into chunks and embedding them?

dschuld
  • 41
  • 1
  • 3

2 Answers2

2

Using embeddings is an effective approach when you have limited data, such as a book, and want to extract relevant context and related text for querying.

An example of this approach at this URL: https://jameshwade.com/posts/2023-03-10_vectorstores.html.

The idea behind using embeddings is to train a model to understand the semantic relationships within the document and use that knowledge to provide accurate responses.

Another option for improving model performance is fine-tuning, but it typically requires a larger amount of data, specifically data that includes the desired prompts and corresponding answers. So if you dont have alot of good data relating to prompting, embeddings would be the way id suggest.

Harsh Gill
  • 21
  • 2
1

There are few options for that

  • Fine-tune GPT: You can fine-tune the GPT model on a specific book directly with data from the book as context and use that book later on for prompt Response. This is the best approach when you want responses for similar content.

  • Chunking: You can convert the whole book into chunks which is digestible by GPT or LLM models convert the book to Embeddings and later on use a retrieval-based system to get the results. It will not require fine-tuning of the LLM but fetches a similar result of the Query or prompt. It is good when you want to create LLM LLM-based search engine.

  • Agent-Based: You can train some agents based on the book, first get all the embeddings for the book, and store in some vector DB. Build some sort of agent to find content similar to the query and feed that query with the context fetched to the GPT. This is better when you want to create multiple book Agents and sort of things.

Hiren Namera
  • 785
  • 6
  • 20