Train my own LLM on a smaller corpus of text?

Question

Would it be possible to train my own LLM on a smaller corpus of text, lets say some coding documentation that I then want to ask questions about using the model?

If so, are there any recommended ways of doing this, i.e is there a prebuilt architecture or library I can use, and just provide the corpus of text.

knb · Answer 1 · 2025-02-17T08:15:44.947

Aside from common Retrieval-Augmented-Query (RAG) architectures and Agent memory offerings, you can try this:

If you have an OpenAI Pro account you can define your own GPT that can do certain operations, and you can also add some documents to a "Knowledge" menu item in the sidebar of the custom GPT. Then,

If you upload files under Knowledge, conversations with your GPT may include file contents.

If your stuff is publicly accessible, this is an additional way to get started quickly:

Upload the documentation files (or your code) to a public gitlab/github repository accessible from the public internet.
Study the github/gitlab/... REST API , select endpoints that you might need to call in order to answer your questions about the docs under version control. (e.g. /commits endpoint, /issues endpoint etc)
Define a GPT action that let your GPT (and thus GPT-4o) call those REST API endpoints.
Open the ChatGPT interface and "start talking with your documents" , leveraging the power of GPT-4o or GPT-4.

(Costs for tokens consumed may apply)

Train my own LLM on a smaller corpus of text?

1 Answers1