I am trying to create a kind of support bot to answer questions from my clients about specific technical details about WordPress plugins that I sell.
The goal is that the /completions API would be fed a prompt, which could be something general like a CSS styling change, which the davinci engine knows without any specific data about my business, but the customer might ask something specific for which I have about a data set of 3000 questions and answers (prompts/completions? input/output?) on which the bot can feed on exactly like this awesome example here.
I am a web developer, and I don't have experience with AI. I am just scratching the surface, trying to put this bot together with learning concepts like machine learning, training data, validation sets, plotting, and neural networks. So bear with me, because it's a lot to grasp.
So first of all, I did a lot of documenting, and getting an API key from OpenAI is certainly the first step.
Then I told ChatGPT my story and what I tried to achieve. I asked him to write in PHP, preferably, but it always ends up hallucinating, so I could not really use anything he generated without adjusting. And the further I asked him about specifics, the further he hallucinated.
So I read a lot of documentation and extrapolated with what I got from ChatGPT. I think there are 3 ways to achieve this:
- A fine-tuned model;
- Uploading a training set and a validation set;
- Embeddings API ( which the example that I linked uses )
By understanding that most examples are in Python, I started the
GPT-3 Fine Tuning: Key Concepts and Use Cases tutorial, then prepared fine-tunes.prepare DATA_UNDER_COMMENT into the json by line to be categorised in prompts and completions.
Then, I used openai api fine_tunes.create -t to create my fine-tune, and now I have my fine-tune created, and I run:
openai.Completion.create(
model=FINE_TUNED_MODEL,
prompt=YOUR_PROMPT)
This looked like the way to go, but even if you put a basic question that was actually in the JSONL, it's like the engine forgot to talk and outputs random characters.
So I tried another approach from the cookbook, which seems pretty great following this which seems exactly like what I want to achieve.
The GPT models have picked up a lot of general knowledge in training, but we often need to ingest and use a large library of more specific information.
I tried to use the code there, but with my CSV hosted online, I got a 406 response when trying to load.
Then I stored the CSV locally, and it complained that a column ( tokens ) was not available for converting to an int( 10 ).
Then, from what I could understand, I switched from using load_embeddings to compute_doc_embeddings because it says from the documentation that they already have the embedding generated for that CSV
I did that, but now it asks for a JSON instead of a CSV.
Of course, I am able to provide my data in any format, so when I tried to load my data, it said that the token limit of 8000 is exceeded for this request.
I now try to input a small JSON here, under a comment, and try to run a prompt. And, kind of amazing, after hours of work, it seems to work. I provide a question from the data, but under a different structure, and it replies to me correctly, using a different wording than the one from the JSON data.
He could not have known this from his general knowledge.
So this is what I want to achieve, but my data set is much larger.
I need help to understand if my approach is correct. And if Embedding is the way to go, how do I feed data into OpenAI and reference the embeddings set when doing API calls to completions? Ideally, I would have those embeddings stored somehow with the possibility of adding to them. Just like I have fine-tune sets or files under my API account.