1

It seems that most of the open-sourced large language models (LLMs) like Llama 2 had the model released but not the exact training procedure and training data-sources (exact data revisions) so that one could fully re-create (re-train) the model from scratch.

Therefore I ask what are the most notable (=largest?) open source LLMs that are really 100% open source from the beginning to the end? I mean projects where exact training data revisions (e.g. exact Common Crawl or Wikipedia dump revisions etc.) and exact training/tuning procedures (probably including code) have been disclosed so that the model can be independently verified and one can learn about the procedures in general.

Kozuch
  • 281
  • 1
  • 6

0 Answers0