Is Llama3 fully open-source, including tokenizer, transformers, and other components needed to build a custom LLM?

Question

I'm trying to understand whether Llama 3 (or other open source models) is fully open-source. Specifically, I would like to know:

Is the source code for Llama 3 (including the tokenizer, transformers, and other components) available under an open-source license?
Does the available source code provide everything necessary to build a large language model (LLM) with custom data using Llama 3's architecture?

For example, if I wanted to train my own model like Llama 3, would I have access to all the underlying code (for tokenization, model architecture, etc.), or are there any proprietary components involved?

Aleph 0 · Accepted Answer · 2024-09-26T08:42:23.467

The main components that are not available are: the large data sets of text used to pretrain and fine tune the llama models.

The components that are available are: The software that is used to train and fine tune models. The result of pretraining (the llama base model), the result of fine tuning (the llama instruct and chat models).

The hardware for pretraining is also very very very expensive. The hardware for fine tuning is possible to rent.

So 'open source' is slightly misleading, llama is not fully libre in the GNU sense. It is a 'weights available' machine learning model and the source code for training it, fine tuning it, running inference is open source.

Note: I should clarify a little bit, The exact source code meta used to pretrain their models on their gigantic hardware infrastructure with tens of thousands of GPU nodes is very specific to their individual setup - and this code is not available to us. But there is other open source code available to pretrain models from scratch, including models with the llama architecture.

Franck Dernoncourt · Answer 2 · 2024-10-07T05:41:00.537

It doesn't meet the typical definition of open source, such as the definition given by the Open Source Initiative, since the llama3 license includes restrictive clauses such as:

If, on the Meta Llama 3 version release date, the monthly active users of the products or services made available by or for Licensee, or Licensee’s affiliates, is greater than 700 million monthly active users in the preceding calendar month, you must request a license from Meta, which Meta may grant to you in its sole discretion, and you are not authorized to exercise any of the rights under this Agreement unless or until Meta otherwise expressly grants you such rights.

Is Llama3 fully open-source, including tokenizer, transformers, and other components needed to build a custom LLM?

2 Answers2