Why are some LLMs trained on both CommonCrawl and Wikipedia/StackExchange?

Question

Some LLMs are trained on both CommonCrawl and Wikipedia/StackExchange. Why? Does CommonCrawl already contain Wikipedia/StackExchange?

E.g., from the LLaMa 1 paper:

and from https://huggingface.co/datasets/togethercomputer/RedPajama-Data-1T:

Dataset	Token Count
Commoncrawl	878 Billion
C4	175 Billion
GitHub	59 Billion
Books	26 Billion
ArXiv	28 Billion
Wikipedia	24 Billion
StackExchange	20 Billion
Total	1.2 Trillion

Looking at https://commoncrawl.github.io/cc-crawl-statistics/plots/domains, it seems that CommonCrawl includes the Wikipedia and StackExchange domains. But maybe it's incomplete?

score 3 · Answer 1 · answered Nov 07 '23 at 02:28

Wikipedia is a vanishingly small % of the internet. According to the information in the screenshot you posted, Wikipedia comprises less than 0.2% of the Common Crawl. A random sample of Common Crawl is extremely unlikely to have a meaningful amount of Wikipedia text.

Also, even if a Common Crawl segment does contain Wikipedia, it's still good to include it as a separate component. High quality factual information should be upsampled to achieve the best performance, as the repeated exposure to facts increases a LLM's ability to answer those factual questions correctly.

Why are some LLMs trained on both CommonCrawl and Wikipedia/StackExchange?

1 Answers1