3

Some LLMs are trained on both CommonCrawl and Wikipedia/StackExchange. Why? Does CommonCrawl already contain Wikipedia/StackExchange?

E.g., from the LLaMa 1 paper:

enter image description here

and from https://huggingface.co/datasets/togethercomputer/RedPajama-Data-1T:

Dataset Token Count
Commoncrawl 878 Billion
C4 175 Billion
GitHub 59 Billion
Books 26 Billion
ArXiv 28 Billion
Wikipedia 24 Billion
StackExchange 20 Billion
Total 1.2 Trillion

Looking at https://commoncrawl.github.io/cc-crawl-statistics/plots/domains, it seems that CommonCrawl includes the Wikipedia and StackExchange domains. But maybe it's incomplete?

enter image description here

enter image description here

Franck Dernoncourt
  • 3,473
  • 2
  • 21
  • 39

1 Answers1

3

Wikipedia is a vanishingly small % of the internet. According to the information in the screenshot you posted, Wikipedia comprises less than 0.2% of the Common Crawl. A random sample of Common Crawl is extremely unlikely to have a meaningful amount of Wikipedia text.

Also, even if a Common Crawl segment does contain Wikipedia, it's still good to include it as a separate component. High quality factual information should be upsampled to achieve the best performance, as the repeated exposure to facts increases a LLM's ability to answer those factual questions correctly.

Stella Biderman
  • 331
  • 1
  • 14