close
Jump to content

Common Crawl

From Wikipedia, the free encyclopedia
Common Crawl
BERJAYA
Type of business501(c)(3) non-profit
Founded2007
HeadquartersSan Francisco, California; Los Angeles, California, United States
FounderGil Elbaz
Managing directorRich Skrenta
URLcommoncrawl.org
Content license
Apache 2.0 (software) [clarification needed]

The Common Crawl Foundation (Common Crawl) is a nonprofit 501(c)(3) organization that crawls the web and freely provides its archives and datasets to the public.[1][2]

Common Crawl was founded by Gil Elbaz.[1][2]

The data had mostly been primarily used by researchers and some startups until the 2020s, when AI companies started training large language models using the data.[3] In November 2025, an investigation by The Atlantic revealed that Common Crawl misled publishers when it claimed it respected paywalls in its scraping and it was not honoring requests from publishers to have their content removed from its databases.[3]

History

[edit]

Common Crawl was founded in 2007 in San Francisco.[4] It began publishing its crawls in 2011.[5][additional citation(s) needed]

By 2013, sites like TinEye were building their products off of Common Crawl.[6][7] The crawl reduces the reliance of companies and researchers on Google, which has the biggest dataset.[6][7] Common Crawl was designed to have more and fresher data that was more efficient to analyze and utilize than the Wayback Machine created by the Internet Archive.[7][6]

By 2015, 1.8 billion webpages were on the Common Crawl, which started by crawling a list of URLs donated by the search engine Blekko.[8] They use Amazon Web Services, which provides some of its services for free, allowing computing costs to average $2-4000/month.[8] The Common Crawl website listed 30 studies based on Common Crawl data.[8]

Before 2023, Common Crawl was not very well known outside of academic researchers who utilize the data.[4] Common Crawl received its first requests to redact information in 2023 and increasingly started seeing its crawler, CCBot, blocked.[4] In 2023, it began receiving significant financial support from AI companies, including Anthropic and OpenAI, each of which donated $250,000.[3] It was also used to train Google DeepMind's large language model Gemini.[9] By April 2023, Common Crawl was capturing 3.1 billion webpages, with an estimated 5% of pages before 2021 containing hate speech or slurs.[10]

As of 2024, Common Crawl had been cited in more than 10,000 academic studies.[11] By 2024, The Pile and Common Crawl had been the two main training datasets being used to train AI models.[12][13]

In November 2025, an investigation by technology journalist Alex Reisner for The Atlantic revealed that Common Crawl misled publishers when it claimed it respected paywalls in its scraping and when it said that it was honoring requests from publishers to have their content removed from its databases.[3] It included misleading results in the public search function on its website that showed no entries for websites that had requested their archives be removed, when in fact those sites were still included in its scrapes used by AI companies.[3] As of 2025, Reisner found that CCBot was the most widely-blocked bot by the top 1000 websites.[3]

A 2026 article in LWN.net discussed an advantage to services like Common Crawl being that it can limit the scraping costs to websites by allowing companies and researchers to download the data from Common Crawl instead of scraping it themselves.[14]

Organization

[edit]

Peter Norvig and Joi Ito have served on the advisory board.[7] Rich Skrenta is the executive director.[3]

It has received funding almost exclusively from the Elbaz Family Foundation Trust until 2023 when it started receiving donations from the AI industry.[3]

Refined versions

[edit]

A number of organizations take raw Common Crawl data and refine it into datasets that exclude edgy content or are otherwise higher-quality for their purposes, such as FineWeb, DCLM and C4.[3]

Colossal Clean Crawled Corpus

[edit]

Google version of the Common Crawl is called the Colossal Clean Crawled Corpus, or C4 for short. It was constructed for the training of the T5 language model series in 2019.[15] As of 2023, there were some concerns over copyrighted content in the C4 as well as racist content.[16][15] A 2024 study found that 45% of content was explicitly restricted by websites' terms of service to be used for purposes like AI training by for-profit companies.[11]

See also

[edit]

References

[edit]
  1. ^ a b Rosanna Xia (February 5, 2012). "Tech entrepreneur Gil Elbaz made it big in L.A." Los Angeles Times. Retrieved July 31, 2014. his nonprofit Common Crawl Foundation 'seeks to make a copy of the Web accessible to a data scientist, or to a start-up or to a researcher or to an analyst that just wants to improve the world.'
  2. ^ a b "Gil Elbaz and Common Crawl". NBC News. April 4, 2013. Archived from the original on April 13, 2013. Retrieved July 31, 2014.
  3. ^ a b c d e f g h i Reisner, Alex (2025-11-04). "The Company Quietly Funneling Paywalled Articles to AI Developers". The Atlantic. Retrieved 2025-11-14.
  4. ^ a b c Knibbs, Kate (June 13, 2024). "Publishers Target Common Crawl In Fight Over AI Training Data". Wired. ISSN 1059-1028. Retrieved 2025-12-10.
  5. ^ Potts, Jason; Torrance, Andrew; Harhoff, Dietmar; von Hippel, Eric (March 2024). "Profiting from Data Commons: Theory, Evidence, and Strategy Implications". Strategy Science. 9 (1): 1–17. doi:10.1287/stsc.2021.0080. ISSN 2333-2050.
  6. ^ a b c Brandom, Russell (2013-03-01). "Common Crawl: going after Google on a non-profit budget". The Verge. Retrieved 2025-12-10.
  7. ^ a b c d Tom Simonite (January 23, 2013). "A Free Database of the Entire Web May Spawn the Next Google". MIT Technology Review. Archived from the original on June 26, 2014. Retrieved July 31, 2014.
  8. ^ a b c Hayes, Brian (2015). "Computing Science: Crawling toward a Wiser Web". American Scientist. 103 (3): 184–187. ISSN 0003-0996.
  9. ^ Cuesta, Albert (15 November 2025). "L'FBI, a la caça del web arxivat que incomoda els mitjans". Ara (in Catalan). Archived from the original on 17 November 2025. Retrieved 2 March 2026.
  10. ^ Soos, Carlin; Haroutunian, Levon (2024). "On the Question of Authorship in Large Language Models". Knowledge Organization. 51 (2): 83–95. doi:10.5771/0943-7444-2024-2-83. ISSN 0943-7444.
  11. ^ a b Roose, Kevin (July 19, 2024). "The Data That Powers A.I. Is Disappearing Fast". New York Times.
  12. ^ Gilbertson, Annie (July 16, 2024). "Apple, Nvidia, Anthropic Used Thousands of Swiped YouTube Videos to Train AI". Wired. ISSN 1059-1028. Retrieved 2026-01-29.
  13. ^ Anthony, Aubra; Sharma, Lakshmee; Noor, Elina (2024-04-30). "Advancing a More Global Agenda for Trustworthy Artificial Intelligence". Carnegie Endowment for International Peace. Retrieved 2026-01-29.
  14. ^ Alden, Daroc (2026-02-12). "Poisoning scraperbots with iocaine". LWN.net. Retrieved 2026-05-09.
  15. ^ a b Quach, Katyanna (2023-04-20). "Google's C4 ML training data drew from 4chan, racist sources". theregister. Retrieved 2026-05-09.
  16. ^ Hern, Alex (2023-04-20). "Fresh concerns raised over sources of training material for AI systems". The Guardian. ISSN 0261-3077. Retrieved 2023-04-21.
[edit]