N-gram counts and language models from the common crawl

C Buck, K Heafield, B Van Ooyen - Proceedings of the Language …, 2014 - research.ed.ac.uk
… Finally, we investigate the relation between the amount of Common Crawl data used and …
cannot rule out the possibility that some of the segments appear in the Common Crawl data. …

What's in the box? a preliminary analysis of undesirable content in the common crawl corpus

AS Luccioni, JD Viviano - arXiv preprint arXiv:2105.02732, 2021 - arxiv.org
… This dwarfs other commonly used corpora such as English-… The Common Crawl has been
used to train many of the recent … an initial analysis of the Common Crawl, highlighting the pres…

Dirt cheap web-scale parallel text from the common crawl

JR Smith, H Saint-Amand, M Plamada, P Koehn… - 2013 - zora.uzh.ch
… parallel text, but crawling the entire web is impossible for all but … Common Crawl, a public
Web crawl hosted on Amazon’s Elastic Cloud. Starting from nothing more than a set of common

Introduction to common crawl datasets

JM Patel - Getting structured data from the internet: running web …, 2020 - Springer
… When we take the common crawl data cumulatively, across monthly crawls since 2008, it
represents one of the largest publicly accessible web crawl data corpuses on a petabyte …

Elastic chatnoir: Search engine for the clueweb and the common crawl

J Bevendorff, B Stein, M Hagen, M Potthast - European conference on …, 2018 - Springer
… reference corpora like the ClueWebs and the Common Crawl. ChatNoir is freely available
and … In the future, we plan to incorporate further versions of the Common Crawl, so that …

A critical analysis of the largest source for generative ai training data: Common crawl

S Baack - Proceedings of the 2024 ACM Conference on Fairness …, 2024 - dl.acm.org
… • In chapter 4, we discuss how LLM builders typically use Common Crawl and highlight
that the popularity of Common Crawl has shaped builders’ expectations regarding model …

CopyCat: Near-Duplicates within and between the ClueWeb and the Common Crawl

M Fröbe, J Bevendorff, L Gienapp, M Völske… - Proceedings of the 44th …, 2021 - dl.acm.org
… With the CopyCat resource, we provide lists of near-duplicates in the commonly used ClueWeb
and Common Crawl datasets and a software toolkit to conduct deduplication on arbitrary …

Nemotron-CC: Transforming Common Crawl into a refined long-horizon pretraining dataset

D Su, K Kong, Y Lin, J Jennings, B Norick… - arXiv preprint arXiv …, 2024 - arxiv.org
… We propose a method for transforming English Common Crawl into a 6.3T token longhorizon …
We release the dataset2 under the Common Crawl Terms of Use and a reference …

Comparison of common crawl news & gdelt

A El Ouadi, D Beskow - 2024 IEEE international systems …, 2024 - ieeexplore.ieee.org
Common Crawl data has proven valuable for research and training artificial intelligence, …
the Common Crawl dataset. CC-News, a derived corpus of news data from Common Crawl, …

Understanding regional context of World Wide Web using common crawl corpus

MA Mehmood, HM Shafiq… - 2017 IEEE 13th Malaysia …, 2017 - ieeexplore.ieee.org
… analyzed freely available crawl data of December 2016 by the Common Crawl Corpus [5] …
Note that crawling the whole web is very expensive, however, Common Crawl Corpus …