Skip to content

commoncrawl/cc-citations

Repository files navigation

Common Crawl Citations – BibTeX Database

BibTex files are in bib/

Note: work in progress, still contains only a fraction of recent articles

Fields Specific for Common Crawl

The following non-standard fields are used to add information how the publications relate to Common Crawl:

cc-author-affiliation
affiliation of the authors
cc-class
classification of the publication: domain of research, topics, keywords
cc-snippet
snippet citing Common Crawl
cc-dataset-used
subset of Common Crawl used, e.g., CC-MAIN-2016-07
cc-derived-dataset-about
the publication describes a dataset which has been derived from Common Crawl, e.g., GloVe-word-embeddings
cc-derived-dataset-used
a dataset has been used which is derived from Common Crawl, e.g., GloVe-word-embeddings
cc-derived-dataset-cited
a derived dataset is cited but not used

Formatting and Export of Citations

The Makefile contains targets to apply a consistent formatting to the citations. It also allows to export the citations. The following BibTeX tools are required: bibtex2html, bibclean, bibtool.

(Do not be confused by the pypi package bibclean, it's entirely different. bibclean, bibtool, and bibtex2html are available as OS packages, at least in apt-based distros.)

Citations from Google Scholar Alerts

As an initial step and to get a higher coverage, citations are extracted from Google Scholar Alert e-mails received April 2016 to date. See gscholar_alerts.

Updating the awesome graph that everyone loves

Uploading the raw data to Hugging Face

Google Scholar

This data is split by year to make it easier to explore.

Annotated Citations

This much smaller dataset has the extra fields mentioned above.

  • pull the updated repo
  • make tmp/commoncrawl_annotated.csv
  • TODO

About

Scientific articles using or citing Common Crawl data

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 4

  •  
  •  
  •  
  •