Common Crawl Citations – BibTeX Database

BibTex files are in bib/

Note: work in progress, still contains only a fraction of recent articles

Fields Specific for Common Crawl

The following non-standard fields are used to add information how the publications relate to Common Crawl:

cc-author-affiliation: affiliation of the authors
cc-class: classification of the publication: domain of research, topics, keywords
cc-snippet: snippet citing Common Crawl
cc-dataset-used: subset of Common Crawl used, e.g., CC-MAIN-2016-07
cc-derived-dataset-about: the publication describes a dataset which has been derived from Common Crawl, e.g., GloVe-word-embeddings
cc-derived-dataset-used: a dataset has been used which is derived from Common Crawl, e.g., GloVe-word-embeddings
cc-derived-dataset-cited: a derived dataset is cited but not used

Formatting and Export of Citations

The Makefile contains targets to apply a consistent formatting to the citations. It also allows to export the citations. The following BibTeX tools are required: bibtex2html, bibclean, bibtool.

(Do not be confused by the pypi package bibclean, it's entirely different. bibclean, bibtool, and bibtex2html are available as OS packages, at least in apt-based distros.)

Citations from Google Scholar Alerts

As an initial step and to get a higher coverage, citations are extracted from Google Scholar Alert e-mails received April 2016 to date. See gscholar_alerts.

Updating the awesome graph that everyone loves

Uploading the raw data to Hugging Face

Google Scholar

This data is split by year to make it easier to explore.

pull the updated repo
make gscholar-bib
look in tmp for 2024.jsonl etc.
upload at https://huggingface.co/datasets/commoncrawl/citations/tree/main

Annotated Citations

This much smaller dataset has the extra fields mentioned above.

pull the updated repo
make tmp/commoncrawl_annotated.csv
TODO

Name		Name	Last commit message	Last commit date
Latest commit History 75 Commits
bib		bib
gscholar_alerts		gscholar_alerts
.gitignore		.gitignore
Makefile		Makefile
README.md		README.md
citations_2025.csv		citations_2025.csv
citations_2025.png		citations_2025.png
citations_plot.py		citations_plot.py
cumulative_citations_2025.png		cumulative_citations_2025.png
export-csv.py		export-csv.py
requirements.txt		requirements.txt
split-jsonl.py		split-jsonl.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Common Crawl Citations – BibTeX Database

Fields Specific for Common Crawl

Formatting and Export of Citations

Citations from Google Scholar Alerts

Updating the awesome graph that everyone loves

Uploading the raw data to Hugging Face

Google Scholar

Annotated Citations

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 4

Uh oh!

Languages

commoncrawl/cc-citations

Folders and files

Latest commit

History

Repository files navigation

Common Crawl Citations – BibTeX Database

Fields Specific for Common Crawl

Formatting and Export of Citations

Citations from Google Scholar Alerts

Updating the awesome graph that everyone loves

Uploading the raw data to Hugging Face

Google Scholar

Annotated Citations

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 4

Uh oh!

Languages

Packages