An Integrated Corpus Tool With Multilingual Support for the Study of Language, Literature, and Translation
-
Updated
Aug 23, 2025 - Python
An Integrated Corpus Tool With Multilingual Support for the Study of Language, Literature, and Translation
Crawler for linguistic corpora
Data for the quantitative study of (Vedic) Sanskrit
Large silver standart Russian corpus with NER, morphology and syntax markup
A set of workflows for corpus building through OCR, post-correction and normalisation
Amharic English Machine Translation Corpus prepared through website crawelling and custom preprocessing.
CONLL-U to Pandas DataFrame
Yet another search platform for linguistic corpora.
Vietnamese Wikipedia Corpus
An unofficial Python API that allows users to create a corpus of lyrical text from their favorite artists and billboard charts
Preprocessing and analysis for training SNOMED-CT concept embeddings from CORD-19 corpus
simple bs4 based web crawl for a corpus in need of statistical machine translation
Scraper
Measure the similarity of text corpora for 74 languages
Filipino wordlist word-level
TextDirectory allows you to filter, transform, and combine multiple text files into one aggregated file.
Scripts for building a geo-located web corpus using Common Crawl data
Statistical association measures for Python pandas
Implementation of the term scoring algorithm in Tomokiyo & Hurst (2003), based on Kullback-Leibler Divergence (kldiv). Given a foreground and background corpus, it returns the most descriptive terms of the foreground corpus in the form of a termcloud
Tools and resources for the computational processing of Nheengatu (Modern Tupi)
Add a description, image, and links to the corpus-linguistics topic page so that developers can more easily learn about it.
To associate your repository with the corpus-linguistics topic, visit your repo's landing page and select "manage topics."