Upgrade lang identifier model #375

bdewilde · 2023-04-02T20:53:22Z

Description

adds scripts to prepare training/evaluation data for a language identification model and then train the langid model
updates lang identifier class to use the new (v3) floret/fasttext model rather than the existing (v2) thinc/cld3
adds additional datasets to training data, for more and better language coverage
updates a couple existing datasets to newer versions
adds floret (the explosion folks' wrapper around fasttext) as a package dependency

Motivation and Context

I wanted to update my home-brewed language identification code to use something a bit more standard and, ideally, faster / more accurate. I also want to use floret for other purposes, so this use case brings it into textacy for later.

How Has This Been Tested?

Screenshots (if appropriate):

Types of changes

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to change)

Checklist:

My code follows the code style of this project.
My change requires a change to the documentation, and I have updated it accordingly.

bdewilde added 18 commits March 18, 2023 13:08

placeholder fixup

1dd9904

feat: Update univ dep dataset version

fbcff68

feat: Make char tok a subclass of official dummy

974b073

feat: Add script to prep langid datasets only

538211d

feat: Tweak data sizes for langid datasets

4a3640e

feat: Save a couple pipeline configs

deb427d

feat: Delete older model config

95df5b0

feat: Add Ted dataset for lang id

ff65066

fix: Fix stale link for udhr dataset

026992c

feat: Add SETimes dataaset for langid

520008d

feat: Add script to prepare v3 langid dataset

9113ca1

feat: Update LangId class to v3 model

6104f4b

feat: Delete cfgs for failed experimental langids

99b38c9

feat: Delete chartokenizer for exp langid model

af26b40

fix: Use str paths with floret

a6bab11

feat: Add script to train v3 lang id model

7cae2fe

docs: Update lang id module docstring for v3

09e962d

build: Use v3 langid in CI

5a1664a

bdewilde changed the base branch from main to develop April 2, 2023 20:53

bdewilde added 2 commits April 2, 2023 16:54

feat: Remove old langid dataset script

dd951d6

build: Add missing floret pkg dep

e8ba446

bdewilde marked this pull request as ready for review April 2, 2023 21:04

bdewilde merged commit 1b8059b into develop Apr 2, 2023

bdewilde deleted the try-standard-lang-identifier-model branch April 2, 2023 21:07

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Upgrade lang identifier model #375

Upgrade lang identifier model #375

Uh oh!

bdewilde commented Apr 2, 2023 •

edited

Loading

Uh oh!

Uh oh!

Upgrade lang identifier model #375

Upgrade lang identifier model #375

Uh oh!

Conversation

bdewilde commented Apr 2, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Motivation and Context

How Has This Been Tested?

Screenshots (if appropriate):

Types of changes

Checklist:

Uh oh!

Uh oh!

bdewilde commented Apr 2, 2023 •

edited

Loading