Skip to content

Conversation

lukasgarbas
Copy link
Collaborator

Added Ukrainian NER dataset from lang-uk project. Fixed splits (train and test) are taken from lang-uk/flair-ner:

from flair.datasets import NER_UKRAINIAN

corpus = NER_UKRAINIAN()

print(corpus)
# Corpus: 7886 train + 876 dev + 4045 test sentences

print(corpus.train[161])  # sentence example
# "І СхідSide втратив Дудка ..." → ["СхідSide"/ORG, "Дудка"/PERS]

And Ukrainian Universal Dependency Treebank from UniversalDependencies:

from flair.datasets import UD_UKRAINIAN

corpus = UD_UKRAINIAN()

print(corpus)
# Corpus: 5521 train + 673 dev + 898 test sentences

print(corpus.train[9])  # sentence example
# "Бо самою авторкою всі акценти розставлено зовсім очевидно." → ["Бо"/бо/SCONJ/Css/mark, ...

@lukasgarbas
Copy link
Collaborator Author

I also trained a few models on Ukrainian NER:

embeddings method parameters dev F1 (micro) test F1 (micro)
electra-base-ukrainian fine_tune() lr: 5e-5, batch: 16 95.02 88.39
Flair uk-forward, Flair uk-backward train() default 86.20 81.42
electra-base-ukrainian train() default, fine_tune: False, layers: 'all', layer_mean: True 92.87 87.38
electra-base-ukrainian, Flair uk-forward, Flair uk-backward train() default, fine_tune: False, layers: 'all', layer_mean: True 94.22 88.61

@alanakbik
Copy link
Collaborator

@lukasgarbas thanks for adding these datasets, and for posting these numbers!

@alanakbik alanakbik merged commit ff74a9f into flairNLP:master Jan 27, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants