Added Ukrainian NER and UD datasets #3069

lukasgarbas · 2023-01-26T14:23:23Z

Added Ukrainian NER dataset from lang-uk project. Fixed splits (train and test) are taken from lang-uk/flair-ner:

from flair.datasets import NER_UKRAINIAN

corpus = NER_UKRAINIAN()

print(corpus)
# Corpus: 7886 train + 876 dev + 4045 test sentences

print(corpus.train[161])  # sentence example
# "І СхідSide втратив Дудка ..." → ["СхідSide"/ORG, "Дудка"/PERS]

And Ukrainian Universal Dependency Treebank from UniversalDependencies:

from flair.datasets import UD_UKRAINIAN

corpus = UD_UKRAINIAN()

print(corpus)
# Corpus: 5521 train + 673 dev + 898 test sentences

print(corpus.train[9])  # sentence example
# "Бо самою авторкою всі акценти розставлено зовсім очевидно." → ["Бо"/бо/SCONJ/Css/mark, ...

lukasgarbas · 2023-01-27T11:27:36Z

I also trained a few models on Ukrainian NER:

embeddings	method	parameters	dev F1 (micro)	test F1 (micro)
electra-base-ukrainian	fine_tune()	lr: 5e-5, batch: 16	95.02	88.39
Flair uk-forward, Flair uk-backward	train()	default	86.20	81.42
electra-base-ukrainian	train()	default, fine_tune: False, layers: 'all', layer_mean: True	92.87	87.38
electra-base-ukrainian, Flair uk-forward, Flair uk-backward	train()	default, fine_tune: False, layers: 'all', layer_mean: True	94.22	88.61

alanakbik · 2023-01-27T11:30:06Z

@lukasgarbas thanks for adding these datasets, and for posting these numbers!

lukasgarbas added 2 commits January 26, 2023 13:50

added Ukrainian NER and UD datasets

29f5bb8

Added Ukrainian UD corpus and fixed formatting

ce205db

alanakbik merged commit ff74a9f into flairNLP:master Jan 27, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Added Ukrainian NER and UD datasets #3069

Added Ukrainian NER and UD datasets #3069

Uh oh!

lukasgarbas commented Jan 26, 2023

Uh oh!

lukasgarbas commented Jan 27, 2023

Uh oh!

alanakbik commented Jan 27, 2023

Uh oh!

Uh oh!

Uh oh!

Added Ukrainian NER and UD datasets #3069

Added Ukrainian NER and UD datasets #3069

Uh oh!

Conversation

lukasgarbas commented Jan 26, 2023

Uh oh!

lukasgarbas commented Jan 27, 2023

Uh oh!

alanakbik commented Jan 27, 2023

Uh oh!

Uh oh!