Skip to content

Add support for new MasakhaNER v2 dataset #2971

@stefan-it

Description

@stefan-it

Hi,

MasakhaNER v2 was recently accepted at EMNLP 20220 and the new dataset is already online available here.

Preprint is available here.

It should be relatively easy to add this dataset.

The current existing v1 has the following arguments:

class NER_MASAKHANE(MultiCorpus):
def __init__(
self,
languages: Union[str, List[str]] = "luo",
base_path: Union[str, Path] = None,
in_memory: bool = True,
**corpusargs,
):

I think we can simply add a version variable and default-set it to v1 to ensure backward compatibility?

Then version dependend-logic such as available languages and GitHub folder paths could be added.

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions