Skip to content

Improve parsing of Wikipedia articles with leading formatting boxes #1

@davidmezzetti

Description

@davidmezzetti

Currently, there is a small subset of articles that are retaining leading formatting boxes in the text. The main parser for the Wikipedia dataset filters most of this out but for about 7K articles it doesn't. This is causing the logic that builds the abstract text to pull the wrong text.

This change will handle this case better.

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions