Skip to content

filter_with_english_letters #46

@rsnair2

Description

@rsnair2

Currently, this function has been listed as a TO-DO. I was looking over at the source from Mozilla and it seems that there could be a bug in that.

From what I can tell, the original intention of this function was to remove all markup tags. Its used in the LatinProber and I imagine that the idea is to remove all markup tags - which will probably contain english alphabets/words - so that we can avoid skewing our confidence incorrectly.

The current behavior though is not that. A simple example:

<some tag> outside <some tag>

returns

tag outside tag

It includes parts of text within a tag if there are multiple words separated by any kind of punctuation in the tag. I can look into this, by I wanted to know your thoughts on this first?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions