-
Notifications
You must be signed in to change notification settings - Fork 265
Closed
Description
Currently, this function has been listed as a TO-DO. I was looking over at the source from Mozilla and it seems that there could be a bug in that.
From what I can tell, the original intention of this function was to remove all markup tags. Its used in the LatinProber and I imagine that the idea is to remove all markup tags - which will probably contain english alphabets/words - so that we can avoid skewing our confidence incorrectly.
The current behavior though is not that. A simple example:
<some tag> outside <some tag>
returns
tag outside tag
It includes parts of text within a tag if there are multiple words separated by any kind of punctuation in the tag. I can look into this, by I wanted to know your thoughts on this first?
Metadata
Metadata
Assignees
Labels
No labels