-
Notifications
You must be signed in to change notification settings - Fork 265
Modified filter_english_with_letters to mimic the behavior form Mozilla's version. #44
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…avior of mozillas version
Can you give an example of a string where the new function returns something different than the old version? As far as I can discern it seems like you're still just looking for English letters and converting everything else to spaces. |
@dan-blanchard, so from what I can understand the original implementation from Mozilla seems to do three things:
So, overall I think that what this is doing is removing all english characters and words that contain english characters and tokenizing the string to remove punctuation. What we should get at the end of this is a space separated sequence comprising of only words in the other language. That being said, I have written a small test script that compares the output of the C++ implementation with both the current version and the one I am suggesting. I will attach it later tonight so you can check it out. |
@dan-blanchard, I am not sure how to share the test setup for the function with you best, so I have attached it to dropbox. You can view it at <a href=https://www.dropbox.com/s/l98bdpdg5oqo3xu/test.zip?dl=0> dropbox . The test creates randomly generated files comprising of bytes in range(0, 256) to a file, calls the C++ function (which I copy-pasted from Mozilla's version) and tests it against both the current version of the function and the updated one. |
@rsnair2 thanks for putting together that test script! I just had to play around with it a bit, and I was really surprised to see how different what you're proposing is versus what we're currently doing. Then I looked back at the original Mozilla code and realized that our current approach is most definitely a mistake, but one that was due to poor naming in the original code. There is an all-important comment above
So, this isn't a filter that removes English characters, as myself and others (probably even Mark Pilgrim) thought, but rather a filter that is only applied to sequences we think are from codecs that don't use English characters. What the filter actually does is replace all byte sequences less than 127 with a single space, unless the sequence is a mixture of English letters and bytes above 127. For example: b'This is \x32 438 a 6 test \x80\x81 \x80\x81oo sdj \x80\x81 .\x80\x32' becomes: b'\x80\x81 \x80\x81oo \x80\x81 \x80 ' |
@rsnair2 I've tweaked your code a bit to make it a bit more Pythonic (and potentially more efficient, since it's not creating temporary strings all over the place anymore). from io import BytesIO
from six import indexbytes
def filter_without_english_letters_proposed(aBuf):
"""
Filter to be applied to bytes that we suspect contain non-English codecs.
:returns: a copy of aBuf that only contains only \\x80-\\xFF sequences with
optional English letters in them, separated by spaces
"""
newStr = BytesIO()
prev = 0
curr = 0
meetMSB = False
for curr in range(len(aBuf)):
ord_curr = indexbytes(aBuf, curr)
if ord_curr >= 128:
meetMSB = True
# If current character is not an English letter...
elif not (65 <= ord_curr <= 90 or # A-Z
97 <= ord_curr <= 122): # a-z
# ... and we've seen a byte above 127
if meetMSB and curr > prev:
# Keep substring from right after previous non-letter byte
# (less than 127) to current
newStr.write(aBuf[prev:curr])
newStr.write(b' ')
meetMSB = False
# Update position
prev = curr + 1
# If we hit the end of the string while in a sequence we wanted
# to keep...
if meetMSB:
# Keep everything after previous non-letter byte (less than 127)
newStr.write(aBuf[prev:])
return newStr.getvalue() Please update your PR to include the updated function, and I'll get this merged. |
@dan-blanchard if we're going to use magic integers I'd much rather they be constants. I'm -1 on the code as you've written it |
@sigmavirus24 I agree. I only did it that way because it was simpler to write it in a 2/3 compatible way that way. In that, case this alternative is probably more readable (and avoids magic all together). from io import BytesIO
PY3 = sys.version_info >= (3, 0)
def filter_without_english_letters_proposed(aBuf):
"""
Filter to be applied to bytes that we suspect contain non-English codecs.
:returns: a copy of aBuf that contains only \\x80-\\xFF sequences with
optional English letters in them, separated by spaces
"""
newStr = BytesIO()
prev = 0
curr = 0
meetMSB = False
for curr_byte in range(len(aBuf)):
curr_byte = bytes((aBuf[curr],)) if PY3 else aBuf[curr]
if curr_byte >= b'\x80':
meetMSB = True
# If current character is not an English letter...
elif not (b'A' <= curr_byte <= b'Z' or
b'a' <= curr_byte <= b'z'):
# ... and we've seen a byte above 127
if meetMSB and curr > prev:
# Keep substring from right after previous non-letter byte
# (less than 127) to current
newStr.write(aBuf[prev:curr])
newStr.write(b' ')
meetMSB = False
# Update position
prev = curr + 1
# If we never encountered a non-English letter sequence after our first...
if meetMSB:
# Keep everything after prev
newStr.write(aBuf[prev:])
return newStr.getvalue() |
I would rather the old code but have defined constants for the magic numbers. Ternary expressions based on the version of Python are just going to be confusing to new contributors. |
The previous version did not work with Python 3, because when indexing a |
I'm honestly not a big fan of how C-like the whole function is, because it took me a while to grasp that sequences like |
@dan-blanchard, I couldn't agree more - this function has been named confusingly and its pretty confusing in what its says. However, @sigmavirus24 did mention that regex can be inefficient, and given that chardet is already slow, I don't know if this is the ideal thing to do. But then again, the string is filtered only once so maybe the performance impact won't be large. P.S. Sorry I accidentally closed this by clicking on close and commit. I don't know how to undo this. |
So this should stay open. Can we use the docstring to provide a very good explanation of what this function does? |
@rsnair2 All of the regex versions of this I tested that almost worked—they each had their own little corner cases that didn't perform exactly the same—were roughly 4 times faster when doing some basic testing with
@sigmavirus24 I attempted to do that with my updated version, but I can only try to describe what it does, not why it does it. |
I'm wont to find a case where using a regular expression is the more performant. In every case it comes down to being easier for me. There are also really good benchmarks and examples provided by the |
|
… and comments. note - not compatible with P3
@dan-blanchard, @sigmavirus24 I updated the fork to use regex instead of the for loop. I profiled it with timeit and it seems to be around twice as fast for large strings with the regex. I also changed around some of the variables. Its still not ready though, it does not work with python3. I will work on that sometime today if I get the chance. |
Alright, I tested with python 3 and 2. Seems good. |
Thanks for this @rsnair2. This looks much better now! 👍 @sigmavirus24 I'm +1 to merge, how about yourself? |
I'd like the docstring written differently, but I'll submit a PR with how I'd like it formatted/written. |
Modified filter_english_with_letters to mimic the behavior form Mozilla's version.
This change helps pass three unit tests that were failing before. I have also tested the changes by comparing the output of this function with Mozilla's version over some large randomly generated byte strings and so far so good.