Skip to content

Modified filter_english_with_letters to mimic the behavior form Mozilla's version. #44

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 6 commits into from
Dec 19, 2014

Conversation

rsnair2
Copy link
Contributor

@rsnair2 rsnair2 commented Dec 5, 2014

This change helps pass three unit tests that were failing before. I have also tested the changes by comparing the output of this function with Mozilla's version over some large randomly generated byte strings and so far so good.

@dan-blanchard
Copy link
Member

Can you give an example of a string where the new function returns something different than the old version? As far as I can discern it seems like you're still just looking for English letters and converting everything else to spaces.

@rsnair2
Copy link
Contributor Author

rsnair2 commented Dec 6, 2014

@dan-blanchard, so from what I can understand the original implementation from Mozilla seems to do three things:

  1. it removes byte sequences that start with high bytes and end with a valid ascii english alphabet character. this sequence cannot include low bytes that are non-alphabetical but can include high bytes. so potentially some high bytes will also be removed. also, the alphabetical characters don't seem to be replaced with a space but rather are removed altogether.
  2. second, it seems keep sequences that start with a high byte and end with a low ascii byte that is not an english alphabet and contains no english alphabet. the low ascii byte is also replaced with a space.
  3. finally, all other sequences are skipped.

So, overall I think that what this is doing is removing all english characters and words that contain english characters and tokenizing the string to remove punctuation. What we should get at the end of this is a space separated sequence comprising of only words in the other language.

That being said, I have written a small test script that compares the output of the C++ implementation with both the current version and the one I am suggesting. I will attach it later tonight so you can check it out.

@rsnair2
Copy link
Contributor Author

rsnair2 commented Dec 6, 2014

@dan-blanchard, I am not sure how to share the test setup for the function with you best, so I have attached it to dropbox. You can view it at <a href=https://www.dropbox.com/s/l98bdpdg5oqo3xu/test.zip?dl=0> dropbox .

The test creates randomly generated files comprising of bytes in range(0, 256) to a file, calls the C++ function (which I copy-pasted from Mozilla's version) and tests it against both the current version of the function and the updated one.

@dan-blanchard
Copy link
Member

@rsnair2 thanks for putting together that test script! I just had to play around with it a bit, and I was really surprised to see how different what you're proposing is versus what we're currently doing. Then I looked back at the original Mozilla code and realized that our current approach is most definitely a mistake, but one that was due to poor naming in the original code.

There is an all-important comment above nsCharSetProber::FilterWithoutEnglishLetters in nsCharSetProber.cpp. It says:

This filter applies to all scripts which do not use English characters

So, this isn't a filter that removes English characters, as myself and others (probably even Mark Pilgrim) thought, but rather a filter that is only applied to sequences we think are from codecs that don't use English characters.

What the filter actually does is replace all byte sequences less than 127 with a single space, unless the sequence is a mixture of English letters and bytes above 127.

For example:

b'This is \x32 438 a  6 test \x80\x81 \x80\x81oo  sdj   \x80\x81   .\x80\x32'

becomes:

b'\x80\x81 \x80\x81oo \x80\x81 \x80 '

@dan-blanchard
Copy link
Member

@rsnair2 I've tweaked your code a bit to make it a bit more Pythonic (and potentially more efficient, since it's not creating temporary strings all over the place anymore).

from io import BytesIO
from six import indexbytes


def filter_without_english_letters_proposed(aBuf):
    """
    Filter to be applied to bytes that we suspect contain non-English codecs.

    :returns: a copy of aBuf that only contains only \\x80-\\xFF sequences with
              optional English letters in them, separated by spaces
    """
    newStr = BytesIO()
    prev = 0
    curr = 0
    meetMSB = False

    for curr in range(len(aBuf)):
        ord_curr = indexbytes(aBuf, curr)
        if ord_curr >= 128:
            meetMSB = True
        # If current character is not an English letter...
        elif not (65 <= ord_curr <= 90 or  # A-Z
                  97 <= ord_curr <= 122):  # a-z
            # ... and we've seen a byte above 127
            if meetMSB and curr > prev:
                # Keep substring from right after previous non-letter byte
                # (less than 127) to current
                newStr.write(aBuf[prev:curr])
                newStr.write(b' ')
                meetMSB = False
            # Update position
            prev = curr + 1

    # If we hit the end of the string while in a sequence we wanted
    # to keep...
    if meetMSB:
        # Keep everything after previous non-letter byte (less than 127)
        newStr.write(aBuf[prev:])

    return newStr.getvalue()

Please update your PR to include the updated function, and I'll get this merged.

@sigmavirus24
Copy link
Member

@dan-blanchard if we're going to use magic integers I'd much rather they be constants. I'm -1 on the code as you've written it

@dan-blanchard
Copy link
Member

@sigmavirus24 I agree. I only did it that way because it was simpler to write it in a 2/3 compatible way that way.

In that, case this alternative is probably more readable (and avoids magic all together).

from io import BytesIO

PY3 = sys.version_info >= (3, 0)


def filter_without_english_letters_proposed(aBuf):
    """
    Filter to be applied to bytes that we suspect contain non-English codecs.

    :returns: a copy of aBuf that contains only \\x80-\\xFF sequences with
              optional English letters in them, separated by spaces
    """
    newStr = BytesIO()
    prev = 0
    curr = 0
    meetMSB = False

    for curr_byte in range(len(aBuf)):
        curr_byte = bytes((aBuf[curr],)) if PY3 else aBuf[curr]
        if curr_byte >= b'\x80':
            meetMSB = True
        # If current character is not an English letter...
        elif not (b'A' <= curr_byte <= b'Z' or
                  b'a' <= curr_byte <= b'z'):
            # ... and we've seen a byte above 127
            if meetMSB and curr > prev:
                # Keep substring from right after previous non-letter byte
                # (less than 127) to current
                newStr.write(aBuf[prev:curr])
                newStr.write(b' ')
                meetMSB = False
            # Update position
            prev = curr + 1

    # If we never encountered a non-English letter sequence after our first...
    if meetMSB:
        # Keep everything after prev
        newStr.write(aBuf[prev:])

    return newStr.getvalue()

@sigmavirus24
Copy link
Member

I would rather the old code but have defined constants for the magic numbers. Ternary expressions based on the version of Python are just going to be confusing to new contributors.

@dan-blanchard
Copy link
Member

The previous version did not work with Python 3, because when indexing a bytes object in Python 3, you get an int. There's going to need to be a ternary expression based on the version number in there somewhere.

@dan-blanchard
Copy link
Member

I'm honestly not a big fan of how C-like the whole function is, because it took me a while to grasp that sequences like S\x84Z would be retained. It seems like it should be simple to a make a regex that says "keep only sequences in the set [\x80-\xFFa-zA-Z], where there must be at least one character in the range [\x80-\xFF]", but I've failed enough times for today.

@rsnair2
Copy link
Contributor Author

rsnair2 commented Dec 9, 2014

@dan-blanchard, I couldn't agree more - this function has been named confusingly and its pretty confusing in what its says. However, @sigmavirus24 did mention that regex can be inefficient, and given that chardet is already slow, I don't know if this is the ideal thing to do. But then again, the string is filtered only once so maybe the performance impact won't be large.

P.S. Sorry I accidentally closed this by clicking on close and commit. I don't know how to undo this.

@rsnair2 rsnair2 closed this Dec 9, 2014
@dan-blanchard dan-blanchard reopened this Dec 9, 2014
@sigmavirus24
Copy link
Member

So this should stay open. Can we use the docstring to provide a very good explanation of what this function does?

@dan-blanchard
Copy link
Member

However, @sigmavirus24 did mention that regex can be inefficient, and given that chardet is already slow, I don't know if this is the ideal thing to do.

@rsnair2 All of the regex versions of this I tested that almost worked—they each had their own little corner cases that didn't perform exactly the same—were roughly 4 times faster when doing some basic testing with %timeit in IPython. Regular expressions can be slower in Python, but this is usually only the case when what you're doing is something that is so simple that you don't need them in the first place. For example, checking for the presence of a simple substring is much slower with a regular expression than with simply using the in operator. In our case we're iterating through a string and performing another of bookkeeping operations, which is almost definitely going to be slower than the equivalent regex (if we could formulate it).

Can we use the docstring to provide a very good explanation of what this function does?

@sigmavirus24 I attempted to do that with my updated version, but I can only try to describe what it does, not why it does it.

@sigmavirus24
Copy link
Member

I'm wont to find a case where using a regular expression is the more performant. In every case it comes down to being easier for me. There are also really good benchmarks and examples provided by the re2 library that show how inefficient Python's standard re module actually is.

@dan-blanchard
Copy link
Member

re2 and regex are definitely both much more efficient than the standard re module in most cases, but that does not mean re can't be more efficient than a for loop. Loops have quite a bit of overhead in Python.

… and comments. note - not compatible with P3
@rsnair2
Copy link
Contributor Author

rsnair2 commented Dec 13, 2014

@dan-blanchard, @sigmavirus24 I updated the fork to use regex instead of the for loop. I profiled it with timeit and it seems to be around twice as fast for large strings with the regex. I also changed around some of the variables. Its still not ready though, it does not work with python3. I will work on that sometime today if I get the chance.

@rsnair2
Copy link
Contributor Author

rsnair2 commented Dec 14, 2014

Alright, I tested with python 3 and 2. Seems good.

@dan-blanchard
Copy link
Member

Thanks for this @rsnair2. This looks much better now!

👍

@sigmavirus24 I'm +1 to merge, how about yourself?

@sigmavirus24
Copy link
Member

I'd like the docstring written differently, but I'll submit a PR with how I'd like it formatted/written.

sigmavirus24 added a commit that referenced this pull request Dec 19, 2014
Modified filter_english_with_letters to mimic the behavior form Mozilla's version.
@sigmavirus24 sigmavirus24 merged commit 0ad6242 into chardet:master Dec 19, 2014
@dan-blanchard dan-blanchard mentioned this pull request Apr 11, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants