Modified filter_english_with_letters to mimic the behavior form Mozilla's version. #44

rsnair2 · 2014-12-05T03:30:50Z

This change helps pass three unit tests that were failing before. I have also tested the changes by comparing the output of this function with Mozilla's version over some large randomly generated byte strings and so far so good.

…avior of mozillas version

…ormance

dan-blanchard · 2014-12-05T14:11:35Z

Can you give an example of a string where the new function returns something different than the old version? As far as I can discern it seems like you're still just looking for English letters and converting everything else to spaces.

rsnair2 · 2014-12-06T03:32:40Z

@dan-blanchard, so from what I can understand the original implementation from Mozilla seems to do three things:

it removes byte sequences that start with high bytes and end with a valid ascii english alphabet character. this sequence cannot include low bytes that are non-alphabetical but can include high bytes. so potentially some high bytes will also be removed. also, the alphabetical characters don't seem to be replaced with a space but rather are removed altogether.
second, it seems keep sequences that start with a high byte and end with a low ascii byte that is not an english alphabet and contains no english alphabet. the low ascii byte is also replaced with a space.
finally, all other sequences are skipped.

So, overall I think that what this is doing is removing all english characters and words that contain english characters and tokenizing the string to remove punctuation. What we should get at the end of this is a space separated sequence comprising of only words in the other language.

That being said, I have written a small test script that compares the output of the C++ implementation with both the current version and the one I am suggesting. I will attach it later tonight so you can check it out.

rsnair2 · 2014-12-06T07:40:16Z

@dan-blanchard, I am not sure how to share the test setup for the function with you best, so I have attached it to dropbox. You can view it at <a href=https://www.dropbox.com/s/l98bdpdg5oqo3xu/test.zip?dl=0> dropbox .

The test creates randomly generated files comprising of bytes in range(0, 256) to a file, calls the C++ function (which I copy-pasted from Mozilla's version) and tests it against both the current version of the function and the updated one.

dan-blanchard · 2014-12-08T16:23:25Z

@rsnair2 thanks for putting together that test script! I just had to play around with it a bit, and I was really surprised to see how different what you're proposing is versus what we're currently doing. Then I looked back at the original Mozilla code and realized that our current approach is most definitely a mistake, but one that was due to poor naming in the original code.

There is an all-important comment above nsCharSetProber::FilterWithoutEnglishLetters in nsCharSetProber.cpp. It says:

This filter applies to all scripts which do not use English characters

So, this isn't a filter that removes English characters, as myself and others (probably even Mark Pilgrim) thought, but rather a filter that is only applied to sequences we think are from codecs that don't use English characters.

What the filter actually does is replace all byte sequences less than 127 with a single space, unless the sequence is a mixture of English letters and bytes above 127.

For example:

b'This is \x32 438 a  6 test \x80\x81 \x80\x81oo  sdj   \x80\x81   .\x80\x32'

becomes:

b'\x80\x81 \x80\x81oo \x80\x81 \x80 '

dan-blanchard · 2014-12-08T20:13:00Z

@rsnair2 I've tweaked your code a bit to make it a bit more Pythonic (and potentially more efficient, since it's not creating temporary strings all over the place anymore).

from io import BytesIO
from six import indexbytes


def filter_without_english_letters_proposed(aBuf):
    """
    Filter to be applied to bytes that we suspect contain non-English codecs.

    :returns: a copy of aBuf that only contains only \\x80-\\xFF sequences with
              optional English letters in them, separated by spaces
    """
    newStr = BytesIO()
    prev = 0
    curr = 0
    meetMSB = False

    for curr in range(len(aBuf)):
        ord_curr = indexbytes(aBuf, curr)
        if ord_curr >= 128:
            meetMSB = True
        # If current character is not an English letter...
        elif not (65 <= ord_curr <= 90 or  # A-Z
                  97 <= ord_curr <= 122):  # a-z
            # ... and we've seen a byte above 127
            if meetMSB and curr > prev:
                # Keep substring from right after previous non-letter byte
                # (less than 127) to current
                newStr.write(aBuf[prev:curr])
                newStr.write(b' ')
                meetMSB = False
            # Update position
            prev = curr + 1

    # If we hit the end of the string while in a sequence we wanted
    # to keep...
    if meetMSB:
        # Keep everything after previous non-letter byte (less than 127)
        newStr.write(aBuf[prev:])

    return newStr.getvalue()

Please update your PR to include the updated function, and I'll get this merged.

sigmavirus24 · 2014-12-08T20:24:04Z

@dan-blanchard if we're going to use magic integers I'd much rather they be constants. I'm -1 on the code as you've written it

dan-blanchard · 2014-12-08T20:33:52Z

@sigmavirus24 I agree. I only did it that way because it was simpler to write it in a 2/3 compatible way that way.

In that, case this alternative is probably more readable (and avoids magic all together).

from io import BytesIO

PY3 = sys.version_info >= (3, 0)


def filter_without_english_letters_proposed(aBuf):
    """
    Filter to be applied to bytes that we suspect contain non-English codecs.

    :returns: a copy of aBuf that contains only \\x80-\\xFF sequences with
              optional English letters in them, separated by spaces
    """
    newStr = BytesIO()
    prev = 0
    curr = 0
    meetMSB = False

    for curr_byte in range(len(aBuf)):
        curr_byte = bytes((aBuf[curr],)) if PY3 else aBuf[curr]
        if curr_byte >= b'\x80':
            meetMSB = True
        # If current character is not an English letter...
        elif not (b'A' <= curr_byte <= b'Z' or
                  b'a' <= curr_byte <= b'z'):
            # ... and we've seen a byte above 127
            if meetMSB and curr > prev:
                # Keep substring from right after previous non-letter byte
                # (less than 127) to current
                newStr.write(aBuf[prev:curr])
                newStr.write(b' ')
                meetMSB = False
            # Update position
            prev = curr + 1

    # If we never encountered a non-English letter sequence after our first...
    if meetMSB:
        # Keep everything after prev
        newStr.write(aBuf[prev:])

    return newStr.getvalue()

sigmavirus24 · 2014-12-08T20:37:28Z

I would rather the old code but have defined constants for the magic numbers. Ternary expressions based on the version of Python are just going to be confusing to new contributors.

dan-blanchard · 2014-12-08T20:38:44Z

The previous version did not work with Python 3, because when indexing a bytes object in Python 3, you get an int. There's going to need to be a ternary expression based on the version number in there somewhere.

dan-blanchard · 2014-12-08T20:52:09Z

I'm honestly not a big fan of how C-like the whole function is, because it took me a while to grasp that sequences like S\x84Z would be retained. It seems like it should be simple to a make a regex that says "keep only sequences in the set [\x80-\xFFa-zA-Z], where there must be at least one character in the range [\x80-\xFF]", but I've failed enough times for today.

rsnair2 · 2014-12-09T00:17:24Z

@dan-blanchard, I couldn't agree more - this function has been named confusingly and its pretty confusing in what its says. However, @sigmavirus24 did mention that regex can be inefficient, and given that chardet is already slow, I don't know if this is the ideal thing to do. But then again, the string is filtered only once so maybe the performance impact won't be large.

P.S. Sorry I accidentally closed this by clicking on close and commit. I don't know how to undo this.

sigmavirus24 · 2014-12-09T00:56:37Z

So this should stay open. Can we use the docstring to provide a very good explanation of what this function does?

dan-blanchard · 2014-12-09T15:11:00Z

However, @sigmavirus24 did mention that regex can be inefficient, and given that chardet is already slow, I don't know if this is the ideal thing to do.

@rsnair2 All of the regex versions of this I tested that almost worked—they each had their own little corner cases that didn't perform exactly the same—were roughly 4 times faster when doing some basic testing with %timeit in IPython. Regular expressions can be slower in Python, but this is usually only the case when what you're doing is something that is so simple that you don't need them in the first place. For example, checking for the presence of a simple substring is much slower with a regular expression than with simply using the in operator. In our case we're iterating through a string and performing another of bookkeeping operations, which is almost definitely going to be slower than the equivalent regex (if we could formulate it).

Can we use the docstring to provide a very good explanation of what this function does?

@sigmavirus24 I attempted to do that with my updated version, but I can only try to describe what it does, not why it does it.

sigmavirus24 · 2014-12-09T15:13:59Z

I'm wont to find a case where using a regular expression is the more performant. In every case it comes down to being easier for me. There are also really good benchmarks and examples provided by the re2 library that show how inefficient Python's standard re module actually is.

dan-blanchard · 2014-12-09T15:20:06Z

re2 and regex are definitely both much more efficient than the standard re module in most cases, but that does not mean re can't be more efficient than a for loop. Loops have quite a bit of overhead in Python.

… and comments. note - not compatible with P3

rsnair2 · 2014-12-13T15:06:09Z

@dan-blanchard, @sigmavirus24 I updated the fork to use regex instead of the for loop. I profiled it with timeit and it seems to be around twice as fast for large strings with the regex. I also changed around some of the variables. Its still not ready though, it does not work with python3. I will work on that sometime today if I get the chance.

rsnair2 · 2014-12-14T06:47:05Z

Alright, I tested with python 3 and 2. Seems good.

dan-blanchard · 2014-12-19T20:34:08Z

Thanks for this @rsnair2. This looks much better now!

👍

@sigmavirus24 I'm +1 to merge, how about yourself?

sigmavirus24 · 2014-12-19T22:34:45Z

I'd like the docstring written differently, but I'll submit a PR with how I'd like it formatted/written.

Modified filter_english_with_letters to mimic the behavior form Mozilla's version.

rsnair2 added 2 commits December 4, 2014 03:07

modified the behavior of filter_english_with_letters to mimic the beh…

dd9b86f

…avior of mozillas version

improved filter_without_english_letters code for readability and perf…

3250cbf

…ormance

dan-blanchard mentioned this pull request Dec 5, 2014

Potential changes to filter_without_english_characters? #43

Closed

fixed a minor problem with the operators

0a0030a

rsnair2 closed this Dec 9, 2014

dan-blanchard reopened this Dec 9, 2014

updated to using regex to filter international words. added docstring…

ea04328

… and comments. note - not compatible with P3

made changes for compat with py3

df3bf6d

improved docstring

040b962

sigmavirus24 added a commit that referenced this pull request Dec 19, 2014

Merge pull request #44 from rsnair2/master

0ad6242

Modified filter_english_with_letters to mimic the behavior form Mozilla's version.

sigmavirus24 merged commit 0ad6242 into chardet:master Dec 19, 2014

dan-blanchard mentioned this pull request Apr 11, 2017

Release 3.0.0 #110

Merged

Modified filter_english_with_letters to mimic the behavior form Mozilla's version. #44

Modified filter_english_with_letters to mimic the behavior form Mozilla's version. #44

Uh oh!

Conversation

rsnair2 commented Dec 5, 2014

Uh oh!

dan-blanchard commented Dec 5, 2014

Uh oh!

rsnair2 commented Dec 6, 2014

Uh oh!

rsnair2 commented Dec 6, 2014

Uh oh!

dan-blanchard commented Dec 8, 2014

Uh oh!

dan-blanchard commented Dec 8, 2014

Uh oh!

sigmavirus24 commented Dec 8, 2014

Uh oh!

dan-blanchard commented Dec 8, 2014

Uh oh!

sigmavirus24 commented Dec 8, 2014

Uh oh!

dan-blanchard commented Dec 8, 2014

Uh oh!

dan-blanchard commented Dec 8, 2014

Uh oh!

rsnair2 commented Dec 9, 2014

Uh oh!

sigmavirus24 commented Dec 9, 2014

Uh oh!

dan-blanchard commented Dec 9, 2014

Uh oh!

sigmavirus24 commented Dec 9, 2014

Uh oh!

dan-blanchard commented Dec 9, 2014

Uh oh!

rsnair2 commented Dec 13, 2014

Uh oh!

rsnair2 commented Dec 14, 2014

Uh oh!

dan-blanchard commented Dec 19, 2014

Uh oh!

sigmavirus24 commented Dec 19, 2014

Uh oh!

Uh oh!