Skip to content

Add Hypothesis based test of chardet #66

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 4 commits into from
Sep 19, 2015
Merged

Add Hypothesis based test of chardet #66

merged 4 commits into from
Sep 19, 2015

Conversation

DRMacIver
Copy link
Contributor

The concept here is pretty simple: This tries to test for the
invariant that if a string comes from valid unicode and is
encoded in one of the chardet supported encodings then chardet
should detect some encoding.

More nuanced tests are possible (e.g. asserting that the string
should be decodable from the detected encoding) but given
that even this test is demonstrating a bunch of bugs this seemed to
be a good starting point.

This is (more or less) the test that caught #65, #64 and #63. #62 had one extra line in it to try to reencode the data as the reported format.

Notes:

  1. This test is currently failing. I'm pretty sure this is because of issues it's finding in the code, not issues with the test.
  2. min_size=100 is to rule out bugs that come solely from the length, prompted by your saying that short strings aren't really supported. Anecdotally all of the bugs that have been found so far don't depend on the length and min_size=1 would have been fine (leaving min_size alone is also valid, but I assume '' having a None encoding is intended behaviour)

The concept here is pretty simple: This tries to test for the
invariant that if a string comes from valid unicode and is
encoded in one of the chardet supported encodings then chardet
should detect *some* encoding.

More nuanced tests are possible (e.g. asserting that the string
should be decodable from the detected encoding) but given
that even this test is demonstrating a bunch of bugs this seemed to
be a good starting point.
@DRMacIver
Copy link
Contributor Author

Oops. No, the tests are failing because I've failed to set up travis to install them correctly. One second...

@dan-blanchard
Copy link
Member

There are expected test failures at this point. We've got a handful of tests that still fail, and this is part of the reason we haven't released a new version yet.

@DRMacIver
Copy link
Contributor Author

Ah, apologies if I've then gone and made your life even more difficult.

@dan-blanchard
Copy link
Member

Ah, apologies if I've then gone and made your life even more difficult.

We just have too few maintainers for a complex, heavily used, project (mostly because people don't even realize things like requests rely on us). I wouldn't say you've made things more complex, you've just come up with a way to generate little tests that might actually help us establish why the other things are failing.

@sigmavirus24
Copy link
Member

So I disagree that some of these tests are actually useful. I was going to refrain from commenting on this until I had time to actually evaluate them, but padding arbitrary code-points with ascii does bias the detection (as you've been doing in other tests to argue that this isn't a length issue). The algorithms in this project have some shortcuts (based on BOM) for speed, but otherwise you wouldn't take SHIFT-JIS characters and pad them with ascii. That's just silly. Without having much time to dedicate to this right now, I'm not in favor of merging this just yet.

@dan-blanchard
Copy link
Member

@sigmavirus24, I completely agree that padding arbitrary code points with ASCII is a bad idea for some encodings, but these tests are restricted to encodings that are supersets of ASCII. They also don't test for correctness of detection, just that there was something detected. This is helping us find some weird edge cases where our shortcuts fail in such a way that no encoding is detected at all.

I certainly would not have expected any ASCII text that contains \x1b to fail to be recognized as anything. And now the same thing apparently happens with UTF8 as well.

@DRMacIver
Copy link
Contributor Author

Yeah, I deliberately avoided testing weirder encodings until I had a better handle on what would happen with the simple ones. it's easy to hit strange edge cases in heuristics, so the test in this PR is quite conservative and only covers the case where chardet has failed to find anything at all.

Incidentally, padding with ascii was simply chosen for convenience and does not seem to be required to argue that the examples found so far are not length based. e.g. I just checked #64 and it remains the case even if you just repeat the same string N times (also a non-breaking space is not an unreasonable character to have mixed in with ascii).

@landscape-bot
Copy link

Code Health
Code quality remained the same when pulling ff793af on DRMacIver:add-hypothesis into 9e419e9 on chardet:master.

This attempts to address some of @sigmavirus24's concerns over
the utility of this test: Instead of merely looking for strings
where chardet can't find an encoding, we look for prefixes that
prevent any string (or at least every string we checked) that
starts with that prefix from having a detected encoding.

This makes the test significantly more complicated, but should lead
to more robust examples.
This isn't strictly necessary, but the new test is quite
slow to minimize when a failure is found, which it typically
is right now. Caching the .hypothesis directory means that
examples will persist between runs, which both speeds up the
build and helps keep consistency between runs so that causes
to failures persist.
@DRMacIver
Copy link
Contributor Author

I've updated the test so that instead of looking simply for strings with no detected encoding it looks for prefixes which prevent any string starting with that prefix from having a detected encoding. Does that help address the concern?

@landscape-bot
Copy link

Code Health
Code quality remained the same when pulling c058f52 on DRMacIver:add-hypothesis into 9e419e9 on chardet:master.

@DRMacIver
Copy link
Contributor Author

What should I do with this? If it's just a question of not having time to review there's of course no rush and you can get to it whenever, but if you're flat out not interested I'll close the pull request.

@dan-blanchard
Copy link
Member

For me, it's a matter of not having the time to review this at the moment.

@sigmavirus24, are you still against this?

@sigmavirus24
Copy link
Member

I don't have the time to review it either. I wonder if we can use hypothesis to generate tests off of our known good fixtures.

@dan-blanchard
Copy link
Member

I finally got a chance to look over your changes, and this looks good to me. I think in the future we could use hypothesis even more than this, like @sigmavirus24 suggested, but this is good for now.

dan-blanchard added a commit that referenced this pull request Sep 19, 2015
Add Hypothesis based test of chardet
@dan-blanchard dan-blanchard merged commit cc9d6d2 into chardet:master Sep 19, 2015
@dan-blanchard dan-blanchard mentioned this pull request Apr 11, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants