Add Hypothesis based test of chardet #66

DRMacIver · 2015-08-17T15:48:17Z

The concept here is pretty simple: This tries to test for the
invariant that if a string comes from valid unicode and is
encoded in one of the chardet supported encodings then chardet
should detect some encoding.

More nuanced tests are possible (e.g. asserting that the string
should be decodable from the detected encoding) but given
that even this test is demonstrating a bunch of bugs this seemed to
be a good starting point.

This is (more or less) the test that caught #65, #64 and #63. #62 had one extra line in it to try to reencode the data as the reported format.

Notes:

This test is currently failing. I'm pretty sure this is because of issues it's finding in the code, not issues with the test.
min_size=100 is to rule out bugs that come solely from the length, prompted by your saying that short strings aren't really supported. Anecdotally all of the bugs that have been found so far don't depend on the length and min_size=1 would have been fine (leaving min_size alone is also valid, but I assume '' having a None encoding is intended behaviour)

The concept here is pretty simple: This tries to test for the invariant that if a string comes from valid unicode and is encoded in one of the chardet supported encodings then chardet should detect *some* encoding. More nuanced tests are possible (e.g. asserting that the string should be decodable from the detected encoding) but given that even this test is demonstrating a bunch of bugs this seemed to be a good starting point.

DRMacIver · 2015-08-17T15:51:10Z

Oops. No, the tests are failing because I've failed to set up travis to install them correctly. One second...

dan-blanchard · 2015-08-17T15:52:33Z

There are expected test failures at this point. We've got a handful of tests that still fail, and this is part of the reason we haven't released a new version yet.

DRMacIver · 2015-08-17T15:56:21Z

Ah, apologies if I've then gone and made your life even more difficult.

dan-blanchard · 2015-08-17T16:04:25Z

Ah, apologies if I've then gone and made your life even more difficult.

We just have too few maintainers for a complex, heavily used, project (mostly because people don't even realize things like requests rely on us). I wouldn't say you've made things more complex, you've just come up with a way to generate little tests that might actually help us establish why the other things are failing.

sigmavirus24 · 2015-08-17T16:14:27Z

So I disagree that some of these tests are actually useful. I was going to refrain from commenting on this until I had time to actually evaluate them, but padding arbitrary code-points with ascii does bias the detection (as you've been doing in other tests to argue that this isn't a length issue). The algorithms in this project have some shortcuts (based on BOM) for speed, but otherwise you wouldn't take SHIFT-JIS characters and pad them with ascii. That's just silly. Without having much time to dedicate to this right now, I'm not in favor of merging this just yet.

dan-blanchard · 2015-08-17T16:19:57Z

@sigmavirus24, I completely agree that padding arbitrary code points with ASCII is a bad idea for some encodings, but these tests are restricted to encodings that are supersets of ASCII. They also don't test for correctness of detection, just that there was something detected. This is helping us find some weird edge cases where our shortcuts fail in such a way that no encoding is detected at all.

I certainly would not have expected any ASCII text that contains \x1b to fail to be recognized as anything. And now the same thing apparently happens with UTF8 as well.

DRMacIver · 2015-08-17T16:28:00Z

Yeah, I deliberately avoided testing weirder encodings until I had a better handle on what would happen with the simple ones. it's easy to hit strange edge cases in heuristics, so the test in this PR is quite conservative and only covers the case where chardet has failed to find anything at all.

Incidentally, padding with ascii was simply chosen for convenience and does not seem to be required to argue that the examples found so far are not length based. e.g. I just checked #64 and it remains the case even if you just repeat the same string N times (also a non-breaking space is not an unreasonable character to have mixed in with ascii).

landscape-bot · 2015-08-18T04:45:45Z

Code quality remained the same when pulling ff793af on DRMacIver:add-hypothesis into 9e419e9 on chardet:master.

@sigmavirus24

This attempts to address some of @sigmavirus24's concerns over the utility of this test: Instead of merely looking for strings where chardet can't find an encoding, we look for prefixes that prevent any string (or at least every string we checked) that starts with that prefix from having a detected encoding. This makes the test significantly more complicated, but should lead to more robust examples.

This isn't strictly necessary, but the new test is quite slow to minimize when a failure is found, which it typically is right now. Caching the .hypothesis directory means that examples will persist between runs, which both speeds up the build and helps keep consistency between runs so that causes to failures persist.

DRMacIver · 2015-08-18T07:43:20Z

I've updated the test so that instead of looking simply for strings with no detected encoding it looks for prefixes which prevent any string starting with that prefix from having a detected encoding. Does that help address the concern?

landscape-bot · 2015-08-26T16:07:46Z

Code quality remained the same when pulling c058f52 on DRMacIver:add-hypothesis into 9e419e9 on chardet:master.

DRMacIver · 2015-08-26T16:10:30Z

What should I do with this? If it's just a question of not having time to review there's of course no rush and you can get to it whenever, but if you're flat out not interested I'll close the pull request.

dan-blanchard · 2015-08-27T19:15:53Z

For me, it's a matter of not having the time to review this at the moment.

@sigmavirus24, are you still against this?

sigmavirus24 · 2015-08-27T20:47:22Z

I don't have the time to review it either. I wonder if we can use hypothesis to generate tests off of our known good fixtures.

dan-blanchard · 2015-09-19T13:54:22Z

I finally got a chance to look over your changes, and this looks good to me. I think in the future we could use hypothesis even more than this, like @sigmavirus24 suggested, but this is good for now.

Add Hypothesis based test of chardet

Add hypothesis to the travis install

ff793af

DRMacIver added 2 commits August 18, 2015 08:34

dan-blanchard added a commit that referenced this pull request Sep 19, 2015

Merge pull request #66 from DRMacIver/add-hypothesis

cc9d6d2

Add Hypothesis based test of chardet

dan-blanchard merged commit cc9d6d2 into chardet:master Sep 19, 2015

dan-blanchard mentioned this pull request Apr 11, 2017

Release 3.0.0 #110

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add Hypothesis based test of chardet #66

Add Hypothesis based test of chardet #66

Uh oh!

DRMacIver commented Aug 17, 2015

Uh oh!

DRMacIver commented Aug 17, 2015

Uh oh!

dan-blanchard commented Aug 17, 2015

Uh oh!

DRMacIver commented Aug 17, 2015

Uh oh!

dan-blanchard commented Aug 17, 2015

Uh oh!

sigmavirus24 commented Aug 17, 2015

Uh oh!

dan-blanchard commented Aug 17, 2015

Uh oh!

DRMacIver commented Aug 17, 2015

Uh oh!

landscape-bot commented Aug 18, 2015

Uh oh!

DRMacIver commented Aug 18, 2015

Uh oh!

landscape-bot commented Aug 26, 2015

Uh oh!

DRMacIver commented Aug 26, 2015

Uh oh!

dan-blanchard commented Aug 27, 2015

Uh oh!

sigmavirus24 commented Aug 27, 2015

Uh oh!

dan-blanchard commented Sep 19, 2015

Uh oh!

Uh oh!

Add Hypothesis based test of chardet #66

Add Hypothesis based test of chardet #66

Uh oh!

Conversation

DRMacIver commented Aug 17, 2015

Uh oh!

DRMacIver commented Aug 17, 2015

Uh oh!

dan-blanchard commented Aug 17, 2015

Uh oh!

DRMacIver commented Aug 17, 2015

Uh oh!

dan-blanchard commented Aug 17, 2015

Uh oh!

sigmavirus24 commented Aug 17, 2015

Uh oh!

dan-blanchard commented Aug 17, 2015

Uh oh!

DRMacIver commented Aug 17, 2015

Uh oh!

landscape-bot commented Aug 18, 2015

Uh oh!

DRMacIver commented Aug 18, 2015

Uh oh!

landscape-bot commented Aug 26, 2015

Uh oh!

DRMacIver commented Aug 26, 2015

Uh oh!

dan-blanchard commented Aug 27, 2015

Uh oh!

sigmavirus24 commented Aug 27, 2015

Uh oh!

dan-blanchard commented Sep 19, 2015

Uh oh!

Uh oh!