-
Notifications
You must be signed in to change notification settings - Fork 265
Add Hypothesis based test of chardet #66
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
The concept here is pretty simple: This tries to test for the invariant that if a string comes from valid unicode and is encoded in one of the chardet supported encodings then chardet should detect *some* encoding. More nuanced tests are possible (e.g. asserting that the string should be decodable from the detected encoding) but given that even this test is demonstrating a bunch of bugs this seemed to be a good starting point.
Oops. No, the tests are failing because I've failed to set up travis to install them correctly. One second... |
There are expected test failures at this point. We've got a handful of tests that still fail, and this is part of the reason we haven't released a new version yet. |
Ah, apologies if I've then gone and made your life even more difficult. |
We just have too few maintainers for a complex, heavily used, project (mostly because people don't even realize things like requests rely on us). I wouldn't say you've made things more complex, you've just come up with a way to generate little tests that might actually help us establish why the other things are failing. |
So I disagree that some of these tests are actually useful. I was going to refrain from commenting on this until I had time to actually evaluate them, but padding arbitrary code-points with ascii does bias the detection (as you've been doing in other tests to argue that this isn't a length issue). The algorithms in this project have some shortcuts (based on BOM) for speed, but otherwise you wouldn't take SHIFT-JIS characters and pad them with ascii. That's just silly. Without having much time to dedicate to this right now, I'm not in favor of merging this just yet. |
@sigmavirus24, I completely agree that padding arbitrary code points with ASCII is a bad idea for some encodings, but these tests are restricted to encodings that are supersets of ASCII. They also don't test for correctness of detection, just that there was something detected. This is helping us find some weird edge cases where our shortcuts fail in such a way that no encoding is detected at all. I certainly would not have expected any ASCII text that contains |
Yeah, I deliberately avoided testing weirder encodings until I had a better handle on what would happen with the simple ones. it's easy to hit strange edge cases in heuristics, so the test in this PR is quite conservative and only covers the case where chardet has failed to find anything at all. Incidentally, padding with ascii was simply chosen for convenience and does not seem to be required to argue that the examples found so far are not length based. e.g. I just checked #64 and it remains the case even if you just repeat the same string N times (also a non-breaking space is not an unreasonable character to have mixed in with ascii). |
This attempts to address some of @sigmavirus24's concerns over the utility of this test: Instead of merely looking for strings where chardet can't find an encoding, we look for prefixes that prevent any string (or at least every string we checked) that starts with that prefix from having a detected encoding. This makes the test significantly more complicated, but should lead to more robust examples.
This isn't strictly necessary, but the new test is quite slow to minimize when a failure is found, which it typically is right now. Caching the .hypothesis directory means that examples will persist between runs, which both speeds up the build and helps keep consistency between runs so that causes to failures persist.
I've updated the test so that instead of looking simply for strings with no detected encoding it looks for prefixes which prevent any string starting with that prefix from having a detected encoding. Does that help address the concern? |
What should I do with this? If it's just a question of not having time to review there's of course no rush and you can get to it whenever, but if you're flat out not interested I'll close the pull request. |
For me, it's a matter of not having the time to review this at the moment. @sigmavirus24, are you still against this? |
I don't have the time to review it either. I wonder if we can use hypothesis to generate tests off of our known good fixtures. |
I finally got a chance to look over your changes, and this looks good to me. I think in the future we could use hypothesis even more than this, like @sigmavirus24 suggested, but this is good for now. |
Add Hypothesis based test of chardet
The concept here is pretty simple: This tries to test for the
invariant that if a string comes from valid unicode and is
encoded in one of the chardet supported encodings then chardet
should detect some encoding.
More nuanced tests are possible (e.g. asserting that the string
should be decodable from the detected encoding) but given
that even this test is demonstrating a bunch of bugs this seemed to
be a good starting point.
This is (more or less) the test that caught #65, #64 and #63. #62 had one extra line in it to try to reencode the data as the reported format.
Notes: