Skip to content

Conversation

queeup
Copy link

@queeup queeup commented Jan 22, 2014

I didn't commit README.rst changes. I leave it to developers.

queeup added 4 commits January 22, 2014 19:38
I have no idea about side effects. With this value it works good. More
then 0.6..... always detects Turkish(iso-8859-9) as iso-8859-2.
@hakanzy
Copy link

hakanzy commented Jan 31, 2014

Merge?

@sigmavirus24
Copy link
Member

@hakanzy I assume you're looking for the "Merge" button. If that's the case you need to be a maintainer on the repository. If instead you're giving an obscure direction, all pull requests need a proper amount of review before being accepted. The maintainers of chardet are busy right now and do not have the time to dedicate to review.

@dan-blanchard
Copy link
Member

So I finally got a chance to look at this, and my initial thought was that things looked fine, but then I looked at the unit test results. We currently have the repo setup so that travis builds always look like they pass (because of known failures), which makes it difficult for people to see when their changes introduce new bugs.

Anyway, in the case of this PR, we go from having 27 unit test failures (see #13), to 29. If you look at the failures:

$ python test.py
.................................FFFFFF...................................................................FF.FFF......................................FF..............................................F.FF.FFFFF..................F..............................................F........................................FF..F...F...............F..F...............................................
======================================================================
FAIL: runTest (__main__.TestCase)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "test.py", line 34, in runTest
    self.file_name))
AssertionError: Expected windows-1254, but got 'ISO-8859-9' in /home/travis/build/chardet/chardet/tests/windows-1254-turkish/_chromium_windows-1254_with_no_encoding_specified.html

======================================================================
FAIL: runTest (__main__.TestCase)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "test.py", line 34, in runTest
    self.file_name))
AssertionError: Expected latin1, but got 'ISO-8859-2' in /home/travis/build/chardet/chardet/tests/latin1/_ude_2.txt

======================================================================
FAIL: runTest (__main__.TestCase)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "test.py", line 34, in runTest
    self.file_name))
AssertionError: Expected latin1, but got 'TIS-620' in /home/travis/build/chardet/chardet/tests/latin1/_ude_4.txt

======================================================================
FAIL: runTest (__main__.TestCase)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "test.py", line 34, in runTest
    self.file_name))
AssertionError: Expected latin1, but got 'ascii' in /home/travis/build/chardet/chardet/tests/latin1/_mozilla_bug638318_text.html

======================================================================
FAIL: runTest (__main__.TestCase)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "test.py", line 34, in runTest
    self.file_name))
AssertionError: Expected latin1, but got 'ISO-8859-2' in /home/travis/build/chardet/chardet/tests/latin1/_ude_3.txt

======================================================================
FAIL: runTest (__main__.TestCase)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "test.py", line 34, in runTest
    self.file_name))
AssertionError: Expected latin1, but got 'IBM855' in /home/travis/build/chardet/chardet/tests/latin1/_ude_1.txt

======================================================================
FAIL: runTest (__main__.TestCase)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "test.py", line 34, in runTest
    self.file_name))
AssertionError: Expected windows-1250, but got 'windows-1255' in /home/travis/build/chardet/chardet/tests/windows-1250-hungarian/bbc.co.uk.hu.xml

======================================================================
FAIL: runTest (__main__.TestCase)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "test.py", line 34, in runTest
    self.file_name))
AssertionError: Expected windows-1250, but got 'ISO-8859-7' in /home/travis/build/chardet/chardet/tests/windows-1250-hungarian/objektivhir.hu.xml

======================================================================
FAIL: runTest (__main__.TestCase)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "test.py", line 34, in runTest
    self.file_name))
AssertionError: Expected windows-1250, but got 'ISO-8859-2' in /home/travis/build/chardet/chardet/tests/windows-1250-hungarian/bbc.co.uk.hu.forum.xml

======================================================================
FAIL: runTest (__main__.TestCase)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "test.py", line 34, in runTest
    self.file_name))
AssertionError: Expected windows-1250, but got 'IBM855' in /home/travis/build/chardet/chardet/tests/windows-1250-hungarian/bbc.co.uk.hu.pressreview.xml

======================================================================
FAIL: runTest (__main__.TestCase)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "test.py", line 34, in runTest
    self.file_name))
AssertionError: Expected windows-1250, but got 'ISO-8859-9' in /home/travis/build/chardet/chardet/tests/windows-1250-hungarian/bbc.co.uk.hu.learningenglish.xml

======================================================================
FAIL: runTest (__main__.TestCase)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "test.py", line 34, in runTest
    self.file_name))
AssertionError: Expected windows-1252, but got 'ISO-8859-2' in /home/travis/build/chardet/chardet/tests/windows-1252/_mozilla_bug421271_text.html

======================================================================
FAIL: runTest (__main__.TestCase)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "test.py", line 34, in runTest
    self.file_name))
AssertionError: Expected windows-1252, but got 'ISO-8859-9' in /home/travis/build/chardet/chardet/tests/windows-1252/github_bug_9.txt

======================================================================
FAIL: runTest (__main__.TestCase)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "test.py", line 34, in runTest
    self.file_name))
AssertionError: Expected iso-8859-7, but got 'ISO-8859-9' in /home/travis/build/chardet/chardet/tests/iso-8859-7-greek/naftemporiki.gr.fin.xml

======================================================================
FAIL: runTest (__main__.TestCase)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "test.py", line 34, in runTest
    self.file_name))
AssertionError: Expected iso-8859-7, but got 'ISO-8859-9' in /home/travis/build/chardet/chardet/tests/iso-8859-7-greek/naftemporiki.gr.mrk.xml

======================================================================
FAIL: runTest (__main__.TestCase)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "test.py", line 34, in runTest
    self.file_name))
AssertionError: Expected iso-8859-7, but got 'ISO-8859-9' in /home/travis/build/chardet/chardet/tests/iso-8859-7-greek/naftemporiki.gr.mrt.xml

======================================================================
FAIL: runTest (__main__.TestCase)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "test.py", line 34, in runTest
    self.file_name))
AssertionError: Expected iso-8859-7, but got 'ISO-8859-9' in /home/travis/build/chardet/chardet/tests/iso-8859-7-greek/naftemporiki.gr.bus.xml

======================================================================
FAIL: runTest (__main__.TestCase)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "test.py", line 34, in runTest
    self.file_name))
AssertionError: Expected iso-8859-7, but got 'ISO-8859-9' in /home/travis/build/chardet/chardet/tests/iso-8859-7-greek/naftemporiki.gr.cmm.xml

======================================================================
FAIL: runTest (__main__.TestCase)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "test.py", line 34, in runTest
    self.file_name))
AssertionError: Expected iso-8859-7, but got 'ISO-8859-9' in /home/travis/build/chardet/chardet/tests/iso-8859-7-greek/naftemporiki.gr.wld.xml

======================================================================
FAIL: runTest (__main__.TestCase)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "test.py", line 34, in runTest
    self.file_name))
AssertionError: Expected iso-8859-7, but got 'ISO-8859-9' in /home/travis/build/chardet/chardet/tests/iso-8859-7-greek/naftemporiki.gr.spo.xml

======================================================================
FAIL: runTest (__main__.TestCase)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "test.py", line 34, in runTest
    self.file_name))
AssertionError: Expected iso-8859-7, but got 'windows-1253' in /home/travis/build/chardet/chardet/tests/iso-8859-7-greek/disabled.gr.xml

======================================================================
FAIL: runTest (__main__.TestCase)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "test.py", line 34, in runTest
    self.file_name))
AssertionError: Expected iso-8859-6, but got 'MacCyrillic' in /home/travis/build/chardet/chardet/tests/iso-8859-6-arabic/_chromium_ISO-8859-6_with_no_encoding_specified.html

======================================================================
FAIL: runTest (__main__.TestCase)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "test.py", line 34, in runTest
    self.file_name))
AssertionError: Expected windows-1256, but got 'MacCyrillic' in /home/travis/build/chardet/chardet/tests/windows-1256-arabic/_chromium_windows-1256_with_no_encoding_specified.html

======================================================================
FAIL: runTest (__main__.TestCase)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "test.py", line 34, in runTest
    self.file_name))
AssertionError: Expected iso-8859-2, but got 'ISO-8859-7' in /home/travis/build/chardet/chardet/tests/iso-8859-2-hungarian/escience.hu.xml

======================================================================
FAIL: runTest (__main__.TestCase)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "test.py", line 34, in runTest
    self.file_name))
AssertionError: Expected iso-8859-2, but got 'windows-1251' in /home/travis/build/chardet/chardet/tests/iso-8859-2-hungarian/cigartower.hu.xml

======================================================================
FAIL: runTest (__main__.TestCase)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "test.py", line 34, in runTest
    self.file_name))
AssertionError: Expected iso-8859-2, but got 'KOI8-R' in /home/travis/build/chardet/chardet/tests/iso-8859-2-hungarian/shamalt.uw.hu.xml

======================================================================
FAIL: runTest (__main__.TestCase)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "test.py", line 34, in runTest
    self.file_name))
AssertionError: Expected iso-8859-2, but got 'ISO-8859-9' in /home/travis/build/chardet/chardet/tests/iso-8859-2-hungarian/shamalt.uw.hu.mv.xml

======================================================================
FAIL: runTest (__main__.TestCase)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "test.py", line 34, in runTest
    self.file_name))
AssertionError: Expected utf-8, but got 'ISO-8859-2' in /home/travis/build/chardet/chardet/tests/utf-8/bom-utf-8.srt

======================================================================
FAIL: runTest (__main__.TestCase)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "test.py", line 34, in runTest
    self.file_name))
AssertionError: Expected utf-8, but got 'ISO-8859-9' in /home/travis/build/chardet/chardet/tests/utf-8/_mozilla_bug306272_text.html

----------------------------------------------------------------------
Ran 389 tests in 162.607s

FAILED (failures=29)

you can see that with your change we would incorrectly predict ISO-8859-9 all over the place, so I think your threshold is too low.

@queeup
Copy link
Author

queeup commented Sep 12, 2014

What do you suggest? What do you mean about low threshold?

@dan-blanchard
Copy link
Member

I believe the mTypicalPositiveRatio you changed in 1059680 is too low now, because it seems to predict Turkish for things that are not Turkish now.

@sigmavirus24
Copy link
Member

Yeah this is why I think we need to since with the upstream version of universal chardet from Mozilla. I expect they have updated models and ratios that would be far more accurate.

@dan-blanchard
Copy link
Member

I completely agree. I'm working on a bunch of SKLL stuff at the moment, but I hope to get a chance to tackle this in the next month or so.

@dan-blanchard
Copy link
Member

Sorry for the confusion, but because so many people were targeting master instead of develop, I renamed our branches in the following way:

  • master ➡️ stable
  • develop ➡️ master

Your pull request was unfortunately the only one to have the proper target of develop, so it got closed because it has a now non-existent target.

If you don't mind, could you please create a new PR with master as the target?

Also, @sigmavirus24, I've looked into the upstream changes a little bit, and there were very few substantial changes that I've found so far (except for them eliminating detection of a number of codecs that we would like to continue to support).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants