Skip to content

[BUG] Regression and change of behaviour between 3.3.0 and 3.3.1 #520

@pombredanne

Description

@pombredanne

Describe the bug
The detection of encoding did change recently, and IMHO regressed (I found that in a CI failure https://dev.azure.com/nexB/commoncode/_build/results?buildId=14502&view=logs&jobId=ba20146e-138e-5341-c558-bc25972fe2bd&j=ba20146e-138e-5341-c558-bc25972fe2bd&t=18eddfd8-abe5-5f8c-405c-5d0e0bd4c25d ) where we use beautifulsoup4 that uses in turn charset_normalizer.

To Reproduce
Note that I am using bs4 UnicodeDammit to show the side effects. I added the encoding detection that to see the charset_normalizer side:

Up to 3.2.0 the behavior is stable:

$ pip install beautifulsoup4==4.12.3
$ pip install charset-normalizer==3.2.0
$ python -c "from bs4.dammit import UnicodeDammit;print(UnicodeDammit(b'/includes/webform.compon\xd2\xaants.inc/').markup)"
/includes/webform.componŇŞnts.inc/
$ python -c "import charset_normalizer as cn; print(cn.detect(b'/includes/webform.compon\xd2\xaants.inc/')['encoding'])"
windows-1250

Note the small change in 3.3.0

$ pip install charset-normalizer==3.3.0
$ python -c "from bs4.dammit import UnicodeDammit;print(UnicodeDammit(b'/includes/webform.compon\xd2\xaants.inc/').markup)"
/includes/webform.compon훩nts.inc/
$ python -c "import charset_normalizer as cn; print(cn.detect(b'/includes/webform.compon\xd2\xaants.inc/')['encoding'])"
johab

Note the big change in 3.3.1

$ pip install charset-normalizer==3.3.1
$ python -c "from bs4.dammit import UnicodeDammit;print(UnicodeDammit(b'/includes/webform.compon\xd2\xaants.inc/').markup)"
⽩湣汵摥猯睥扦潲洮捯浰潮튪湴献楮振

$ python -c "import charset_normalizer as cn; print(cn.detect(b'/includes/webform.compon\xd2\xaants.inc/')['encoding'])"
utf_16_be

Expected behavior

I would expect the behavior of 3.2.0 or 3.3.0 as correct. The 3.3.1 is not correct or if it is, then this should be IMHO an API breaking major version bump

Desktop (please complete the following information):

  • OS: Linux and Windows
  • Python version 3.8 and up
  • Package version 3.2.0 to 3.3.1

Metadata

Metadata

Assignees

No one assigned

    Labels

    detectionRelated to the charset detection mechanism, chaos/mess/coherence

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions