-
-
Notifications
You must be signed in to change notification settings - Fork 57
Description
Describe the bug
The detection of encoding did change recently, and IMHO regressed (I found that in a CI failure https://dev.azure.com/nexB/commoncode/_build/results?buildId=14502&view=logs&jobId=ba20146e-138e-5341-c558-bc25972fe2bd&j=ba20146e-138e-5341-c558-bc25972fe2bd&t=18eddfd8-abe5-5f8c-405c-5d0e0bd4c25d ) where we use beautifulsoup4 that uses in turn charset_normalizer.
To Reproduce
Note that I am using bs4 UnicodeDammit to show the side effects. I added the encoding detection that to see the charset_normalizer side:
Up to 3.2.0 the behavior is stable:
$ pip install beautifulsoup4==4.12.3
$ pip install charset-normalizer==3.2.0
$ python -c "from bs4.dammit import UnicodeDammit;print(UnicodeDammit(b'/includes/webform.compon\xd2\xaants.inc/').markup)"
/includes/webform.componŇŞnts.inc/
$ python -c "import charset_normalizer as cn; print(cn.detect(b'/includes/webform.compon\xd2\xaants.inc/')['encoding'])"
windows-1250
Note the small change in 3.3.0
$ pip install charset-normalizer==3.3.0
$ python -c "from bs4.dammit import UnicodeDammit;print(UnicodeDammit(b'/includes/webform.compon\xd2\xaants.inc/').markup)"
/includes/webform.compon훩nts.inc/
$ python -c "import charset_normalizer as cn; print(cn.detect(b'/includes/webform.compon\xd2\xaants.inc/')['encoding'])"
johab
Note the big change in 3.3.1
$ pip install charset-normalizer==3.3.1
$ python -c "from bs4.dammit import UnicodeDammit;print(UnicodeDammit(b'/includes/webform.compon\xd2\xaants.inc/').markup)"
⽩湣汵摥猯睥扦潲洮捯浰潮튪湴献楮振
$ python -c "import charset_normalizer as cn; print(cn.detect(b'/includes/webform.compon\xd2\xaants.inc/')['encoding'])"
utf_16_be
Expected behavior
I would expect the behavior of 3.2.0 or 3.3.0 as correct. The 3.3.1 is not correct or if it is, then this should be IMHO an API breaking major version bump
Desktop (please complete the following information):
- OS: Linux and Windows
- Python version 3.8 and up
- Package version 3.2.0 to 3.3.1