-
-
Notifications
You must be signed in to change notification settings - Fork 57
Description
I have files with names in national language in zip archive. RFC for zip has poor support for non ascii names so I use charset_normalizer.detect() to find out correct charset. (I found such recommendation in Internet)
The issue is with file name that is b'\x84\xae\xaa\xe3\xac\xa5\xad\xe2 Microsoft Word.docx'
in binary.
The initial file name is'Документ Microsoft Word.docx'
In the latest 3.3.2 release the charset is detected as 'utf_16_be'
( detect(b'\x84\xae\xaa\xe3\xac\xa5\xad\xe2 Microsoft Word.docx')['encoding'] == 'utf_16_be'
) that is incorrect.
As I understand the real charset that should be detected for the binary line above as 'cp866' because the 'Документ Microsoft Word.docx'.encoding('cp866') == b'\x84\xae\xaa\xe3\xac\xa5\xad\xe2 Microsoft Word.docx'
.
I tried the previous charset_normalizer releases and in the version 3.3.0 the charset is detected as 'cp1125', and the 'cp1125' also allows suddenly to decode name correctly into 'Документ Microsoft Word.docx'.
So the 3.3.0 detects the encoding incorrectly but allows to get usable result.
I may use 3.3.0 release as workaround now.
But please, try to fix the issue.
OS is Linux - SLES 15-SP5
Python 3.11.5