Skip to content

[BUG] #498

@M-Startc

Description

@M-Startc

I have files with names in national language in zip archive. RFC for zip has poor support for non ascii names so I use charset_normalizer.detect() to find out correct charset. (I found such recommendation in Internet)

The issue is with file name that is b'\x84\xae\xaa\xe3\xac\xa5\xad\xe2 Microsoft Word.docx' in binary.
The initial file name is'Документ Microsoft Word.docx'
In the latest 3.3.2 release the charset is detected as 'utf_16_be' ( detect(b'\x84\xae\xaa\xe3\xac\xa5\xad\xe2 Microsoft Word.docx')['encoding'] == 'utf_16_be') that is incorrect.
As I understand the real charset that should be detected for the binary line above as 'cp866' because the 'Документ Microsoft Word.docx'.encoding('cp866') == b'\x84\xae\xaa\xe3\xac\xa5\xad\xe2 Microsoft Word.docx' .
I tried the previous charset_normalizer releases and in the version 3.3.0 the charset is detected as 'cp1125', and the 'cp1125' also allows suddenly to decode name correctly into 'Документ Microsoft Word.docx'.
So the 3.3.0 detects the encoding incorrectly but allows to get usable result.

I may use 3.3.0 release as workaround now.
But please, try to fix the issue.

OS is Linux - SLES 15-SP5
Python 3.11.5

Metadata

Metadata

Assignees

No one assigned

    Labels

    detectionRelated to the charset detection mechanism, chaos/mess/coherence

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions