[BUG]

I have files with names in national language in zip archive.  RFC for zip has poor support for non ascii names so I use charset_normalizer.detect() to find out correct charset. (I found such recommendation in Internet)

The issue is with file name that is  ` b'\x84\xae\xaa\xe3\xac\xa5\xad\xe2 Microsoft Word.docx' `  in binary.
The initial file name is` 'Документ Microsoft Word.docx' `
 In the  latest 3.3.2 release the charset is detected as `'utf_16_be'`  ( ` detect(b'\x84\xae\xaa\xe3\xac\xa5\xad\xe2 Microsoft Word.docx')['encoding'] == 'utf_16_be' `) that is incorrect. 
As I understand the real charset that should be detected for the binary line above as 'cp866'  because the ` 'Документ Microsoft Word.docx'.encoding('cp866') == b'\x84\xae\xaa\xe3\xac\xa5\xad\xe2 Microsoft Word.docx' ` . 
I tried the previous charset_normalizer releases and in the version  3.3.0 the charset  is detected as 'cp1125', and the 'cp1125' also allows suddenly to decode name correctly into 'Документ Microsoft Word.docx'. 
So the 3.3.0 detects the encoding incorrectly but allows to get usable result.

I may use 3.3.0 release as workaround now.
But please,   try to fix the issue.

OS is Linux  - SLES 15-SP5
Python 3.11.5


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[BUG] #498

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

[BUG] #498

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions