Skip to content

[BUG] Incorrect encoding detected in 3.3.1 #371

@jefferyto

Description

@jefferyto

I'm updating the charset-normalizer package in OpenWrt (with Python 3.11.6) and tried the example in https://charset-normalizer.readthedocs.io/en/latest/user/handling_result.html#handling-result:

my_byte_str = 'Bсеки човек има право на образование.'.encode('cp1251')

# Assign return value so we can fully exploit result
result = from_bytes(
    my_byte_str
).best()

print(result.encoding)  # cp1251

In 3.3.0 this would print cp1251 but in 3.3.1 this prints cp1257 (str(result) returns 'Bńåźč ÷īāåź čģą ļšąāī ķą īįšąēīāąķčå.').

I also tried the French phrase from https://charset-normalizer.readthedocs.io/en/latest/index.html#introduction:

my_byte_str = 'Bonjour, je suis à la recherche d\'une aide sur les étoiles'.encode('cp1252')

and from_bytes(my_byte_str).best() also has the encoding cp1257.

I have compiled the package for arm, aarch64 and x86_64 and I get the same results.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingdetectionRelated to the charset detection mechanism, chaos/mess/coherence

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions