Skip to content

[BUG] Incorrect detections #633

@gaborbernat

Description

@gaborbernat

To Reproduce

from __future__ import annotations

from pathlib import Path
from tempfile import NamedTemporaryFile

from charset_normalizer import from_path

for encoding in ("utf-8", "utf-16", "utf-32"):
    print(f"Testing with encoding: {encoding}")  # noqa: T201
    before = "▶hello"
    with NamedTemporaryFile() as filename:
        path = Path(filename.name)
        path.write_text(before, encoding=encoding)

        print(f"Correct: {before}=={path.read_text(encoding)}")  # noqa: T201

        result = from_path(path)
        print(result)  # noqa: T201
        best = result.best()
        if best is None:
            print("No best encoding found")
            continue
        enc = best.encoding
        print(f"charset_normalizer: {before}=={path.read_text(enc)} [detected {enc}]")  # noqa: T201

Expected behavior
Correct: ▶hello==▶hello

Logs

Testing with encoding: utf-8
Correct: ▶hello==▶hello
<charset_normalizer.models.CharsetMatches object at 0x1053e4d70>
charset_normalizer: ▶hello==離梶汥潬 [detected utf_16_le]
Testing with encoding: utf-16
Correct: ▶hello==▶hello
<charset_normalizer.models.CharsetMatches object at 0x1054e9bd0>
No best encoding found
Testing with encoding: utf-32
Correct: ▶hello==▶hello
<charset_normalizer.models.CharsetMatches object at 0x1053b1a70>
No best encoding found

Desktop (please complete the following information):

  • OS: macos
  • Python version 3.13
  • Package version 3.42

PS. utf-16 or utf-32 returns no matches at all.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workinghelp wantedExtra attention is needed

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions