Skip to content

[BUG] Memory usage increase for big files  #376

@josteinl

Description

@josteinl

Describe the bug
After upgrading from version 3.2.0 to either 3.3.0 or 3.31 I notice a huge increase in memory usage. Run from_bytes() on a 25 MB file, now results in using almost 3 GB of memory.

To Reproduce
Run this file, placed inside the charset_normalizer folder, with the scalene memory profiler (Linux/WSL):

memory_profile_test.py:

"""
Run from the project root:

    poetry run python3 -m scalene charset_normalizer/memory_profile_test.py

or (with an activated virtual environment)

    pip install scalene
    scalene charset_normalizer/memory_profile_test.py
"""

from charset_normalizer.api import from_bytes

file_name = "data/memory_profile_test.txt"

with open(file_name, "rb") as file:
    data = file.read()
    result = from_bytes(data)
    best = result.best()
    print(f"{best=}")

Data file used (25 MB), placed in the data folder :
memory_profile_test.txt

Profiler result (download and view in browser):
profile_charset_normalizer_3.3.1.html

Expected behaviour
Expected that the function did use just a bit more memory than the file I passed into from_bytes().

Testing Environment

  • OS: Ubuntu on WSL
  • Python version 3.11.6
  • Package version 3.3.0/1

Additional context
We use the charset-normalizer in our program running in containers with strict memory limits. We noticed the change in behaviour after our pods were Out Of Memory (OOM) killed.

Doing some debugging, it seems that the increase in memory consumption comes from storing the decoded_payload in the CharsetMatch().

Finally
A big thank you to the authors and maintainers! This library is much needed, used and appreciated!

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions