-
-
Notifications
You must be signed in to change notification settings - Fork 57
Description
Describe the bug
After upgrading from version 3.2.0 to either 3.3.0 or 3.31 I notice a huge increase in memory usage. Run from_bytes()
on a 25 MB file, now results in using almost 3 GB of memory.
To Reproduce
Run this file, placed inside the charset_normalizer
folder, with the scalene memory profiler (Linux/WSL):
"""
Run from the project root:
poetry run python3 -m scalene charset_normalizer/memory_profile_test.py
or (with an activated virtual environment)
pip install scalene
scalene charset_normalizer/memory_profile_test.py
"""
from charset_normalizer.api import from_bytes
file_name = "data/memory_profile_test.txt"
with open(file_name, "rb") as file:
data = file.read()
result = from_bytes(data)
best = result.best()
print(f"{best=}")
Data file used (25 MB), placed in the data
folder :
memory_profile_test.txt
Profiler result (download and view in browser):
profile_charset_normalizer_3.3.1.html
Expected behaviour
Expected that the function did use just a bit more memory than the file I passed into from_bytes()
.
Testing Environment
- OS: Ubuntu on WSL
- Python version 3.11.6
- Package version 3.3.0/1
Additional context
We use the charset-normalizer in our program running in containers with strict memory limits. We noticed the change in behaviour after our pods were Out Of Memory (OOM) killed.
Doing some debugging, it seems that the increase in memory consumption comes from storing the decoded_payload
in the CharsetMatch()
.
Finally
A big thank you to the authors and maintainers! This library is much needed, used and appreciated!