-
-
Notifications
You must be signed in to change notification settings - Fork 11.3k
Description
We have a NPZ file storing a very large number of arrays O(100K) to O(1M). We access these arrays using f[key]
syntax. Unfortunately, the code in __item__
first looks up the key in the list containing the items inside the zip file. This makes what should be a O(1) access a O(n) access. If we're iterating through all elements of the file using keys, this is accidentally quadratic. On my decade-old laptop, looking through a list containing 50K elements this way takes more than a minute. For 1M elements, on a more modern computer, the time can be up to 15 hours.
The code is here:
numpy/numpy/lib/_npyio_impl.py
Line 252 in 3363b38
if key in self._files: |
I'm not sure how to address this. files
is part of the external API and a list, so changing that might be impossible. However, _files
is internal and could become a set. In the common(?) case that npz
files mostly contain npy
files, this would ameliorate the situation quite a bit.