Skip to content

Accessing items in NpzFile using [] is accidentally linear #29081

@sree314

Description

@sree314

We have a NPZ file storing a very large number of arrays O(100K) to O(1M). We access these arrays using f[key] syntax. Unfortunately, the code in __item__ first looks up the key in the list containing the items inside the zip file. This makes what should be a O(1) access a O(n) access. If we're iterating through all elements of the file using keys, this is accidentally quadratic. On my decade-old laptop, looking through a list containing 50K elements this way takes more than a minute. For 1M elements, on a more modern computer, the time can be up to 15 hours.

The code is here:

if key in self._files:

I'm not sure how to address this. files is part of the external API and a list, so changing that might be impossible. However, _files is internal and could become a set. In the common(?) case that npz files mostly contain npy files, this would ameliorate the situation quite a bit.

Metadata

Metadata

Assignees

No one assigned

    Labels

    00 - BugsprintableIssue fits the time-frame and setting of a sprint

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions